# learning_mixtures_of_linear_dynamical_systems__41b753db.pdf Learning Mixtures of Linear Dynamical Systems Yanxi Chen 1 H. Vincent Poor 1 We study the problem of learning a mixture of multiple linear dynamical systems (LDSs) from unlabeled short sample trajectories, each generated by one of the LDS models. Despite the wide applicability of mixture models for timeseries data, learning algorithms that come with end-to-end performance guarantees are largely absent from existing literature. There are multiple sources of technical challenges, including but not limited to (1) the presence of latent variables (i.e. the unknown labels of trajectories); (2) the possibility that the sample trajectories might have lengths much smaller than the dimension d of the LDS models; and (3) the complicated temporal dependence inherent to time-series data. To tackle these challenges, we develop a two-stage meta-algorithm, which is guaranteed to efficiently recover each ground-truth LDS model up to error e O( p d/T), where T is the total sample size. We validate our theoretical studies with numerical experiments, confirming the efficacy of the proposed algorithm. 1. Introduction Imagine that we are asked to learn multiple linear dynamical systems (LDSs) from a mixture of unlabeled sample trajectories namely, each sample trajectory is generated by one of the LDSs of interest, but we have no idea which system it is. To set the stage and facilitate discussion, recall that in a classical LDS, one might observe a sample trajectory {xt}0 t T generated by an LDS obeying xt+1 = Axt + wt, where A Rd d determines the system dynamics in the noiseless case, and {wt}t 0 denote independent zero-mean noise vectors with covariance cov(wt) = W 0. The mixed LDSs setting considered herein extends classical LDSs by allowing for mixed 1Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544, USA. Correspondence to: Yanxi Chen . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). measurements as described below; see Figure 1 for a visualization of the scenario. Multiple linear systems. Suppose that there are K different LDSs as represented by {(A(k), W (k))}1 k K, where A(k) Rd d and W (k) Rd d represent the state transition matrix and noise covariance matrix of the k-th LDS, respectively. Here and throughout, we shall refer to (A(k), W (k)) as the system matrix for the k-th LDS. We only assume that (A(k), W (k)) = (A(ℓ), W (ℓ)) for any k = ℓ, whereas A(k) = A(ℓ) or W (k) = W (ℓ) is allowed. Mixed sample trajectories. We collect a total number of M unlabeled trajectories from these LDSs. More specifically, the m-th sample trajectory is drawn from one of the LDSs in the following manner: set (A, W ) to be (A(k), W (k)) for some 1 k K, and generate a trajectory of length Tm obeying xt+1 = Axt + wt, where the wt s are i.i.d., E[wt] = 0, cov(wt) = W 0. Note, however, that the label k associated with each sample trajectory is a latent variable not revealed to us, resulting in a mixture of unlabeled trajectories. The current paper focuses on the case where the length of each trajectory is somewhat short, making it infeasible to estimate the system matrix from a single trajectory. Goal. The aim is to jointly learn the system matrices {(A(k), W (k))}1 k K from the mixture of sample trajectories. In particular, we seek to accomplish this task in a sample-efficient manner, where the total sample size is defined to be the aggregate trajectory length PM m=1 Tm. Motivations. The mixed LDSs setting described above is motivated by many real-world scenarios where a single time-series model is insufficient to capture the complex and heterogeneous patterns in temporal data. For instance, in psychology (Bulteel et al., 2016), researchers collect multiple time-series trajectories (e.g. depression-related symptoms over a period of time) from different patients. Fitting this data with multi-modal LDSs (instead of a single model) not only achieves better fitting performance, but also helps Learning Mixtures of Linear Dynamical Systems A mixture of trajectories Estimated LDS models Clustering and classification ("𝑨! , % 𝑾(!)) ("𝑨$ , % 𝑾($)) ("𝑨% , % 𝑾(%)) Model estimation Figure 1. A high-level visualization of the mixed LDSs formulation, and the algorithmic idea of combining clustering, classification, and model estimation. Here, we consider the special case where the multiple short trajectories come from the segments of a single continuous trajectory. The black dots within the continuous trajectory represent the time steps when the latent variable (i.e. the unknown label) changes. identify subgroups of the persons, which further inspires interpretations of the results and tailored treatments for patients from different subgroups. Another example concerns an automobile sensor dataset, which consists of a single continuous (but possibly time-varying) trajectory of measurements from the sensors (Hallac et al., 2017). Through segmentation of the trajectory, clustering of the short pieces, and learning within each cluster, one can discover, and obtain meaningful interpretations for, a small number of key driving modes (such as driving straight , slowing down , turning ). See Section 4 for a longer list of applications. Challenges. While there is no shortage of potential applications, a mixture of LDSs is far more challenging to learn compared to the classical setting with a single LDS. In particular, the presence of the latent variables, i.e. the unknown labels of the sample trajectories, significantly complicates matters. One straightforward idea is to learn a coarse model for each trajectory, followed by proper clustering of these coarse models (to be utilized for refined model estimation); however, this idea becomes infeasible in the highdimensional setting unless all trajectories are sufficiently long. Another popular approach is to alternate between model estimation and clustering of trajectories, based on, say, the expectation-maximization (EM) algorithm; unfortunately, there is no theoretical support for such EM-type algorithms, and we cannot preclude the possibilities that the algorithms get stuck at undesired local optima. The lack of theoretical guarantees in prior literature motivates us to come up with algorithms that enjoy provable performance guarantees. Finally, the present paper is also inspired by the recent progress in meta-learning for mixed linear regression (Kong et al., 2020b;a), where the goal is to learn multiple linear models from a mixture of independent samples; note, however, that the temporal dependence underlying time- series data in our case poses substantial challenges and calls for the development of new algorithmic ideas. Main contributions. In this work, we take an important step towards guaranteed learning of mixed LDSs, focusing on algorithm design that comes with end-to-end theoretical guarantees. In particular, we propose a two-stage metaalgorithm to tackle the challenge of mixed LDSs: 1. Coarse estimation: perform a coarse-level clustering of the unlabeled sample trajectories (assisted by dimension reduction), and compute initial model estimation for each cluster; 2. Refinement: classify additional trajectories (and add each of them to the corresponding cluster) based on the above coarse model estimates, followed by refined model estimation with the updated clusters. This two-stage meta approach, as well as the specific methods for individual steps, will be elucidated in Section 2. Encouragingly, assuming that the noise vectors {wm,t} are independent Gaussian random vectors, the proposed twostage algorithm is not only computationally efficient, but also guaranteed to succeed in the presence of a polynomial sample size. Informally, our algorithm achieves exact clustering/classification of the sample trajectories as well as faithful model estimation, under the following conditions (with a focus on the dependency on the dimension d): (1) each short trajectory length Tm is allowed to be much smaller than d; (2) the total trajectory length of each stage is linear in d (up to logarithmic factors); (3) to achieve a final model estimation error ϵ 0 (in the spectral norm), it suffices to have a total trajectory length of order d/ϵ2 for each LDS model. See Section 3 (in particular, Corollary 3.7) for the precise statements of our main results, which will also be validated numerically. It is worth noting that, although we focus on mixed LDSs for concreteness, we will make clear that the proposed modular algorithm is fairly flexible and can be adapted to learning mixtures of other time-series models, as long as certain technical conditions are satisfied; see Remark 2.3 at the end of Section 2 for a detailed discussion. Notation. Throughout this paper, vectors and matrices are represented by boldface letters. For a vector x Rd, we define (x)i R as the i-th entry of x, and x 2 as its ℓ2 norm; for a matrix X Rm n, we define (X)i Rn as the transpose of the i-th row of X, and X (resp. X F) as its spectral (resp. Frobenius) norm. For a symmetric matrix X Rd d, we denote its maximal (resp. minimal) eigenvalue as λmax(X) (resp. λmin(X)); if in addition X is positive definite, we denote its condition number as κ(X) := Learning Mixtures of Linear Dynamical Systems Algorithm 1 A two-stage algorithm for mixed LDSs 1: Input: M short trajectories {Xm}1 m M (where Xm = {xm,t}0 t Tm); parameters τ, G. 2: // Stage 1: coarse estimation. 3: Run Algorithm 2 with {Xm}m Msubspace to obtain subspaces {Vi, Ui}1 i d . 4: Run Algorithm 3 with {Xm}m Mclustering, {Vi, Ui}1 i d, τ, G, to obtain clusters {Ck}1 k K. 5: Run Algorithm 4 with {Ck}1 k K to obtain coarse models { b A(k), c W (k)}1 k K. 6: // Stage 2: refinement. 7: Run Algorithm 5 with {Xm}m Mclassification, { b A(k), c W (k)}1 k K, {Ck}1 k K, to update clusters {Ck}. 8: Run Algorithm 4 with {Ck}1 k K to obtain refined models { b A(k), c W (k)}1 k K. 9: Output: final model estimation { b A(k), c W (k)}1 k K and clusters {Ck}1 k K. λmax(X)/λmin(X) 1. If A = [Aij]1 i m,1 j n and B = [Bij]1 i m,1 j n are matrices of the same dimension, we denote by A, B := Pm i=1 Pn j=1 Aij Bij their inner product. Given vectors x1, . . . , xn Rd where n < d, let span{xi, 1 i N} Rd n represent the subspace spanned by these vectors. Let Id be the d d identity matrix. We always use the superscript (k) to indicate the k-th model , as in A(k) and W (k); this is to be distinguished from the superscript k without the parentheses, which simply means the power of k. For a discrete set Ω, we denote by |Ω| its cardinality. Define 1(E) to be the indicator function, which takes value 1 if the event E happens, and 0 otherwise. Let an bn indicate that an C0bn for all n = 1, 2, . . . , where C0 > 0 is some universal constant; moreover, an bn is equivalent to bn an, and an bn indicates that an bn and an bn hold simultaneously. 2. Algorithms We propose a two-stage paradigm for solving the mixed LDSs problem, as summarized in Algorithm 1. It consists of several subroutines as described in Algorithms 2 5; due to space limitation, in this section we will only illustrate the key ideas behind these subroutines, with full details deferred to Appendix A. Note that Algorithm 1 is stated in a modular manner, and one might replace certain subroutines by alternative schemes in order to handle different settings and model assumptions. Let us first introduce some additional notation and assumptions that will be useful for our presentation, without much loss of generality. To begin with, we augment the notation for each sample trajectory with its trajectory index; that is, for each 1 m M, the m-th trajectory denoted by Xm := {xm,t}0 t Tm starts with some initial state xm,0 Rd, and evolves according to the km-th LDS for some unknown label 1 km K such that xm,t+1 = A(km)xm,t + wm,t, where the wm,t s are i.i.d., E[wm,t] = 0, cov(wm,t) = W (km) 0 (1) for all 0 t Tm 1. Next, we divide the M sample tra- jectories {Xm}1 m M in hand into three disjoint subsets Msubspace, Mclustering, Mclassification satisfying Msubspace Mclustering Mclassification = {1, 2, . . . , M}, where each subset of samples will be employed to perform one subroutine. We assume that all trajectories within each subset have the same length, namely, Tsubspace if m Msubspace, Tclustering if m Mclustering, Tclassification if m Mclassification. (2) Finally, we assume that K d, so that performing subspace estimation in Algorithm 1 will be helpful (otherwise one might simply eliminate this step). The interested readers are referred to Appendix E for discussions of some potential extensions of our algorithms (e.g. adapting to the case where the trajectories within a subset have different lengths, or the case where certain parameters are a priori unknown). 2.1. Preliminary facts We first introduce some preliminary background on the autocovariance structures and mixing property of linear dynamical systems, which form the basis of our algorithms for subspace estimation and clustering of trajectories. Stationary covariance matrices. Consider first a single LDS model (A, W ), with xt+1 = Axt + wt. If E[xt] = 0, cov(xt) = Γ and E[wt] = 0, cov(wt) = W , then it follows that E[xt+1] = 0, and cov(xt+1) = A cov(xt) A +cov(wt) = AΓA +W . Under certain assumption on stability, this leads to the order-0 stationary autocovariance matrix defined as (Kailath et al., 2000) Γ(A, W ) := E xtxt |A, W = A Γ(A, W ) A + W = t=0 At W (At) , (3) and the order-1 stationary autocovariance matrix Y (A, W ) := E xt+1xt |A, W = A Γ(A, W ). (4) Learning Mixtures of Linear Dynamical Systems For the mixed LDSs case (1) with K models, we abbreviate Γ(k) := Γ(A(k), W (k)), Y (k) := Y (A(k), W (k)). (5) Mixing. Expanding the LDS recursion xt+1 = Axt+wt with cov(wt) = W , we have xt+s = Asxt + i=1 Ai 1wt+s i, s = 1, 2, . . . (6) If A satisfies certain assumptions regarding stability and if s is larger than certain mixing time of the LDS, then the first term on the right-hand side of (6) approaches zero, while the second term is independent of the history up to time t, with covariance close to the stationary autocovariance Γ(A, W ). This suggests that, for two samples within the same trajectory that are sufficiently far apart, we can treat them as being (almost) independent of each other; this simple fact shall inspire our algorithmic design and streamline statistical analysis later on. It is noteworthy that the proposed algorithms do not require prior knowledge about the mixing times of the LDS models. 2.2. Subspace estimation Recall that the notation (Γ(k))i Rd (resp. (Y (k))i) represents the transpose of the i-th row of Γ(k) (resp. Y (k)) defined in (5). Consider the following subspaces: V i := span n (Γ(k))i, 1 k K o , U i := span n (Y (k))i, 1 k K o , 1 i d, each of which has rank at most K. We develop a spectral method to estimate these subspaces, which will in turn allow for proper dimension reduction in subsequent steps. Here, we only introduce the idea for estimating {U i }, since the idea for {V i } is very similar. We first divide {0, 1, . . . , Tsubspace} into four segments of the same size, and denote the 2nd (resp. 4th) segment as Ω1 (resp. Ω2). According to the mixing property of LDS, if Tsubspace is larger than some appropriately defined mixing time, then (1) each sample trajectory in Msubspace will mix sufficiently and nearly reach stationarity (when constrained to t Ω1 Ω2), (2) the samples in Ω1 are nearly independent of those in Ω2. Therefore, defining gm,i,j := 1 |Ωj| t Ωj (xm,t+1)i xm,t (7) for 1 i d and j {1, 2}, it is easy to check that E[gm,i,1g m,i,2] E[gm,i,1]E[gm,i,2] (Y (km))i(Y (km)) i . Taking the average over m Msubspace, we have constructed a matrix b Gi such that E[ b Gi] PK k=1 p(k)(Y (k))i(Y (k))i =: Gi, where p(k) denotes the fraction of sample trajectories generated by the k-th model. As a consequence, if Tsubspace and |Msubspace| are sufficiently large, then one might expect b Gi to be a good approximation of Gi, the latter of which is symmetric, has rank at most K, and has U i as its eigenspace. All this motivates us to compute the rank-K eigenspace of b Gi + b G i . 2.3. Clustering In this step, we propose to construct a similarity matrix S for the sample trajectories in Mclustering, based on their pairwise comparisons; then we can apply any mainstream clustering algorithm (e.g. spectral clustering (Chen et al., 2021b)) to S, dividing Mclustering into K disjoint clusters {Ck}1 k K such that the trajectories in each cluster are primarily generated by the same LDS model. The remaining question is how to perform the pairwise comparisons; we intend to achieve this by comparing the autocovariance matrices associated with the sample trajectories. Note that, even though (A(k), W (k)) = (A(ℓ), W (ℓ)) for k = ℓ, it is indeed possible that Γ(k) = Γ(ℓ) or Y (k) = Y (ℓ). Fortunately, the following fact ensures the separation of (Γ(k), Y (k)) versus (Γ(ℓ), Y (ℓ)). Fact 2.1. If (A(k), W (k)) = (A(ℓ), W (ℓ)), then we have either Γ(k) = Γ(ℓ) or Y (k) = Y (ℓ) (or both). Now, let us compare the m-th and n-th trajectories for some m, n Mclustering. In order to determine whether they have the same label (namely km = kn), we propose to estimate the quantity Γ(km) Γ(kn) 2 F + Y (km) Y (kn) 2 F using the data samples {xm,t} and {xn,t}, which is expected to be small (resp. large) if km = kn (resp. km = kn). To do so, let us divide {0, 1, . . . , Tclustering} evenly into four segments, and denote by Ω1 (resp. Ω2) the 2nd (resp. 4th) segment. Recall our earlier definition of the g vectors in (7). Assuming sufficient mixing and utilizing near independence between samples from Ω1 and those from Ω2, we might resort to the following statistic: D Ui gm,i,1 gn,i,1 , Ui gm,i,2 gn,i,2 E , whose expectation is given by Ui (Y (km))i (Y (kn))i 2 (Y (km))i (Y (kn))i 2 2 = Y (km) Y (kn) 2 F; here, the second approximation holds if each subspace Ui is sufficiently close to U i = span{(Y (j))i, 1 j K}. Learning Mixtures of Linear Dynamical Systems The purpose of utilizing {Ui} is to reduce the variance of stat Y . Similarly, we can compute another statistic statΓ with E[statΓ] Γ(km) Γ(kn) 2 F. Consequently, we shall declare km = kn if statΓ +stat Y exceeds some appropriate threshold. Remark 2.2. For subspace estimation and clustering, we split each trajectory evenly into four parts, and only utilize the 2nd and 4th. In theory, this only affects sample complexities by a constant factor. The key benefit is that no prior knowledge of the mixing time tmix is needed. In cases where tmix is known/estimated, one might improve data efficiency by letting the 1st and 3rd parts have only length tmix. 2.4. Model estimation Suppose that we have obtained a good clustering accuracy in the previous step. Then, for each k, we use the samples {{xm,t}0 t Tm}m Ck to obtain an estimate b A(k) of the state transition matrix by solving a least-squares problem (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019). Finally, we can estimate the noise vector b wm,t := xm,t+1 b A(k)xm,t wm,t, and use the empirical covariance of { b wm,t} as our estimate of the noise covariance matrix. 2.5. Classification In the previous steps, we have obtained initial clusters {Ck}1 k K and coarse model estimates { b A(k), c W (k)}1 k K. With the assistance of additional sample trajectories in Mclassification, we can augment the clusters in the following way: for each new trajectory {xt}0 t T , we infer its label as bk = arg mink L( b A(k), c W (k)), and then assign this trajectory to the bk-th cluster; here, the loss function L(A, W ) := T log det(W )+PT 1 t=0 (xt+1 Axt) W 1(xt+1 Axt) coincides with the negative log-likelihood under a Gaussian assumption. Once this is done, we can run Algorithm 4 again with the updated clusters {Ck} to refine our model estimates, which is exactly the last step of Algorithm 1. Remark 2.3. While this work focuses on mixtures of LDSs, we emphasize that the general principles of Algorithm 1 are applicable under much weaker assumptions. For Algorithms 2 and 3 to work, we essentially only require that each sample trajectory satisfies a certain mixing property, and that the autocovariances of different models are sufficiently separated (and hence distinguishable). As for Algorithms 4 and 5, we essentially require a well-specified parametric form of the time-series models. This observation might inspire future extensions (in theory or applications) of Algorithm 1 to much broader settings. 3. Main results 3.1. Model assumptions To streamline the theoretical analysis, we focus on the case where the trajectories are driven by Gaussian noise; that is, for each 1 m M, 0 t Tm, the noise vector wm,t in (1) is independently generated from the Gaussian distribution N(0, W (km)). Next, we assume for simplicity that the labels {km}1 m M of the trajectories are pre-determined and fixed, although one might equivalently regard {km} as being random and independent of the noise vectors {wm,t}. Moreover, while our algorithms and analysis are largely insensitive to the initial states {xm,0}1 m M, we focus on two canonical cases for concreteness: (i) the trajectories start at zero state, or (ii) they are segments of one continuous long trajectory. This is formalized as follows: Case 0: xm,0 = 0, 1 m M; (8a) Case 1: x1,0 = 0, and xm+1,0 = xm,Tm, 1 m M 1. (8b) We further define the total trajectory length as Ttotal := P 1 m M Tm = P o Ttotal,o, where Ttotal,o := To |Mo|, o {subspace, clustering, classification}. Additionally, we assume that each model occupies a nondegenerate fraction of the data; in other words, there exists some 0 < pmin 1/K such that for all 1 k K and o {subspace, clustering, classification}, pmin p(k) o := 1 |Mo| m Mo 1(km = k). Finally, we make the following assumptions about the ground-truth LDS models, where we recall that the autocovariance matrices {Γ(k), Y (k)} have been defined in (5). Assumption 3.1. The LDS models {A(k), W (k)}1 k K satisfy the following conditions: 1. There exist κA 1 and 0 ρ < 1 such that for any 1 k K, (A(k))t κA ρt, t = 1, 2, . . . ; 2. There exist Γmax Wmax Wmin > 0 and κw,cross κw 1 such that for any 1 k K, (i) λmax(Γ(k)) Γmax, (ii) Wmin λmin(W (k)) λmax(W (k)) Wmax, (iii) Wmax/Wmin =: κw,cross, (iv) κ(W (k)) = λmax(W (k))/λmin(W (k)) κw; 3. There exist Γ,Y , A,W > 0 such that for any 1 k < ℓ K, Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F 2 Γ,Y , A(k) A(ℓ) 2 F + W (k) W (ℓ) 2 F W 2max 2 A,W . Learning Mixtures of Linear Dynamical Systems The first assumption states that each state transition matrix A(k) is exponentially stable, which is a quantified version of stability and has appeared in various forms in the literature of LDS (Kailath et al., 2000; Cohen et al., 2018); here, κA can be regarded as a condition number, while ρ is a contraction rate. The second assumption ensures that the noise covariance matrices {W (k)} are well conditioned, and the autocovariance matrices {Γ(k)} are bounded. The third assumption quantifies the separation between different LDS models. It is important that we consider the separation of (Γ(k), Y (k)) versus (Γ(ℓ), Y (ℓ)) jointly, which guarantees that Γ,Y is always strictly positive (thanks to Fact 2.1), despite the possibility of Γ(k) = Γ(ℓ) or Y (k) = Y (ℓ); the reasoning for our definition of A,W is similar. Remark 3.2. Since the separation parameters A,W , Γ,Y are defined with respect to the Frobenius norm, we may naturally regard them as A,W = dδA,W , Γ,Y = dδΓ,Y , where δA,W , δΓ,Y are the canonical separation parameters (in terms of the spectral norm). For example, in the simple setting where K = 2, W (1) = W (2), A(1) = 0.5Id and A(2) = (0.5 + δ)Id for some canonical separation parameter δ, we have A,W = A(1) A(2) F = δId F = dδ. This observation will be crucial for obtaining the correct dependence on the dimension d in our subsequent analysis. Remark 3.3. Most of the parameters in Assumption 3.1 come directly from the ground-truth LDS models {A(k), W (k)}, except Γmax and Γ,Y . In fact, they can be bounded by Γmax Wmax and Γ,Y A,W ; see Fact D.4 for the formal statements. However, the pre-factors in these bounds can be pessimistic in general cases, therefore we choose to preserve Γmax and Γ,Y in our analysis. 3.2. Theoretical guarantees We are now ready to present our end-to-end performance guarantees for Algorithm 1. Our main results for Cases 0 and 1 (defined in (8)) are summarized in Theorems 3.4 and 3.5 below, with proofs deferred to Appendix B. Theorem 3.4 (Case 0). There exist positive constants {Ci}1 i 8 such that the following holds for any fixed 0 < δ < 1/2. Consider the model (1) under the assumptions in Sections 2 and 3.1, with a focus on Case 0 (8a). Suppose that we run Algorithm 1 with parameters τ, G that satisfy 1/8 < τ/ 2 Γ,Y < 3/8, G C1ι1, and data {Xm}1 m M (where Xm = {xm,t}0 t Tm) that satisfies Tsubspace C2 ι1 1 ρ, Ttotal,subspace C3 d 1 ρ Tclustering C4 G 1 ρ d K 2 Γ,Y + 1 Ttotal,clustering C5 dκ2 w,cross pmin Γmax Wmin + 1 ι2 3, Tclassification C6 κ2 w + κ6 w,cross 2 A,W where we define the logarithmic terms ι1 := log( dκATtotal δ ), ι2 := log ( Γmax Γ,Y + 2) dκATtotal δ , ι3 := log( Γmax Wmin dκATtotal δ ). Then, with probability at least 1 δ, Algorithm 1 achieves exact clustering in Line 4 and exact classification in Line 7; moreover, there exists some permutation π : {1, . . . , K} {1, . . . , K} such that the final model estimation { b A(k), c W (k)}1 k K in Line 8 obeys 1 k K, b A(k) A(π(k)) C7 c W (k) W (π(k)) W (π(k)) C8 dι3 pmin T , (10) where T := Ttotal,clustering + Ttotal,classification. Theorem 3.5 (Case 1). Consider the same setting of Theorem 3.4, except that we focus on Case 1 (cf. (8b)) instead. Then the same performance guarantees continue to hold, if we replace the conditions on Ttotal,clustering and Tclassification in (9) with Ttotal,clustering C5 dκ2 w,cross pmin Γmax Wmin κ2 A + 1 ι2 3, Tclassification C6 κ2 w + κ6 w,cross 2 A,W ι2 1 + 1 1 ρ log(2κA). Remark 3.6. It is worth noting that all the short trajectory lengths {To} in the above theorems have sublinear dependence on the dimension d. In particular, the d K dependence in Tclustering (9) is due to variance reduction achieved by subspace projection (see Section 2.3), without which this dependence would have been linear in d. While Theorems 3.4 and 3.5 guarantee that Algorithm 1 successfully learns a mixture of LDS models with a polynomial number of samples, these results involve many parameters and may be somewhat difficult to interpret. In the following corollary, we make some simplifications and focus on the most important parameters. Corollary 3.7. Consider the same setting of Theorems 3.4 and 3.5. For simplicity, suppose that the condition numbers κA, κw, κw,cross 1, and the fractions of data generated by different LDS models are balanced, namely pmin 1/K. Moreover, define the canonical separation parameters δA,W := A,W / d and δΓ,Y := Γ,Y / Learning Mixtures of Linear Dynamical Systems as suggested by Remark 3.2. Finally, define the mixing time tmix := 1/(1 ρ). Then we can rewrite the sample complexities in Theorems 3.4 and 3.5 as follows (where we hide the logarithmic terms {ιi}1 i 3): if Tsubspace tmix, Ttotal,subspace tmixd Tclustering tmix Ttotal,clustering Kd 1 δ2 A,W Γmax Wmin + 1 , Tclassification 1 dδ2 A,W + 1 for Case 0, 1 dδ2 A,W + tmix for Case 1, Ttotal,clustering + Ttotal,classification Kd then with high probability, Algorithm 1 achieves exact clustering in Line 4, exact classification in Line 7, and final model estimation errors b A(k) A(π(k)) ϵ, c W (k) W (π(k)) W (π(k)) ϵ. Below are some key implications of Corollary 3.7. Dimension d and targeted error ϵ: (1) Our algorithms allow the To s to be much smaller than (and even decrease with) d. (2) Each Ttotal,o shall grow linearly with d, and it takes an order of Kd/ϵ2 samples in total to learn K models in up to ϵ 0 errors (in the spectral norm), which is just K times the usual parametric rate (d/ϵ2) of estimating a single model. Mixing time tmix: (1) Tsubspace and Tclustering are linear in tmix, which ensures sufficient mixing and thus facilitates our algorithms for Stage 1. In contrast, Tclassification depends on tmix only for Case 1, with the sole and different purpose of ensuring that the states {xm,t} are bounded throughout (see Example D.5 for an explanation). (2) While Ttotal,subspace needs to grow linearly with tmix, this is not required for Ttotal,clustering and Ttotal,classification, because our methods for model estimation (Algorithm 4) do not rely on mixing (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019). Canonical separation parameters δA,W , δΓ,Y : (1) Tclustering 1/δ2 Γ,Y guarantees exact clustering of the trajectories, while Tclassification 1/δ2 A,W guarantees exact classification. (2) Ttotal,subspace 1/δ4 Γ,Y leads to sufficiently accurate subspaces, while Figure 2. The model estimation errors of Algorithm 1 versus the total sample size (excluding Msubspace). Each curve is an average over 12 independent trials. Ttotal,clustering 1/δ2 A,W leads to accurate initial model estimation.1 Remark 3.8. It is worth noting that Tclustering Tclassification in Corollary 3.7, i.e. clustering requires a larger trajectory length than classification does. This justifies the benefit of the proposed two-stage procedure, compared with simply doing a large clustering of all trajectories and then one-shot model estimation. 3.3. Numerical experiments We now validate our theoretical findings with a series of numerical experiments, confirming that Algorithm 1 successfully solves the mixed LDSs problem. In these experiments, we fix d = 80, K = 4; moreover, let Tsubspace = 20, Tclustering = 20 and Tclassification = 5, all of which are much smaller than d. We take |Msubspace| = 30 d, |Mclustering| = 10 d, and vary |Mclassification| between [0, 5000 d]. Our experiments focus on Case 1 as defined in (8b), and we generate the labels of the sample trajectories uniformly at random. The ground-truth LDS models are generated in the following manner: A(k) = ρR(k), where ρ = 0.5 and R(k) Rd d is a random orthogonal matrix; W (k) has eigendecomposition U (k)Λ(k)(U (k)) , where U (k) is a random orthogonal matrix, and the diagonal entries of Λ(k) are independently drawn from the uniform distribution on [1, 2]. Our experimental results are illustrated in Figure 2. Here, the horizontal axis represents the sample size T = Ttotal,clustering + Ttotal,classification for model estimation, and the vertical axis represents the estimation errors, measured by maxk b A(k) A(π(k)) (plotted in blue) and maxk c W (k) W (π(k)) / W (π(k)) (plotted in orange). The results confirm our main theoretical prediction: Algorithm 1 recovers the LDS models based on a mixture of short trajectories with length To d, and achieves an error rate of 1/ T. In addition, we observe in our experiments 1It is possible to improve the 1/δ4 Γ,Y factor in Ttotal,subspace to 1/δ2 Γ,Y , if one is willing to pay for some extra factors of eigengaps; see Appendix B for a detailed discussion. Learning Mixtures of Linear Dynamical Systems that the final outputs of Algorithm 1 are robust to a small number of mis-clustered or mis-classified trajectories during the intermediate steps; this is clearly an appealing property to have, especially when the To s and |Mo| s in reality are slightly smaller than what our theory requires. Additional experimental results can be found in Appendix F. 4. Related works Meta-learning for mixed linear regression. Our work is closely related to the recent papers (Kong et al., 2020b;a) that bridge mixed linear regression (Quandt & Ramsey, 1978; Yi et al., 2014; Chen et al., 2017; Li & Liang, 2018; Chen et al., 2020; Kwon et al., 2021b; Diakonikolas & Kane, 2020; Yin et al., 2018; Pal & Mazumdar, 2020; Chen et al., 2021c; Diamandis et al., 2021) and meta-learning (Finn et al., 2017; Harrison et al., 2020; Snell et al., 2017; Du et al., 2021; Pan & Yang, 2009; Tripuraneni et al., 2020; Baxter, 2000; Maurer et al., 2016). In this setting of metalearning for mixed linear regression (Kong et al., 2020b;a), one has access to multiple tasks, each containing a few independent and identically distributed samples generated by an unknown model; the inductive bias (i.e. the common structure underlying these tasks) for meta-learning is that there is only a small discrete set of ground-truth linear regression models, akin to the case of mixed linear regression. While the high-level idea of our Algorithm 1 is largely inspired by the work of (Kong et al., 2020b;a), our detailed implementations are substantially different, due to the lack of ideal i.i.d. assumption (among others) in our case of time-series data; in addition, we improve upon some of the analyses in (Kong et al., 2020b;a).2 Mixtures of time-series models and trajectories. Mixture models for time series have achieved empirical success in the study of psychology (Bulteel et al., 2016; Takano et al., 2020), neuroscience (Albert, 1991; Mezer et al., 2009), biology (Wong & Li, 2000), air pollution (D Urso et al., 2015), economics (Mc Culloch & Tsay, 1994; Maharaj, 2000; Kalliovirta et al., 2016), automobile sensors (Hallac et al., 2017), and many other domains. Some specific aspects of mixture models include hypothesis testing for a pair of trajectories (Maharaj, 2000), or clustering of multiple trajectories (Liao, 2005; Aghabozorgi et al., 2015; Pathak 2There is a concurrent preprint (Modi et al., 2021) that extends multi-task learning (Du et al., 2021; Tripuraneni et al., 2020) to time-series data, and the setting therein includes mixed LDSs as a special case. However, the authors of (Modi et al., 2021) assume oracle access to the global optimum of a non-convex optimization problem, without providing a practical algorithm that can provably find it; moreover, with the short trajectory length fixed, the estimation error bounds in that work will remain bounded away from zero, even if the number of trajectories grows to infinity. In comparison, we consider a simpler problem setting, and propose computationally efficient algorithms with better error bounds. et al., 2021; Huang et al., 2021). In addition to mixture models, other related yet different models include time-varying systems (Qu et al., 2021; Minasyan et al., 2021), systems with random parameters (Du et al., 2020), switching systems (Sun, 2006; Sarkar et al., 2019; Ansari et al., 2021), switching state-space models (Ghahramani & Hinton, 2000; Linderman et al., 2017), Markovian jump systems (Shi & Li, 2015; Zhao et al., 2019), and event-triggered systems (Sedghi et al., 2020; Schluter et al., 2020), to name just a few. There are even more related models in reinforcement learning (RL), such as latent bandit (Maillard & Mannor, 2014; Hong et al., 2020), multi-task learning (Wilson et al., 2007; Brunskill & Li, 2013; Liu et al., 2016; Sodhani et al., 2021) / meta-learning (Finn et al., 2017) / transfer-learning (Taylor & Stone, 2009; Tirinzoni et al., 2020) for RL, latent Markov decision processes (Kwon et al., 2021a; Brunskill et al., 2009; Steimle et al., 2021), and so on. What distinguishes our work from this extensive literature is that, we design algorithms and prove non-asymptotic sample complexities for model estimation, in the specific setting of mixture models that features (1) a finite set of underlying time-series models, and (2) unknown labels of the trajectories, with no probabilistic assumptions imposed on these latent variables. Linear dynamical systems. LDS (also called vector autoregressive models) is one of the most fundamental models in system identification and optimal control (Ljung, 1998; Khalil et al., 1996). Recently, there has been a surge of studies about non-asymptotic theoretical analyses of various learning procedures for the basic LDS model (Faradonbeh et al., 2018; Simchowitz et al., 2018; Sarkar & Rakhlin, 2019), linear-quadratic regulators (Dean et al., 2020; Cohen et al., 2018; Jedra & Proutiere, 2019; Faradonbeh et al., 2020; Mania et al., 2019; Simchowitz & Foster, 2020; Fazel et al., 2018; Malik et al., 2019), and LDSs with partial observations (Oymak & Ozay, 2019; Simchowitz et al., 2019; Sarkar et al., 2021; Sun et al., 2020; Tsiamis et al., 2020; Lale et al., 2020; Zheng et al., 2021). In particular, it was only until recently that the authors of (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019) proved sharp error bounds of least squares for estimating the state transition matrix of a LDS model, using a single trajectory; our analysis of Algorithm 4 is largely inspired by their techniques. 5. Discussion This paper has developed a theoretical and algorithmic framework for learning multiple LDS models from a mixture of short, unlabeled sample trajectories. Our key contributions include a modular two-stage meta-algorithm, as well as theoretical analysis demonstrating its computational and statistical efficiency. We would like to invite the readers to contribute to this important topic by, say, further strengthening the theoretical analysis and algorithmic design. For exam- Learning Mixtures of Linear Dynamical Systems ple, in certain cases Tclustering can be a bottleneck compared with Tsubspace and Tclassification, and thus one might hope to achieve a better dependence on Ttotal,clustering (e.g. allowing Ttotal,clustering d), by replacing Line 5 in Algorithm 1 with a different method (e.g. adapting Algorithm 3 of (Kong et al., 2020b) to our setting). As another example, in some practical scenarios, the data is a single continuous trajectory, and the time steps when the underlying model changes are unknown (Hallac et al., 2017; Harrison et al., 2020); in order to accommodate such a case, one might need to incorporate change-point detection into the learning process. Moving beyond the current setting of mixed LDSs, we remark that there are plenty of opportunities for future studies. For instance, while our methods in Stage 1 rely on the mixing property of the LDS models, it is worth exploring whether it is feasible to handle the non-mixing case (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019). Another potential direction is to consider the robustness against outliers and adversarial noise (Chen et al., 2021a; Kong et al., 2020a). One might even go further and extend the ideas to learning mixtures of other time-series models (see Remark 2.3), such as LDS with partial or nonlinear observations (Mhammedi et al., 2020), or nonlinear dynamical systems (Mania et al., 2020; Kakade et al., 2020; Foster et al., 2020). Ultimately, it would be of great importance to consider the case with controlled inputs, such as learning mixtures of linear-quadratic regulators, or latent Markov decision processes (Kwon et al., 2021a) that arises in reinforcement learning. Acknowledgements Y. Chen is supported in part by the ARO grant W911NF20-1-0097, the NSF grants CCF-1907661 and IIS-1900140, and the AFOSR grant FA9550-19-1-0030. H. V. Poor is supported in part by the NSF under Grant CCF-1908308. We would like to thank Yuxin Chen and Gen Li for numerous helpful discussions. Aghabozorgi, S., Shirkhorshidi, A. S., and Wah, T. Y. Timeseries clustering a decade review. Information Systems, 53:16 38, 2015. Albert, P. S. A two-state markov mixture model for a time series of epileptic seizure counts. Biometrics, pp. 1371 1381, 1991. Ansari, A. F., Benidis, K., Kurle, R., Turkmen, A. C., Soh, H., Smola, A. J., Wang, B., and Januschowski, T. Deep explicit duration switching models for time series. In Advances in Neural Information Processing Systems, volume 34, 2021. Baxter, J. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149 198, 2000. Brunskill, E. and Li, L. Sample complexity of multi-task reinforcement learning. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pp. 122 131, Arlington, Virginia, USA, 2013. AUAI Press. Brunskill, E., Leffler, B. R., Li, L., Littman, M. L., and Roy, N. Provably efficient learning with typed parametric models. Journal of Machine Learning Research, 10(68): 1955 1988, 2009. Bulteel, K., Tuerlinckx, F., Brose, A., and Ceulemans, E. Clustering vector autoregressive models: Capturing qualitative differences in within-person dynamics. Frontiers in Psychology, 7:1540, 2016. Chen, S., Li, J., and Song, Z. Learning mixtures of linear regressions in subexponential time via fourier moments. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 587 600, 2020. Chen, S., Koehler, F., Moitra, A., and Yau, M. Kalman filtering with adversarial corruptions. ar Xiv preprint ar Xiv:2111.06395, 2021a. Chen, Y., Yi, X., and Caramanis, C. Convex and nonconvex formulations for mixed regression with two components: Minimax optimal rates. IEEE Transactions on Information Theory, 64(3):1738 1766, 2017. Chen, Y., Chi, Y., Fan, J., Ma, C., et al. Spectral methods for data science: A statistical perspective. Foundations and Trends R in Machine Learning, 14(5):566 806, 2021b. Chen, Y., Ma, C., Poor, H. V., and Chen, Y. Learning mixtures of low-rank models. IEEE Transactions on Information Theory, 67(7):4613 4636, 2021c. doi: 10. 1109/TIT.2021.3065700. Cohen, A., Hasidim, A., Koren, T., Lazic, N., Mansour, Y., and Talwar, K. Online linear quadratic control. In Proceedings of the International Conference on Machine Learning, pp. 1029 1038. PMLR, 2018. Davis, C. and Kahan, W. M. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970. Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, 20(4):633 679, 2020. Diakonikolas, I. and Kane, D. M. Small covers for near-zero sets of polynomials and learning latent variable models. In Proceedings of the 2020 IEEE 61st Annual Symposium Learning Mixtures of Linear Dynamical Systems on Foundations of Computer Science (FOCS), pp. 184 195. IEEE, 2020. Diamandis, T., Eldar, Y., Fallah, A., Farnia, F., and Ozdaglar, A. A Wasserstein minimax framework for mixed linear regression. In Meila, M. and Zhang, T. (eds.), Proceedings of the International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 2697 2706. PMLR, 18 24 Jul 2021. Du, K., Meng, Q., and Zhang, F. A q-learning algorithm for discrete-time linear-quadratic control with random parameters of unknown distribution: convergence and stabilization. ar Xiv preprint ar Xiv:2011.04970, 2020. Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-shot learning via learning the representation, provably. In International Conference on Learning Representations, 2021. D Urso, P., De Giovanni, L., and Massari, R. Time series clustering by a robust autoregressive metric with application to air pollution. Chemometrics and Intelligent Laboratory Systems, 141:107 124, 2015. Faradonbeh, M. K. S., Tewari, A., and Michailidis, G. Finite time identification in unstable linear systems. Automatica, 96:342 353, 2018. Faradonbeh, M. K. S., Tewari, A., and Michailidis, G. On adaptive linear quadratic regulators. Automatica, 117: 108982, 2020. Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. In Proceedings of the International Conference on Machine Learning, pp. 1467 1476. PMLR, 2018. Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning - Volume 70, pp. 1126 1135. JMLR.org, 2017. Foster, D., Sarkar, T., and Rakhlin, A. Learning nonlinear dynamical systems from a single trajectory. In Learning for Dynamics and Control, pp. 851 861. PMLR, 2020. Ghahramani, Z. and Hinton, G. E. Variational learning for switching state-space models. Neural computation, 12 (4):831 864, 2000. Hallac, D., Vare, S., Boyd, S., and Leskovec, J. Toeplitz inverse covariance-based clustering of multivariate time series data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 215 223, 2017. Harrison, J., Sharma, A., Finn, C., and Pavone, M. Continuous meta-learning without tasks. In Advances in Neural Information Processing Systems, volume 33, 2020. Hong, J., Kveton, B., Zaheer, M., Chow, Y., Ahmed, A., and Boutilier, C. Latent bandits revisited. In Advances in Neural Information Processing Systems, volume 33, pp. 13423 13433. Curran Associates, Inc., 2020. Huang, L., Sudhir, K., and Vishnoi, N. Coresets for time series clustering. In Advances in Neural Information Processing Systems, volume 34, 2021. Jedra, Y. and Proutiere, A. Sample complexity lower bounds for linear system identification. In Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 2676 2681. IEEE, 2019. Kailath, T., Sayed, A. H., and Hassibi, B. Linear Estimation. Prentice Hall, 2000. Kakade, S., Krishnamurthy, A., Lowrey, K., Ohnishi, M., and Sun, W. Information theoretic regret bounds for online nonlinear control. In Advances in Neural Information Processing Systems, volume 33, pp. 15312 15325. Curran Associates, Inc., 2020. Kalliovirta, L., Meitz, M., and Saikkonen, P. Gaussian mixture vector autoregression. Journal of Econometrics, 192(2):485 498, 2016. Khalil, I. S., Doyle, J., and Glover, K. Robust and Optimal Control. Prentice Hall, New Jersey, 1996. Kong, W., Somani, R., Kakade, S., and Oh, S. Robust metalearning for mixed linear regression with small batches. In Advances in Neural Information Processing Systems, volume 33, pp. 4683 4696. Curran Associates, Inc., 2020a. Kong, W., Somani, R., Song, Z., Kakade, S., and Oh, S. Meta-learning for mixed linear regression. In Proceedings of the International Conference on Machine Learning, pp. 5394 5404. PMLR, 2020b. Kwon, J., Efroni, Y., Caramanis, C., and Mannor, S. Rl for latent MDPs: Regret guarantees and a lower bound. ar Xiv preprint ar Xiv:2102.04939, 2021a. Kwon, J., Ho, N., and Caramanis, C. On the minimax optimality of the em algorithm for learning two-component mixed linear regression. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 1405 1413. PMLR, 2021b. Lale, S., Azizzadenesheli, K., Hassibi, B., and Anandkumar, A. Logarithmic regret bound in partially observable linear dynamical systems. In Advances in Neural Information Processing Systems, volume 33, pp. 20876 20888. Curran Associates, Inc., 2020. Learning Mixtures of Linear Dynamical Systems Li, Y. and Liang, Y. Learning mixtures of linear regressions with nearly optimal complexity. In Proceedings of the Conference On Learning Theory, pp. 1125 1144, 2018. Liao, T. W. Clustering of time series data a survey. Pattern Recognition, 38(11):1857 1874, 2005. Liberzon, D. Switching in Systems and Control. Springer Science & Business Media, 2003. Linderman, S., Johnson, M., Miller, A., Adams, R., Blei, D., and Paninski, L. Bayesian learning and inference in recurrent switching linear dynamical systems. In Artificial Intelligence and Statistics, pp. 914 922. PMLR, 2017. Liu, Y., Guo, Z., and Brunskill, E. Pac continuous state online multitask reinforcement learning with identification. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 438 446, 2016. Ljung, L. System identification. In Signal analysis and prediction, pp. 163 173. Springer, 1998. Lugosi, G. and Mendelson, S. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145 1190, 2019. Maharaj, E. A. Cluster of time series. Journal of Classification, 17(2):297 314, 2000. Maillard, O.-A. and Mannor, S. Latent bandits. In Proceedings of the International Conference on Machine Learning, pp. 136 144. PMLR, 2014. Malekzadeh, M., Clegg, R. G., Cavallaro, A., and Haddadi, H. Mobile sensor data anonymization. In Proceedings of the International Conference on Internet of Things Design and Implementation, Io TDI 19, pp. 49 58. ACM, 2019. Malik, D., Pananjady, A., Bhatia, K., Khamaru, K., Bartlett, P., and Wainwright, M. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 2916 2925. PMLR, 2019. Mania, H., Tu, S., and Recht, B. Certainty equivalence is efficient for linear quadratic control. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 10154 10164, 2019. Mania, H., Jordan, M. I., and Recht, B. Active learning for nonlinear system identification with guarantees. ar Xiv preprint ar Xiv:2006.10277, 2020. Maurer, A., Pontil, M., and Romera-Paredes, B. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853 2884, 2016. Mc Culloch, R. E. and Tsay, R. S. Statistical analysis of economic time series via markov switching models. Journal of Time Series Analysis, 15(5):523 539, 1994. Mezer, A., Yovel, Y., Pasternak, O., Gorfine, T., and Assaf, Y. Cluster analysis of resting-state fmri time series. Neuroimage, 45(4):1117 1125, 2009. Mhammedi, Z., Foster, D. J., Simchowitz, M., Misra, D., Sun, W., Krishnamurthy, A., Rakhlin, A., and Langford, J. Learning the linear quadratic regulator from nonlinear observations. ar Xiv preprint ar Xiv:2010.03799, 2020. Minasyan, E., Gradu, P., Simchowitz, M., and Hazan, E. Online control of unknown time-varying dynamical systems. In Advances in Neural Information Processing Systems, volume 34, 2021. Modi, A., Faradonbeh, M. K. S., Tewari, A., and Michailidis, G. Joint learning of linear time-invariant dynamical systems. ar Xiv preprint ar Xiv:2112.10955, 2021. Oymak, S. and Ozay, N. Non-asymptotic identification of LTI systems from a single trajectory. In Proceedings of the 2019 American Control Conference (ACC), pp. 5655 5661. IEEE, 2019. Pal, S. and Mazumdar, A. Recovery of sparse signals from a mixture of linear samples. In Proceedings of the International Conference on Machine Learning, pp. 7466 7475. PMLR, 2020. Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22 (10):1345 1359, 2009. Pathak, R., Sen, R., Rao, N., Erichson, N. B., Jordan, M. I., and Dhillon, I. S. Cluster-and-conquer: A framework for time-series forecasting. ar Xiv preprint ar Xiv:2110.14011, 2021. Qu, G., Shi, Y., Lale, S., Anandkumar, A., and Wierman, A. Stable online control of linear time-varying systems. In Learning for Dynamics and Control, pp. 742 753. PMLR, 2021. Quandt, R. E. and Ramsey, J. B. Estimating mixtures of normal distributions and switching regressions. Journal of the American Statistical Association, 73(364):730 738, 1978. Sarkar, T. and Rakhlin, A. Near optimal finite time identification of arbitrary linear dynamical systems. In Proceedings of the International Conference on Machine Learning, pp. 5610 5618. PMLR, 2019. Learning Mixtures of Linear Dynamical Systems Sarkar, T., Rakhlin, A., and Dahleh, M. Nonparametric system identification of stochastic switched linear systems. In Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 3623 3628. IEEE, 2019. Sarkar, T., Rakhlin, A., and Dahleh, M. A. Finite time LTI system identification. Journal of Machine Learning Research, 22(26):1 61, 2021. Schluter, H., Solowjow, F., and Trimpe, S. Event-triggered learning for linear quadratic control. IEEE Transactions on Automatic Control, 2020. Sedghi, L., Ijaz, Z., Witheephanich, K., Pesch, D., et al. Machine learning in event-triggered control: Recent advances and open issues. ar Xiv preprint ar Xiv:2009.12783, 2020. Shi, P. and Li, F. A survey on markovian jump systems: modeling and design. International Journal of Control, Automation and Systems, 13(1):1 16, 2015. Simchowitz, M. and Foster, D. Naive exploration is optimal for online lqr. In Proceedings of the International Conference on Machine Learning, pp. 8937 8948. PMLR, 2020. Simchowitz, M., Mania, H., Tu, S., Jordan, M. I., and Recht, B. Learning without mixing: Towards a sharp analysis of linear system identification. In Proceedings of the Conference On Learning Theory, pp. 439 473. PMLR, 2018. Simchowitz, M., Boczar, R., and Recht, B. Learning linear dynamical systems with semi-parametric least squares. In Proceedings of the Conference on Learning Theory, pp. 2714 2802. PMLR, 2019. Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077 4087, 2017. Sodhani, S., Zhang, A., and Pineau, J. Multi-task reinforcement learning with context-based representations. In Meila, M. and Zhang, T. (eds.), Proceedings of the International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 9767 9779. PMLR, 18 24 Jul 2021. Steimle, L. N., Kaufman, D. L., and Denton, B. T. Multimodel markov decision processes. IISE Transactions, pp. 1 16, 2021. Sun, Y., Oymak, S., and Fazel, M. Finite sample system identification: Optimal rates and the role of regularization. In Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pp. 16 25. PMLR, 10 11 Jun 2020. Sun, Z. Switched Linear Systems: Control and Design. Springer Science & Business Media, 2006. Takano, K., Stefanovic, M., Rosenkranz, T., and Ehring, T. Clustering individuals on limited features of a vector autoregressive model. Multivariate Behavioral Research, pp. 1 19, 2020. Taylor, M. E. and Stone, P. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7), 2009. Tirinzoni, A., Poiani, R., and Restelli, M. Sequential transfer in reinforcement learning with a generative model. In Proceedings of the International Conference on Machine Learning, pp. 9481 9492. PMLR, 2020. Tripuraneni, N., Jordan, M. I., and Jin, C. On the theory of transfer learning: The importance of task diversity. In Advances in Neural Information Processing Systems, volume 33, 2020. Tsiamis, A., Matni, N., and Pappas, G. Sample complexity of kalman filtering for unknown systems. In Learning for Dynamics and Control, pp. 435 444. PMLR, 2020. Vershynin, R. High-dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge university press, 2018. Wilson, A., Fern, A., Ray, S., and Tadepalli, P. Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the International Conference on Machine Learning, pp. 1015 1022, 2007. Wong, C. S. and Li, W. K. On a mixture autoregressive model. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1):95 115, 2000. Yi, X., Caramanis, C., and Sanghavi, S. Alternating minimization for mixed linear regression. In Proceedings of the International Conference on Machine Learning, pp. 613 621, 2014. Yin, D., Pedarsani, R., Chen, Y., and Ramchandran, K. Learning mixtures of sparse linear regressions using sparse graph codes. IEEE Transactions on Information Theory, 65(3):1430 1451, 2018. Zhao, P., Kang, Y., and Zhao, Y.-B. A brief tutorial and survey on markovian jump systems: Stability and control. IEEE Systems, Man, and Cybernetics Magazine, 5(2): 37 C3, 2019. Learning Mixtures of Linear Dynamical Systems Zheng, Y., Furieri, L., Kamgarpour, M., and Li, N. Sample complexity of linear quadratic gaussian (lqg) control for output feedback systems. In Learning for Dynamics and Control, pp. 559 570. PMLR, 2021. Learning Mixtures of Linear Dynamical Systems Table 1. A list of notation and parameters. In the subscripts of Mo, To, Ttotal,o, the symbol o takes value in {subspace, clustering, classification}. Notation Explanation d State dimension K Number of LDS models km The unknown label (latent variable) of the m-th trajectory Mo Subsets of M trajectories, {1, . . . , M} = Msubspace Mclustering Mclassification pmin A lower bound for the fraction of trajectories generated by each model To Short trajectory length for Mo Ttotal,o Total trajectory length, Ttotal,o = To |Mo| A(k), W (k) State transition matrix and noise covariance matrix of the k-th LDS model Γ(k), Y (k) Order-0 and order-1 stationary autocovariance matrices of the k-th LDS model κA, ρ (A(k))t κA ρt, t = 1, 2, . . . Γ,Y , A,W Model separation parameters (see Assumption 3.1) Wmin, Wmax Wmin λmin(W (k)) λmax(W (k)) Wmax, 1 k K κw,cross, κw κw,cross = Wmax/Wmin; κ(W (k)) κw, 1 k K Γmax Γ(k) Γmax, 1 k K This appendix is organized as follows. Appendix A complements Section 2, providing full details of the proposed Algorithms 2 5. Appendix B includes detailed theoretical results for the algorithms, with proofs postponed to Appendix C; it also provides a proof for our main results in Section 3. Appendix D collects some miscellaneous results. In Appendix E we discuss on some potential extensions of our methods. Finally, Appendix F includes additional experimental results. For the readers convenience, we include Table 1 for a quick reference to the key notation and parameters used in our analysis. Additional notation. Let vec( ) denote the vectorization of a matrix. For two matrices A, B, let A B denote their Kronecker product. Given a sequence of real numbers {xi}1 i N, we denote its median as median{xi, 1 i N}. We shall also let poly(n) denote some polynomial in n of a constant degree. For a positive integer n, we denote [n] := {1, 2, . . . , n}. A. Detailed Algorithms A.1. Subspace estimation Procedure. Recall that the notation (Γ(k))i Rd (resp. (Y (k))i) represents the transpose of the i-th row of Γ(k) (resp. Y (k)) defined in (5). With this set of notation in place, let us define the following subspaces: V i := span n (Γ(k))i, 1 k K o , U i := span n (Y (k))i, 1 k K o , 1 i d. (12) It is easily seen from the construction that each of these subspaces has rank at most K. As it turns out, the collection of 2d subspaces defined in (12) provides crucial low-dimensional information about the linear dynamical systems of interest. This motivates us to develop a data-driven method to estimate these subspaces, which will in turn allow for proper dimension reduction in subsequent steps. Towards this end, we propose to employ a spectral method for subspace estimation using sample trajectories in Msubspace: (i) divide {0, 1, . . . , Tsubspace} into four segments of the same size, and denote the 2nd (resp. 4th) segment as Ω1 (resp. Ω2); Learning Mixtures of Linear Dynamical Systems (ii) for each m Msubspace, 1 i d, j {1, 2}, compute hm,i,j := 1 |Ωj| t Ωj (xm,t)i xm,t, gm,i,j := 1 |Ωj| t Ωj (xm,t+1)i xm,t, ; (13) (iii) for each 1 i d, compute the following matrices c Hi := 1 |Msubspace| m Msubspace hm,i,1h m,i,2, b Gi := 1 |Msubspace| m Msubspace gm,i,1g m,i,2, (14) and let Vi Rd K (resp. Ui) be the top-K eigenspace of c Hi + c H i (resp. b Gi + b G i ). The output {Vi, Ui}1 i d will then serve as our estimate of {V i , U i }1 i d. This spectral approach is summarized in Algorithm 2. Rationale. According to the mixing property of LDS, if Tsubspace is larger than some appropriately defined mixing time, then each sample trajectory in Msubspace will mix sufficiently and nearly reach stationarity (when constrained to t Ω1 Ω2). In this case, it is easy to check that E hm,i,j] (Γ(km))i, E gm,i,j] (Y (km))i, 1 i d, j {1, 2}. Moreover, the samples in Ω1 are nearly independent of those in Ω2 as long as the spacing between them exceeds the mixing time, and therefore, E h c Hi i 1 |Msubspace| m Msubspace (Γ(km))i(Γ(km)) i = k=1 p(k) subspace(Γ(k))i(Γ(k))i =: Hi, (15a) E h b Gi i 1 |Msubspace| m Msubspace (Y (km))i(Y (km)) i = k=1 p(k) subspace(Y (k))i(Y (k))i =: Gi, (15b) where p(k) subspace denotes the fraction of sample trajectories generated by the k-th model, namely, p(k) subspace := 1 |Msubspace| m Msubspace 1(km = k), 1 k K. (16) As a consequence, if Tsubspace and |Msubspace| are both sufficiently large, then one might expect c Hi (resp. b Gi) to be a reasonably good approximation of Hi (resp. Gi), the latter of which has rank at most K and has V i (resp. U i ) as its eigenspace. All this motivates us to compute the rank-K eigenspaces of c Hi + c H i and b Gi + b G i in Algorithm 2. A.2. Clustering This step seeks to divide the sample trajectories in Mclustering into K clusters (albeit not perfectly), such that the trajectories in each cluster are primarily generated by the same LDS model. We intend to achieve this by performing pairwise comparisons of the autocovariance matrices associated with the sample trajectories. Key observation. Even though (A(k), W (k)) = (A(ℓ), W (ℓ)) for k = ℓ, it is indeed possible that Γ(k) = Γ(ℓ) or Y (k) = Y (ℓ). Therefore, in order to differentiate sample trajectories generated by different systems based on Γ(A, W ) and Y (A, W ), it is important to ensure separation of (Γ(k), Y (k)) and (Γ(ℓ), Y (ℓ)) when k = l, which can be guaranteed by the following fact. Fact A.1. If (A(k), W (k)) = (A(ℓ), W (ℓ)), then we have either Γ(k) = Γ(ℓ) or Y (k) = Y (ℓ) (or both). Proof. First, observe that the definitions of {Γ(k), Y (k)} in (5) suggest that we can recover A(k), W (k) from Γ(k), Y (k) as follows: A(k) = Y (k)Γ(k) 1, W (k) = Γ(k) A(k)Γ(k)A(k) . (17) Learning Mixtures of Linear Dynamical Systems Algorithm 2 Subspace estimation 1: Input: short trajectories {Xm}m Msubspace, where Xm = {xm,t}0 t Tsubspace. 2: Let N Tsubspace/4 , and Ω1 {N + j, 1 j N}, Ω2 {3N + j, 1 j N}. 3: for (m, i, j) Msubspace [d] [2] do 4: Compute hm,i,j 1 |Ωj| t Ωj (xm,t)i xm,t, gm,i,j 1 |Ωj| t Ωj (xm,t+1)i xm,t. 5: end for 6: for i = 1, . . . , d do 7: Compute c Hi 1 |Msubspace| m Msubspace hm,i,1h m,i,2, b Gi 1 |Msubspace| m Msubspace gm,i,1g m,i,2, 8: Let Vi Rd K (resp. Ui) be the top-K eigenspace of c Hi + c H i (resp. b Gi + b G i ). 9: end for 10: Output: subspaces {Vi, Ui}1 i d. With this, we can prove the fact by contradiction. Suppose instead that Γ(k) = Γ(ℓ) and Y (k) = Y (ℓ), then (17) yields A(k) = Y (k)Γ(k) 1 = Y (ℓ)Γ(ℓ) 1 = A(ℓ), and W (k) = Γ(k) A(k)Γ(k)A(k) = Γ(ℓ) A(ℓ)Γ(ℓ)A(ℓ) = W (ℓ), which is contradictory to the assumption that (A(k), W (k)) = (A(ℓ), W (ℓ)). Idea. Let us compare a pair of sample trajectories {xt}0 t Tclustering and {zt}0 t Tclustering, where {xt} is generated by the system (A(k), W (k)) and {zt} by the system (A(ℓ), W (ℓ)). In order to determine whether k = ℓ, we propose to estimate the quantity Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F using the data samples {xt} and {zt}, which is expected to be small (resp. large) if k = ℓ(resp. k = ℓ). To do so, let us divide {0, 1, . . . , Tclustering} evenly into four segments, and denote by Ω1 (resp. Ω2) the 2nd (resp. 4th) segment, akin to what we have done in the previous step. Observe that Ui E h (xt+1)ixt (zt+1)izt i Ui (Y (k))i (Y (ℓ))i for all 1 i d and t Ω1 Ω2. Assuming sufficient mixing and utilizing (near) statistical independence due to sample splitting, we might resort to the following statistic (xt+1)ixt (zt+1)izt , Ui 1 (xt+1)ixt (zt+1)izt E , (19) whose expectation is given by D Ui (Y (k))i (Y (ℓ))i , Ui (Y (k))i (Y (ℓ))i E Ui (Y (k))i (Y (ℓ))i 2 (Y (k))i (Y (ℓ))i 2 2 = Y (k) Y (ℓ) 2 F; here, the first approximation is due to the near independence between samples from Ω1 and those from Ω2, whereas the second approximation holds if each subspace Ui is sufficiently close to U i = span{(Y (j))i, 1 j K}. The purpose of Learning Mixtures of Linear Dynamical Systems Algorithm 3 Clustering 1: Input: short trajectories {Xm}m Mclustering, where Xm = {xm,t}0 t Tclustering; subspaces {Vi, Ui}1 i d; testing threshold τ; number of copies G. 2: Let N Tclustering/4G , and Ωg,1 n (4g 3)N + j, 1 j N o , Ωg,2 n (4g 1)N + j, 1 j N o , 1 g G. 3: for (m, i, g, j) Mclustering [d] [G] [2] do 4: Compute hm,i,g,j 1 |Ωg,j| t Ωg,j (xm,t)i xm,t, gm,i,g,j 1 |Ωg,j| t Ωg,j (xm,t+1)i xm,t. 5: end for 6: // Compute the similarity matrix S: 7: for (m, n) Mclustering Mclustering do 8: for g = 1, . . . , G do 9: Compute D Vi hm,i,g,1 hn,i,g,1 , Vi hm,i,g,2 hn,i,g,2 E , (18a) D Ui gm,i,g,1 gn,i,g,1 , Ui gm,i,g,2 gn,i,g,2 E , (18b) 10: end for 11: Set Sm,n 1 median statΓ,g, 1 g G + median stat Y,g, 1 g G τ . 12: end for 13: Divide Mclustering into K clusters {Ck}1 k K according to {Sm,n}m,n Mclustering. 14: Output: clusters {Ck}1 k K. utilizing {Ui} is to reduce the variance of stat Y . Similarly, we can compute another statistic (by replacing {Ui} with {Vi} and (xt+1)i, (zt+1)i with (xt)i, (zt)i) as follows: (xt)ixt (zt)izt , Vi 1 (xt)ixt (zt)izt E , (20) which has expectation i=1 Vi ((Γ(k))i (Γ(ℓ))i) 2 2 Γ(k) Γ(ℓ) 2 F. Consequently, we shall declare k = ℓif statΓ + stat Y exceeds some appropriate threshold τ. Procedure. We are now positioned to describe the proposed clustering procedure. We first compute the statistics statΓ and stat Y for each pair of sample trajectories in Mclustering by means of the method described above, and then construct a similarity matrix S, in a way that Sm,n is set to 0 if statΓ +stat Y (computed for the m-th and n-th trajectories) is larger than a threshold τ, or set to 1 otherwise. In order to enhance the robustness of these statistics, we divide {0, 1, . . . , Tclustering} into 4G (instead of 4) segments, compute G copies of statΓ and stat Y , and take the medians of these values. Next, we apply a mainstream clustering algorithm (e.g. spectral clustering (Chen et al., 2021b)) to the similarity matrix S, and divide Mclustering into K disjoint clusters {Ck}1 k K. The complete clustering procedure is provided in Algorithm 3. The threshold τ shall be chosen to be on the order of min1 k<ℓ K{ Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F} (which is strictly positive due to Fact 2.1), and it suffices to set the number of copies G to be on some logarithmic order. Our choice of these parameters are specified in Theorem 3.4. Learning Mixtures of Linear Dynamical Systems Algorithm 4 Least squares and covariance estimation 1: Input: clusters {Ck}1 k K. 2: for k = 1, . . . , K do 3: Compute 0 t Tm 1 xm,t+1xm,t X 0 t Tm 1 xm,txm,t 1 , and (22a) c W (k) 1 P 0 t Tm 1 b wm,t b w m,t, where b wm,t = xm,t+1 b A(k)xm,t. (22b) 4: end for 5: Output: estimated models { b A(k), c W (k)}1 k K. A.3. Model estimation Suppose that we have obtained reasonably good clustering accuracy in the previous step, namely for each 1 k K, the cluster Ck output by Algorithm 3 contains mostly trajectories generated by the same model (A(π(k)), W (π(k))), with π representing some permutation function of {1, . . . , K}. We propose to obtain a coarse model estimation and covariance estimation, as summarized in Algorithm 4. More specifically, for each k, we use the samples {{xm,t}0 t Tm}m Ck to obtain an estimate of A(π(k)) by solving the following least-squares problem (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019) b A(k) := arg min A 0 t Tm 1 xm,t+1 Axm,t 2 2, (21) whose closed-form solution is given in (22a). Next, we can use b A(k) to estimate the noise vector b wm,t := xm,t+1 b A(k)xm,t wm,t, and finally estimate the noise covariance W (π(k)) with the empirical covariance of { b wm,t}, as shown in (22b). A.4. Classification Procedure. In the previous steps, we have obtained initial clusters {Ck}1 k K and coarse model estimates { b A(k), c W (k)}1 k K. With the assistance of additional sample trajectories {Xm}m Mclassification, we can infer their latent labels and assign them to the corresponding clusters; the procedure is stated in Algorithm 5 and will be explained shortly. Once this is done, we can run Algorithm 4 again with the updated clusters {Ck} to refine our model estimates, which is exactly the last step of Algorithm 1. Rationale. The strategy of inferring labels in Algorithm 5 can be derived from the maximum likelihood estimator, under the assumption that the noise vectors {wm,t} follow Gaussian distributions. Note, however, that even in the absence of Gaussian assumptions, Algorithm 5 remains effective in principle. To see this, consider a short trajectory {xt}0 t T generated by model (A(k), W (k)), i.e. xt+1 = A(k)xt + wt where E[wt] = 0, cov(wt) = W (k). Let us define the following loss function L(A, W ) := T log det(W ) + t=0 (xt+1 Axt) W 1(xt+1 Axt). (23) With some elementary calculation, we can easily check that for any incorrect label ℓ = k, it holds that E[L(A(ℓ), W (ℓ)) L(A(k), W (k))] > 0, with the proviso that (A(k), W (k)) = (A(ℓ), W (ℓ)) and {xt} are non-degenerate in some sense; in other words, the correct model (A(k), W (k)) achieves the minimal expected loss (which, due to the quadratic form of the loss function, depends solely on the first and second moments of the distributions of {wt}, as well as the initial state x0). This justifies the proposed procedure for inferring unknown labels in Algorithm 5, provided that Tm is large enough and that the estimated LDS models are sufficiently reliable. Learning Mixtures of Linear Dynamical Systems Algorithm 5 Classification 1: Input: short trajectories {Xm}m Mclassification, where Xm = {xm,t}0 t Tm; coarse models { b A(k), c W (k)}1 k K; clusters {Ck}1 k K. 2: for m Mclassification do 3: Infer the label of the m-th trajectory by bkm arg min ℓ n Tm log det(c W (ℓ)) + t=0 (xm,t+1 b A(ℓ)xm,t) (c W (ℓ)) 1(xm,t+1 b A(ℓ)xm,t) o , then add m to cluster Cbkm. 4: end for 5: Output: updated clusters {Ck}1 k K. B. Detailed analysis This section provides detailed, modular theoretical results for the performance of Algorithms 2 5, and concludes with a proof for the main theorems in Section 3.2 (i.e. the performance guarantees of Algorithm 1). Subspace estimation. The following theorem provides upper bounds on the errors of subspaces {Vi, Ui} output by Algorithm 2, assuming sufficient mixing of the short trajectories. Theorem B.1. Consider the model (1) under the assumptions in Sections 2 and 3.1. There exist some universal constants C1, C2, C3 > 0 such that the following holds. Suppose that we run Algorithm 2 with data {Xm}m Msubspace obeying Tsubspace C1 tmix, Ttotal,subspace C2 tmixd log Ttotald δ , where tmix := 1 1 ρ log dκATtotal Then with probability at least 1 δ, Algorithm 2 outputs {Vi, Ui}1 i d satisfying the following: for all 1 k K and 1 i d, max n (Γ(k))i Vi Vi (Γ(k))i 2, (Y (k))i Ui Ui (Y (k))i 2 o 1/2 tmixd Ttotal,subspace log3 Ttotald Our proof (deferred to Appendix C.2) includes a novel perturbation analysis; the resulted error bound (24) has a 1/T 1/4 total,subspace dependence and is gap-free (i.e. independent of the eigenvalue gaps of the ground-truth low-rank matrices, which can be arbitrarily close to zero in the worst case). It is possible to adapt the existing perturbation results in (Kong et al., 2020b;a) to our setting (which we include in Lemma C.3 in the appendix for completeness); however, one of them is dependent on the eigenvalue gaps, while the other one incurs a worse 1/T 1/6 total,subspace dependence. It would be interesting future work to investigate whether a gap-free bound with a 1/T 1/2 total,subspace dependence is possible. Clustering. Our next theorem shows that Algorithm 3 achieves exact clustering of Mclustering, if Tclustering is sufficiently large and subspaces {Vi, Ui} are accurate. The proof is deferred to Appendix C.3. Theorem B.2. Consider the model (1) under the assumptions in Sections 2 and 3.1. There exist universal constants C1, C2, c3 > 0 such that the following holds. Suppose that we run Algorithm 3 with data {Xm}m Mclustering, independent subspaces {Vi, Ui}1 i d and parameters τ, G that satisfy the following: The threshold τ obeys 1/8 < τ/ 2 Γ,Y < 3/8; The short trajectory length Tclustering = NG, where G C1 log |Mclustering| d K 2 Γ,Y + 1 ! 1 1 ρ log Γ,Y + 2 dκATtotal Learning Mixtures of Linear Dynamical Systems The subspaces {Vi, Ui}1 i d satisfy that, for all 1 i d and 1 k K, max (Γ(k))i Vi Vi (Γ(k))i 2, (Y (k))i Ui Ui (Y (k))i 2 Then with probability at least 1 δ, Algorithm 3 achieves exact clustering: for all m1, m2 Mclustering, Sm1,m2 = 1 if and only if the m1-th and m2-th trajectories are generated by the same model, i.e. they have the same label km1 = km2. Least squares and covariance estimation. The next result controls the model estimation errors of Algorithm 4, under the assumption that every cluster is pure. Theorem B.3. Consider the model (1) under the assumptions in Section 3.1. There exist universal constants C1, C2, C3 > 0 such that the following holds. Let {Ck}1 k K be subsets of Mclustering Mclassification such that for all 1 k K, Ck contains only short trajectories generated by model (A(k), W (k)), namely km = k for all m Ck. Suppose that for all m Mclustering Mclassification, Tm 4, and for all 1 k K, T (k) total := X m Ck Tm C1dκ2 wι, where ι := log Γmax Let b A(k), c W (k) be computed by (22) in Algorithm 4. Then with probability at least 1 δ, one has b A(k) A(k) C2 T (k) total , c W (k) W (k) T (k) total , 1 k K. Our proof (postponed to Appendix C.4) is based on the techniques of (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019), but with two major differences. First, the authors of (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019) consider the setting where ordinary least squares is applied to a single continuous trajectory generated by a single LDS model; this is not the case for our setting, and thus our proof and results are different from theirs. Second, the noise covariance matrix W is assumed to be σ2Id in (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019), while in our case, {W (k)}1 k K are unknown and need to be estimated. Classification. Our last theorem shows that Algorithm 5 correctly classifies all trajectories in Mclassification, as long as the coarse models are sufficiently accurate and the short trajectory lengths are large enough; these conditions are slightly different for Cases 0 and 1 defined in (8). See Appendix C.5 for the proof. Theorem B.4. Consider the model (1) under the assumptions in Section 3.1. There exist universal constants c1, c2, C3 > 0 such that the following holds. Suppose that we run Algorithm 5 with data {Xm}m Mclassification and independent coarse models { b A(k), c W (k)}1 k K satisfying b A(k) A(k) ϵA, c W (k) W (k) ϵW for all k. Then with probability at least 1 δ, Algorithm 5 correctly classifies all trajectories in Mclassification, provided that For Case 0: ϵA c1 A,W Wmin Γmaxκw,cross(d + ι), ϵW Wmin c2 min n 1, A,W p Tm C3 κ2 w + κ6 w,cross 2 A,W ι2, m Mclassification; (26b) For Case 1: ϵA c1 A,W Wmin Γmaxκw,crossκ2 A(d + ι), ϵW Wmin c2 min n 1, A,W p Tm C3 κ2 w + κ6 w,cross 2 A,W ι2 + 1 1 ρ log(2κA), m Mclassification, (26d) where ι := log Ttotal δ is a logarithmic term. Learning Mixtures of Linear Dynamical Systems Proof of Theorems 3.4 and 3.5. Our main theorems are direct implications of the above guarantees for the individual steps. Stage 1: To begin with, according to Theorem B.1, if the condition (9a) on Tsubspace and Ttotal,subspace hold, then the subspaces {Vi, Ui} output by Line 3 of Algorithm 1 are guaranteed to satisfy the error bounds (25) required by Theorem B.2. This together with the condition (9b) on Tclustering ensures exact clustering in Line 4. Stage 2: Based on this, if we further know that Ttotal,clustering obeys condition (9b) for Case 0 or (11a) for Case 1, then Theorem B.3 tells us that the coarse model estimation in Line 5 satisfies the error bounds (26a) or (26c) required by Theorem B.4. Together with the assumption (9c) or (11b) on Tclassification, this guarantees exact classification in Line 7 of Algorithm 1, for either Case 0 or 1. At the end, the final model estimation errors (10) follow immediately from Theorem B.3. Note that all the statements above are high-probability guarantees; it suffices to take the union bound over these steps, so that the performance guarantees of Theorems 3.4 and 3.5 hold with probability at least 1 δ. This finishes the proof of our main theorems. Remark B.5. The step of subspace estimation in Algorithm 1 is non-essential and optional; it allows for a smaller Tclustering, but comes at the price of complicating the overall algorithm. For practitioners who prefer a simpler algorithm, they might simply remove this step (i.e. Line 3 of Algorithm 1), and replace the rank-K subspaces {Vi, Ui} with Id (i.e. no dimensionality reduction for the clustering step). The theoretical guarantees continue to hold with minor modification: in Corollary 3.7, one simply needs to remove the conditions on Tsubspace, Ttotal,subspace, and in the condition for Tclustering, replace the factor p K/d (where K is due to dimensionality reduction) with p d/d = 1. This is one example regarding how our modular algorithms and theoretical analysis can be easily modified and adapted to accommodate different situations. C. Proofs for Appendix B This section starts with some preliminaries about linear dynamical systems that will be helpful later. Then, it provides the main proofs for the theorems in Appendix B. C.1. Preliminaries Truncation of autocovariance. Recall the notation Γ(k) = P i=0 A(k)i W (k)(A(k)i) from (3) and (5). We add a subscript t to represent its t-step truncation: i=0 A(k)i W (k)(A(k)i) . (27) Also recall the assumption of exponential stability in Assumption 3.1, namely (A(k))t κAρt. As a result, Γ(k) t is close to Γ(k): 0 Γ(k) Γ(k) t = i=t A(k)i W (k)(A(k)i) = A(k)tΓ(k)(A(k)t) , Γ(k) Γ(k) t Γ(k) A(k)t 2 Γ(k) κ2 Aρ2t Γmaxκ2 Aρ2t. (28) Moreover, let Y (k) t := A(k)Γ(k) t , then Y (k) t is also close to Y (k): Y (k) Y (k) t A(k) Γ(k) Γ(k) t Γmaxκ3 Aρ2t. (29) Independent version of states. Given some mixing time tmix, we define exm,t(tmix) as the (tmix 1)-step approximation of xm,t : exm,t = exm,t(tmix) := i=0 (A(km))iwm,t i 1 N(0, Γ(km) tmix 1), tmix t Tm, 1 m M. (30) Learning Mixtures of Linear Dynamical Systems Since exm,t consists of only the most recent noise vectors, it is independent of the history up to xm,t tmix+1. Our proofs for Stage 1 will rely on this independent version {exm,t} of states {xm,t}; the basic idea is that, for an appropriately chosen tmix, a trajectory of length T can be regarded as a collection of T/tmix independent samples. We will often use the notatione to represent the independent version of other variables as well. Boundedness of states. The following lemma provides upper bounds for { exm,t 2} and { xm,t 2}. This will help to control the effects of mixing errors and model estimation errors in the analyses later. Lemma C.1 (Bounded states). Consider the model (1) under the assumptions in Sections 2 and 3.1. Fix any tmix 3. Then with probability at least 1 δ, we have exm,t 2 C0 p Γmax(d + log(Ttotal/δ)) for all 1 m M, tmix t Tm, where C0 > 0 is some universal constant; moreover, for both cases of initial states defined in (8), all states {{xm,t}0 t Tm}1 m M are bounded throughout: Case 0: with probability at least 1 δ, we have xm,t 2 C0 p Γmax(d + log(Ttotal/δ)) for all m, t. Case 1: suppose that tmix 1 1 ρ log( 2κA), and Tm tmix for all m, then with probability at least 1 δ, we have xm,t 2 3C0κA p Γmax(d + log(Ttotal/δ)) for all m, t, and xm,t 2 2C0 p Γmax(d + log(Ttotal/δ)) for all t tmix or t = 0. Proof. First, recall from Corollary 7.3.3 of (Vershynin, 2018) that, if random vector a N(0, Id), then for all u 0, we have P( a 2 2 d + u) 2 exp( cu2). Since exm,t N(0, Γ(km) tmix 1), where Γ(km) tmix 1 Γ(km) Γmax Id, we have P( exm,t 2 Γmax(2 d + u)) 2 exp( cu2). Taking the union bound, we have P there exists m, t such that exm,t 2 p t=tmix P exm,t 2 p d + u) 2Ttotal exp( cu2) δ, where the last inequality holds if we pick u q 1 c log 2Ttotal δ . This finishes the proof of the first claim in the lemma. Next, we prove the boundedness of { xm,t 2}. Case 0: It is easy to check that xm,t N(0, Γ(km) t ), where Γ(km) t Γ(km) Γmax Id. The boundedness of { xm,t 2} can be proved by a similar argument as before, which we omit for brevity. Case 1: Define ξm,t := xm,t (A(km))txm,0 N(0, Γ(km) t ). By a similar argument as before, we have with probability at least 1 δ, ξm,t 2 C0 p Γmax(d + log(Ttotal/δ)) for all m, t. Morever, for any t tmix 1 1 ρ log( 2κA), we have (A(km))t κAρt 1/2. With this in place, we have xm+1,0 = xm,Tm = (A(km))Tmxm,0 + ξm,Tm, and thus xm+1,0 (A(km))Tm xm,0 2 + ξm,Tm 2 1 2 xm,0 2 + C0 Γmax(d + log Ttotal Recall the assumption that x1,0 = 0; by induction, we have xm,0 2 2C0 p Γmax(d + log(Ttotal/δ)) for all 1 m M. Now, we have for all m, t, xm,t 2 (A(km))txm,0 2 + ξm,t 2 κA xm,0 2 + C0 Γmax(d + log Ttotal Γmax(d + log Ttotal moreover, for t tmix, since (A(km))t 1/2, we obtain a better bound xm,t 2 (A(km))txm,0 2 + ξm,t 2 2 xm,0 2 + C0 Γmax(d + log Ttotal Γmax(d + log Ttotal This shows the boundedness of { xm,t 2} and completes our proof of the lemma. Learning Mixtures of Linear Dynamical Systems C.2. Proof of Theorem B.1 Theorem B.1 is an immediate consequence of Lemmas C.2 and C.3 below. The former shows the concentration of c Hi, b Gi around the targeted low-rank matrices Hi, Gi, while the latter is a result of perturbation analysis. Lemma C.2. Under the setting of Theorem B.1, with probability at least 1 δ, we have for all 1 i d, max n c Hi Hi , b Gi Gi o Γ2 max tmixd Ttotal,subspace log3 Ttotald Lemma C.3. Consider the matrix M = PK k=1 p(k)y(k)y(k) Rd d, where 0 < p(k) < 1, PK k=1 p(k) = 1, and y(k) Rd. Let M be symmetric and satisfy M M ϵ, and U Rd K be the top-K eigenspace of M. Then we have K X k=1 p(k) y(k) UU y(k) 2 2 2Kϵ, (31) and for all 1 k K, it holds that y(k) UU y(k) 2 min 1/2 , 2ϵ λmin(M) y(k) 2, p(k) y(k) 2 For our main analyses in Sections 3.2 and B, we choose the first term on the right-hand side of (32). C.2.1. PROOF OF LEMMA C.2 We first analyze the idealized case with i.i.d. samples; then we make a connection between this i.i.d. case and the actual case of mixed LDSs, by utilizing the mixing property of linear dynamical systems. We prove the result of Lemma C.2 for b Gi Gi only, since the analysis for c Hi Hi is mostly the same (and simpler in fact). Step 1: the idealized i.i.d. case. With some abuse of notation, suppose that for all 1 m M, we have for some km {1, . . . , K}, xm,t, zm,t i.i.d. N(0, eΓ(km)), wm,t, vm,t i.i.d. N(0, W (km)), xm,t = A(km)xm,t + wm,t, zm,t = A(km)zm,t + vm,t, 1 t N, where for all 1 k K, it holds that W (k), eΓ(k) Γ(k) Γmax Id. Notice that cov(xm,t ) = A(km)eΓ(km)(A(km)) + W (km) A(km)Γ(km)(A(km)) +W (km) = Γ(km) Γmax Id. Consider the i.i.d. version of matrix b Gi and its expectation Gi defined as follows: b Gi := 1 MN izm,t , Gi = k=1 p(k)( e Y (k))i( e Y (k)) i , where ( e Y (k))i is the transpose of the i-th row of e Y (k) := A(k)eΓ(k). For this i.i.d. setting, we claim that, if the i.i.d. sample size MN satisfies MN d log(MNd/δ), then with probability at least 1 δ, b Gi Gi Γ2 max d MN log3 MNd δ , 1 i d. (33) This claim can be proved by a standard covering argument with truncation; we will provide a proof later for completeness. Step 2: back to the actual case of mixed LDSs. Now we turn to the analysis of b Gi defined in (14) versus its expectation Gi defined in (15b), for the mixed LDSs setting. We first show that b Gi can be writte as a weighted average of some matrices, each of which can be further decomposed into an i.i.d. part (as in Step 1) plus a negligible mixing error term. Then we analyze each term in the decomposition, and finally put pieces together to show that b Gi Gi. Learning Mixtures of Linear Dynamical Systems Step 2.1: decomposition of the index set Ω1 Ω2. Recall the definition of index sets Ω1, Ω2 in Algorithm 2. Denote the first index of Ω1 (resp. Ω2) as τ1 + 1 (resp. τ2 + 1), and let := τ2 τ1 be their distance. Also denote N := |Ω1| = |Ω2| Tsubspace. For any t Ω1 and 1 j N, define sj(t) := Cycle(t + + j; Ω2) = ( t + + j if t + + j τ2 + N, t + + j N otherwise, where Cycle(i; Ω) represents the cyclic indexing of value i on set Ω. Then we have Ω1 Ω2 = n (t1, t2), t1 Ω1, t2 Ω2 o = N j=1 n t, sj(t) , t Ω1 o . We further define Sτ := {τ1 + τ + f tmix : f 0, τ + f tmix N}, 1 τ tmix, so that Ω1 = tmix τ=1Sτ. Notice that for each τ, the elements of Sτ are at least tmix far apart. Putting together, we have Ω1 Ω2 = N j=1 tmix τ=1 n t, sj(t) , t Sτ o . (34) Step 2.2: decomposition of b Gi. In the remaining proof, we denote xm,t := xm,t+1 for notational consistency. Using the decomposition (34) of Ω1 Ω2, we can rewrite b Gi defined in (14) as a weighted average of Ntmix matrices: b Gi = 1 |Msubspace| m Msubspace (t1,t2) Ω1 Ω2 (x m,t1)i xm,t1 (x m,t2)i xm,t2 = N 2 Fi,j,τ, Fi,j,τ := 1 |Msubspace| |Sτ| m Msubspace t Sτ (xm,t )i xm,t (x m,sj(t))i xm,sj(t) . (35) Recalling the definition of exm,t in (30) and Γ(k) t in (27), we have xm,t = exm,t + (A(km))tmix 1xm,t tmix+1 | {z } =:δm,t = exm,t + δm,t, where δm,t 2 (A(km))tmix 1 xm,t tmix+1 2 κAρtmix 1 xm,t tmix+1 2, (36) and exm,t N(0, Γ(k) tmix 1) is independent of δm,t. Moreover, xm,t = A(km)xm,t + wm,t = ex m,t + A(km)δm,t, where ex m,t := A(km)exm,t + wm,t. We can rewrite xm,sj(t) = exm,sj(t) + δm,sj(t) and x m,sj(t) = ex m,sj(t) + A(km)δm,sj(t) in a similar manner. Putting this back to (35), one has Fi,j,τ = 1 |Msubspace| |Sτ| m Msubspace t Sτ (xm,t )i xm,t (x m,sj(t))i xm,sj(t) = 1 |Msubspace| |Sτ| m Msubspace (ex m,t + A(km)δm,t)i (exm,t + δm,t) (ex m,sj(t) + A(km)δm,sj(t))i (exm,sj(t) + δm,sj(t)) = 1 |Msubspace| |Sτ| m Msubspace t Sτ (ex m,t)i exm,t (ex m,sj(t))i exm,sj(t) =: e Fi,j,τ + i,j,τ, (37) where i,j,τ contains all the {δm,t} terms in the expansion. The key observation here is that, by our definition of index set Sτ, the {exm,t} terms in e Fi,j,τ are independent, and thus we can utilize our earlier analysis of the i.i.d. case in Step 1 to study e Fi,j,τ. Learning Mixtures of Linear Dynamical Systems Step 2.3: analysis for each term of the decomposition. Towards showing b Gi Gi, we prove in the following that e Fi,j,τ concentrates around its expectation e Gi, which is close to Gi; moreover, the error term i,j,τ becomes exponentially small as tmix grows. More specifically, we have the following: Recall the notation Y (k) tmix 1 = A(k)Γ(k) tmix 1. It holds that E[ e Fi,j,τ] = e Gi := k=1 p(k)(Y (k) tmix 1)i(Y (k) tmix 1)i . According to our result (33) for the i.i.d. case, for fixed j, τ, we have with probability at least 1 δ, 1 i d, e Fi,j,τ e Gi Γ2 max d |Msubspace| |Sτ| log3 |Msubspace| |Sτ|d tmixd Ttotal,subspace log3 Ttotal,subspaced Recall from (29) that Y (k) Y (k) tmix 1 Γmaxκ3 Aρ2(tmix 1). Therefore, (Y (k))i(Y (k))i (Y (k) tmix 1)i(Y (k) tmix 1)i (Y (k))i 2 + (Y (k) tmix 1)i 2 (Y (k))i (Y (k) tmix 1)i 2 2 Y (k) + Y (k) Y (k) tmix 1 Y (k) Y (k) tmix 1 2ΓmaxκA + Γmaxκ3 Aρ2(tmix 1) Γmaxκ3 Aρ2(tmix 1) = 2 + κ2 Aρ2(tmix 1) Γ2 maxκ4 Aρ2(tmix 1) 3Γ2 maxκ4 Aρ2(tmix 1), where we use the mild assumption that tmix 1+ log κA 1 ρ , and the fact that Y (k) , Y (k) tmix 1 ΓmaxκA. Consequently, k=1 p(k) (Y (k))i(Y (k))i (Y (k) tmix 1)i(Y (k) tmix 1)i 3Γ2 maxκ4 Aρ2(tmix 1). (39) By Lemma C.1, if tmix 1 1 ρ log(2κA), then we have with probability at least 1 δ, all the xm,t s and exm,t s involved in the definition of i,j,τ in (37) have ℓ2 norm bounded by Γmaxpoly(d, κA, log(Ttotal/δ)). This together with the upper bound on δm,t 2 in (36) implies that for all i, j, τ, it holds that i,j,τ Γ2 max poly d, κA, log Ttotal ρtmix 1. (40) Step 2.4: putting pieces together. With (38), (39) and (40) in place and taking the union bound, we have with probability at least 1 δ, for all 1 i d, N 2 Fi,j,τ Gi max j,τ e Fi,j,τ e Gi + e Gi Gi + max j,τ i,j,τ tmixd Ttotal,subspace log3 Ttotal,subspaced δ + Γ2 maxκ4 Aρ2(tmix 1) + Γ2 max poly d, κA, log d Ttotal tmixd Ttotal,subspace log3 Ttotald where the last inequality holds if tmix 1 1 ρ log dκATtotal δ . This finishes the proof of Lemma C.2. Learning Mixtures of Linear Dynamical Systems Proof of (33). Define the truncating operator Trunc(x; D) := x 1(|x| D), x R, D 0. Consider the following truncated version of b Gi: b GTrunc i := 1 MN (the truncating level D0 will be specified later), and let ETrunc i := E b GTrunc i be its expectation. In the following, we first show that b GTrunc i concentrates around ETrunc i , and then prove that ETrunc i Gi. By a standard covering argument, we have b GTrunc i ETrunc i = sup u,v Sd 1 u b GTrunc i ETrunc i v 4 sup u,v N1/8 u b GTrunc i ETrunc i v, where N1/8 denotes the 1/8-covering of the unit sphere Sd 1 and has cardinality |N1/8| 32d. For fixed u, v N1/8, one has u b GTrunc i v = 1 MN t=1 Trunc xm,t i; D0 u xm,t Trunc zm,t i; D0 v zm,t, where (cf. Chapter 2 of (Vershynin, 2018) for the definitions of subgaussian norm ψ2 and subexponential norm ψ1) Trunc xm,t i; D0 , Trunc zm,t i; D0 D0, u xm,t ψ2, v zm,t ψ2 p Hence Trunc xm,t i; D0 u xm,t Trunc zm,t i; D0 v zm,t ψ1 D2 0Γmax, and by Bernstein s inequality (Corollary 2.8.3 of (Vershynin, 2018)), we have P u b GTrunc i ETrunc i v τ 2 exp c0MN τ D2 0Γmax for all 0 τ D2 0Γmax. Taking the union bound over u, v N1/8, we have with probability at least 1 δ/2, b GTrunc i ETrunc i D2 0Γmax δ MN , provided that MN d + log 1 Gi ETrunc i = k=1 p(k)Ext,zt N(0,eΓ(k)) (xt )i(zt )i Trunc (xt )i Trunc (zt )i xtzt , where for each k, Ext,zt N(0,eΓ(k)) (xt )i(zt )i Trunc (xt )i Trunc (zt )i xtzt = sup u,v Sd 1 Ext,zt N(0,eΓ(k)) (xt )i(zt )iu xtv zt 1 1 |(xt )i| D0, |(zt )i| D0 Learning Mixtures of Linear Dynamical Systems sup u,v Sd 1 E (xt )i(zt )iu xtv zt 2r E 1 1 |(xt )i| D0, |(zt )i| D0 2 P(|N(0, Γmax)| > D0) Γ2 max exp c1 D2 0 Γmax Finally, notice that if the truncating level D0 is sufficiently large, then we have b Gi = b GTrunc i with high probability. More formally, we have shown that for fixed 1 i d, if MN d + log(1/δ), then P b Gi Gi D2 0Γmax δ MN + Γ2 max exp c1 D2 0 Γmax m,t P |(xm,t )i| > D0 or |(zm,t )i| > D0 . If we pick the truncating level D0 p Γmax log(MN/δ), then it is easy to check that with probability at least 1 δ, b Gi Gi D2 0Γmax δ MN Γ2 max d MN log3 MN Taking the union bound over 1 i d finishes our proof of (33). C.2.2. PROOF OF LEMMA C.3 Define := M M , and denote the eigendecomposition of M and M as M = U Λ U and M = UΛU + U Λ U , where diagonal matrix Λ (resp. Λ ) contains the top-K eigenvalues of M (resp. M ). Then we have Λ = U MU = U M U + U U = k=1 p(k)U y(k)y(k) U + U U, Λ = U M U = k=1 p(k)U y(k)y(k) U . Substracting these two equations gives k=1 p(k) U y(k)y(k) U U y(k)y(k) U = Λ Λ + U U. Taking the trace of both sides, we get k=1 p(k) U y(k) 2 2 U y(k) 2 2 = Tr Λ Λ + Tr U U . On the left-hand side, U y(k) 2 2 U y(k) 2 2 = y(k) 2 2 U y(k) 2 2 = y(k) UU y(k) 2 2 0, while on the right-hand side, λk(Λ ) λk(Λ) (i) K Kϵ, Tr(U U) Tr(U U) = Tr(IK) Kϵ, where (i) follows from Weyl s inequality. Putting things together, we have proved (31), which immediately leads to the first upper bound in (32). The second upper bound in (32) follows from a simple application of Davis-Kahan s sin Θ theorem (Davis & Kahan, 1970), and the third term is due to Lemma A.11 of (Kong et al., 2020b); we skip the details for brevity. Learning Mixtures of Linear Dynamical Systems C.3. Proof of Theorem B.2 Our proof follows the three steps below: 1. Consider the idealized i.i.d. case, and characterize the expectations and variances of the testing statistics computed by Algorithm 3; 2. Go back to the actual case of mixed LDSs, and analyze one copy of statΓ,g or stat Y,g defined in (18) for some fixed 1 g G, by decomposing it into an i.i.d. part plus a negligible mixing error term; 3. Analyze median{statΓ,g, 1 g G} and median{stat Y,g, 1 g G}, and prove the correct testing for each pair of trajectories, which implies that Algorithm 3 achieves exact clustering. Step 1: the idealized i.i.d. case. Recall the definition of stat Y in (19) when we first introduce our method for clustering. For notational convenience, we drop the subscript in stat Y , and replace xt+1, zt+1 with xt , zt ; then, with some elementary linear algebra, (19) can be rewritten as (xt )ixt (zt )izt , Ui 1 (xt )ixt (zt )izt E t Ω1 vec xt(xt ) zt(zt ) , U 1 t Ω2 vec xt(xt ) zt(zt ) E , where we define a large orthonormal matrix U1 0 . . . 0 0 U2 ... ... ... ... ... 0 0 . . . 0 Ud Rd2 d K, (41) In this step, we consider the idealized i.i.d. case: t Ω1 Ω2, xt i.i.d. N(0, eΓ(k)), wt i.i.d. N(0, f W (k)), xt = e A(k)xt + wt, zt i.i.d. N(0, eΓ(l)), vt i.i.d. N(0, f W (l)), zt = e A(l)zt + vt, where eΓ(k), eΓ(l), f W (k), f W (l) are d d covariance matrices, and e A(k), e A(l) are d d state transition matrix. Our goal is to characterize the expectation and variance of stat in this i.i.d. case. Before we present the results, we need some additional notation. First, let {ei}1 i d be the canonical basis of Rd, and define e Y (k) := e A(k)eΓ(k), Σ(k) := e A(k) Id (eΓ(k))1/2 (eΓ(k))1/2 Id2 + P (eΓ(k))1/2 (eΓ(k))1/2 ( e A(k)) Id , where P Rd2 d2 is a symmetric permutation matrix, whose (i, j)-th block is eje i Rd d, 1 i, j d. Let e Y (l), Σ(ℓ) be defined similarly, with e A(k), eΓ(k) replaced by e A(l), eΓ(l). Moreover, define µk,l := U vec ( e Y (k) e Y (l)) Rd K, (42a) Σk,l := U Σ(k) + f W (k) eΓ(k) + Σ(ℓ) + f W (l) eΓ(l) U Rd K d K. (42b) Now we are ready to present our results for the i.i.d. case. The first lemma below gives a precise characterization of E[stat] and var(stat), in terms of µk,l and Σk,l; the second lemma provides some upper and lower bounds, which will be handy for our later analyses. Learning Mixtures of Linear Dynamical Systems Lemma C.4. Denote N = min{|Ω1|, |Ω2|}. For the i.i.d. case just described, one has E[stat] = µk,l 2 2, var(stat) 1 N 2 Tr(Σ2 k,l) + 2 N µ k,lΣk,lµk,l, and the inequality becomes an equality if |Ω1| = |Ω2| = N. Lemma C.5. Consider the same setting of Lemma C.4. Furthermore, suppose that eΓ(k), eΓ(l) Γmax Id, f W (k), f W (l) Wmax Id, e A(k) , e A(l) κA for some 0 < Wmax Γmax and κA 1. Then the following holds true. (Upper bound on expectation) It holds that E[stat] = µk,l 2 2 e Y (k) e Y (l) 2 F. (43) (Lower bound on expectation) If e Y (k) = e Y (l) and subspaces {Ui}1 i d satisfy 1 i d, max n ( e Y (k))i Ui Ui ( e Y (k))i 2, ( e Y (l))i Ui Ui ( e Y (l))i 2 o ϵ, (44) for some ϵ 0, then we have E[stat] = µk,l 2 2 e Y (k) e Y (l) 2 F 4ϵ ( e Y (k))i + ( e Y (l))i 2. (Upper bound on variance) The matrix Σk,l is symmetric and satisfies 0 Σk,l 6Γ2 maxκ2 AId K; this, together with the earlier upper bound (43) on µk,l 2 2, implies that var(stat) 1 N 2 Tr(Σ2 k,l) + 2 N µ k,lΣk,lµk,l Γ2 maxκ2 A N 2 d K + Γ2 maxκ2 A N e Y (k) e Y (l) 2 F. Step 2: one copy of stat Y,g, statΓ,g for a fixed g. Now we turn back to the mixed LDSs setting and prove Theorem B.2. Let us focus on the testing of one pair of short trajectories {xm1,t}, {xm2,t} for some m1, m2 Mclustering, m1 = m2. For notational consistency, in this proof we rewrite these two trajectories as {xt} and {zt}, their labels km1, km2 as k, ℓ, and the trajectory length Tclustering as T, respectively. Also denote xt := xt+1 and zt := zt+1. Recall the definition of {stat Y,g}1 g G in (18b); for now, we consider one specific element and ignore the subscript g. Recalling the definition of U Rd2 d K in (41), we have t Ω1 Ui (xt )ixt (zt )izt , 1 t Ω2 Ui (xt )ixt (zt )izt E t Ω1 vec xt(xt ) zt(zt ) , U 1 t Ω2 vec xt(xt ) zt(zt ) E , where N = T/4G = |Ω1| = |Ω2|. In the following, we show how to decompose stat Y into an i.i.d. term plus a negligible mixing error term, and then analyze each component of this decomposition; finally, we put pieces together to give a characterization of stat Y , or {stat Y,g}1 g G when we put the subscript g back in at the end of this step. Learning Mixtures of Linear Dynamical Systems Step 2.1: decomposition of stat Y . Define S1,τ1 := {t1 + τ1 + f tmix : f 0, τ1 + f tmix |Ω1|}, where t1 + 1 is the first index of Ω1, and the mixing time tmix will be specified later; define S2,τ2 similarly. Note that for each τ1, the elements of S1,τ1 are at least tmix far apart; moreover, we have Ω1 = tmix τ1=1S1,τ1, Ω2 = tmix τ2=1S2,τ2, and thus t Ω1 vec xt(xt ) zt(zt ) = N 1 |S1,τ1| t S1,τ1 vec xt(xt ) zt(zt ) , t Ω2 vec xt(xt ) zt(zt ) = N 1 |S2,τ2| t S2,τ2 vec xt(xt ) zt(zt ) . Therefore, we can rewrite stat Y as a weighted average τ2=1 wτ1,τ2 statτ1,τ2 Y , where wτ1,τ2 := |S1,τ1||S2,τ2| τ2=1 wτ1,τ2 = 1, and statτ1,τ2 Y := D U 1 |S1,τ1| t S1,τ1 vec xt(xt ) zt(zt ) , U 1 |S2,τ2| t S2,τ2 vec xt(xt ) zt(zt ) E . (45) We can further decompose statτ1,τ2 Y into an i.i.d. term plus a small error term. To do this, recalling the definition of exm,t in (30) and dropping the subscript m, we have xt = A(k)tmix 1xt tmix+1 | {z } =:δx,t + ext = δx,t + ext, xt = A(k)xt + wt = A(k)δx,t + (A(k)ext + wt) | {z } =:ex t = A(k)δx,t + ex t, where ext N(0, Γ(k) tmix 1). Similarly, we rewrite zt = δz,t + ezt, zt = A(ℓ)δz,t + ez t. Plugging these into the right-hand side of (45) and expanding it, one has statτ1,τ2 Y = g stat τ1,τ2 Y + τ1,τ2 Y , where g stat τ1,τ2 Y := D U 1 |S1,τ1| t S1,τ1 vec ext(ex t) ezt(ez t) , U 1 |S2,τ2| t S2,τ2 vec ext(ex t) ezt(ez t) E , and τ1,τ2 Y involves {δx,t, δz,t} terms. Step 2.2: analysis of each component. First, notice that the {ext, ezt} terms in the definition of g stat τ1,τ2 Y are independent; this suggests that we can characterize g stat τ1,τ2 Y by applying our earlier analysis for the i.i.d. case in Step 1. Second, τ1,τ2 Y involves {δx,t, δz,t} terms, which in turn involve A(k)tmix 1, A(ℓ)tmix 1 and thus will be exponentially small as tmix increases, thanks to Assumption 3.1. More formally, we have the following: Applying Lemmas C.4 and C.5 with (eΓ(k), eΓ(l)) = (Γ(k) tmix 1, Γ(ℓ) tmix 1), (f W (k), f W (l)) = (W (k), W (ℓ)) and ( e A(k), e A(l)) = (A(k), A(ℓ)), we have E g stat τ1,τ2 Y = U vec((Y (k) tmix 1 Y (ℓ) tmix 1) ) 2 2, var g stat τ1,τ2 Y Γ2 maxκ2 A e N 2 d K + Γ2 maxκ2 A e N Y (k) tmix 1 Y (ℓ) tmix 1 2 F, where e N := min{|S1,τ1|, |S2,τ2|} N/tmix. Learning Mixtures of Linear Dynamical Systems By Lemma C.1, with probability at least 1 δ, all {xt, ext, zt, ezt} terms involved in the definition of τ1,τ2 Y has ℓ2 norm bounded by Γmaxpoly(d, κA, log(Ttotal/δ)). This implies that τ1,τ2 Y Γ2 max poly d, κA, log Ttotal Step 2.3: putting pieces together. Putting the subscript g back in, we have already shown that τ2=1 wτ1,τ2 statτ1,τ2 Y,g τ2=1 wτ1,τ2 g stat τ1,τ2 Y,g | {z } =:g stat Y,g τ2=1 wτ1,τ2 τ1,τ2 Y,g | {z } =: Y,g = g stat Y,g + Y,g, E g stat Y,g = U vec((Y (k) tmix 1 Y (ℓ) tmix 1) ) 2 2, var g stat Y,g max τ1,τ2 var(g stat τ1,τ2 Y,g ) Γ2 maxκ2 A e N 2 d K + Γ2 maxκ2 A e N Y (k) tmix 1 Y (ℓ) tmix 1 2 F, Y,g max τ1,τ2 | τ1,τ2 Y,g | Γ2 max poly d, κA, log Ttotal So far in Step 2, we have focused on the analysis of stat Y,g. We can easily adapt the argument to study statΓ,g as well; the major difference is that, in Step 2.2, we should apply Lemmas C.4 and C.5 with (eΓ(k), eΓ(l)) = (Γ(k) tmix 1, Γ(ℓ) tmix 1), (f W (k), f W (l)) = (0, 0), ( e A(k), e A(l)) = (Id, Id), and subspaces {Ui} replaced by {Vi} instead. The final result is that, for all 1 g G, statΓ,g = g statΓ,g + Γ,g, E g statΓ,g = V vec((Γ(k) tmix 1 Γ(ℓ) tmix 1) ) 2 2, var g statΓ,g Γ2 maxκ2 A e N 2 d K + Γ2 maxκ2 A e N Γ(k) tmix 1 Γ(ℓ) tmix 1 2 F, Γ,g Γ2 max poly d, κA, log Ttotal Step 3: analysis of the medians, and final results. Recall the following standard result on the concentration of medians (or median-of-means in general; see Theorem 2 of (Lugosi & Mendelson, 2019)). Proposition C.6 (Concentration of medians). Let X1, . . . XG be i.i.d. random variables with mean µ and variance σ2. Then we have |median{Xg, 1 g G} µ| 2σ with probability at least 1 e c0G for some constant c0. Notice that by construction, {g stat Y,g}1 g G are i.i.d. (and so are {g statΓ,g}1 g G). Applying Proposition C.6 to our case, we know that if G log(1/δ), then with probability at least 1 δ, the following holds: If k = ℓ, i.e. the two trajectories are generated by the same LDS model, then median{statΓ,g, 1 g G} + median{stat Y,g, 1 g G} median{g statΓ,g} + median{g stat Y,g} + max g | Γ,g| + max g | Y,g| var(g statΓ,g) + 2 q var(g stat Y,g) + Γ2 max poly d, κA, log Ttotal d K e N + Γ2 max poly d, κA, log Ttotal ρtmix 1; (46) Learning Mixtures of Linear Dynamical Systems On the other hand, if k = ℓ, then median{statΓ,g, 1 g G} + median{stat Y,g, 1 g G} median{g statΓ,g} + median{g stat Y,g} max g | Γ,g| + max g | Y,g| E[g statΓ,g] + E[g stat Y,g] 2 q var(g statΓ,g) + q var(g stat Y,g) Γ2 max poly d, κA, log Ttotal V vec((Γ(k) tmix 1 Γ(ℓ) tmix 1) ) 2 2 + U vec((Y (k) tmix 1 Y (ℓ) tmix 1) ) 2 Γ2maxκ2 A e N Γ(k) tmix 1 Γ(ℓ) tmix 1 F + Y (k) tmix 1 Y (ℓ) tmix 1 F Γ2 max poly d, κA, log Ttotal ρtmix 1. (47) We need to further simplify the result (47) for the k = ℓcase. According to (28) and (29), we have Y (k) Y (k) tmix 1 F Γmax dκ3 Aρ2(tmix 1) =: ϵmix, Γ(k) Γ(k) tmix 1 F Γmax dκ2 Aρ2(tmix 1) ϵmix, which implies that Γ(k) tmix 1 Γ(ℓ) tmix 1 F Γ(k) Γ(ℓ) F + 2ϵmix, V vec (Γ(k) tmix 1 Γ(ℓ) tmix 1) 2 2 max n V vec (Γ(k) Γ(ℓ)) 2 2ϵmix, 0 o2 V vec (Γ(k) Γ(ℓ)) 2 2 4ϵmix V vec((Γ(k) Γ(ℓ)) ) 2 V vec (Γ(k) Γ(ℓ)) 2 2 4ϵmix Γ(k) Γ(ℓ) F. We can do a similar analysis for Y (k) tmix 1 Y (ℓ) tmix 1 F and U vec((Y (k) tmix 1 Y (ℓ) tmix 1) ) 2 2. Moreover, we claim (and prove later) that if the subspaces {Vi, Ui} satisfy the condition (25) in the theorem, then V vec((Γ(k) Γ(ℓ)) ) 2 2 + U vec((Y (k) Y (ℓ)) ) 2 Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F . (48) Putting these back to (47), we have for the k = ℓcase, median{statΓ,g, 1 g G} + median{stat Y,g, 1 g G} V vec (Γ(k) Γ(ℓ)) 2 2 + U vec (Y (k) Y (ℓ)) 2 4ϵmix Γ(k) Γ(ℓ) F + Y (k) Y (ℓ) F Γ2maxκ2 A e N Γ(k) Γ(ℓ) F + Y (k) Y (ℓ) F + 4ϵmix Γ2 max poly d, κA, log Ttotal Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F 0.01 Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F Γ2maxκ2 A e N Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F Γ2 max poly d, κA, log Ttotal Learning Mixtures of Linear Dynamical Systems (ii) 0.48 Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F Γ2maxκ2 A e N Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F where (i) holds if tmix 1 1 ρ log(( Γmax Γ,Y + 2)dκA) so that ϵmix 10 3 Γ,Y , and (ii) holds if tmix 1 1 ρ log(( Γmax 2) dκATtotal δ ) so that Γ2 max poly(d, κA, log Ttotal δ ) ρtmix 1 10 3 2 Γ,Y . Putting (46) and (49) together, we can finally check that, if it further holds that e N N/tmix Γ2 maxκ2 A d K 2 Γ,Y + 1, then we have with probability at least 1 δ, median{statΓ,g} + median{stat Y,g} 8 2 Γ,Y if k = ℓ, 3 8 Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F 3 8 2 Γ,Y if k = ℓ. This together with our choice of testing threshold τ [ 2 Γ,Y /8, 3 2 Γ,Y /8] in Algorithm 3 implies correct testing of the two trajectories {xt}, {zt}. Finally, taking the union bound over all pairs of trajectories in Mclustering leads to correct pairwise testing, which in turn implies exact clustering of Mclustering; this completes our proof of Theorem (B.2). C.3.1. PROOF OF LEMMA C.4 We first assume |Ω1| = |Ω2| = N for simplicity. Recall that stat = D U 1 t Ω1 vec xt(xt ) zt(zt ) t Ω2 vec xt(xt ) zt(zt ) where a, b Rd K are i.i.d., and E[a] = E h U vec xt(xt ) zt(zt ) i = U vec ( e Y (k) e Y (l)) = µk,l. Therefore, we have the expectation E[stat] = E[a], E[b] = µk,l 2 2. It remains to compute the variance var(stat) = E[stat2] E[stat]2, where E[stat2] = E (a b)2 = Tr E[bb ]E[aa ] = Tr E[aa ]2 . (50) Here E[aa ] = E[a]E[a] + cov(a), and since a is an empirical average of N i.i.d. random vectors, we have N cov(f), where f := U vec xt(xt ) zt(zt ) Rd K. For now, we claim that cov(f) = U (Σ(k) + f W (k) eΓ(k) + Σ(ℓ) + f W (l) eΓ(l))U = Σk,l, (51) which will be proved soon later. Putting these back to (50), one has E[aa ] = cov(a) + E[a]E[a] = 1 N Σk,l + µk,lµ k,l, E[aa ]2 = 1 N 2 Σ2 k,l + 1 N (Σk,lµk,lµ k,l + µk,lµ k,lΣk,l) + µk,l 2 2µk,lµ k,l, var(stat) = E[stat2] E[stat]2 = Tr(E[aa ]2) µk,l 4 2 = 1 N 2 Tr(Σ2 k,l) + 2 N µ k,lΣk,lµk,l, which completes our calculation of the variance for the case of |Ω1| = |Ω2| = N. For the more general case where (without loss of generality) |Ω1| = N |Ω2|, we simply need to modify the equation (50) to an inequality E[stat2] = Tr(E[bb ]E[aa ]) Tr(E[aa ]2), and the remaining analysis is the same. This finishes the proof of Lemma C.4. Learning Mixtures of Linear Dynamical Systems Proof of (51). For notational simplicity, we drop the subscripts t in the definition of f. Then we have f = U vec x(x ) z(z ) , and hence cov(f) = U cov vec x(x ) z(z ) U = U cov vec(x(x ) ) + cov vec(z(z ) ) U, (52) where the second equality uses the independence between (x, x ) and (z, z ). Let us focus on cov vec(x(x ) ) . For notational simplicity, for now we rewrite e A(k), f W (k), eΓ(k) as A, W , Γ. Define y := Γ 1/2x N(0, Id), and recall that x = Ax + w where w N(0, W ). Using the fact that vec(ABC) = (C A)vec(B) for any matrices of compatible shape, we have g := vec x x = vec(xx A ) + vec(xw ) = (A Id)vec(xx ) + vec(xw ) = (A Id)vec(Γ1/2yy Γ1/2) + vec(xw ) = (A Id)(Γ1/2 Γ1/2)vec(yy ) | {z } =:g1 + vec(xw ) | {z } =:g2 Note that E[g2] = 0, E[g] = E[g1], and E[g1g 2 ] = 0. Hence cov(g) = E[gg ] E[g]E[g] = cov(g1) + E[g2g 2 ]. (53) For the second term on the right-hand side, we have g2 = vec(xw ) = w1x ... wdx , E[g2g 2 ] = E h [wiwjxx ]1 i,j d i = W Γ; as for the first term, we claim (and prove soon later) that cov vec(yy ) = Id2 + P , (54) which implies that cov(g1) = cov (A Id)(Γ1/2 Γ1/2)vec(yy ) = (A Id)(Γ1/2 Γ1/2)cov vec(yy ) (Γ1/2 Γ1/2)(A Id) = (A Id)(Γ1/2 Γ1/2)(Id2 + P )(Γ1/2 Γ1/2)(A Id). Putting these back to (53), one has cov vec(x(x ) ) = cov(g) = (A Id)(Γ1/2 Γ1/2)(Id2 + P )(Γ1/2 Γ1/2)(A Id) + W Γ, which is equal to Σ(k) + f W (k) eΓ(k) if we return to the original notation of e A(k), f W (k), eΓ(k). By a similar analysis, we can show that cov vec(z(z ) ) = Σ(ℓ) + f W (l) eΓ(l). Putting these back to (52) finishes our calculation of cov(f). Finally, it remains to prove (54). Denote u := vec(yy ) = y1y ... ydy Then E[u] = vec(Id), and thus E[u]E[u] = [eie j ]1 i,j d. Next, consider E[uu ] = E h [yiyjyy ]1 i,j d i . Learning Mixtures of Linear Dynamical Systems E y2 i ykyℓ = 3 if k = ℓ= i, 1 if k = ℓ = i, 0 if k = ℓ, and hence E[y2 i yy ] = Id + 2eie i . E yiyjykyℓ = ( 1 if k = i, ℓ= j or k = j, ℓ= i, 0 otherwise, and hence E[yiyjyy ] = eie j + eje i . Putting together, the (i, j)-th d d block of cov(u) = E[uu ] E[u]E[u] is equal to Id + eie i if i = j, and eje i if i = j. In other words, cov(vec(yy )) = cov(u) = Id2 + P , where P = [eje i ]1 i,j d is a symmetric permutation matrix; this completes our proof of (54). C.3.2. PROOF OF LEMMA C.5 First, it holds that Ui ( e Y (k))i ( e Y (l))i 2 ( e Y (k))i ( e Y (l))i 2 2 = e Y (k) e Y (l) 2 F, which gives the upper bound on E[stat] = µk,l 2 2. For the lower bound, the triangle inequality tells us that Ui ( e Y (k))i ( e Y (l))i 2 = Ui Ui ( e Y (k))i Ui Ui ( e Y (l))i 2 max n ( e Y (k))i ( e Y (l))i 2 2ϵ, 0 o , which implies that Ui ( e Y (k))i ( e Y (l))i 2 2 ( e Y (k))i ( e Y (l))i 2 2 4ϵ ( e Y (k))i ( e Y (l))i 2, Ui ( e Y (k))i ( e Y (l))i 2 i=1 ( e Y (k))i ( e Y (l))i 2 2 4ϵ i=1 ( e Y (k))i ( e Y (l))i 2 = e Y (k) e Y (l) 2 F 4ϵ i=1 ( e Y (k))i ( e Y (l))i 2. It remains to upper bound Σk,l. Recall the definition Σk,l = U Σ(k) + f W (k) eΓ(k) + Σ(ℓ) + f W (l) eΓ(l) U. We will utilize the following basic facts: (1) for square matrices A and B with eigenvalues {λi} and {µj} respectively, their Kronecker product A B has eigenvalues {λiµj}; (2) For matrices A, B, C, D of compatible shapes, it holds that (A B)(C D) = (AC) (BD). These imply that 0 f W (k) eΓ(k) f W (k) eΓ(k) Id2 WmaxΓmax Id2, 0 Σ(k) = e A(k) Id (eΓ(k))1/2 (eΓ(k))1/2 Id2 + P (eΓ(k))1/2 (eΓ(k))1/2 ( e A(k)) Id 2 e A(k) Id (eΓ(k))1/2 (eΓ(k))1/2 (eΓ(k))1/2 (eΓ(k))1/2 ( e A(k)) Id Learning Mixtures of Linear Dynamical Systems = 2 e A(k) Id eΓ(k) eΓ(k) ( e A(k)) Id 2Γ2 max e A(k) Id ( e A(k)) Id = 2Γ2 max e A(k)( e A(k)) Id 2Γ2 maxκ2 AId2. Using the conditions Wmax Γmax and κA 1, we have Σ(k) + f W (k) eΓ(k) (WmaxΓmax + 2Γ2 maxκ2 A)Id2 3Γ2 maxκ2 AId2. We can upper bound Σ(ℓ) + f W (l) eΓ(l) by the same analysis. As a result, Σk,l U (6Γ2 maxκ2 AId2)U = 6Γ2 maxκ2 AU U = 6Γ2 maxκ2 AId K, which finishes the proof of Lemma C.5. C.3.3. PROOF OF (48) Let ϵ be the right-hand side of the condition (25) on the subspaces. Then, applying the second point of Lemma C.5 (with e Y (k) = Y (k), e Y (l) = Y (ℓ)) tells us that U vec (Y (k) Y (ℓ)) 2 2 Y (k) Y (ℓ) 2 F 4ϵ (Y (k))i (Y (ℓ))i 2 Y (k) Y (ℓ) 2 F 4ϵ d Y (k) Y (ℓ) F, where the last line follows from the Cauchy-Schwarz inequality: (Y (k))i (Y (ℓ))i 2 (Y (k))i (Y (ℓ))i 2 2 = d Y (k) Y (ℓ) F. We can lower bound V vec((Γ(k) Γ(ℓ)) ) 2 2 similarly. Putting pieces together, we have V vec (Γ(k) Γ(ℓ)) 2 2 + U vec (Y (k) Y (ℓ)) 2 2 Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F 4ϵ d Γ(k) Γ(ℓ) F + Y (k) Y (ℓ) F Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F , where the last inequality is due to the assumption ϵ Γ,Y / d. This completes our proof of (48). C.4. Proof of Theorem B.3 It suffices to prove the error bounds for one specific value of k, and then take the union bound over 1 k K. For notational convenience, in this proof we rewrite T (k) total, A(k), b A(k), W (k), c W (k) as T, A, b A, W , c W , respectively. We will investigate the close-form solution b A, and prepare ourselves with a self-normalized concentration bound; this will be helpful in finally proving the error bounds for b A A and c W W . Step 1: preparation. Recall the least-squares solution 0 t Tm 1 xm,t+1xm,t X 0 t Tm 1 xm,txm,t 1 . Using the notation ... xm,t ... 0 t Tm 1,m Ck RT d, X+ := ... xm,t+1 ... ... wm,t ... Learning Mixtures of Linear Dynamical Systems we have X + = AX + N , b A = X +X(X X) 1 = A + N X(X X) 1, namely A := b A A = N X(X X) 1. (55) We will utilize the following matrix form of self-normalized concentration; see Lemma C.4 of (Kakade et al., 2020). Lemma C.7 (Self-normalized concentration, matrix form). Consider filtrations {Ft}, and random vectors {xt, zt} satisfying xt Ft 1 and zt|Ft 1 N(0, Id). Let V Rd d be a fixed, symmetric, positive definite matrix, and denote V T := PT t=1 xtxt + V . Then with probability at least 1 δ, (V T ) 1/2 T X δ + log det(V T ) Step 2: estimation error of b A. Let us rewrite (55) as A = (X X) 1/2 (X X) 1/2X N. (56) It is obvious that X X plays a crucial role. Recall from Lemma C.1 that with probability at least 1 δ, xm,t 2 Dvec for some Dvec Γmax poly(d, κA, log(Ttotal/δ)). Then trivially we have the upper bound X X D2 vec T Id =: Vup. For a lower bound, we claim (and prove later) that with probability at least 1 δ, 5T W =: Vlb, provided that T κ2 wd log Γmax Now we are ready to control b A A = A . From (56), we have A (X X) 1/2 (X X) 1/2X N . First, the lower bound (57) on X X tells us that (X X) 1/2 1/ p T λmin(W ). Moreover, applying Lemma C.7 with V = Vlb, one has with probability at least 1 δ, (X X) 1/2X N (X X + V ) 1/2X N W 1/2 δ + log det(Vup + Vlb) Putting these together, we have T λmin(W ) p which proves our upper bound for b A A in the theorem. Step 3: estimation error of c W . By the definition of b wm,t = xm,t+1 b Axm,t = wm,t Axm,t, we have ... b w m,t ... c N = N AX = N N X(X X) 1X = N IT X(X X) 1X . Learning Mixtures of Linear Dynamical Systems Notice that IT X(X X) 1X is a symmetric projection matrix. Therefore, T c N c N = 1 T N IT X(X X) 1X N, and thus c W W = 1 T N X(X X) 1X N. (59) The first term on the right-hand side can be controlled by a standard result for covariance estimation (see Proposition D.1): with probability at least 1 δ, 1 As for the second term, we rewrite it as 1 T N X(X X) 1X N = 1 (X X) 1/2X N (X X) 1/2X N . Applying our earlier self-normalized concentration bound (58), we have T N X(X X) 1X N W d log( Γmax Wmin dκATtotal Putting these back to (59), we have T N N W + 1 T N X(X X) 1X N δ T + W d log( Γmax Wmin dκATtotal d log( Γmax Wmin dκATtotal where the last inequality uses T d log( Γmax Wmin dκATtotal δ ). This finishes our proof of Theorem B.3. C.4.1. PROOF OF (57) We start with the following decomposition: 0 t Tm 1 xm,txm,t X 1 t Tm 1 xm,txm,t 0 t Tm 2 xm,t+1xm,t+1 = X 0 t Tm 2 (Axm,t + wm,t)(Axm,t + wm,t) 0 t Tm 2 wm,twm,t Axm,txm,t A + Axm,twm,t + wm,txm,t A | {z } =:Q = P + Q. (60) Lower bound for P . By Proposition D.1, we have with probability at least 1 δ, 1 T |Ck|P W W δ T λmin(W ) Id, provided that T κ2 w(d + log 1 As a result, we have 1 T |Ck|P 1 2W , which implies P 1 Learning Mixtures of Linear Dynamical Systems Lower bound for Q. Let Nϵ be an ϵ-net of the unit sphere Sd 1 (where the value of 0 < ϵ < 1 will be specified later), and let π be the projection onto Nϵ. Recall the standard result that |Nϵ| (9/ϵ)d. Moreover, for any v Sd 1, denote v := v π(v), which satisfies v 2 ϵ. Then we have λmin(Q) = inf v Sd 1 v Qv = inf v Sd 1(π(v) + v) Q(π(v) + v) inf v Sd 1 π(v) Qπ(v) (2ϵ + ϵ2) Q inf v Nϵ v Qv 3ϵ Q . (61) For Q , we simply use a crude upper bound, based on the boundedness of { xm,t 2} (Lemma C.1): with probability at least 1 δ, one has Axm,t 2 2 + 2 Axm,t 2 wm,t 2 Γmax poly κA, d, Ttotal, 1 Next, we lower bound infv Nϵ v Qv. First, consider a fixed v Nϵ; denoting ym,t := Axm,t and um,t := W 1/2wm,t N(0, Id), we have 0 t Tm 2 (v ym,t)2 + 2 X 0 t Tm 2 v ym,t u m,t W 1/2v, where u m,t W 1/2v N(0, v W v). Lemma D.2 (the scalar version of self-normalized concentration) tells us that, for any fixed λ > 0, with probability at least 1 δ, X 0 t Tm 2 v ym,t u m,t W 1/2v 0 t Tm 2 (v ym,t)2 + 1 0 t Tm 2 (v ym,t)2 + 1 Replacing δ with δ/(9/ϵ)d and taking the union bound, we have with probability at least 1 δ, for any v Nϵ, 0 t Tm 2 (v ym,t)2 p 0 t Tm 2 (v ym,t)2 + 2 0 t Tm 2 (v ym,t)2 p With the choice of λ = 1/ p W , this implies v Qv 2 W (d log 9 δ ), for all v Nϵ. (63) Putting (62) and (63) back to (61), we have with probability at least 1 δ, λmin(Q) inf v Nϵ v Qv 3ϵ Q C0 δ + ϵ Γmax poly κA, d, Ttotal, 1 for some universal constant C0 > 0. Putting things together. Recall the decomposition X X P + Q in (60). We have already shown that if T κ2 w(d + log 1 δ ), then with probability at least 1 δ, X X P + Q 1 δ + ϵ Γmax poly κA, d, Ttotal, 1 It is easy to check that, if we further choose ϵ 1 poly(κA, d, Ttotal, 1 Wmin ), T κ2 wd log Γmax then we have X X 1 5T W , which finishes the proof of (57). Learning Mixtures of Linear Dynamical Systems C.5. Proof of Theorem B.4 In this proof, we show the correct classification of one short trajectory {xm,t}0 t Tm (with true label km) for some m Mclassification; then it suffices to take the union bound to prove the correct classification of all trajectories in Mclassification. For notational simplicity, we drop the subscript m and rewrite {xm,t}, Tm, km as {xt}, T, k, respectively. The basic idea of this proof is to show that L( b A(ℓ), c W (ℓ)) > L( b A(k), c W (k)) (64) for any incorrect label ℓ = k, where L is the loss function defined in (23) and used by Algorithm 5 for classification. Our proof below is simply a sequence of arguments that finally transform (64) into a sufficient condition in terms of the coarse model errors ϵA, ϵW and short trajectory length T. Before we proceed, we record a few basic facts that will be useful later. First, the assumption c W (k) W (k) ϵW 0.1Wmin implies that λmin(c W (k)) 0.9λmin(W (k)), and c W (k) is well conditioned with κ(c W (k)) κ(W (k)) κw. Morever, by Lemma C.1, with probability at least 1 δ, we have for all 0 t T, In Case 0, xt 2 Dx q Γmax(d + log Ttotal δ ), provided that T 1; In Case 1, xt 2 Dx κA q Γmax(d + log Ttotal δ ), provided that T 1 1 ρ log(2κA). Now we are ready to state our proof. Throughout our analyses, we will make some intermediate claims, whose proofs will be deferred to the end of this subsection. Step 1: a sufficient condition for correct classification. In the following, we prove that for a fixed ℓ = k, the condition (64) holds with high probability; at the end of the proof, we simply take the union bound over ℓ = k. Using xt+1 = A(k)xt + wt where wt N(0, W (k)), we can rewrite the loss function L as L(A, W ) = T log det(W ) + t=0 wt W 1wt t=0 xt (A(k) A) W 1(A(k) A)xt + 2 t=0 wt W 1(A(k) A)xt. After some basic calculation, (64) can be equivalently written as L( b A(ℓ), c W (ℓ)) L( b A(k), c W (k)) = (A) + (B) (C) > 0, where (A) := T log det(c W (ℓ)) log det(c W (k)) + t=0 wt (c W (ℓ)) 1 (c W (k)) 1 wt (B) := T 1 X t=0 xt (A(k) b A(ℓ)) (c W (ℓ)) 1(A(k) b A(ℓ))xt + 2 t=0 wt (c W (ℓ)) 1(A(k) b A(ℓ))xt (C) := T 1 X t=0 xt (A(k) b A(k)) (c W (k)) 1(A(k) b A(k))xt + 2 t=0 wt (c W (k)) 1(A(k) b A(k))xt Step 2: a lower bound for (A) + (B) (C). Intuitively, we expect that (A) + (B) should be large because the LDS models (A(k), W (k)) and (A(ℓ), W (ℓ)) are well separated, while (C) should be small if ( b A(k), c W (k)) (A(k), W (k)). More formally, we claim that the following holds for some universal constants C1, C2, C3 > 0: With probability at least 1 δ, (A) T log det(c W (ℓ)) log det(c W (k)) + Tr (W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 T (W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 F log 1 Learning Mixtures of Linear Dynamical Systems With probability at least 1 δ, T A(k) b A(ℓ) 2 F κw,cross κwκw,cross log 1 provided that T κ2 w log2(1/δ); With probability at least 1 δ, T D2 x A(k) b A(k) 2 Wmin + κw log 1 provided that T 1 (under Case 0) or T 1 1 ρ log(2κA) (under Case 1). Putting these together, we have with probability at least 1 δ, (A) + (B) (C) C4 T log det(c W (ℓ)) det(c W (k)) + Tr W (k) (c W (ℓ)) 1 (c W (k)) 1 + A(k) b A(ℓ) 2 F κw,cross D2 x A(k) b A(k) 2 T (W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 F + κwκw,cross for some universal constants C4, C5 > 0, provided that T κ2 w log2 1 δ (under Case 0), or T κ2 w log2 1 δ + 1 1 ρ log(2κA) (under Case 1). Now we have a lower bound of (A) + (B) (C) as an order-T term minus a low-order term. Therefore, to show (A) + (B) (C) > 0, it suffices to prove that (a) the leading factor of order-T term is positive and large, and (b) the low-order term is negligible compared with the order-T term. More specifically, under the assumption that the coarse models { b A(j), c W (j)} satisfy b A(j) A(j) ϵA, c W (j) W (j) ϵW 0.1Wmin for all 1 j K, we make the following claims: (Order-T term is large.) Define the leading factor b Dk,ℓof the order-T term and a related parameter Dk,ℓas follows: b Dk,ℓ:= log det(c W (ℓ)) det(c W (k)) + Tr W (k) (c W (ℓ)) 1 (c W (k)) 1 + A(k) b A(ℓ) 2 F κw,cross D2 x A(k) b A(k) 2 Dk,ℓ:= 1 κw,cross W (k) W (ℓ) 2 F W 2max + A(k) A(ℓ) 2 F 2 A,W κw,cross , (68) where the last inequality follows from Assumption 3.1. They are related in the sense that Dk,ℓ e Dk,ℓ:= log det(W (ℓ)) det(W (k)) + Tr W (k) (W (ℓ)) 1 (W (k)) 1 + A(k) A(ℓ) 2 F κw,cross , (69) where e Dk,ℓis defined in the same way as b Dk,ℓ, except that the coarse models are replaced with the accurate ones; moreover, we have Dk,ℓ b Dk,ℓ, provided that ϵA D2x , ϵW Wmin (Low-order term is negligible.) With (70) in place, we have (A) + (B) (C) C6T Dk,ℓ C5 T (W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 F + κwκw,cross We claim that if T κ5 w,cross Dk,ℓ + 1 log2 1 δ , then (A) + (B) (C) C7T Dk,ℓ> 0. (72) Learning Mixtures of Linear Dynamical Systems Step 3: putting things together. So far, we have proved that for a short trajectory generated by (A(k), W (k)) and for a fixed ℓ = k, it holds with probability at least 1 δ that L( b A(ℓ), c W (ℓ)) > L( b A(k), c W (k)), provided that D2x , ϵW Wmin min n 1, κ2 w + κ5 w,cross Dk,ℓ δ for Case 0, κ2 w + κ5 w,cross Dk,ℓ δ + 1 1 ρ log(2κA) for Case 1. Plugging in the relation Dk,ℓ 2 A,W /κw,cross and Dx q Γmax(d + log Ttotal δ ) (for Case 0) or Dx Γmax(d + log Ttotal δ ) (for Case 1), the above conditions become For Case 0: ϵA Wmin 2 A,W Γmaxκw,cross(d + log Ttotal δ ), ϵW Wmin min n 1, A,W p T κ2 w + κ6 w,cross 2 A,W For Case 1: ϵA Wmin 2 A,W Γmaxκw,crossκ2 A(d + log Ttotal δ ), ϵW Wmin min n 1, A,W p T κ2 w + κ6 w,cross 2 A,W δ + 1 1 ρ log(2κA). Finally, taking the union bound over all ℓ = k as well as over all trajectories in Mclassification finishes the proof of Theorem B.4. C.5.1. PROOF OF (65). Since wt i.i.d. N(0, W (k)), we have t=0 wt (c W (ℓ)) 1 (c W (k)) 1 wt = z Mz, where z N(0, IT d), and M RT d T d is a block-diagonal matrix with Q := (W (k))1/2((c W (ℓ)) 1 (c W (k)) 1)(W (k))1/2 Rd d as its diagonal blocks. Therefore, by the Hanson-Wright inequality (Theorem 6.2.1 of (Vershynin, 2018)), we have P z Mz E[z Mz] u 2 exp c min n u2 M 2 F , u M where M 2 F = T Q 2 F, M = Q , and E[z Mz] = Tr(M) = T Tr(Q). Choosing u T Q F log 1 δ, we have with probability at least 1 δ, |z Mz T Tr(Q)| C1 T Q F log 1 δ , which immediately leads to our lower bound (65) for (A). C.5.2. PROOF OF (66). Denote ut = (W (k)) 1/2wt N(0, Id) and yt = (W (k))1/2(c W (ℓ)) 1(A(k) b A(ℓ))xt. Then we have t=0 xt (A(k) b A(ℓ)) (c W (ℓ)) 1(A(k) b A(ℓ))xt + 2 By Lemma D.2, we have with probability at least 1 δ, PT 1 t=0 ut yt ( λ 2 PT 1 t=0 yt 2 2 + 1 δ ) for any fixed λ > 0. This implies that t=0 xt (A(k) b A(ℓ)) (c W (ℓ)) 1(A(k) b A(ℓ))xt λ t=0 yt yt 2 Learning Mixtures of Linear Dynamical Systems t=0 xt (A(k) b A(ℓ)) (c W (ℓ)) 1 λ (c W (ℓ)) 1W (k)(c W (ℓ)) 1 (A(k) b A(ℓ))xt 2 Choosing λ = 0.05/(κwκw,cross), we have λ (c W (ℓ)) 1W (k)(c W (ℓ)) 1 0.1(c W (ℓ)) 1, and thus t=0 xt (A(k) b A(ℓ)) (c W (ℓ)) 1(A(k) b A(ℓ))xt 40κwκw,cross log 2 0.9λmin (c W (ℓ)) 1 T 1 X t=0 xt (A(k) b A(ℓ)) (A(k) b A(ℓ))xt 40κwκw,cross log 2 Now it remains to lower bound PT 1 t=0 xt xt, where := A A and A := A(k) b A(ℓ). Since 0, we have t=0 xt+1 xt+1 = t=0 (A(k)xt + wt) (A(k)xt + wt) t=0 xt A(k) A(k)xt + 2 t=0 wt A(k)xt | {z } (ii) We can lower bound (i) by the Hanson-Wright inequality, similar to our previous proof of (65); the result is that, with probability at least 1 δ, one has (i) T Tr (W (k))1/2 (W (k))1/2 C0 T (W (k))1/2 (W (k))1/2 F log 1 To lower bound (ii), we apply Lemma D.2, which shows that with probability at least 1 δ, t=0 wt A(k)xt t=0 xt (A(k)) W (k) A(k)xt + 1 for any fixed λ > 0, hence t=0 xt A(k) ( λ W (k) )A(k)xt 2 Recall = A A, and thus λ W (k) = A(Id λ AW (k) A) A 0 if we choose λ = 1/( W (k) A 2) = 1/( W (k) ); this implies that δ = 2 W (k) log 2 Putting these back to (74), we have t=0 xt xt (i) + (ii) T Tr (W (k))1/2 (W (k))1/2 C0 T (W (k))1/2 (W (k))1/2 F log 1 δ 2 W (k) log 2 T λmin(W (k))Tr( A A) C0 T W (k) A A F log 1 δ 2 W (k) A A log 2 T λmin(W (k)) A(k) b A(ℓ) 2 F 1 C0 T κ(W (k)) log 1 T κ(W (k)) log 2 0.9T λmin(W (k)) A(k) b A(ℓ) 2 F, Learning Mixtures of Linear Dynamical Systems where the last inequality holds if T κ2 w log2 1 Going back to (73), we have (B) 0.9λmin((c W (ℓ)) 1) t=0 xt xt 40κwκw,cross log 2 0.81T λmin((c W (ℓ)) 1)λmin(W (k)) A(k) b A(ℓ) 2 F 40κwκw,cross log 2 0.7T A(k) b A(ℓ) 2 F κw,cross 40κwκw,cross log 2 which finishes the proof of (66). C.5.3. PROOF OF (67). Denote = A(k) b A(k), ut = (W (k)) 1/2wt N(0, Id) and yt = (W (k))1/2(c W (k)) 1 xt. Then one has t=0 xt (c W (k)) 1 xt + 2 t=0 wt (c W (k)) 1 xt = t=0 xt (c W (k)) 1 xt + 2 By Lemma D.2, we have with probability at least 1 δ, PT 1 t=0 ut yt λ 2 PT 1 t=0 yt 2 2 + 1 δ for any fixed λ > 0. This implies that t=0 xt (c W (k)) 1 xt + λ t=0 yt yt + 2 t=0 xt (c W (k)) 1 xt + λ t=0 xt (c W (k)) 1W (k)(c W (k)) 1 xt + 2 t=0 xt (c W (k)) 1 + λ (c W (k)) 1W (k)(c W (k)) 1 xt + 2 1.5 1 λmin(W (k)) + λ W (k) λmin(W (k))2 t=0 xt xt + 2 Choosing λ 1/κw and recalling xt 2 Dx, we have (C) 1 Wmin T D2 x 2 + κw log 1 δ , which finishes the proof of (67). C.5.4. PROOF OF (69). Denote := W (k) W (ℓ). Then the right-hand side of (69) becomes e Dk,ℓ= log det W (ℓ) log det W (k) + Tr W (k)(W (ℓ)) 1 Id + A(k) A(ℓ) 2 F κw,cross = log det W (ℓ) log det(W (ℓ) + ) + Tr (W (ℓ) + )(W (ℓ)) 1 Id + A(k) A(ℓ) 2 F κw,cross = log det W (ℓ) log det (W (ℓ))1/2 Id + (W (ℓ)) 1/2 (W (ℓ)) 1/2 (W (ℓ))1/2 + Tr (W (ℓ)) 1/2 (W (ℓ)) 1/2 + A(k) A(ℓ) 2 F κw,cross = Tr(X) log det(Id + X) + A(k) A(ℓ) 2 F κw,cross , where we define X := (W (ℓ)) 1/2 (W (ℓ)) 1/2. Notice that X is symmetric and satisfies X + Id = (W (ℓ)) 1/2W (k)(W (ℓ)) 1/2 0, Learning Mixtures of Linear Dynamical Systems X (W (ℓ)) 1/2 2 W (k) W (ℓ) 2Wmax Wmin = 2κw,cross, X 2 F = (W (ℓ)) 1/2 (W (ℓ)) 1/2 2 F 2 F W 2max . Therefore, by Lemma D.3, we have Tr(X) log det(Id + X) X 2 F 6κw,cross W (k) W (ℓ) 2 F 6κw,cross W 2max , e Dk,ℓ= Tr(X) log det(Id + X) + A(k) A(ℓ) 2 F κw,cross W (k) W (ℓ) 2 F 6κw,cross W 2max + A(k) A(ℓ) 2 F κw,cross Dk,ℓ, which finishes the proof of (69). C.5.5. PROOF OF (70). Recall the definition b Dk,ℓ= log det(c W (ℓ)) det(c W (k)) + Tr W (k) (c W (ℓ)) 1 (c W (k)) 1 + A(k) b A(ℓ) 2 F κw,cross D2 x A(k) b A(k) 2 First, we have log det(c W (ℓ)) det(c W (k)) + Tr W (k) (c W (ℓ)) 1 (c W (k)) 1 = log det(c W (ℓ)) det(W (k)) + Tr W (k) (c W (ℓ)) 1 (W (k)) 1 log det(c W (k)) det(W (k)) + Tr W (k) (c W (k)) 1 (W (k)) 1 | {z } (ii) We can lower bound (i) by the same idea of our earlier proof for (69), except that we replace W (ℓ) in that proof with c W (ℓ); this gives us (i) W (k) c W (ℓ) 2 F κw,cross W 2max . As for (ii), applying Lemma D.3 with X = (W (k)) 1/2(c W (k) W (k))(W (k)) 1/2, one has (ii) = Tr(X) log det(X + Id) X 2 F ϵW 2d Putting things together, we have b Dk,ℓ= (i) (ii) + A(k) b A(ℓ) 2 F κw,cross D2 x A(k) b A(k) 2 C1 1 κw,cross W (k) c W (ℓ) 2 F W 2max + A(k) b A(ℓ) 2 F | {z } (iii) W 2 min + D2 xϵA2 | {z } (iv) If ϵW , ϵA satisfy (70), then (iv) c0Dk,ℓfor some sufficiently small constant c0 > 0. As for (iii), according to the definition of Dk,ℓin (68), there are two possible cases: Learning Mixtures of Linear Dynamical Systems If 1 κw,cross W (k) W (ℓ) 2 F W 2 max Dk,ℓ 2 , then it is easy to check that (70) implies that 4 W (k) W (ℓ) F (iii) 1 κw,cross W (k) c W (ℓ) 2 F W 2max 1 κw,cross W (k) W (ℓ) F W (k) W (ℓ) 2 F W 2max Dk,ℓ. On the other hand, if 1 κw,cross A(k) A(ℓ) 2 F Dk,ℓ 2 , then one can check that (70) implies that 4 A(k) A(ℓ) F, (iii) A(k) b A(ℓ) 2 F κw,cross ( A(k) A(ℓ) F ϵA κw,cross A(k) A(ℓ) 2 F κw,cross Dk,ℓ. In sum, it is always guaranteed that (iii) Dk,ℓ. Going back to (75), we claim that b Dk,ℓ (iii) (iv) Dk,ℓas long as ϵW and ϵA satisfy (70), which finishes the proof of (70). C.5.6. PROOF OF (72). Recall the lower bound (71) for (A) + (B) (C). We want to show that the low-order term is dominated by the order-T term, namely T Dk,ℓ. First, if T κwκw,cross δ , then κwκw,cross log 1 δ T Dk,ℓ. Next, we have (W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 F W (k) (c W (ℓ)) 1 (c W (k)) 1 F = W (k) (c W (ℓ)) 1(c W (k) c W (ℓ))(c W (k)) 1 F W (k) (c W (ℓ)) 1 (c W (k)) 1 W (k) W (ℓ) F + 2 W (k) W (ℓ) F + 2 dϵW = 2κw,cross W (k) W (ℓ) F + 2 Notice that the definition (68) of Dk,ℓimplies that W (k) W (ℓ) F p κw,cross W 2max Dk,ℓ, and thus (W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 F κw,cross κw,cross W 2max Dk,ℓ+ Now it is easy to checked that if T κ5 w,cross Dk,ℓ + κw,cross dϵW Wmin Dk,ℓ T (W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 F log 1 Due to our assumption (70) on ϵW , we have ( κw,cross dϵW Wmin Dk,ℓ )2 κ2 w,cross Dk,ℓ κ5 w,cross Dk,ℓ, and thus it suffices to have T ( κ5 w,cross Dk,ℓ+ 1) log2 1 δ , which finishes the proof of (72). Learning Mixtures of Linear Dynamical Systems D. Miscellaneous results Proposition D.1. Consider at, bt i.i.d. N(0, Id), 1 t N, where N d + log(1/δ). Then with probability at least 1 δ, 1 N t=1 ata t Id The proof follows from a standard covering argument (cf. (Vershynin, 2018)), which we skip for brevity. Lemma D.2 (Self-normalized concentration, scalar version). Suppose that random vectors {ut, yt}1 t T and filtrations {Ft}0 t T 1 satisfy Ft = σ(ui, 1 i t), yt Ft 1, and ut|Ft 1 N(0, Id). Then for any fixed λ > 0, with probability at least 1 δ, we have Proof. Using basic properties of the Gaussian distribution, we have for all fixed λ R, = E exp T 1 X 2λ2 yt 2 2 E h exp λ u T y T 1 2λ2 y T 2 2 |FT 1 i E exp T 1 X 2λ2 yt 2 2 . Continuing this expansion leads to the result E[exp(PT t=1(λ ut yt 1 2λ2 yt 2 2))] 1. Now, letting z = log(2/δ) and using Markov s inequality, we have 2λ2 yt 2 2 z = P exp T X 2λ2 yt 2 2 exp(z) exp( z) E exp T X 2λ2 yt 2 2 exp( z) = δ In other words, with probability at least 1 δ/2, we have PT t=1(λ ut yt 1 2λ2 yt 2 2) < z = log(2/δ), which implies PT t=1 ut yt < λ 2 PT t=1 yt 2 2 + 1 δ if λ > 0. By a similar argument (but with λ replaced by λ), we have with probability at least 1 δ/2, PT t=1( λ ut yt 1 2λ2 yt 2 2) < log(2/δ), which implies PT t=1 ut yt > ( λ 2 PT t=1 yt 2 2 + 1 δ ) if λ > 0. Finally, taking the union bound finishes the proof of the lemma. Lemma D.3. Consider a symmetric matrix X Rd d that satisfies Id + X 0. If X B for some B 2, then we have Tr(X) log det(Id + X) X 2 F 3B . On the other hand, if X 1 2Id, then we have Tr(X) log det(Id + X) X 2 F. Proof. Denote {λi}1 i d as the eigenvalues of X, which satisfies λi > 1 for all 1 i d. Then Tr(X) log det(Id + X) = i=1 (1 + λi) = λi log(1 + λi) . Learning Mixtures of Linear Dynamical Systems It can be checked (via elementary calculus) that, for all 1 < λ B where B 2, it holds that λ log(1 + λ) λ2/(3B). Therefore, Tr(X) log det(Id + X) λ2 i 3B = X 2 F 3B , which completes the proof of our first claim. Similarly, it can be checked that, for all λ 1/2, one has λ log(1+λ) λ2, which implies that Tr(X) log det(Id + X) i=1 λ2 i = X 2 F; this completes the proof of our second claim. Fact D.4. In the setting of Section 3.1, it holds that Γmax Wmaxκ2 A/(1 ρ) and Γ,Y A,W Wmax Wmin/(4κ2 AΓmax). Proof. First, consider Γ = P i=0 Ai W (Ai) , where W Wmax and Ai κAρi. Then we have Γ P i=0 Ai 2 W Wmaxκ2 A P i=0 ρ2i Wmaxκ2 A/(1 ρ), which proves our upper bound for Γmax. Next, let us turn to Γ,Y . Our proof below can be viewed as a quantitative version of the earlier proof for Fact 2.1. We will show that, if the autocovariance matrices between the k-th and the ℓ-th models are close (in Frobenius norm), then the models themselves should also be close; our lower bound for Γ,Y in terms of A,W then follows from contraposition. Consider two LDS models (A(k), W (k)) = (A(ℓ), W (ℓ)) and their autocovariance matrics (Γ(k), Y (k)), (Γ(ℓ), Y (ℓ)). Recall from (17) that A(k) = Y (k)Γ(k) 1 and W (k) = Γ(k) A(k)Γ(k)A(k) = Γ(k) A(k)Y (k) ; A(ℓ) and W (ℓ) can be expressed similarly. First, regarding A(k) A(ℓ), one has A(k) A(ℓ) = Y (k)Γ(k) 1 Y (ℓ)Γ(ℓ) 1 = (Y (k) Y (ℓ))Γ(k) 1 + Y (ℓ)(Γ(k) 1 Γ(ℓ) 1), where Y (ℓ)(Γ(k) 1 Γ(ℓ) 1) = Y (ℓ)Γ(ℓ) 1(Γ(ℓ) Γ(k))Γ(k) 1 = A(ℓ)(Γ(ℓ) Γ(k))Γ(k) 1. Therefore, we have A(k) A(ℓ) F Y (k) Y (ℓ) F Γ(k) 1 + A(ℓ) Γ(ℓ) Γ(k) F Γ(k) 1 1 Wmin Y (k) Y (ℓ) F + κA Wmin Γ(ℓ) Γ(k) F, (76) where the last line follows from Γ(k) W (k) Wmin Id and A(ℓ) κAρ κA. Next, we turn to the analysis of W (k) W (ℓ), which satisfies W (k) W (ℓ) = (Γ(k) Γ(ℓ)) (A(k)Y (k) A(ℓ)Y (ℓ) ), W (k) W (ℓ) F Γ(k) Γ(ℓ) F + A(k)Y (k) A(ℓ)Y (ℓ) F. Notice that A(k)Y (k) A(ℓ)Y (ℓ) F = A(k)(Y (k) Y (ℓ)) + (A(k) A(ℓ))Y (ℓ) F A(k)(Y (k) Y (ℓ)) F + (A(k) A(ℓ))Y (ℓ) F κA Y (k) Y (ℓ) F + κAΓmax A(k) A(ℓ) F, where the last line is due to Y (ℓ) = A(ℓ)Γ(ℓ) A(ℓ) Γ(ℓ) κAΓmax. Therefore, we have W (k) W (ℓ) F Γ(k) Γ(ℓ) F + κA Y (k) Y (ℓ) F + κAΓmax A(k) A(ℓ) F. (77) Learning Mixtures of Linear Dynamical Systems For notational simplify, denote A := A(k) A(ℓ), W := W (k) W (ℓ), Γ := Γ(k) Γ(ℓ), Y := Y (k) Y (ℓ). From (77), one has A 2 F + W 2 F W 2max A 2 F + 3 W 2max Γ 2 F + κ2 A Y 2 F + κ2 AΓ2 max A 2 F 3 W 2max Γ 2 F + 3κ2 A W 2max Y 2 F + 1 + 3κ2 AΓ2 max W 2max 3 W 2max Γ 2 F + 3κ2 A W 2max Y 2 F + 4κ2 AΓ2 max W 2max A 2 F, where the last line follows from κ2 AΓ2 max/W 2 max 1. Morever, (76) tells us that A 2 F 2 W 2 min Y 2 F + 2κ2 A W 2 min Γ 2 F. Putting together, we have A 2 F + W 2 F W 2max 3 W 2max Γ 2 F + 3κ2 A W 2max Y 2 F + 4κ2 AΓ2 max W 2max A 2 F 3 W 2max Γ 2 F + 3κ2 A W 2max Y 2 F + 4κ2 AΓ2 max W 2max 2 W 2 min Y 2 F + 2κ2 A W 2 min Γ 2 F = 3 W 2max + 4κ2 AΓ2 max W 2max 2κ2 A W 2 min Γ 2 F + 3κ2 A W 2max + 4κ2 AΓ2 max W 2max 11κ4 AΓ2 max W 2max W 2 min Γ 2 F + Y 2 F . In sum, we have just shown that, if Γ 2 F + Y 2 F < 2 Γ,Y , then A 2 F + W 2 F W 2 max < 11κ4 AΓ2 max W 2 max W 2 min 2 Γ,Y . Equivalently (by contraposition), if A 2 F + W 2 F W 2 max 2 A,W , then Γ 2 F + Y 2 F W 2 max W 2 min 11κ4 AΓ2max 2 A,W . This proves our lower bound for Γ,Y in terms of A,W . Example D.5. It has been known that in our Case 1 (i.e. a single continuous trajectory), the quick switching of multiple LDS models may lead to exponentially large states, even if each individual model is stable (Liberzon, 2003). We give a quick example for completeness. Consider A(k) = 0.99 0 2 1 2 0 , A(ℓ) = 0.99 0 3 1 3 0 both satisfying the stability condition in Assumption 3.1 with ρ = 0.99 < 1 and hence tmix 1/(1 ρ) = 100. Suppose that each short trajectory has only a length of 2, and the m-th (resp. (m + 1)-th) trajectory has label km = ℓ(resp. km+1 = k). Then xm+2,0 is equal to A(k)A(ℓ)xm,0 plus a mean-zero noise term, where A(k)A(ℓ) = 0.992 0 2 1 2 0 has spectral radius 0.992 3/2 > 1; this will cause the exponential explosion of the states. E. Extensions of Algorithm 1 Different trajectory lengths. Recall that in Section 2, we assume that all short trajectories within each subset of data Mo have the same length Tm = To. If this is not the case, we can easily modify our algorithms in the following ways: For subspace estimation, the easiest way to handle different Tm s is to simply truncate the trajectories in Msubspace so that they have the same length Tsubspace = minm Msubspace Tm, and then apply Algorithm 2 without modification. However, this might waste many samples when some trajectories of Msubspace are much longer than others; one way to resolve this is to manually divide the longer trajectories into shorter segments of comparable lengths, before doing truncation. A more refined method is to modify Algorithm 2 itself, by re-defining the index sets Ω1, Ω2 separately for each trajectory; moreover, in the definition of c Hi and b Gi, one might consider assigning larger weights to longer trajectories, instead of using the uniform weight 1/|Msubspace|. Learning Mixtures of Linear Dynamical Systems For clustering (or pairwise testing) of Mclustering, we can handle various Tm s similarly, by either truncating each pair of trajectories to the same length, or modifying Algorithm 3 itself (via re-defining the index sets {Ωg,1, Ωg,2}1 g G separately for each trajectory). Our methods for model estimation and classification, namely Algorithms 4 and 5, are already adaptive to different Tm s in Mclustering and Mclassification, and hence need no modification. Unknown parameters. Next, we show how to handle the case when certain parameters are unknown to the algorithms: In Algorithm 2, we set the dimension of the output subspaces {Vi, Ui} to be K (the number of models). If K is unknown, we might instead examine the eigenvalues of c Hi + c H i and b Gi + b G i , and pick the subspace dimension that covers most of the energy in the eigenvalues. In Algorithm 3, we need to know the separation parameter Γ,Y (in order to choose the testing threshold τ appropriately) and the number of models K (for clustering). If either Γ,Y or K is unknown, we might instead try different values of threshold τ, and pick the one that (after permutation) makes the similarity matrix S block-diagonal with as few blocks as possible. F. Additional experiments F.1. Synthetic experiments: clustering and classification First, we take a closer look at the performance of our clustering method (Algorithm 3) through synthetic experiments. We set the parameters d = 40, K = 2, ρ = 0.5, δ = 0.12. The LDS models are generated by A(k) = (ρ δ)R and W (k) = Id, where R is a random orthogonal matrix. We let |Mclustering| = 5d, and vary Tclustering [10, 60]. We run our clustering method on the dataset Mclustering, either with or without the assistance of subspace estimation (Algorithm 2) and dimensionality reduction. For the former case, we use the same dataset Mclustering for subspace estimation, without sample splitting, which is closer to practice; for the latter, we simply replace the subspaces {Vi, Ui} with Id. The numerical results are illustrated in Figure 3 (left), confirming that (1) in both cases, the clustering error decreases as Tclustering increases, and (2) subspace estimation and dimensionality reduction significantly improve the clustering accuracy. Next, we examine the performance of our classification method (Algorithm 5) in the same setting as above. We first obtain a coarse model estimation by running Stage 1 of Algorithm 1 on the dataset Mclustering, with |Mclustering| = 10d and Tclustering = 30. Then, we run classification on the dataset Mclassification, with varying Tclassification [4, 50]. The numerical results are included in Figure 3 (right), showing that the classification error rapidly decreases to zero as Tclassification grows. Figure 3. Left: mis-clustering rate versus Tclustering. Right: mis-classification rate versus Tclassification. F.2. Real-data experiments: Motion Sense To show the practical relevance of the proposed algorithms, we work with the Motion Sense dataset (Malekzadeh et al., 2019). This datasets consists of multivariate time series of dimension d = 12, collected (at a rate of 50Hz) by accelerometer Learning Mixtures of Linear Dynamical Systems and gyroscope sensors on a mobile phone while a person performs various activities, such as jogging , walking , sitting , and so on. In our experiments, we break the data into 8-second short trajectories, and treat the human activities as latent variables. Figure 4 (left) illustrates what the data looks like. Notice that the time series do not satisfy the mixing property assumed in our theory, but are rather periodic instead. As a preliminary attempt to apply our algorithms in the real world, we show that the proposed clustering method (which is one of the most crucial step in our overall approach), without any modification, works reasonably well even for this dataset. To be concrete, we apply Algorithm 3 (without dimensionality reduction, i.e. {Vi, Ui} are set to Id) to a mixture of 12 jogging and 12 walking trajectories. Figure 4 (right) shows the resulted distance matrix, which is defined in the same way as Line 11 of Algorithm 3, but without thresholding. Its clear block structure confirms that, with an appropriate choice of threshold τ, Algorithm 3 will return an accurate/exact clustering of the mixed trajectories. These results are strong indication that the proposed algorithms in this work might generalize to much broader settings than what our current theory suggests, and we hope that this will inspire further extensions and applications of the proposed methods. Figure 4. Left: examples of jogging and walking trajectories from the Motion Sense dataset. Right: the distance matrix constructed by Algorithm 3 for 12 jogging and 12 walking trajectories.