# learning_mixtures_of_linear_dynamical_systems__41b753db.pdf

Learning Mixtures of Linear Dynamical Systems

Yanxi Chen 1 H. Vincent Poor 1

We study the problem of learning a mixture of multiple linear dynamical systems (LDSs) from unlabeled short sample trajectories, each generated by one of the LDS models. Despite the wide applicability of mixture models for timeseries data, learning algorithms that come with end-to-end performance guarantees are largely absent from existing literature. There are multiple sources of technical challenges, including but not limited to (1) the presence of latent variables (i.e. the unknown labels of trajectories); (2) the possibility that the sample trajectories might have lengths much smaller than the dimension d of the LDS models; and (3) the complicated temporal dependence inherent to time-series data. To tackle these challenges, we develop a two-stage meta-algorithm, which is guaranteed to efﬁciently recover each ground-truth LDS model up to error e O( p

d/T), where T is the total sample size. We validate our theoretical studies with numerical experiments, conﬁrming the efﬁcacy of the proposed algorithm.

1. Introduction

Imagine that we are asked to learn multiple linear dynamical systems (LDSs) from a mixture of unlabeled sample trajectories namely, each sample trajectory is generated by one of the LDSs of interest, but we have no idea which system it is. To set the stage and facilitate discussion, recall that in a classical LDS, one might observe a sample trajectory {xt}0 t T generated by an LDS obeying xt+1 = Axt + wt, where A Rd d determines the system dynamics in the noiseless case, and {wt}t 0 denote independent zero-mean noise vectors with covariance cov(wt) = W 0. The mixed LDSs setting considered herein extends classical LDSs by allowing for mixed

1Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544, USA. Correspondence to: Yanxi Chen <yanxic@princeton.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

measurements as described below; see Figure 1 for a visualization of the scenario.

Multiple linear systems. Suppose that there are K different LDSs as represented by {(A(k), W (k))}1 k K, where A(k) Rd d and W (k) Rd d represent the state transition matrix and noise covariance matrix of the k-th LDS, respectively. Here and throughout, we shall refer to (A(k), W (k)) as the system matrix for the k-th LDS. We only assume that (A(k), W (k)) = (A(ℓ), W (ℓ)) for any k = ℓ, whereas A(k) = A(ℓ) or W (k) = W (ℓ) is allowed.

Mixed sample trajectories. We collect a total number of M unlabeled trajectories from these LDSs. More speciﬁcally, the m-th sample trajectory is drawn from one of the LDSs in the following manner: set (A, W ) to be (A(k), W (k)) for some 1 k K, and generate a trajectory of length Tm obeying

xt+1 = Axt + wt, where the wt s are i.i.d.,

E[wt] = 0, cov(wt) = W 0.

Note, however, that the label k associated with each sample trajectory is a latent variable not revealed to us, resulting in a mixture of unlabeled trajectories. The current paper focuses on the case where the length of each trajectory is somewhat short, making it infeasible to estimate the system matrix from a single trajectory.

Goal. The aim is to jointly learn the system matrices {(A(k), W (k))}1 k K from the mixture of sample trajectories. In particular, we seek to accomplish this task in a sample-efﬁcient manner, where the total sample size is deﬁned to be the aggregate trajectory length PM m=1 Tm.

Motivations. The mixed LDSs setting described above is motivated by many real-world scenarios where a single time-series model is insufﬁcient to capture the complex and heterogeneous patterns in temporal data. For instance, in psychology (Bulteel et al., 2016), researchers collect multiple time-series trajectories (e.g. depression-related symptoms over a period of time) from different patients. Fitting this data with multi-modal LDSs (instead of a single model) not only achieves better ﬁtting performance, but also helps

Learning Mixtures of Linear Dynamical Systems

A mixture of trajectories

Estimated LDS models

Clustering and classification

("𝑨! , % 𝑾(!)) ("𝑨$ , % 𝑾($)) ("𝑨% , % 𝑾(%))

Model estimation

Figure 1. A high-level visualization of the mixed LDSs formulation, and the algorithmic idea of combining clustering, classiﬁcation, and model estimation. Here, we consider the special case where the multiple short trajectories come from the segments of a single continuous trajectory. The black dots within the continuous trajectory represent the time steps when the latent variable (i.e. the unknown label) changes.

identify subgroups of the persons, which further inspires interpretations of the results and tailored treatments for patients from different subgroups. Another example concerns an automobile sensor dataset, which consists of a single continuous (but possibly time-varying) trajectory of measurements from the sensors (Hallac et al., 2017). Through segmentation of the trajectory, clustering of the short pieces, and learning within each cluster, one can discover, and obtain meaningful interpretations for, a small number of key driving modes (such as driving straight , slowing down , turning ). See Section 4 for a longer list of applications.

Challenges. While there is no shortage of potential applications, a mixture of LDSs is far more challenging to learn compared to the classical setting with a single LDS. In particular, the presence of the latent variables, i.e. the unknown labels of the sample trajectories, signiﬁcantly complicates matters. One straightforward idea is to learn a coarse model for each trajectory, followed by proper clustering of these coarse models (to be utilized for reﬁned model estimation); however, this idea becomes infeasible in the highdimensional setting unless all trajectories are sufﬁciently long. Another popular approach is to alternate between model estimation and clustering of trajectories, based on, say, the expectation-maximization (EM) algorithm; unfortunately, there is no theoretical support for such EM-type algorithms, and we cannot preclude the possibilities that the algorithms get stuck at undesired local optima. The lack of theoretical guarantees in prior literature motivates us to come up with algorithms that enjoy provable performance guarantees. Finally, the present paper is also inspired by the recent progress in meta-learning for mixed linear regression (Kong et al., 2020b;a), where the goal is to learn multiple linear models from a mixture of independent samples; note, however, that the temporal dependence underlying time-

series data in our case poses substantial challenges and calls for the development of new algorithmic ideas.

Main contributions. In this work, we take an important step towards guaranteed learning of mixed LDSs, focusing on algorithm design that comes with end-to-end theoretical guarantees. In particular, we propose a two-stage metaalgorithm to tackle the challenge of mixed LDSs:

1. Coarse estimation: perform a coarse-level clustering of the unlabeled sample trajectories (assisted by dimension reduction), and compute initial model estimation for each cluster;

2. Reﬁnement: classify additional trajectories (and add each of them to the corresponding cluster) based on the above coarse model estimates, followed by reﬁned model estimation with the updated clusters.

This two-stage meta approach, as well as the speciﬁc methods for individual steps, will be elucidated in Section 2.

Encouragingly, assuming that the noise vectors {wm,t} are independent Gaussian random vectors, the proposed twostage algorithm is not only computationally efﬁcient, but also guaranteed to succeed in the presence of a polynomial sample size. Informally, our algorithm achieves exact clustering/classiﬁcation of the sample trajectories as well as faithful model estimation, under the following conditions (with a focus on the dependency on the dimension d): (1) each short trajectory length Tm is allowed to be much smaller than d; (2) the total trajectory length of each stage is linear in d (up to logarithmic factors); (3) to achieve a ﬁnal model estimation error ϵ 0 (in the spectral norm), it sufﬁces to have a total trajectory length of order d/ϵ2 for each LDS model. See Section 3 (in particular, Corollary 3.7) for the precise statements of our main results, which will also be validated numerically.

It is worth noting that, although we focus on mixed LDSs for concreteness, we will make clear that the proposed modular algorithm is fairly ﬂexible and can be adapted to learning mixtures of other time-series models, as long as certain technical conditions are satisﬁed; see Remark 2.3 at the end of Section 2 for a detailed discussion.

Notation. Throughout this paper, vectors and matrices are represented by boldface letters. For a vector x Rd, we deﬁne (x)i R as the i-th entry of x, and x 2 as its ℓ2 norm; for a matrix X Rm n, we deﬁne (X)i Rn as the transpose of the i-th row of X, and X (resp. X F) as its spectral (resp. Frobenius) norm. For a symmetric matrix X Rd d, we denote its maximal (resp. minimal) eigenvalue as λmax(X) (resp. λmin(X)); if in addition X is positive deﬁnite, we denote its condition number as κ(X) :=

Learning Mixtures of Linear Dynamical Systems

Algorithm 1 A two-stage algorithm for mixed LDSs

1: Input: M short trajectories {Xm}1 m M (where Xm = {xm,t}0 t Tm); parameters τ, G. 2: // Stage 1: coarse estimation. 3: Run Algorithm 2 with {Xm}m Msubspace to obtain subspaces {Vi, Ui}1 i d . 4: Run Algorithm 3 with {Xm}m Mclustering, {Vi, Ui}1 i d, τ, G, to obtain clusters {Ck}1 k K.

5: Run Algorithm 4 with {Ck}1 k K to obtain coarse models { b A(k), c W (k)}1 k K. 6: // Stage 2: refinement. 7: Run Algorithm 5 with {Xm}m Mclassiﬁcation, { b A(k), c W (k)}1 k K, {Ck}1 k K, to update clusters {Ck}.

8: Run Algorithm 4 with {Ck}1 k K to obtain reﬁned models { b A(k), c W (k)}1 k K.

9: Output: ﬁnal model estimation { b A(k), c W (k)}1 k K and clusters {Ck}1 k K.

λmax(X)/λmin(X) 1. If A = [Aij]1 i m,1 j n and B = [Bij]1 i m,1 j n are matrices of the same dimension, we denote by A, B := Pm i=1 Pn j=1 Aij Bij their inner product. Given vectors x1, . . . , xn Rd where n < d, let span{xi, 1 i N} Rd n represent the subspace spanned by these vectors. Let Id be the d d identity matrix.

We always use the superscript (k) to indicate the k-th model , as in A(k) and W (k); this is to be distinguished from the superscript k without the parentheses, which simply means the power of k. For a discrete set Ω, we denote by |Ω| its cardinality. Deﬁne 1(E) to be the indicator function, which takes value 1 if the event E happens, and 0 otherwise. Let an bn indicate that an C0bn for all n = 1, 2, . . . , where C0 > 0 is some universal constant; moreover, an bn is equivalent to bn an, and an bn indicates that an bn and an bn hold simultaneously.

2. Algorithms

We propose a two-stage paradigm for solving the mixed LDSs problem, as summarized in Algorithm 1. It consists of several subroutines as described in Algorithms 2 5; due to space limitation, in this section we will only illustrate the key ideas behind these subroutines, with full details deferred to Appendix A. Note that Algorithm 1 is stated in a modular manner, and one might replace certain subroutines by alternative schemes in order to handle different settings and model assumptions.

Let us ﬁrst introduce some additional notation and assumptions that will be useful for our presentation, without much loss of generality. To begin with, we augment the notation for each sample trajectory with its trajectory index; that is, for each 1 m M, the m-th trajectory denoted by Xm := {xm,t}0 t Tm starts with some initial state xm,0 Rd, and evolves according to the km-th LDS for some unknown label 1 km K such that

xm,t+1 = A(km)xm,t + wm,t, where the wm,t s are i.i.d.,

E[wm,t] = 0, cov(wm,t) = W (km) 0 (1)

for all 0 t Tm 1. Next, we divide the M sample tra-

jectories {Xm}1 m M in hand into three disjoint subsets Msubspace, Mclustering, Mclassiﬁcation satisfying

Msubspace Mclustering Mclassiﬁcation = {1, 2, . . . , M},

where each subset of samples will be employed to perform one subroutine. We assume that all trajectories within each subset have the same length, namely,

Tsubspace if m Msubspace, Tclustering if m Mclustering, Tclassiﬁcation if m Mclassiﬁcation. (2)

Finally, we assume that K d, so that performing subspace estimation in Algorithm 1 will be helpful (otherwise one might simply eliminate this step). The interested readers are referred to Appendix E for discussions of some potential extensions of our algorithms (e.g. adapting to the case where the trajectories within a subset have different lengths, or the case where certain parameters are a priori unknown).

2.1. Preliminary facts

We ﬁrst introduce some preliminary background on the autocovariance structures and mixing property of linear dynamical systems, which form the basis of our algorithms for subspace estimation and clustering of trajectories.

Stationary covariance matrices. Consider ﬁrst a single LDS model (A, W ), with xt+1 = Axt + wt. If E[xt] = 0, cov(xt) = Γ and E[wt] = 0, cov(wt) = W , then it follows that E[xt+1] = 0, and cov(xt+1) = A cov(xt) A +cov(wt) = AΓA +W . Under certain assumption on stability, this leads to the order-0 stationary autocovariance matrix deﬁned as (Kailath et al., 2000)

Γ(A, W ) := E xtxt |A, W

= A Γ(A, W ) A + W =

t=0 At W (At) , (3)

and the order-1 stationary autocovariance matrix

Y (A, W ) := E xt+1xt |A, W = A Γ(A, W ). (4)

Learning Mixtures of Linear Dynamical Systems

For the mixed LDSs case (1) with K models, we abbreviate

Γ(k) := Γ(A(k), W (k)), Y (k) := Y (A(k), W (k)). (5)

Mixing. Expanding the LDS recursion xt+1 = Axt+wt with cov(wt) = W , we have

xt+s = Asxt +

i=1 Ai 1wt+s i, s = 1, 2, . . . (6)

If A satisﬁes certain assumptions regarding stability and if s is larger than certain mixing time of the LDS, then the ﬁrst term on the right-hand side of (6) approaches zero, while the second term is independent of the history up to time t, with covariance close to the stationary autocovariance Γ(A, W ). This suggests that, for two samples within the same trajectory that are sufﬁciently far apart, we can treat them as being (almost) independent of each other; this simple fact shall inspire our algorithmic design and streamline statistical analysis later on. It is noteworthy that the proposed algorithms do not require prior knowledge about the mixing times of the LDS models.

2.2. Subspace estimation

Recall that the notation (Γ(k))i Rd (resp. (Y (k))i) represents the transpose of the i-th row of Γ(k) (resp. Y (k)) deﬁned in (5). Consider the following subspaces:

V i := span n (Γ(k))i, 1 k K o ,

U i := span n (Y (k))i, 1 k K o , 1 i d,

each of which has rank at most K. We develop a spectral method to estimate these subspaces, which will in turn allow for proper dimension reduction in subsequent steps. Here, we only introduce the idea for estimating {U i }, since the idea for {V i } is very similar.

We ﬁrst divide {0, 1, . . . , Tsubspace} into four segments of the same size, and denote the 2nd (resp. 4th) segment as Ω1 (resp. Ω2). According to the mixing property of LDS, if Tsubspace is larger than some appropriately deﬁned mixing time, then (1) each sample trajectory in Msubspace will mix sufﬁciently and nearly reach stationarity (when constrained to t Ω1 Ω2), (2) the samples in Ω1 are nearly independent of those in Ω2. Therefore, deﬁning

gm,i,j := 1 |Ωj|

t Ωj (xm,t+1)i xm,t (7)

for 1 i d and j {1, 2}, it is easy to check that E[gm,i,1g m,i,2] E[gm,i,1]E[gm,i,2] (Y (km))i(Y (km)) i . Taking the average over m Msubspace, we have constructed a matrix b Gi such that

E[ b Gi] PK k=1 p(k)(Y (k))i(Y (k))i

=: Gi, where p(k)

denotes the fraction of sample trajectories generated by the k-th model. As a consequence, if Tsubspace and |Msubspace| are sufﬁciently large, then one might expect b Gi to be a good approximation of Gi, the latter of which is symmetric, has rank at most K, and has U i as its eigenspace. All this motivates us to compute the rank-K eigenspace of b Gi + b G i .

2.3. Clustering

In this step, we propose to construct a similarity matrix S for the sample trajectories in Mclustering, based on their pairwise comparisons; then we can apply any mainstream clustering algorithm (e.g. spectral clustering (Chen et al., 2021b)) to S, dividing Mclustering into K disjoint clusters {Ck}1 k K such that the trajectories in each cluster are primarily generated by the same LDS model.

The remaining question is how to perform the pairwise comparisons; we intend to achieve this by comparing the autocovariance matrices associated with the sample trajectories. Note that, even though (A(k), W (k)) = (A(ℓ), W (ℓ)) for k = ℓ, it is indeed possible that Γ(k) = Γ(ℓ) or Y (k) = Y (ℓ). Fortunately, the following fact ensures the separation of (Γ(k), Y (k)) versus (Γ(ℓ), Y (ℓ)).

Fact 2.1. If (A(k), W (k)) = (A(ℓ), W (ℓ)), then we have either Γ(k) = Γ(ℓ) or Y (k) = Y (ℓ) (or both).

Now, let us compare the m-th and n-th trajectories for some m, n Mclustering. In order to determine whether they have the same label (namely km = kn), we propose to estimate the quantity Γ(km) Γ(kn) 2 F + Y (km) Y (kn) 2 F using the data samples {xm,t} and {xn,t}, which is expected to be small (resp. large) if km = kn (resp. km = kn). To do so, let us divide {0, 1, . . . , Tclustering} evenly into four segments, and denote by Ω1 (resp. Ω2) the 2nd (resp. 4th) segment. Recall our earlier deﬁnition of the g vectors in (7). Assuming sufﬁcient mixing and utilizing near independence between samples from Ω1 and those from Ω2, we might resort to the following statistic:

D Ui gm,i,1 gn,i,1 , Ui gm,i,2 gn,i,2 E ,

whose expectation is given by

Ui (Y (km))i (Y (kn))i 2

(Y (km))i (Y (kn))i 2 2 = Y (km) Y (kn) 2 F;

here, the second approximation holds if each subspace Ui is sufﬁciently close to U i = span{(Y (j))i, 1 j K}.

Learning Mixtures of Linear Dynamical Systems

The purpose of utilizing {Ui} is to reduce the variance of stat Y . Similarly, we can compute another statistic statΓ with E[statΓ] Γ(km) Γ(kn) 2 F. Consequently, we shall declare km = kn if statΓ +stat Y exceeds some appropriate threshold.

Remark 2.2. For subspace estimation and clustering, we split each trajectory evenly into four parts, and only utilize the 2nd and 4th. In theory, this only affects sample complexities by a constant factor. The key beneﬁt is that no prior knowledge of the mixing time tmix is needed. In cases where tmix is known/estimated, one might improve data efﬁciency by letting the 1st and 3rd parts have only length tmix.

2.4. Model estimation

Suppose that we have obtained a good clustering accuracy in the previous step. Then, for each k, we use the samples {{xm,t}0 t Tm}m Ck to obtain an estimate b A(k) of the state transition matrix by solving a least-squares problem (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019). Finally, we can estimate the noise vector b wm,t := xm,t+1 b A(k)xm,t wm,t, and use the empirical covariance of { b wm,t} as our estimate of the noise covariance matrix.

2.5. Classiﬁcation

In the previous steps, we have obtained initial clusters {Ck}1 k K and coarse model estimates { b A(k), c W (k)}1 k K. With the assistance of additional sample trajectories in Mclassiﬁcation, we can augment the clusters in the following way: for each new trajectory {xt}0 t T , we infer its label as bk = arg mink L( b A(k), c W (k)), and then assign this trajectory to the bk-th cluster; here, the loss function L(A, W ) := T log det(W )+PT 1 t=0 (xt+1 Axt) W 1(xt+1 Axt) coincides with the negative log-likelihood under a Gaussian assumption. Once this is done, we can run Algorithm 4 again with the updated clusters {Ck} to reﬁne our model estimates, which is exactly the last step of Algorithm 1.

Remark 2.3. While this work focuses on mixtures of LDSs, we emphasize that the general principles of Algorithm 1 are applicable under much weaker assumptions. For Algorithms 2 and 3 to work, we essentially only require that each sample trajectory satisﬁes a certain mixing property, and that the autocovariances of different models are sufﬁciently separated (and hence distinguishable). As for Algorithms 4 and 5, we essentially require a well-speciﬁed parametric form of the time-series models. This observation might inspire future extensions (in theory or applications) of Algorithm 1 to much broader settings.

3. Main results

3.1. Model assumptions

To streamline the theoretical analysis, we focus on the case where the trajectories are driven by Gaussian noise; that is, for each 1 m M, 0 t Tm, the noise vector wm,t in (1) is independently generated from the Gaussian distribution N(0, W (km)). Next, we assume for simplicity that the labels {km}1 m M of the trajectories are pre-determined and ﬁxed, although one might equivalently regard {km} as being random and independent of the noise vectors {wm,t}. Moreover, while our algorithms and analysis are largely insensitive to the initial states {xm,0}1 m M, we focus on two canonical cases for concreteness: (i) the trajectories start at zero state, or (ii) they are segments of one continuous long trajectory. This is formalized as follows:

Case 0: xm,0 = 0, 1 m M; (8a)

Case 1: x1,0 = 0, and

xm+1,0 = xm,Tm, 1 m M 1. (8b)

We further deﬁne the total trajectory length as Ttotal := P

1 m M Tm = P

o Ttotal,o, where

Ttotal,o := To |Mo|,

o {subspace, clustering, classiﬁcation}.

Additionally, we assume that each model occupies a nondegenerate fraction of the data; in other words, there exists some 0 < pmin 1/K such that for all 1 k K and o {subspace, clustering, classiﬁcation},

pmin p(k) o := 1 |Mo|

m Mo 1(km = k).

Finally, we make the following assumptions about the ground-truth LDS models, where we recall that the autocovariance matrices {Γ(k), Y (k)} have been deﬁned in (5). Assumption 3.1. The LDS models {A(k), W (k)}1 k K satisfy the following conditions:

1. There exist κA 1 and 0 ρ < 1 such that for any 1 k K, (A(k))t κA ρt, t = 1, 2, . . . ;

2. There exist Γmax Wmax Wmin > 0 and κw,cross κw 1 such that for any 1 k K, (i) λmax(Γ(k)) Γmax, (ii) Wmin λmin(W (k)) λmax(W (k)) Wmax, (iii) Wmax/Wmin =: κw,cross, (iv) κ(W (k)) = λmax(W (k))/λmin(W (k)) κw;

3. There exist Γ,Y , A,W > 0 such that for any 1 k < ℓ K,

Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F 2 Γ,Y ,

A(k) A(ℓ) 2 F + W (k) W (ℓ) 2 F W 2max 2 A,W .

Learning Mixtures of Linear Dynamical Systems

The ﬁrst assumption states that each state transition matrix A(k) is exponentially stable, which is a quantiﬁed version of stability and has appeared in various forms in the literature of LDS (Kailath et al., 2000; Cohen et al., 2018); here, κA can be regarded as a condition number, while ρ is a contraction rate. The second assumption ensures that the noise covariance matrices {W (k)} are well conditioned, and the autocovariance matrices {Γ(k)} are bounded. The third assumption quantiﬁes the separation between different LDS models. It is important that we consider the separation of (Γ(k), Y (k)) versus (Γ(ℓ), Y (ℓ)) jointly, which guarantees that Γ,Y is always strictly positive (thanks to Fact 2.1), despite the possibility of Γ(k) = Γ(ℓ) or Y (k) = Y (ℓ); the reasoning for our deﬁnition of A,W is similar. Remark 3.2. Since the separation parameters A,W , Γ,Y are deﬁned with respect to the Frobenius norm, we may naturally regard them as A,W =

dδA,W , Γ,Y =

dδΓ,Y , where δA,W , δΓ,Y are the canonical separation parameters (in terms of the spectral norm). For example, in the simple setting where K = 2, W (1) = W (2), A(1) = 0.5Id and A(2) = (0.5 + δ)Id for some canonical separation parameter δ, we have A,W = A(1) A(2) F = δId F =

dδ. This observation will be crucial for obtaining the correct dependence on the dimension d in our subsequent analysis. Remark 3.3. Most of the parameters in Assumption 3.1 come directly from the ground-truth LDS models {A(k), W (k)}, except Γmax and Γ,Y . In fact, they can be bounded by Γmax Wmax and Γ,Y A,W ; see Fact D.4 for the formal statements. However, the pre-factors in these bounds can be pessimistic in general cases, therefore we choose to preserve Γmax and Γ,Y in our analysis.

3.2. Theoretical guarantees

We are now ready to present our end-to-end performance guarantees for Algorithm 1. Our main results for Cases 0 and 1 (deﬁned in (8)) are summarized in Theorems 3.4 and 3.5 below, with proofs deferred to Appendix B.

Theorem 3.4 (Case 0). There exist positive constants {Ci}1 i 8 such that the following holds for any ﬁxed 0 < δ < 1/2. Consider the model (1) under the assumptions in Sections 2 and 3.1, with a focus on Case 0 (8a). Suppose that we run Algorithm 1 with parameters τ, G that satisfy 1/8 < τ/ 2 Γ,Y < 3/8, G C1ι1, and data {Xm}1 m M (where Xm = {xm,t}0 t Tm) that satisﬁes

Tsubspace C2 ι1 1 ρ,

Ttotal,subspace C3 d 1 ρ

Tclustering C4 G 1 ρ

d K 2 Γ,Y + 1

Ttotal,clustering C5 dκ2 w,cross pmin

Γmax Wmin + 1 ι2 3,

Tclassiﬁcation C6 κ2 w + κ6 w,cross 2 A,W

where we deﬁne the logarithmic terms ι1 := log( dκATtotal

δ ), ι2 := log ( Γmax

Γ,Y + 2) dκATtotal

δ , ι3 := log( Γmax

Wmin dκATtotal

δ ). Then, with probability at least 1 δ, Algorithm 1 achieves exact clustering in Line 4 and exact classiﬁcation in Line 7; moreover, there exists some permutation π : {1, . . . , K} {1, . . . , K} such that the ﬁnal model estimation { b A(k), c W (k)}1 k K in Line 8 obeys

1 k K, b A(k) A(π(k)) C7

c W (k) W (π(k))

W (π(k)) C8

dι3 pmin T , (10)

where T := Ttotal,clustering + Ttotal,classiﬁcation.

Theorem 3.5 (Case 1). Consider the same setting of Theorem 3.4, except that we focus on Case 1 (cf. (8b)) instead. Then the same performance guarantees continue to hold, if we replace the conditions on Ttotal,clustering and Tclassiﬁcation in (9) with

Ttotal,clustering C5 dκ2 w,cross pmin

Γmax Wmin κ2 A + 1 ι2 3,

Tclassiﬁcation C6 κ2 w + κ6 w,cross 2 A,W

ι2 1 + 1 1 ρ log(2κA).

Remark 3.6. It is worth noting that all the short trajectory lengths {To} in the above theorems have sublinear dependence on the dimension d. In particular, the

d K dependence in Tclustering (9) is due to variance reduction achieved by subspace projection (see Section 2.3), without which this dependence would have been linear in d.

While Theorems 3.4 and 3.5 guarantee that Algorithm 1 successfully learns a mixture of LDS models with a polynomial number of samples, these results involve many parameters and may be somewhat difﬁcult to interpret. In the following corollary, we make some simpliﬁcations and focus on the most important parameters.

Corollary 3.7. Consider the same setting of Theorems 3.4 and 3.5. For simplicity, suppose that the condition numbers κA, κw, κw,cross 1, and the fractions of data generated by different LDS models are balanced, namely pmin 1/K. Moreover, deﬁne the canonical separation parameters δA,W := A,W /

d and δΓ,Y := Γ,Y /

Learning Mixtures of Linear Dynamical Systems

as suggested by Remark 3.2. Finally, deﬁne the mixing time tmix := 1/(1 ρ). Then we can rewrite the sample complexities in Theorems 3.4 and 3.5 as follows (where we hide the logarithmic terms {ιi}1 i 3): if

Tsubspace tmix,

Ttotal,subspace tmixd

Tclustering tmix

Ttotal,clustering Kd 1 δ2 A,W

Γmax Wmin + 1 ,

Tclassiﬁcation

1 dδ2 A,W + 1 for Case 0,

1 dδ2 A,W + tmix for Case 1,

Ttotal,clustering + Ttotal,classiﬁcation Kd

then with high probability, Algorithm 1 achieves exact clustering in Line 4, exact classiﬁcation in Line 7, and ﬁnal model estimation errors

b A(k) A(π(k)) ϵ, c W (k) W (π(k))

W (π(k)) ϵ.

Below are some key implications of Corollary 3.7.

Dimension d and targeted error ϵ: (1) Our algorithms allow the To s to be much smaller than (and even decrease with) d. (2) Each Ttotal,o shall grow linearly with d, and it takes an order of Kd/ϵ2 samples in total to learn K models in up to ϵ 0 errors (in the spectral norm), which is just K times the usual parametric rate (d/ϵ2) of estimating a single model.

Mixing time tmix: (1) Tsubspace and Tclustering are linear in tmix, which ensures sufﬁcient mixing and thus facilitates our algorithms for Stage 1. In contrast, Tclassiﬁcation depends on tmix only for Case 1, with the sole and different purpose of ensuring that the states {xm,t} are bounded throughout (see Example D.5 for an explanation). (2) While Ttotal,subspace needs to grow linearly with tmix, this is not required for Ttotal,clustering and Ttotal,classiﬁcation, because our methods for model estimation (Algorithm 4) do not rely on mixing (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019).

Canonical separation parameters δA,W , δΓ,Y : (1) Tclustering 1/δ2 Γ,Y guarantees exact clustering of the trajectories, while Tclassiﬁcation 1/δ2 A,W guarantees exact classiﬁcation. (2) Ttotal,subspace 1/δ4 Γ,Y leads to sufﬁciently accurate subspaces, while

Figure 2. The model estimation errors of Algorithm 1 versus the total sample size (excluding Msubspace). Each curve is an average over 12 independent trials.

Ttotal,clustering 1/δ2 A,W leads to accurate initial model estimation.1

Remark 3.8. It is worth noting that Tclustering Tclassiﬁcation in Corollary 3.7, i.e. clustering requires a larger trajectory length than classiﬁcation does. This justiﬁes the beneﬁt of the proposed two-stage procedure, compared with simply doing a large clustering of all trajectories and then one-shot model estimation.

3.3. Numerical experiments

We now validate our theoretical ﬁndings with a series of numerical experiments, conﬁrming that Algorithm 1 successfully solves the mixed LDSs problem. In these experiments, we ﬁx d = 80, K = 4; moreover, let Tsubspace = 20, Tclustering = 20 and Tclassiﬁcation = 5, all of which are much smaller than d. We take |Msubspace| = 30 d, |Mclustering| = 10 d, and vary |Mclassiﬁcation| between [0, 5000 d]. Our experiments focus on Case 1 as deﬁned in (8b), and we generate the labels of the sample trajectories uniformly at random. The ground-truth LDS models are generated in the following manner: A(k) = ρR(k), where ρ = 0.5 and R(k) Rd d

is a random orthogonal matrix; W (k) has eigendecomposition U (k)Λ(k)(U (k)) , where U (k) is a random orthogonal matrix, and the diagonal entries of Λ(k) are independently drawn from the uniform distribution on [1, 2].

Our experimental results are illustrated in Figure 2. Here, the horizontal axis represents the sample size T = Ttotal,clustering + Ttotal,classiﬁcation for model estimation, and the vertical axis represents the estimation errors, measured by maxk b A(k) A(π(k)) (plotted in blue) and maxk c W (k) W (π(k)) / W (π(k)) (plotted in orange). The results conﬁrm our main theoretical prediction: Algorithm 1 recovers the LDS models based on a mixture of short trajectories with length To d, and achieves an error rate of 1/

T. In addition, we observe in our experiments

1It is possible to improve the 1/δ4 Γ,Y factor in Ttotal,subspace to 1/δ2 Γ,Y , if one is willing to pay for some extra factors of eigengaps; see Appendix B for a detailed discussion.

Learning Mixtures of Linear Dynamical Systems

that the ﬁnal outputs of Algorithm 1 are robust to a small number of mis-clustered or mis-classiﬁed trajectories during the intermediate steps; this is clearly an appealing property to have, especially when the To s and |Mo| s in reality are slightly smaller than what our theory requires. Additional experimental results can be found in Appendix F.

4. Related works

Meta-learning for mixed linear regression. Our work is closely related to the recent papers (Kong et al., 2020b;a) that bridge mixed linear regression (Quandt & Ramsey, 1978; Yi et al., 2014; Chen et al., 2017; Li & Liang, 2018;

Chen et al., 2020; Kwon et al., 2021b; Diakonikolas & Kane, 2020; Yin et al., 2018; Pal & Mazumdar, 2020; Chen et al., 2021c; Diamandis et al., 2021) and meta-learning (Finn et al., 2017; Harrison et al., 2020; Snell et al., 2017; Du et al., 2021; Pan & Yang, 2009; Tripuraneni et al., 2020; Baxter, 2000; Maurer et al., 2016). In this setting of metalearning for mixed linear regression (Kong et al., 2020b;a), one has access to multiple tasks, each containing a few independent and identically distributed samples generated by an unknown model; the inductive bias (i.e. the common structure underlying these tasks) for meta-learning is that there is only a small discrete set of ground-truth linear regression models, akin to the case of mixed linear regression. While the high-level idea of our Algorithm 1 is largely inspired by the work of (Kong et al., 2020b;a), our detailed implementations are substantially different, due to the lack of ideal i.i.d. assumption (among others) in our case of time-series data; in addition, we improve upon some of the analyses in (Kong et al., 2020b;a).2

Mixtures of time-series models and trajectories. Mixture models for time series have achieved empirical success in the study of psychology (Bulteel et al., 2016; Takano et al., 2020), neuroscience (Albert, 1991; Mezer et al., 2009), biology (Wong & Li, 2000), air pollution (D Urso et al., 2015), economics (Mc Culloch & Tsay, 1994; Maharaj, 2000; Kalliovirta et al., 2016), automobile sensors (Hallac et al., 2017), and many other domains. Some speciﬁc aspects of mixture models include hypothesis testing for a pair of trajectories (Maharaj, 2000), or clustering of multiple trajectories (Liao, 2005; Aghabozorgi et al., 2015; Pathak

2There is a concurrent preprint (Modi et al., 2021) that extends multi-task learning (Du et al., 2021; Tripuraneni et al., 2020) to time-series data, and the setting therein includes mixed LDSs as a special case. However, the authors of (Modi et al., 2021) assume oracle access to the global optimum of a non-convex optimization problem, without providing a practical algorithm that can provably ﬁnd it; moreover, with the short trajectory length ﬁxed, the estimation error bounds in that work will remain bounded away from zero, even if the number of trajectories grows to inﬁnity. In comparison, we consider a simpler problem setting, and propose computationally efﬁcient algorithms with better error bounds.

et al., 2021; Huang et al., 2021). In addition to mixture models, other related yet different models include time-varying systems (Qu et al., 2021; Minasyan et al., 2021), systems with random parameters (Du et al., 2020), switching systems (Sun, 2006; Sarkar et al., 2019; Ansari et al., 2021), switching state-space models (Ghahramani & Hinton, 2000; Linderman et al., 2017), Markovian jump systems (Shi & Li, 2015; Zhao et al., 2019), and event-triggered systems (Sedghi et al., 2020; Schluter et al., 2020), to name just a few. There are even more related models in reinforcement learning (RL), such as latent bandit (Maillard & Mannor, 2014; Hong et al., 2020), multi-task learning (Wilson et al., 2007; Brunskill & Li, 2013; Liu et al., 2016; Sodhani et al., 2021) / meta-learning (Finn et al., 2017) / transfer-learning (Taylor & Stone, 2009; Tirinzoni et al., 2020) for RL, latent Markov decision processes (Kwon et al., 2021a; Brunskill et al., 2009; Steimle et al., 2021), and so on. What distinguishes our work from this extensive literature is that, we design algorithms and prove non-asymptotic sample complexities for model estimation, in the speciﬁc setting of mixture models that features (1) a ﬁnite set of underlying time-series models, and (2) unknown labels of the trajectories, with no probabilistic assumptions imposed on these latent variables.

Linear dynamical systems. LDS (also called vector autoregressive models) is one of the most fundamental models in system identiﬁcation and optimal control (Ljung, 1998; Khalil et al., 1996). Recently, there has been a surge of studies about non-asymptotic theoretical analyses of various learning procedures for the basic LDS model (Faradonbeh et al., 2018; Simchowitz et al., 2018; Sarkar & Rakhlin, 2019), linear-quadratic regulators (Dean et al., 2020; Cohen et al., 2018; Jedra & Proutiere, 2019; Faradonbeh et al., 2020; Mania et al., 2019; Simchowitz & Foster, 2020; Fazel et al., 2018; Malik et al., 2019), and LDSs with partial observations (Oymak & Ozay, 2019; Simchowitz et al., 2019; Sarkar et al., 2021; Sun et al., 2020; Tsiamis et al., 2020; Lale et al., 2020; Zheng et al., 2021). In particular, it was only until recently that the authors of (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019) proved sharp error bounds of least squares for estimating the state transition matrix of a LDS model, using a single trajectory; our analysis of Algorithm 4 is largely inspired by their techniques.

5. Discussion

This paper has developed a theoretical and algorithmic framework for learning multiple LDS models from a mixture of short, unlabeled sample trajectories. Our key contributions include a modular two-stage meta-algorithm, as well as theoretical analysis demonstrating its computational and statistical efﬁciency. We would like to invite the readers to contribute to this important topic by, say, further strengthening the theoretical analysis and algorithmic design. For exam-

Learning Mixtures of Linear Dynamical Systems

ple, in certain cases Tclustering can be a bottleneck compared with Tsubspace and Tclassiﬁcation, and thus one might hope to achieve a better dependence on Ttotal,clustering (e.g. allowing Ttotal,clustering d), by replacing Line 5 in Algorithm 1 with a different method (e.g. adapting Algorithm 3 of (Kong et al., 2020b) to our setting). As another example, in some practical scenarios, the data is a single continuous trajectory, and the time steps when the underlying model changes are unknown (Hallac et al., 2017; Harrison et al., 2020); in order to accommodate such a case, one might need to incorporate change-point detection into the learning process.

Moving beyond the current setting of mixed LDSs, we remark that there are plenty of opportunities for future studies. For instance, while our methods in Stage 1 rely on the mixing property of the LDS models, it is worth exploring whether it is feasible to handle the non-mixing case (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019). Another potential direction is to consider the robustness against outliers and adversarial noise (Chen et al., 2021a; Kong et al., 2020a). One might even go further and extend the ideas to learning mixtures of other time-series models (see Remark 2.3), such as LDS with partial or nonlinear observations (Mhammedi et al., 2020), or nonlinear dynamical systems (Mania et al., 2020; Kakade et al., 2020; Foster et al., 2020). Ultimately, it would be of great importance to consider the case with controlled inputs, such as learning mixtures of linear-quadratic regulators, or latent Markov decision processes (Kwon et al., 2021a) that arises in reinforcement learning.

Acknowledgements

Y. Chen is supported in part by the ARO grant W911NF20-1-0097, the NSF grants CCF-1907661 and IIS-1900140, and the AFOSR grant FA9550-19-1-0030. H. V. Poor is supported in part by the NSF under Grant CCF-1908308. We would like to thank Yuxin Chen and Gen Li for numerous helpful discussions.

Aghabozorgi, S., Shirkhorshidi, A. S., and Wah, T. Y. Timeseries clustering a decade review. Information Systems, 53:16 38, 2015.

Albert, P. S. A two-state markov mixture model for a time series of epileptic seizure counts. Biometrics, pp. 1371 1381, 1991.

Ansari, A. F., Benidis, K., Kurle, R., Turkmen, A. C., Soh, H., Smola, A. J., Wang, B., and Januschowski, T. Deep explicit duration switching models for time series. In Advances in Neural Information Processing Systems, volume 34, 2021.

Baxter, J. A model of inductive bias learning. Journal of Artiﬁcial Intelligence Research, 12:149 198, 2000.

Brunskill, E. and Li, L. Sample complexity of multi-task reinforcement learning. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artiﬁcial Intelligence, pp. 122 131, Arlington, Virginia, USA, 2013. AUAI Press.

Brunskill, E., Lefﬂer, B. R., Li, L., Littman, M. L., and Roy, N. Provably efﬁcient learning with typed parametric models. Journal of Machine Learning Research, 10(68): 1955 1988, 2009.

Bulteel, K., Tuerlinckx, F., Brose, A., and Ceulemans, E. Clustering vector autoregressive models: Capturing qualitative differences in within-person dynamics. Frontiers in Psychology, 7:1540, 2016.

Chen, S., Li, J., and Song, Z. Learning mixtures of linear regressions in subexponential time via fourier moments. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 587 600, 2020.

Chen, S., Koehler, F., Moitra, A., and Yau, M. Kalman ﬁltering with adversarial corruptions. ar Xiv preprint ar Xiv:2111.06395, 2021a.

Chen, Y., Yi, X., and Caramanis, C. Convex and nonconvex formulations for mixed regression with two components: Minimax optimal rates. IEEE Transactions on Information Theory, 64(3):1738 1766, 2017.

Chen, Y., Chi, Y., Fan, J., Ma, C., et al. Spectral methods for data science: A statistical perspective. Foundations and Trends R in Machine Learning, 14(5):566 806, 2021b.

Chen, Y., Ma, C., Poor, H. V., and Chen, Y. Learning mixtures of low-rank models. IEEE Transactions on Information Theory, 67(7):4613 4636, 2021c. doi: 10. 1109/TIT.2021.3065700.

Cohen, A., Hasidim, A., Koren, T., Lazic, N., Mansour, Y., and Talwar, K. Online linear quadratic control. In Proceedings of the International Conference on Machine Learning, pp. 1029 1038. PMLR, 2018.

Davis, C. and Kahan, W. M. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970.

Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, 20(4):633 679, 2020.

Diakonikolas, I. and Kane, D. M. Small covers for near-zero sets of polynomials and learning latent variable models. In Proceedings of the 2020 IEEE 61st Annual Symposium

Learning Mixtures of Linear Dynamical Systems

on Foundations of Computer Science (FOCS), pp. 184 195. IEEE, 2020.

Diamandis, T., Eldar, Y., Fallah, A., Farnia, F., and Ozdaglar, A. A Wasserstein minimax framework for mixed linear regression. In Meila, M. and Zhang, T. (eds.), Proceedings of the International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 2697 2706. PMLR, 18 24 Jul 2021.

Du, K., Meng, Q., and Zhang, F. A q-learning algorithm for discrete-time linear-quadratic control with random parameters of unknown distribution: convergence and stabilization. ar Xiv preprint ar Xiv:2011.04970, 2020.

Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-shot learning via learning the representation, provably. In International Conference on Learning Representations, 2021.

D Urso, P., De Giovanni, L., and Massari, R. Time series clustering by a robust autoregressive metric with application to air pollution. Chemometrics and Intelligent Laboratory Systems, 141:107 124, 2015.

Faradonbeh, M. K. S., Tewari, A., and Michailidis, G. Finite time identiﬁcation in unstable linear systems. Automatica, 96:342 353, 2018.

Faradonbeh, M. K. S., Tewari, A., and Michailidis, G. On adaptive linear quadratic regulators. Automatica, 117: 108982, 2020.

Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. In Proceedings of the International Conference on Machine Learning, pp. 1467 1476. PMLR, 2018.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning - Volume 70, pp. 1126 1135. JMLR.org, 2017.

Foster, D., Sarkar, T., and Rakhlin, A. Learning nonlinear dynamical systems from a single trajectory. In Learning for Dynamics and Control, pp. 851 861. PMLR, 2020.

Ghahramani, Z. and Hinton, G. E. Variational learning for switching state-space models. Neural computation, 12 (4):831 864, 2000.

Hallac, D., Vare, S., Boyd, S., and Leskovec, J. Toeplitz inverse covariance-based clustering of multivariate time series data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 215 223, 2017.

Harrison, J., Sharma, A., Finn, C., and Pavone, M. Continuous meta-learning without tasks. In Advances in Neural Information Processing Systems, volume 33, 2020.

Hong, J., Kveton, B., Zaheer, M., Chow, Y., Ahmed, A., and Boutilier, C. Latent bandits revisited. In Advances in Neural Information Processing Systems, volume 33, pp. 13423 13433. Curran Associates, Inc., 2020.

Huang, L., Sudhir, K., and Vishnoi, N. Coresets for time series clustering. In Advances in Neural Information Processing Systems, volume 34, 2021.

Jedra, Y. and Proutiere, A. Sample complexity lower bounds for linear system identiﬁcation. In Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 2676 2681. IEEE, 2019.

Kailath, T., Sayed, A. H., and Hassibi, B. Linear Estimation. Prentice Hall, 2000.

Kakade, S., Krishnamurthy, A., Lowrey, K., Ohnishi, M., and Sun, W. Information theoretic regret bounds for online nonlinear control. In Advances in Neural Information Processing Systems, volume 33, pp. 15312 15325. Curran Associates, Inc., 2020.

Kalliovirta, L., Meitz, M., and Saikkonen, P. Gaussian mixture vector autoregression. Journal of Econometrics, 192(2):485 498, 2016.

Khalil, I. S., Doyle, J., and Glover, K. Robust and Optimal Control. Prentice Hall, New Jersey, 1996.

Kong, W., Somani, R., Kakade, S., and Oh, S. Robust metalearning for mixed linear regression with small batches. In Advances in Neural Information Processing Systems, volume 33, pp. 4683 4696. Curran Associates, Inc., 2020a.

Kong, W., Somani, R., Song, Z., Kakade, S., and Oh, S. Meta-learning for mixed linear regression. In Proceedings of the International Conference on Machine Learning, pp. 5394 5404. PMLR, 2020b.

Kwon, J., Efroni, Y., Caramanis, C., and Mannor, S. Rl for latent MDPs: Regret guarantees and a lower bound. ar Xiv preprint ar Xiv:2102.04939, 2021a.

Kwon, J., Ho, N., and Caramanis, C. On the minimax optimality of the em algorithm for learning two-component mixed linear regression. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, pp. 1405 1413. PMLR, 2021b.

Lale, S., Azizzadenesheli, K., Hassibi, B., and Anandkumar, A. Logarithmic regret bound in partially observable linear dynamical systems. In Advances in Neural Information Processing Systems, volume 33, pp. 20876 20888. Curran Associates, Inc., 2020.

Learning Mixtures of Linear Dynamical Systems

Li, Y. and Liang, Y. Learning mixtures of linear regressions with nearly optimal complexity. In Proceedings of the Conference On Learning Theory, pp. 1125 1144, 2018.

Liao, T. W. Clustering of time series data a survey. Pattern Recognition, 38(11):1857 1874, 2005.

Liberzon, D. Switching in Systems and Control. Springer Science & Business Media, 2003.

Linderman, S., Johnson, M., Miller, A., Adams, R., Blei, D., and Paninski, L. Bayesian learning and inference in recurrent switching linear dynamical systems. In Artiﬁcial Intelligence and Statistics, pp. 914 922. PMLR, 2017.

Liu, Y., Guo, Z., and Brunskill, E. Pac continuous state online multitask reinforcement learning with identiﬁcation. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 438 446, 2016.

Ljung, L. System identiﬁcation. In Signal analysis and prediction, pp. 163 173. Springer, 1998.

Lugosi, G. and Mendelson, S. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145 1190, 2019.

Maharaj, E. A. Cluster of time series. Journal of Classiﬁcation, 17(2):297 314, 2000.

Maillard, O.-A. and Mannor, S. Latent bandits. In Proceedings of the International Conference on Machine Learning, pp. 136 144. PMLR, 2014.

Malekzadeh, M., Clegg, R. G., Cavallaro, A., and Haddadi, H. Mobile sensor data anonymization. In Proceedings of the International Conference on Internet of Things Design and Implementation, Io TDI 19, pp. 49 58. ACM, 2019.

Malik, D., Pananjady, A., Bhatia, K., Khamaru, K., Bartlett, P., and Wainwright, M. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, pp. 2916 2925. PMLR, 2019.

Mania, H., Tu, S., and Recht, B. Certainty equivalence is efﬁcient for linear quadratic control. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 10154 10164, 2019.

Mania, H., Jordan, M. I., and Recht, B. Active learning for nonlinear system identiﬁcation with guarantees. ar Xiv preprint ar Xiv:2006.10277, 2020.

Maurer, A., Pontil, M., and Romera-Paredes, B. The beneﬁt of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853 2884, 2016.

Mc Culloch, R. E. and Tsay, R. S. Statistical analysis of economic time series via markov switching models. Journal of Time Series Analysis, 15(5):523 539, 1994.

Mezer, A., Yovel, Y., Pasternak, O., Gorﬁne, T., and Assaf, Y. Cluster analysis of resting-state fmri time series. Neuroimage, 45(4):1117 1125, 2009.

Mhammedi, Z., Foster, D. J., Simchowitz, M., Misra, D., Sun, W., Krishnamurthy, A., Rakhlin, A., and Langford, J. Learning the linear quadratic regulator from nonlinear observations. ar Xiv preprint ar Xiv:2010.03799, 2020.

Minasyan, E., Gradu, P., Simchowitz, M., and Hazan, E. Online control of unknown time-varying dynamical systems. In Advances in Neural Information Processing Systems, volume 34, 2021.

Modi, A., Faradonbeh, M. K. S., Tewari, A., and Michailidis, G. Joint learning of linear time-invariant dynamical systems. ar Xiv preprint ar Xiv:2112.10955, 2021.

Oymak, S. and Ozay, N. Non-asymptotic identiﬁcation of LTI systems from a single trajectory. In Proceedings of the 2019 American Control Conference (ACC), pp. 5655 5661. IEEE, 2019.

Pal, S. and Mazumdar, A. Recovery of sparse signals from a mixture of linear samples. In Proceedings of the International Conference on Machine Learning, pp. 7466 7475. PMLR, 2020.

Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22 (10):1345 1359, 2009.

Pathak, R., Sen, R., Rao, N., Erichson, N. B., Jordan, M. I., and Dhillon, I. S. Cluster-and-conquer: A framework for time-series forecasting. ar Xiv preprint ar Xiv:2110.14011, 2021.

Qu, G., Shi, Y., Lale, S., Anandkumar, A., and Wierman, A. Stable online control of linear time-varying systems. In Learning for Dynamics and Control, pp. 742 753. PMLR, 2021.

Quandt, R. E. and Ramsey, J. B. Estimating mixtures of normal distributions and switching regressions. Journal of the American Statistical Association, 73(364):730 738, 1978.

Sarkar, T. and Rakhlin, A. Near optimal ﬁnite time identiﬁcation of arbitrary linear dynamical systems. In Proceedings of the International Conference on Machine Learning, pp. 5610 5618. PMLR, 2019.

Learning Mixtures of Linear Dynamical Systems

Sarkar, T., Rakhlin, A., and Dahleh, M. Nonparametric system identiﬁcation of stochastic switched linear systems. In Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 3623 3628. IEEE, 2019.

Sarkar, T., Rakhlin, A., and Dahleh, M. A. Finite time LTI system identiﬁcation. Journal of Machine Learning Research, 22(26):1 61, 2021.

Schluter, H., Solowjow, F., and Trimpe, S. Event-triggered learning for linear quadratic control. IEEE Transactions on Automatic Control, 2020.

Sedghi, L., Ijaz, Z., Witheephanich, K., Pesch, D., et al. Machine learning in event-triggered control: Recent advances and open issues. ar Xiv preprint ar Xiv:2009.12783, 2020.

Shi, P. and Li, F. A survey on markovian jump systems: modeling and design. International Journal of Control, Automation and Systems, 13(1):1 16, 2015.

Simchowitz, M. and Foster, D. Naive exploration is optimal for online lqr. In Proceedings of the International Conference on Machine Learning, pp. 8937 8948. PMLR, 2020.

Simchowitz, M., Mania, H., Tu, S., Jordan, M. I., and Recht, B. Learning without mixing: Towards a sharp analysis of linear system identiﬁcation. In Proceedings of the Conference On Learning Theory, pp. 439 473. PMLR, 2018.

Simchowitz, M., Boczar, R., and Recht, B. Learning linear dynamical systems with semi-parametric least squares. In Proceedings of the Conference on Learning Theory, pp. 2714 2802. PMLR, 2019.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077 4087, 2017.

Sodhani, S., Zhang, A., and Pineau, J. Multi-task reinforcement learning with context-based representations. In Meila, M. and Zhang, T. (eds.), Proceedings of the International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 9767 9779. PMLR, 18 24 Jul 2021.

Steimle, L. N., Kaufman, D. L., and Denton, B. T. Multimodel markov decision processes. IISE Transactions, pp. 1 16, 2021.

Sun, Y., Oymak, S., and Fazel, M. Finite sample system identiﬁcation: Optimal rates and the role of regularization. In Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of

Machine Learning Research, pp. 16 25. PMLR, 10 11 Jun 2020.

Sun, Z. Switched Linear Systems: Control and Design. Springer Science & Business Media, 2006.

Takano, K., Stefanovic, M., Rosenkranz, T., and Ehring, T. Clustering individuals on limited features of a vector autoregressive model. Multivariate Behavioral Research, pp. 1 19, 2020.

Taylor, M. E. and Stone, P. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7), 2009.

Tirinzoni, A., Poiani, R., and Restelli, M. Sequential transfer in reinforcement learning with a generative model. In Proceedings of the International Conference on Machine Learning, pp. 9481 9492. PMLR, 2020.

Tripuraneni, N., Jordan, M. I., and Jin, C. On the theory of transfer learning: The importance of task diversity. In Advances in Neural Information Processing Systems, volume 33, 2020.

Tsiamis, A., Matni, N., and Pappas, G. Sample complexity of kalman ﬁltering for unknown systems. In Learning for Dynamics and Control, pp. 435 444. PMLR, 2020.

Vershynin, R. High-dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge university press, 2018.

Wilson, A., Fern, A., Ray, S., and Tadepalli, P. Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the International Conference on Machine Learning, pp. 1015 1022, 2007.

Wong, C. S. and Li, W. K. On a mixture autoregressive model. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1):95 115, 2000.

Yi, X., Caramanis, C., and Sanghavi, S. Alternating minimization for mixed linear regression. In Proceedings of the International Conference on Machine Learning, pp. 613 621, 2014.

Yin, D., Pedarsani, R., Chen, Y., and Ramchandran, K. Learning mixtures of sparse linear regressions using sparse graph codes. IEEE Transactions on Information Theory, 65(3):1430 1451, 2018.

Zhao, P., Kang, Y., and Zhao, Y.-B. A brief tutorial and survey on markovian jump systems: Stability and control. IEEE Systems, Man, and Cybernetics Magazine, 5(2): 37 C3, 2019.

Learning Mixtures of Linear Dynamical Systems

Zheng, Y., Furieri, L., Kamgarpour, M., and Li, N. Sample complexity of linear quadratic gaussian (lqg) control for output feedback systems. In Learning for Dynamics and Control, pp. 559 570. PMLR, 2021.

Learning Mixtures of Linear Dynamical Systems

Table 1. A list of notation and parameters. In the subscripts of Mo, To, Ttotal,o, the symbol o takes value in {subspace, clustering, classiﬁcation}.

Notation Explanation

d State dimension

K Number of LDS models

km The unknown label (latent variable) of the m-th trajectory

Mo Subsets of M trajectories, {1, . . . , M} = Msubspace Mclustering Mclassiﬁcation

pmin A lower bound for the fraction of trajectories generated by each model

To Short trajectory length for Mo

Ttotal,o Total trajectory length, Ttotal,o = To |Mo|

A(k), W (k) State transition matrix and noise covariance matrix of the k-th LDS model

Γ(k), Y (k) Order-0 and order-1 stationary autocovariance matrices of the k-th LDS model

κA, ρ (A(k))t κA ρt, t = 1, 2, . . .

Γ,Y , A,W Model separation parameters (see Assumption 3.1)

Wmin, Wmax Wmin λmin(W (k)) λmax(W (k)) Wmax, 1 k K

κw,cross, κw κw,cross = Wmax/Wmin; κ(W (k)) κw, 1 k K

Γmax Γ(k) Γmax, 1 k K

This appendix is organized as follows. Appendix A complements Section 2, providing full details of the proposed Algorithms 2 5. Appendix B includes detailed theoretical results for the algorithms, with proofs postponed to Appendix C; it also provides a proof for our main results in Section 3. Appendix D collects some miscellaneous results. In Appendix E we discuss on some potential extensions of our methods. Finally, Appendix F includes additional experimental results. For the readers convenience, we include Table 1 for a quick reference to the key notation and parameters used in our analysis.

Additional notation. Let vec( ) denote the vectorization of a matrix. For two matrices A, B, let A B denote their Kronecker product. Given a sequence of real numbers {xi}1 i N, we denote its median as median{xi, 1 i N}. We shall also let poly(n) denote some polynomial in n of a constant degree. For a positive integer n, we denote [n] := {1, 2, . . . , n}.

A. Detailed Algorithms

A.1. Subspace estimation

Procedure. Recall that the notation (Γ(k))i Rd (resp. (Y (k))i) represents the transpose of the i-th row of Γ(k) (resp. Y (k)) deﬁned in (5). With this set of notation in place, let us deﬁne the following subspaces:

V i := span n (Γ(k))i, 1 k K o , U i := span n (Y (k))i, 1 k K o , 1 i d. (12)

It is easily seen from the construction that each of these subspaces has rank at most K.

As it turns out, the collection of 2d subspaces deﬁned in (12) provides crucial low-dimensional information about the linear dynamical systems of interest. This motivates us to develop a data-driven method to estimate these subspaces, which will in turn allow for proper dimension reduction in subsequent steps. Towards this end, we propose to employ a spectral method for subspace estimation using sample trajectories in Msubspace:

(i) divide {0, 1, . . . , Tsubspace} into four segments of the same size, and denote the 2nd (resp. 4th) segment as Ω1 (resp. Ω2);

Learning Mixtures of Linear Dynamical Systems

(ii) for each m Msubspace, 1 i d, j {1, 2}, compute

hm,i,j := 1 |Ωj|

t Ωj (xm,t)i xm,t, gm,i,j := 1 |Ωj|

t Ωj (xm,t+1)i xm,t, ; (13)

(iii) for each 1 i d, compute the following matrices

c Hi := 1 |Msubspace|

m Msubspace hm,i,1h m,i,2, b Gi := 1 |Msubspace|

m Msubspace gm,i,1g m,i,2, (14)

and let Vi Rd K (resp. Ui) be the top-K eigenspace of c Hi + c H i (resp. b Gi + b G i ).

The output {Vi, Ui}1 i d will then serve as our estimate of {V i , U i }1 i d. This spectral approach is summarized in Algorithm 2.

Rationale. According to the mixing property of LDS, if Tsubspace is larger than some appropriately deﬁned mixing time, then each sample trajectory in Msubspace will mix sufﬁciently and nearly reach stationarity (when constrained to t Ω1 Ω2). In this case, it is easy to check that

E hm,i,j] (Γ(km))i, E gm,i,j] (Y (km))i, 1 i d, j {1, 2}.

Moreover, the samples in Ω1 are nearly independent of those in Ω2 as long as the spacing between them exceeds the mixing time, and therefore,

E h c Hi i 1 |Msubspace|

m Msubspace (Γ(km))i(Γ(km)) i =

k=1 p(k) subspace(Γ(k))i(Γ(k))i =: Hi, (15a)

E h b Gi i 1 |Msubspace|

m Msubspace (Y (km))i(Y (km)) i =

k=1 p(k) subspace(Y (k))i(Y (k))i

=: Gi, (15b)

where p(k) subspace denotes the fraction of sample trajectories generated by the k-th model, namely,

p(k) subspace := 1 |Msubspace|

m Msubspace 1(km = k), 1 k K. (16)

As a consequence, if Tsubspace and |Msubspace| are both sufﬁciently large, then one might expect c Hi (resp. b Gi) to be a reasonably good approximation of Hi (resp. Gi), the latter of which has rank at most K and has V i (resp. U i ) as its eigenspace. All this motivates us to compute the rank-K eigenspaces of c Hi + c H i and b Gi + b G i in Algorithm 2.

A.2. Clustering

This step seeks to divide the sample trajectories in Mclustering into K clusters (albeit not perfectly), such that the trajectories in each cluster are primarily generated by the same LDS model. We intend to achieve this by performing pairwise comparisons of the autocovariance matrices associated with the sample trajectories.

Key observation. Even though (A(k), W (k)) = (A(ℓ), W (ℓ)) for k = ℓ, it is indeed possible that Γ(k) = Γ(ℓ) or Y (k) = Y (ℓ). Therefore, in order to differentiate sample trajectories generated by different systems based on Γ(A, W ) and Y (A, W ), it is important to ensure separation of (Γ(k), Y (k)) and (Γ(ℓ), Y (ℓ)) when k = l, which can be guaranteed by the following fact.

Fact A.1. If (A(k), W (k)) = (A(ℓ), W (ℓ)), then we have either Γ(k) = Γ(ℓ) or Y (k) = Y (ℓ) (or both).

Proof. First, observe that the deﬁnitions of {Γ(k), Y (k)} in (5) suggest that we can recover A(k), W (k) from Γ(k), Y (k) as follows:

A(k) = Y (k)Γ(k) 1, W (k) = Γ(k) A(k)Γ(k)A(k) . (17)

Learning Mixtures of Linear Dynamical Systems

Algorithm 2 Subspace estimation

1: Input: short trajectories {Xm}m Msubspace, where Xm = {xm,t}0 t Tsubspace. 2: Let N Tsubspace/4 , and Ω1 {N + j, 1 j N}, Ω2 {3N + j, 1 j N}. 3: for (m, i, j) Msubspace [d] [2] do 4: Compute

hm,i,j 1 |Ωj|

t Ωj (xm,t)i xm,t, gm,i,j 1 |Ωj|

t Ωj (xm,t+1)i xm,t.

5: end for 6: for i = 1, . . . , d do 7: Compute

c Hi 1 |Msubspace|

m Msubspace hm,i,1h m,i,2, b Gi 1 |Msubspace|

m Msubspace gm,i,1g m,i,2,

8: Let Vi Rd K (resp. Ui) be the top-K eigenspace of c Hi + c H i (resp. b Gi + b G i ). 9: end for 10: Output: subspaces {Vi, Ui}1 i d.

With this, we can prove the fact by contradiction. Suppose instead that Γ(k) = Γ(ℓ) and Y (k) = Y (ℓ), then (17) yields

A(k) = Y (k)Γ(k) 1 = Y (ℓ)Γ(ℓ) 1 = A(ℓ), and

W (k) = Γ(k) A(k)Γ(k)A(k) = Γ(ℓ) A(ℓ)Γ(ℓ)A(ℓ) = W (ℓ),

which is contradictory to the assumption that (A(k), W (k)) = (A(ℓ), W (ℓ)).

Idea. Let us compare a pair of sample trajectories {xt}0 t Tclustering and {zt}0 t Tclustering, where {xt} is generated by the system (A(k), W (k)) and {zt} by the system (A(ℓ), W (ℓ)). In order to determine whether k = ℓ, we propose to estimate the quantity Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F using the data samples {xt} and {zt}, which is expected to be small (resp. large) if k = ℓ(resp. k = ℓ). To do so, let us divide {0, 1, . . . , Tclustering} evenly into four segments, and denote by Ω1 (resp. Ω2) the 2nd (resp. 4th) segment, akin to what we have done in the previous step. Observe that

Ui E h (xt+1)ixt (zt+1)izt i Ui (Y (k))i (Y (ℓ))i

for all 1 i d and t Ω1 Ω2. Assuming sufﬁcient mixing and utilizing (near) statistical independence due to sample splitting, we might resort to the following statistic

(xt+1)ixt (zt+1)izt , Ui 1

(xt+1)ixt (zt+1)izt E , (19)

whose expectation is given by

D Ui (Y (k))i (Y (ℓ))i , Ui (Y (k))i (Y (ℓ))i E

Ui (Y (k))i (Y (ℓ))i 2

(Y (k))i (Y (ℓ))i 2 2 = Y (k) Y (ℓ) 2 F;

here, the ﬁrst approximation is due to the near independence between samples from Ω1 and those from Ω2, whereas the second approximation holds if each subspace Ui is sufﬁciently close to U i = span{(Y (j))i, 1 j K}. The purpose of

Learning Mixtures of Linear Dynamical Systems

Algorithm 3 Clustering

1: Input: short trajectories {Xm}m Mclustering, where Xm = {xm,t}0 t Tclustering; subspaces {Vi, Ui}1 i d; testing threshold τ; number of copies G. 2: Let N Tclustering/4G , and

Ωg,1 n (4g 3)N + j, 1 j N o , Ωg,2 n (4g 1)N + j, 1 j N o , 1 g G.

3: for (m, i, g, j) Mclustering [d] [G] [2] do 4: Compute

hm,i,g,j 1 |Ωg,j|

t Ωg,j (xm,t)i xm,t, gm,i,g,j 1 |Ωg,j|

t Ωg,j (xm,t+1)i xm,t.

5: end for 6: // Compute the similarity matrix S: 7: for (m, n) Mclustering Mclustering do 8: for g = 1, . . . , G do 9: Compute

D Vi hm,i,g,1 hn,i,g,1 , Vi hm,i,g,2 hn,i,g,2 E , (18a)

D Ui gm,i,g,1 gn,i,g,1 , Ui gm,i,g,2 gn,i,g,2 E , (18b)

10: end for 11: Set Sm,n 1 median statΓ,g, 1 g G + median stat Y,g, 1 g G τ .

12: end for 13: Divide Mclustering into K clusters {Ck}1 k K according to {Sm,n}m,n Mclustering. 14: Output: clusters {Ck}1 k K.

utilizing {Ui} is to reduce the variance of stat Y . Similarly, we can compute another statistic (by replacing {Ui} with {Vi} and (xt+1)i, (zt+1)i with (xt)i, (zt)i) as follows:

(xt)ixt (zt)izt , Vi 1

(xt)ixt (zt)izt E , (20)

which has expectation

i=1 Vi ((Γ(k))i (Γ(ℓ))i) 2 2 Γ(k) Γ(ℓ) 2 F.

Consequently, we shall declare k = ℓif statΓ + stat Y exceeds some appropriate threshold τ.

Procedure. We are now positioned to describe the proposed clustering procedure. We ﬁrst compute the statistics statΓ and stat Y for each pair of sample trajectories in Mclustering by means of the method described above, and then construct a similarity matrix S, in a way that Sm,n is set to 0 if statΓ +stat Y (computed for the m-th and n-th trajectories) is larger than a threshold τ, or set to 1 otherwise. In order to enhance the robustness of these statistics, we divide {0, 1, . . . , Tclustering} into 4G (instead of 4) segments, compute G copies of statΓ and stat Y , and take the medians of these values. Next, we apply a mainstream clustering algorithm (e.g. spectral clustering (Chen et al., 2021b)) to the similarity matrix S, and divide Mclustering into K disjoint clusters {Ck}1 k K. The complete clustering procedure is provided in Algorithm 3. The threshold τ shall be chosen to be on the order of min1 k<ℓ K{ Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F} (which is strictly positive due to Fact 2.1), and it sufﬁces to set the number of copies G to be on some logarithmic order. Our choice of these parameters are speciﬁed in Theorem 3.4.

Learning Mixtures of Linear Dynamical Systems

Algorithm 4 Least squares and covariance estimation

1: Input: clusters {Ck}1 k K. 2: for k = 1, . . . , K do 3: Compute

0 t Tm 1 xm,t+1xm,t X

0 t Tm 1 xm,txm,t 1 , and (22a)

c W (k) 1 P

0 t Tm 1 b wm,t b w m,t, where b wm,t = xm,t+1 b A(k)xm,t. (22b)

4: end for 5: Output: estimated models { b A(k), c W (k)}1 k K.

A.3. Model estimation

Suppose that we have obtained reasonably good clustering accuracy in the previous step, namely for each 1 k K, the cluster Ck output by Algorithm 3 contains mostly trajectories generated by the same model (A(π(k)), W (π(k))), with π representing some permutation function of {1, . . . , K}. We propose to obtain a coarse model estimation and covariance estimation, as summarized in Algorithm 4. More speciﬁcally, for each k, we use the samples {{xm,t}0 t Tm}m Ck to obtain an estimate of A(π(k)) by solving the following least-squares problem (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019)

b A(k) := arg min A

0 t Tm 1 xm,t+1 Axm,t 2 2, (21)

whose closed-form solution is given in (22a). Next, we can use b A(k) to estimate the noise vector

b wm,t := xm,t+1 b A(k)xm,t wm,t,

and ﬁnally estimate the noise covariance W (π(k)) with the empirical covariance of { b wm,t}, as shown in (22b).

A.4. Classiﬁcation

Procedure. In the previous steps, we have obtained initial clusters {Ck}1 k K and coarse model estimates { b A(k), c W (k)}1 k K. With the assistance of additional sample trajectories {Xm}m Mclassiﬁcation, we can infer their latent labels and assign them to the corresponding clusters; the procedure is stated in Algorithm 5 and will be explained shortly. Once this is done, we can run Algorithm 4 again with the updated clusters {Ck} to reﬁne our model estimates, which is exactly the last step of Algorithm 1.

Rationale. The strategy of inferring labels in Algorithm 5 can be derived from the maximum likelihood estimator, under the assumption that the noise vectors {wm,t} follow Gaussian distributions. Note, however, that even in the absence of Gaussian assumptions, Algorithm 5 remains effective in principle. To see this, consider a short trajectory {xt}0 t T generated by model (A(k), W (k)), i.e. xt+1 = A(k)xt + wt where E[wt] = 0, cov(wt) = W (k). Let us deﬁne the following loss function

L(A, W ) := T log det(W ) +

t=0 (xt+1 Axt) W 1(xt+1 Axt). (23)

With some elementary calculation, we can easily check that for any incorrect label ℓ = k, it holds that E[L(A(ℓ), W (ℓ)) L(A(k), W (k))] > 0, with the proviso that (A(k), W (k)) = (A(ℓ), W (ℓ)) and {xt} are non-degenerate in some sense; in other words, the correct model (A(k), W (k)) achieves the minimal expected loss (which, due to the quadratic form of the loss function, depends solely on the ﬁrst and second moments of the distributions of {wt}, as well as the initial state x0). This justiﬁes the proposed procedure for inferring unknown labels in Algorithm 5, provided that Tm is large enough and that the estimated LDS models are sufﬁciently reliable.

Learning Mixtures of Linear Dynamical Systems

Algorithm 5 Classiﬁcation

1: Input: short trajectories {Xm}m Mclassiﬁcation, where Xm = {xm,t}0 t Tm; coarse models { b A(k), c W (k)}1 k K; clusters {Ck}1 k K. 2: for m Mclassiﬁcation do 3: Infer the label of the m-th trajectory by

bkm arg min ℓ

n Tm log det(c W (ℓ)) +

t=0 (xm,t+1 b A(ℓ)xm,t) (c W (ℓ)) 1(xm,t+1 b A(ℓ)xm,t) o ,

then add m to cluster Cbkm. 4: end for 5: Output: updated clusters {Ck}1 k K.

B. Detailed analysis

This section provides detailed, modular theoretical results for the performance of Algorithms 2 5, and concludes with a proof for the main theorems in Section 3.2 (i.e. the performance guarantees of Algorithm 1).

Subspace estimation. The following theorem provides upper bounds on the errors of subspaces {Vi, Ui} output by Algorithm 2, assuming sufﬁcient mixing of the short trajectories.

Theorem B.1. Consider the model (1) under the assumptions in Sections 2 and 3.1. There exist some universal constants C1, C2, C3 > 0 such that the following holds. Suppose that we run Algorithm 2 with data {Xm}m Msubspace obeying

Tsubspace C1 tmix, Ttotal,subspace C2 tmixd log Ttotald

δ , where tmix := 1 1 ρ log dκATtotal

Then with probability at least 1 δ, Algorithm 2 outputs {Vi, Ui}1 i d satisfying the following: for all 1 k K and 1 i d,

max n (Γ(k))i Vi Vi (Γ(k))i 2, (Y (k))i Ui Ui (Y (k))i 2 o

1/2 tmixd Ttotal,subspace log3 Ttotald

Our proof (deferred to Appendix C.2) includes a novel perturbation analysis; the resulted error bound (24) has a 1/T 1/4 total,subspace dependence and is gap-free (i.e. independent of the eigenvalue gaps of the ground-truth low-rank matrices, which can be arbitrarily close to zero in the worst case). It is possible to adapt the existing perturbation results in (Kong et al., 2020b;a) to our setting (which we include in Lemma C.3 in the appendix for completeness); however, one of them is dependent on the eigenvalue gaps, while the other one incurs a worse 1/T 1/6 total,subspace dependence. It would be

interesting future work to investigate whether a gap-free bound with a 1/T 1/2 total,subspace dependence is possible.

Clustering. Our next theorem shows that Algorithm 3 achieves exact clustering of Mclustering, if Tclustering is sufﬁciently large and subspaces {Vi, Ui} are accurate. The proof is deferred to Appendix C.3.

Theorem B.2. Consider the model (1) under the assumptions in Sections 2 and 3.1. There exist universal constants C1, C2, c3 > 0 such that the following holds. Suppose that we run Algorithm 3 with data {Xm}m Mclustering, independent subspaces {Vi, Ui}1 i d and parameters τ, G that satisfy the following:

The threshold τ obeys 1/8 < τ/ 2 Γ,Y < 3/8;

The short trajectory length Tclustering = NG, where

G C1 log |Mclustering|

d K 2 Γ,Y + 1

! 1 1 ρ log

Γ,Y + 2 dκATtotal

Learning Mixtures of Linear Dynamical Systems

The subspaces {Vi, Ui}1 i d satisfy that, for all 1 i d and 1 k K,

max (Γ(k))i Vi Vi (Γ(k))i 2, (Y (k))i Ui Ui (Y (k))i 2

Then with probability at least 1 δ, Algorithm 3 achieves exact clustering: for all m1, m2 Mclustering, Sm1,m2 = 1 if and only if the m1-th and m2-th trajectories are generated by the same model, i.e. they have the same label km1 = km2.

Least squares and covariance estimation. The next result controls the model estimation errors of Algorithm 4, under the assumption that every cluster is pure.

Theorem B.3. Consider the model (1) under the assumptions in Section 3.1. There exist universal constants C1, C2, C3 > 0 such that the following holds. Let {Ck}1 k K be subsets of Mclustering Mclassiﬁcation such that for all 1 k K, Ck contains only short trajectories generated by model (A(k), W (k)), namely km = k for all m Ck. Suppose that for all m Mclustering Mclassiﬁcation, Tm 4, and for all 1 k K,

T (k) total := X

m Ck Tm C1dκ2 wι, where ι := log Γmax

Let b A(k), c W (k) be computed by (22) in Algorithm 4. Then with probability at least 1 δ, one has

b A(k) A(k) C2

T (k) total , c W (k) W (k)

T (k) total , 1 k K.

Our proof (postponed to Appendix C.4) is based on the techniques of (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019), but with two major differences. First, the authors of (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019) consider the setting where ordinary least squares is applied to a single continuous trajectory generated by a single LDS model; this is not the case for our setting, and thus our proof and results are different from theirs. Second, the noise covariance matrix W is assumed to be σ2Id in (Simchowitz et al., 2018; Sarkar & Rakhlin, 2019), while in our case, {W (k)}1 k K are unknown and need to be estimated.

Classiﬁcation. Our last theorem shows that Algorithm 5 correctly classiﬁes all trajectories in Mclassiﬁcation, as long as the coarse models are sufﬁciently accurate and the short trajectory lengths are large enough; these conditions are slightly different for Cases 0 and 1 deﬁned in (8). See Appendix C.5 for the proof.

Theorem B.4. Consider the model (1) under the assumptions in Section 3.1. There exist universal constants c1, c2, C3 > 0 such that the following holds. Suppose that we run Algorithm 5 with data {Xm}m Mclassiﬁcation and independent coarse models { b A(k), c W (k)}1 k K satisfying b A(k) A(k) ϵA, c W (k) W (k) ϵW for all k. Then with probability at least 1 δ, Algorithm 5 correctly classiﬁes all trajectories in Mclassiﬁcation, provided that

For Case 0: ϵA c1 A,W

Wmin Γmaxκw,cross(d + ι), ϵW Wmin c2 min n 1, A,W p

Tm C3 κ2 w + κ6 w,cross 2 A,W

ι2, m Mclassiﬁcation; (26b)

For Case 1: ϵA c1 A,W

Wmin Γmaxκw,crossκ2 A(d + ι), ϵW Wmin c2 min n 1, A,W p

Tm C3 κ2 w + κ6 w,cross 2 A,W

ι2 + 1 1 ρ log(2κA), m Mclassiﬁcation, (26d)

where ι := log Ttotal

δ is a logarithmic term.

Learning Mixtures of Linear Dynamical Systems

Proof of Theorems 3.4 and 3.5. Our main theorems are direct implications of the above guarantees for the individual steps.

Stage 1: To begin with, according to Theorem B.1, if the condition (9a) on Tsubspace and Ttotal,subspace hold, then the subspaces {Vi, Ui} output by Line 3 of Algorithm 1 are guaranteed to satisfy the error bounds (25) required by Theorem B.2. This together with the condition (9b) on Tclustering ensures exact clustering in Line 4.

Stage 2: Based on this, if we further know that Ttotal,clustering obeys condition (9b) for Case 0 or (11a) for Case 1, then Theorem B.3 tells us that the coarse model estimation in Line 5 satisﬁes the error bounds (26a) or (26c) required by Theorem B.4. Together with the assumption (9c) or (11b) on Tclassiﬁcation, this guarantees exact classiﬁcation in Line 7 of Algorithm 1, for either Case 0 or 1. At the end, the ﬁnal model estimation errors (10) follow immediately from Theorem B.3.

Note that all the statements above are high-probability guarantees; it sufﬁces to take the union bound over these steps, so that the performance guarantees of Theorems 3.4 and 3.5 hold with probability at least 1 δ. This ﬁnishes the proof of our main theorems. Remark B.5. The step of subspace estimation in Algorithm 1 is non-essential and optional; it allows for a smaller Tclustering, but comes at the price of complicating the overall algorithm. For practitioners who prefer a simpler algorithm, they might simply remove this step (i.e. Line 3 of Algorithm 1), and replace the rank-K subspaces {Vi, Ui} with Id (i.e. no dimensionality reduction for the clustering step). The theoretical guarantees continue to hold with minor modiﬁcation: in Corollary 3.7, one simply needs to remove the conditions on Tsubspace, Ttotal,subspace, and in the condition for Tclustering, replace the factor p

K/d (where K is due to dimensionality reduction) with p

d/d = 1. This is one example regarding how our modular algorithms and theoretical analysis can be easily modiﬁed and adapted to accommodate different situations.

C. Proofs for Appendix B

This section starts with some preliminaries about linear dynamical systems that will be helpful later. Then, it provides the main proofs for the theorems in Appendix B.

C.1. Preliminaries

Truncation of autocovariance. Recall the notation Γ(k) = P i=0 A(k)i W (k)(A(k)i) from (3) and (5). We add a subscript t to represent its t-step truncation:

i=0 A(k)i W (k)(A(k)i) . (27)

Also recall the assumption of exponential stability in Assumption 3.1, namely (A(k))t κAρt. As a result, Γ(k) t is close to Γ(k):

0 Γ(k) Γ(k) t =

i=t A(k)i W (k)(A(k)i) = A(k)tΓ(k)(A(k)t) ,

Γ(k) Γ(k) t Γ(k) A(k)t 2 Γ(k) κ2 Aρ2t Γmaxκ2 Aρ2t. (28)

Moreover, let Y (k) t := A(k)Γ(k) t , then Y (k) t is also close to Y (k):

Y (k) Y (k) t A(k) Γ(k) Γ(k) t Γmaxκ3 Aρ2t. (29)

Independent version of states. Given some mixing time tmix, we deﬁne exm,t(tmix) as the (tmix 1)-step approximation of xm,t :

exm,t = exm,t(tmix) :=

i=0 (A(km))iwm,t i 1 N(0, Γ(km) tmix 1), tmix t Tm, 1 m M. (30)

Learning Mixtures of Linear Dynamical Systems

Since exm,t consists of only the most recent noise vectors, it is independent of the history up to xm,t tmix+1. Our proofs for Stage 1 will rely on this independent version {exm,t} of states {xm,t}; the basic idea is that, for an appropriately chosen tmix, a trajectory of length T can be regarded as a collection of T/tmix independent samples. We will often use the notatione to represent the independent version of other variables as well.

Boundedness of states. The following lemma provides upper bounds for { exm,t 2} and { xm,t 2}. This will help to control the effects of mixing errors and model estimation errors in the analyses later. Lemma C.1 (Bounded states). Consider the model (1) under the assumptions in Sections 2 and 3.1. Fix any tmix 3. Then with probability at least 1 δ, we have exm,t 2 C0 p

Γmax(d + log(Ttotal/δ)) for all 1 m M, tmix t Tm, where C0 > 0 is some universal constant; moreover, for both cases of initial states deﬁned in (8), all states {{xm,t}0 t Tm}1 m M are bounded throughout:

Case 0: with probability at least 1 δ, we have xm,t 2 C0 p

Γmax(d + log(Ttotal/δ)) for all m, t.

Case 1: suppose that tmix 1 1 ρ log(

2κA), and Tm tmix for all m, then with probability at least 1 δ, we

have xm,t 2 3C0κA p

Γmax(d + log(Ttotal/δ)) for all m, t, and xm,t 2 2C0 p

Γmax(d + log(Ttotal/δ)) for all t tmix or t = 0.

Proof. First, recall from Corollary 7.3.3 of (Vershynin, 2018) that, if random vector a N(0, Id), then for all u 0, we have P( a 2 2

d + u) 2 exp( cu2). Since exm,t N(0, Γ(km) tmix 1), where Γ(km) tmix 1 Γ(km) Γmax Id, we have P( exm,t 2 Γmax(2

d + u)) 2 exp( cu2). Taking the union bound, we have

P there exists m, t such that exm,t 2 p

t=tmix P exm,t 2 p

d + u) 2Ttotal exp( cu2) δ,

where the last inequality holds if we pick u q

1 c log 2Ttotal

δ . This ﬁnishes the proof of the ﬁrst claim in the lemma. Next, we prove the boundedness of { xm,t 2}.

Case 0: It is easy to check that xm,t N(0, Γ(km) t ), where Γ(km) t Γ(km) Γmax Id. The boundedness of { xm,t 2} can be proved by a similar argument as before, which we omit for brevity.

Case 1: Deﬁne ξm,t := xm,t (A(km))txm,0 N(0, Γ(km) t ). By a similar argument as before, we have with probability at least 1 δ, ξm,t 2 C0 p

Γmax(d + log(Ttotal/δ)) for all m, t. Morever, for any t tmix 1 1 ρ log(

2κA), we have (A(km))t κAρt 1/2. With this in place, we have

xm+1,0 = xm,Tm = (A(km))Tmxm,0 + ξm,Tm, and thus

xm+1,0 (A(km))Tm xm,0 2 + ξm,Tm 2 1

2 xm,0 2 + C0

Γmax(d + log Ttotal

Recall the assumption that x1,0 = 0; by induction, we have xm,0 2 2C0 p

Γmax(d + log(Ttotal/δ)) for all 1 m M. Now, we have for all m, t,

xm,t 2 (A(km))txm,0 2 + ξm,t 2

κA xm,0 2 + C0

Γmax(d + log Ttotal

Γmax(d + log Ttotal

moreover, for t tmix, since (A(km))t 1/2, we obtain a better bound

xm,t 2 (A(km))txm,0 2 + ξm,t 2

2 xm,0 2 + C0

Γmax(d + log Ttotal

Γmax(d + log Ttotal

This shows the boundedness of { xm,t 2} and completes our proof of the lemma.

Learning Mixtures of Linear Dynamical Systems

C.2. Proof of Theorem B.1

Theorem B.1 is an immediate consequence of Lemmas C.2 and C.3 below. The former shows the concentration of c Hi, b Gi around the targeted low-rank matrices Hi, Gi, while the latter is a result of perturbation analysis.

Lemma C.2. Under the setting of Theorem B.1, with probability at least 1 δ, we have for all 1 i d,

max n c Hi Hi , b Gi Gi o Γ2 max

tmixd Ttotal,subspace log3 Ttotald

Lemma C.3. Consider the matrix M = PK k=1 p(k)y(k)y(k) Rd d, where 0 < p(k) < 1, PK k=1 p(k) = 1, and y(k) Rd. Let M be symmetric and satisfy M M ϵ, and U Rd K be the top-K eigenspace of M. Then we have K X

k=1 p(k) y(k) UU y(k) 2 2 2Kϵ, (31)

and for all 1 k K, it holds that

y(k) UU y(k) 2 min

1/2 , 2ϵ λmin(M) y(k) 2,

p(k) y(k) 2

For our main analyses in Sections 3.2 and B, we choose the ﬁrst term on the right-hand side of (32).

C.2.1. PROOF OF LEMMA C.2

We ﬁrst analyze the idealized case with i.i.d. samples; then we make a connection between this i.i.d. case and the actual case of mixed LDSs, by utilizing the mixing property of linear dynamical systems. We prove the result of Lemma C.2 for b Gi Gi only, since the analysis for c Hi Hi is mostly the same (and simpler in fact).

Step 1: the idealized i.i.d. case. With some abuse of notation, suppose that for all 1 m M, we have for some km {1, . . . , K},

xm,t, zm,t i.i.d. N(0, eΓ(km)), wm,t, vm,t i.i.d. N(0, W (km)),

xm,t = A(km)xm,t + wm,t, zm,t = A(km)zm,t + vm,t, 1 t N,

where for all 1 k K, it holds that W (k), eΓ(k) Γ(k) Γmax Id. Notice that cov(xm,t ) = A(km)eΓ(km)(A(km)) + W (km) A(km)Γ(km)(A(km)) +W (km) = Γ(km) Γmax Id. Consider the i.i.d. version of matrix b Gi and its expectation Gi deﬁned as follows:

b Gi := 1 MN

izm,t , Gi =

k=1 p(k)( e Y (k))i( e Y (k)) i ,

where ( e Y (k))i is the transpose of the i-th row of e Y (k) := A(k)eΓ(k). For this i.i.d. setting, we claim that, if the i.i.d. sample size MN satisﬁes MN d log(MNd/δ), then with probability at least 1 δ,

b Gi Gi Γ2 max

d MN log3 MNd

δ , 1 i d. (33)

This claim can be proved by a standard covering argument with truncation; we will provide a proof later for completeness.

Step 2: back to the actual case of mixed LDSs. Now we turn to the analysis of b Gi deﬁned in (14) versus its expectation Gi deﬁned in (15b), for the mixed LDSs setting. We ﬁrst show that b Gi can be writte as a weighted average of some matrices, each of which can be further decomposed into an i.i.d. part (as in Step 1) plus a negligible mixing error term. Then we analyze each term in the decomposition, and ﬁnally put pieces together to show that b Gi Gi.

Learning Mixtures of Linear Dynamical Systems

Step 2.1: decomposition of the index set Ω1 Ω2. Recall the deﬁnition of index sets Ω1, Ω2 in Algorithm 2. Denote the ﬁrst index of Ω1 (resp. Ω2) as τ1 + 1 (resp. τ2 + 1), and let := τ2 τ1 be their distance. Also denote N := |Ω1| = |Ω2| Tsubspace. For any t Ω1 and 1 j N, deﬁne

sj(t) := Cycle(t + + j; Ω2) =

( t + + j if t + + j τ2 + N, t + + j N otherwise,

where Cycle(i; Ω) represents the cyclic indexing of value i on set Ω. Then we have

Ω1 Ω2 = n (t1, t2), t1 Ω1, t2 Ω2 o = N j=1 n t, sj(t) , t Ω1 o .

We further deﬁne Sτ := {τ1 + τ + f tmix : f 0, τ + f tmix N}, 1 τ tmix,

so that Ω1 = tmix τ=1Sτ. Notice that for each τ, the elements of Sτ are at least tmix far apart. Putting together, we have

Ω1 Ω2 = N j=1 tmix τ=1 n t, sj(t) , t Sτ o . (34)

Step 2.2: decomposition of b Gi. In the remaining proof, we denote xm,t := xm,t+1 for notational consistency. Using the decomposition (34) of Ω1 Ω2, we can rewrite b Gi deﬁned in (14) as a weighted average of Ntmix matrices:

b Gi = 1 |Msubspace|

m Msubspace

(t1,t2) Ω1 Ω2 (x m,t1)i xm,t1 (x m,t2)i xm,t2 =

N 2 Fi,j,τ,

Fi,j,τ := 1 |Msubspace| |Sτ|

m Msubspace

t Sτ (xm,t )i xm,t (x m,sj(t))i xm,sj(t) . (35)

Recalling the deﬁnition of exm,t in (30) and Γ(k) t in (27), we have

xm,t = exm,t + (A(km))tmix 1xm,t tmix+1 | {z } =:δm,t

= exm,t + δm,t,

where δm,t 2 (A(km))tmix 1 xm,t tmix+1 2 κAρtmix 1 xm,t tmix+1 2, (36)

and exm,t N(0, Γ(k) tmix 1) is independent of δm,t. Moreover,

xm,t = A(km)xm,t + wm,t = ex m,t + A(km)δm,t, where ex m,t := A(km)exm,t + wm,t.

We can rewrite xm,sj(t) = exm,sj(t) + δm,sj(t) and x m,sj(t) = ex m,sj(t) + A(km)δm,sj(t) in a similar manner. Putting this back to (35), one has

Fi,j,τ = 1 |Msubspace| |Sτ|

m Msubspace

t Sτ (xm,t )i xm,t (x m,sj(t))i xm,sj(t)

= 1 |Msubspace| |Sτ|

m Msubspace

(ex m,t + A(km)δm,t)i (exm,t + δm,t) (ex m,sj(t) + A(km)δm,sj(t))i (exm,sj(t) + δm,sj(t))

= 1 |Msubspace| |Sτ|

m Msubspace

t Sτ (ex m,t)i exm,t (ex m,sj(t))i exm,sj(t)

=: e Fi,j,τ

+ i,j,τ, (37)

where i,j,τ contains all the {δm,t} terms in the expansion. The key observation here is that, by our deﬁnition of index set Sτ, the {exm,t} terms in e Fi,j,τ are independent, and thus we can utilize our earlier analysis of the i.i.d. case in Step 1 to study e Fi,j,τ.

Learning Mixtures of Linear Dynamical Systems

Step 2.3: analysis for each term of the decomposition. Towards showing b Gi Gi, we prove in the following that e Fi,j,τ concentrates around its expectation e Gi, which is close to Gi; moreover, the error term i,j,τ becomes exponentially small as tmix grows. More speciﬁcally, we have the following:

Recall the notation Y (k) tmix 1 = A(k)Γ(k) tmix 1. It holds that

E[ e Fi,j,τ] = e Gi :=

k=1 p(k)(Y (k) tmix 1)i(Y (k) tmix 1)i .

According to our result (33) for the i.i.d. case, for ﬁxed j, τ, we have with probability at least 1 δ,

1 i d, e Fi,j,τ e Gi Γ2 max

d |Msubspace| |Sτ| log3 |Msubspace| |Sτ|d

tmixd Ttotal,subspace log3 Ttotal,subspaced

Recall from (29) that Y (k) Y (k) tmix 1 Γmaxκ3 Aρ2(tmix 1). Therefore,

(Y (k))i(Y (k))i (Y (k) tmix 1)i(Y (k) tmix 1)i

(Y (k))i 2 + (Y (k) tmix 1)i 2 (Y (k))i (Y (k) tmix 1)i 2

2 Y (k) + Y (k) Y (k) tmix 1 Y (k) Y (k) tmix 1

2ΓmaxκA + Γmaxκ3 Aρ2(tmix 1) Γmaxκ3 Aρ2(tmix 1)

= 2 + κ2 Aρ2(tmix 1) Γ2 maxκ4 Aρ2(tmix 1) 3Γ2 maxκ4 Aρ2(tmix 1),

where we use the mild assumption that tmix 1+ log κA

1 ρ , and the fact that Y (k) , Y (k) tmix 1 ΓmaxκA. Consequently,

k=1 p(k) (Y (k))i(Y (k))i (Y (k) tmix 1)i(Y (k) tmix 1)i 3Γ2 maxκ4 Aρ2(tmix 1). (39)

By Lemma C.1, if tmix 1 1 ρ log(2κA), then we have with probability at least 1 δ, all the xm,t s and exm,t s involved in the deﬁnition of i,j,τ in (37) have ℓ2 norm bounded by Γmaxpoly(d, κA, log(Ttotal/δ)). This together with the upper bound on δm,t 2 in (36) implies that for all i, j, τ, it holds that

i,j,τ Γ2 max poly d, κA, log Ttotal

ρtmix 1. (40)

Step 2.4: putting pieces together. With (38), (39) and (40) in place and taking the union bound, we have with probability at least 1 δ, for all 1 i d,

N 2 Fi,j,τ Gi

max j,τ e Fi,j,τ e Gi + e Gi Gi + max j,τ i,j,τ

tmixd Ttotal,subspace log3 Ttotal,subspaced

δ + Γ2 maxκ4 Aρ2(tmix 1) + Γ2 max poly d, κA, log d Ttotal

tmixd Ttotal,subspace log3 Ttotald

where the last inequality holds if tmix 1 1 ρ log dκATtotal

δ . This ﬁnishes the proof of Lemma C.2.

Learning Mixtures of Linear Dynamical Systems

Proof of (33). Deﬁne the truncating operator

Trunc(x; D) := x 1(|x| D), x R, D 0.

Consider the following truncated version of b Gi:

b GTrunc i := 1 MN

(the truncating level D0 will be speciﬁed later), and let ETrunc i := E b GTrunc i be its expectation. In the following, we ﬁrst show that b GTrunc i concentrates around ETrunc i , and then prove that ETrunc i Gi.

By a standard covering argument, we have

b GTrunc i ETrunc i = sup u,v Sd 1 u b GTrunc i ETrunc i v 4 sup u,v N1/8 u b GTrunc i ETrunc i v,

where N1/8 denotes the 1/8-covering of the unit sphere Sd 1 and has cardinality |N1/8| 32d. For ﬁxed u, v N1/8, one has

u b GTrunc i v = 1 MN

t=1 Trunc xm,t

i; D0 u xm,t Trunc zm,t

i; D0 v zm,t,

where (cf. Chapter 2 of (Vershynin, 2018) for the deﬁnitions of subgaussian norm ψ2 and subexponential norm ψ1) Trunc xm,t

i; D0 , Trunc zm,t

i; D0 D0, u xm,t ψ2, v zm,t ψ2 p

Hence Trunc xm,t

i; D0 u xm,t Trunc zm,t

i; D0 v zm,t

ψ1 D2 0Γmax,

and by Bernstein s inequality (Corollary 2.8.3 of (Vershynin, 2018)), we have

P u b GTrunc i ETrunc i v τ 2 exp c0MN τ D2 0Γmax

for all 0 τ D2 0Γmax. Taking the union bound over u, v N1/8, we have with probability at least 1 δ/2,

b GTrunc i ETrunc i D2 0Γmax

δ MN , provided that MN d + log 1

Gi ETrunc i =

k=1 p(k)Ext,zt N(0,eΓ(k))

(xt )i(zt )i Trunc (xt )i Trunc (zt )i xtzt ,

where for each k, Ext,zt N(0,eΓ(k))

(xt )i(zt )i Trunc (xt )i Trunc (zt )i xtzt

= sup u,v Sd 1

Ext,zt N(0,eΓ(k))

(xt )i(zt )iu xtv zt 1 1 |(xt )i| D0, |(zt )i| D0

Learning Mixtures of Linear Dynamical Systems

sup u,v Sd 1

E (xt )i(zt )iu xtv zt 2r

E 1 1 |(xt )i| D0, |(zt )i| D0 2

P(|N(0, Γmax)| > D0)

Γ2 max exp c1 D2 0 Γmax

Finally, notice that if the truncating level D0 is sufﬁciently large, then we have b Gi = b GTrunc i with high probability. More formally, we have shown that for ﬁxed 1 i d, if MN d + log(1/δ), then

P b Gi Gi D2 0Γmax

δ MN + Γ2 max exp c1 D2 0 Γmax

m,t P |(xm,t )i| > D0 or |(zm,t )i| > D0 .

If we pick the truncating level D0 p

Γmax log(MN/δ), then it is easy to check that with probability at least 1 δ,

b Gi Gi D2 0Γmax

δ MN Γ2 max

d MN log3 MN

Taking the union bound over 1 i d ﬁnishes our proof of (33).

C.2.2. PROOF OF LEMMA C.3

Deﬁne := M M , and denote the eigendecomposition of M and M as M = U Λ U and M = UΛU + U Λ U , where diagonal matrix Λ (resp. Λ ) contains the top-K eigenvalues of M (resp. M ). Then we have

Λ = U MU = U M U + U U =

k=1 p(k)U y(k)y(k) U + U U,

Λ = U M U =

k=1 p(k)U y(k)y(k) U .

Substracting these two equations gives

k=1 p(k) U y(k)y(k) U U y(k)y(k) U = Λ Λ + U U.

Taking the trace of both sides, we get

k=1 p(k) U y(k) 2 2 U y(k) 2 2 = Tr Λ Λ + Tr U U .

On the left-hand side,

U y(k) 2 2 U y(k) 2 2 = y(k) 2 2 U y(k) 2 2 = y(k) UU y(k) 2 2 0,

while on the right-hand side,

λk(Λ ) λk(Λ) (i) K Kϵ, Tr(U U) Tr(U U) = Tr(IK) Kϵ,

where (i) follows from Weyl s inequality. Putting things together, we have proved (31), which immediately leads to the ﬁrst upper bound in (32). The second upper bound in (32) follows from a simple application of Davis-Kahan s sin Θ theorem (Davis & Kahan, 1970), and the third term is due to Lemma A.11 of (Kong et al., 2020b); we skip the details for brevity.

Learning Mixtures of Linear Dynamical Systems

C.3. Proof of Theorem B.2

Our proof follows the three steps below:

1. Consider the idealized i.i.d. case, and characterize the expectations and variances of the testing statistics computed by Algorithm 3;

2. Go back to the actual case of mixed LDSs, and analyze one copy of statΓ,g or stat Y,g deﬁned in (18) for some ﬁxed 1 g G, by decomposing it into an i.i.d. part plus a negligible mixing error term;

3. Analyze median{statΓ,g, 1 g G} and median{stat Y,g, 1 g G}, and prove the correct testing for each pair of trajectories, which implies that Algorithm 3 achieves exact clustering.

Step 1: the idealized i.i.d. case. Recall the deﬁnition of stat Y in (19) when we ﬁrst introduce our method for clustering. For notational convenience, we drop the subscript in stat Y , and replace xt+1, zt+1 with xt , zt ; then, with some elementary linear algebra, (19) can be rewritten as

(xt )ixt (zt )izt , Ui 1

(xt )ixt (zt )izt E

t Ω1 vec xt(xt ) zt(zt ) , U 1

t Ω2 vec xt(xt ) zt(zt ) E ,

where we deﬁne a large orthonormal matrix

U1 0 . . . 0

0 U2 ... ... ... ... ... 0 0 . . . 0 Ud

Rd2 d K, (41)

In this step, we consider the idealized i.i.d. case:

t Ω1 Ω2, xt i.i.d. N(0, eΓ(k)), wt i.i.d. N(0, f W (k)), xt = e A(k)xt + wt,

zt i.i.d. N(0, eΓ(l)), vt i.i.d. N(0, f W (l)), zt = e A(l)zt + vt,

where eΓ(k), eΓ(l), f W (k), f W (l) are d d covariance matrices, and e A(k), e A(l) are d d state transition matrix. Our goal is to characterize the expectation and variance of stat in this i.i.d. case.

Before we present the results, we need some additional notation. First, let {ei}1 i d be the canonical basis of Rd, and deﬁne

e Y (k) := e A(k)eΓ(k),

Σ(k) := e A(k) Id (eΓ(k))1/2 (eΓ(k))1/2 Id2 + P (eΓ(k))1/2 (eΓ(k))1/2 ( e A(k)) Id ,

where P Rd2 d2 is a symmetric permutation matrix, whose (i, j)-th block is eje i Rd d, 1 i, j d. Let e Y (l), Σ(ℓ)

be deﬁned similarly, with e A(k), eΓ(k) replaced by e A(l), eΓ(l). Moreover, deﬁne

µk,l := U vec ( e Y (k) e Y (l)) Rd K, (42a)

Σk,l := U Σ(k) + f W (k) eΓ(k) + Σ(ℓ) + f W (l) eΓ(l) U Rd K d K. (42b)

Now we are ready to present our results for the i.i.d. case. The ﬁrst lemma below gives a precise characterization of E[stat] and var(stat), in terms of µk,l and Σk,l; the second lemma provides some upper and lower bounds, which will be handy for our later analyses.

Learning Mixtures of Linear Dynamical Systems

Lemma C.4. Denote N = min{|Ω1|, |Ω2|}. For the i.i.d. case just described, one has

E[stat] = µk,l 2 2, var(stat) 1 N 2 Tr(Σ2 k,l) + 2

N µ k,lΣk,lµk,l,

and the inequality becomes an equality if |Ω1| = |Ω2| = N.

Lemma C.5. Consider the same setting of Lemma C.4. Furthermore, suppose that

eΓ(k), eΓ(l) Γmax Id, f W (k), f W (l) Wmax Id, e A(k) , e A(l) κA

for some 0 < Wmax Γmax and κA 1. Then the following holds true.

(Upper bound on expectation) It holds that

E[stat] = µk,l 2 2 e Y (k) e Y (l) 2 F. (43)

(Lower bound on expectation) If e Y (k) = e Y (l) and subspaces {Ui}1 i d satisfy

1 i d, max n ( e Y (k))i Ui Ui ( e Y (k))i 2, ( e Y (l))i Ui Ui ( e Y (l))i 2 o ϵ, (44)

for some ϵ 0, then we have

E[stat] = µk,l 2 2 e Y (k) e Y (l) 2 F 4ϵ

( e Y (k))i + ( e Y (l))i 2.

(Upper bound on variance) The matrix Σk,l is symmetric and satisﬁes

0 Σk,l 6Γ2 maxκ2 AId K;

this, together with the earlier upper bound (43) on µk,l 2 2, implies that

var(stat) 1 N 2 Tr(Σ2 k,l) + 2

N µ k,lΣk,lµk,l Γ2 maxκ2 A N

2 d K + Γ2 maxκ2 A N e Y (k) e Y (l) 2 F.

Step 2: one copy of stat Y,g, statΓ,g for a ﬁxed g. Now we turn back to the mixed LDSs setting and prove Theorem B.2. Let us focus on the testing of one pair of short trajectories {xm1,t}, {xm2,t} for some m1, m2 Mclustering, m1 = m2. For notational consistency, in this proof we rewrite these two trajectories as {xt} and {zt}, their labels km1, km2 as k, ℓ, and the trajectory length Tclustering as T, respectively. Also denote xt := xt+1 and zt := zt+1. Recall the deﬁnition of {stat Y,g}1 g G in (18b); for now, we consider one speciﬁc element and ignore the subscript g. Recalling the deﬁnition of U Rd2 d K in (41), we have

t Ω1 Ui (xt )ixt (zt )izt , 1

t Ω2 Ui (xt )ixt (zt )izt E

t Ω1 vec xt(xt ) zt(zt ) , U 1

t Ω2 vec xt(xt ) zt(zt ) E ,

where N = T/4G = |Ω1| = |Ω2|.

In the following, we show how to decompose stat Y into an i.i.d. term plus a negligible mixing error term, and then analyze each component of this decomposition; ﬁnally, we put pieces together to give a characterization of stat Y , or {stat Y,g}1 g G when we put the subscript g back in at the end of this step.

Learning Mixtures of Linear Dynamical Systems

Step 2.1: decomposition of stat Y . Deﬁne S1,τ1 := {t1 + τ1 + f tmix : f 0, τ1 + f tmix |Ω1|}, where t1 + 1 is the ﬁrst index of Ω1, and the mixing time tmix will be speciﬁed later; deﬁne S2,τ2 similarly. Note that for each τ1, the elements of S1,τ1 are at least tmix far apart; moreover, we have Ω1 = tmix τ1=1S1,τ1, Ω2 = tmix τ2=1S2,τ2, and thus

t Ω1 vec xt(xt ) zt(zt ) =

N 1 |S1,τ1|

t S1,τ1 vec xt(xt ) zt(zt ) ,

t Ω2 vec xt(xt ) zt(zt ) =

N 1 |S2,τ2|

t S2,τ2 vec xt(xt ) zt(zt ) .

Therefore, we can rewrite stat Y as a weighted average

τ2=1 wτ1,τ2 statτ1,τ2 Y , where

wτ1,τ2 := |S1,τ1||S2,τ2|

τ2=1 wτ1,τ2 = 1, and

statτ1,τ2 Y := D U 1 |S1,τ1|

t S1,τ1 vec xt(xt ) zt(zt ) , U 1 |S2,τ2|

t S2,τ2 vec xt(xt ) zt(zt ) E . (45)

We can further decompose statτ1,τ2 Y into an i.i.d. term plus a small error term. To do this, recalling the deﬁnition of exm,t in (30) and dropping the subscript m, we have

xt = A(k)tmix 1xt tmix+1 | {z } =:δx,t

+ ext = δx,t + ext,

xt = A(k)xt + wt = A(k)δx,t + (A(k)ext + wt) | {z } =:ex t

= A(k)δx,t + ex t,

where ext N(0, Γ(k) tmix 1). Similarly, we rewrite zt = δz,t + ezt, zt = A(ℓ)δz,t + ez t. Plugging these into the right-hand side of (45) and expanding it, one has

statτ1,τ2 Y = g stat τ1,τ2 Y + τ1,τ2 Y , where

g stat τ1,τ2 Y := D U 1 |S1,τ1|

t S1,τ1 vec ext(ex t) ezt(ez t) , U 1 |S2,τ2|

t S2,τ2 vec ext(ex t) ezt(ez t) E ,

and τ1,τ2 Y involves {δx,t, δz,t} terms.

Step 2.2: analysis of each component. First, notice that the {ext, ezt} terms in the deﬁnition of g stat τ1,τ2 Y are independent; this suggests that we can characterize g stat τ1,τ2 Y by applying our earlier analysis for the i.i.d. case in Step 1. Second, τ1,τ2 Y involves {δx,t, δz,t} terms, which in turn involve A(k)tmix 1, A(ℓ)tmix 1 and thus will be exponentially small as tmix increases, thanks to Assumption 3.1. More formally, we have the following:

Applying Lemmas C.4 and C.5 with (eΓ(k), eΓ(l)) = (Γ(k) tmix 1, Γ(ℓ) tmix 1), (f W (k), f W (l)) = (W (k), W (ℓ)) and ( e A(k), e A(l)) = (A(k), A(ℓ)), we have

E g stat τ1,τ2 Y = U vec((Y (k) tmix 1 Y (ℓ) tmix 1) ) 2 2,

var g stat τ1,τ2 Y Γ2 maxκ2 A e N

2 d K + Γ2 maxκ2 A e N

Y (k) tmix 1 Y (ℓ) tmix 1 2 F,

where e N := min{|S1,τ1|, |S2,τ2|} N/tmix.

Learning Mixtures of Linear Dynamical Systems

By Lemma C.1, with probability at least 1 δ, all {xt, ext, zt, ezt} terms involved in the deﬁnition of τ1,τ2 Y has ℓ2 norm bounded by Γmaxpoly(d, κA, log(Ttotal/δ)). This implies that

τ1,τ2 Y Γ2 max poly d, κA, log Ttotal

Step 2.3: putting pieces together. Putting the subscript g back in, we have already shown that

τ2=1 wτ1,τ2 statτ1,τ2 Y,g

τ2=1 wτ1,τ2 g stat τ1,τ2 Y,g

| {z } =:g stat Y,g

τ2=1 wτ1,τ2 τ1,τ2 Y,g | {z } =: Y,g

= g stat Y,g + Y,g,

E g stat Y,g = U vec((Y (k) tmix 1 Y (ℓ) tmix 1) ) 2 2,

var g stat Y,g max τ1,τ2 var(g stat τ1,τ2 Y,g ) Γ2 maxκ2 A e N

2 d K + Γ2 maxκ2 A e N

Y (k) tmix 1 Y (ℓ) tmix 1 2 F,

Y,g max τ1,τ2 | τ1,τ2 Y,g | Γ2 max poly d, κA, log Ttotal

So far in Step 2, we have focused on the analysis of stat Y,g. We can easily adapt the argument to study statΓ,g as well; the major difference is that, in Step 2.2, we should apply Lemmas C.4 and C.5 with (eΓ(k), eΓ(l)) = (Γ(k) tmix 1, Γ(ℓ) tmix 1), (f W (k), f W (l)) = (0, 0), ( e A(k), e A(l)) = (Id, Id), and subspaces {Ui} replaced by {Vi} instead. The ﬁnal result is that, for all 1 g G,

statΓ,g = g statΓ,g + Γ,g,

E g statΓ,g = V vec((Γ(k) tmix 1 Γ(ℓ) tmix 1) ) 2 2,

var g statΓ,g Γ2 maxκ2 A e N

2 d K + Γ2 maxκ2 A e N

Γ(k) tmix 1 Γ(ℓ) tmix 1 2 F,

Γ,g Γ2 max poly d, κA, log Ttotal

Step 3: analysis of the medians, and ﬁnal results. Recall the following standard result on the concentration of medians (or median-of-means in general; see Theorem 2 of (Lugosi & Mendelson, 2019)). Proposition C.6 (Concentration of medians). Let X1, . . . XG be i.i.d. random variables with mean µ and variance σ2. Then we have |median{Xg, 1 g G} µ| 2σ with probability at least 1 e c0G for some constant c0.

Notice that by construction, {g stat Y,g}1 g G are i.i.d. (and so are {g statΓ,g}1 g G). Applying Proposition C.6 to our case, we know that if G log(1/δ), then with probability at least 1 δ, the following holds:

If k = ℓ, i.e. the two trajectories are generated by the same LDS model, then

median{statΓ,g, 1 g G} + median{stat Y,g, 1 g G}

median{g statΓ,g} + median{g stat Y,g} + max g | Γ,g| + max g | Y,g|

var(g statΓ,g) + 2 q

var(g stat Y,g) + Γ2 max poly d, κA, log Ttotal

d K e N + Γ2 max poly d, κA, log Ttotal

ρtmix 1; (46)

Learning Mixtures of Linear Dynamical Systems

On the other hand, if k = ℓ, then

median{statΓ,g, 1 g G} + median{stat Y,g, 1 g G}

median{g statΓ,g} + median{g stat Y,g} max g | Γ,g| + max g | Y,g|

E[g statΓ,g] + E[g stat Y,g] 2 q

var(g statΓ,g) + q

var(g stat Y,g) Γ2 max poly d, κA, log Ttotal

V vec((Γ(k) tmix 1 Γ(ℓ) tmix 1) ) 2

2 + U vec((Y (k) tmix 1 Y (ℓ) tmix 1) ) 2

Γ2maxκ2 A e N

Γ(k) tmix 1 Γ(ℓ) tmix 1 F + Y (k) tmix 1 Y (ℓ) tmix 1 F

Γ2 max poly d, κA, log Ttotal

ρtmix 1. (47)

We need to further simplify the result (47) for the k = ℓcase. According to (28) and (29), we have

Y (k) Y (k) tmix 1 F Γmax

dκ3 Aρ2(tmix 1) =: ϵmix,

Γ(k) Γ(k) tmix 1 F Γmax

dκ2 Aρ2(tmix 1) ϵmix,

which implies that

Γ(k) tmix 1 Γ(ℓ) tmix 1 F Γ(k) Γ(ℓ) F + 2ϵmix, V vec (Γ(k) tmix 1 Γ(ℓ) tmix 1) 2

2 max n V vec (Γ(k) Γ(ℓ)) 2 2ϵmix, 0 o2

V vec (Γ(k) Γ(ℓ)) 2

2 4ϵmix V vec((Γ(k) Γ(ℓ)) ) 2

V vec (Γ(k) Γ(ℓ)) 2

2 4ϵmix Γ(k) Γ(ℓ) F.

We can do a similar analysis for Y (k) tmix 1 Y (ℓ) tmix 1 F and U vec((Y (k) tmix 1 Y (ℓ) tmix 1) ) 2 2. Moreover, we claim (and prove later) that if the subspaces {Vi, Ui} satisfy the condition (25) in the theorem, then

V vec((Γ(k) Γ(ℓ)) ) 2

2 + U vec((Y (k) Y (ℓ)) ) 2

Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F . (48)

Putting these back to (47), we have for the k = ℓcase,

median{statΓ,g, 1 g G} + median{stat Y,g, 1 g G}

V vec (Γ(k) Γ(ℓ)) 2

2 + U vec (Y (k) Y (ℓ)) 2

4ϵmix Γ(k) Γ(ℓ) F + Y (k) Y (ℓ) F

Γ2maxκ2 A e N

Γ(k) Γ(ℓ) F + Y (k) Y (ℓ) F + 4ϵmix

Γ2 max poly d, κA, log Ttotal

Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F 0.01 Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F

Γ2maxκ2 A e N

Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F

Γ2 max poly d, κA, log Ttotal

Learning Mixtures of Linear Dynamical Systems

(ii) 0.48 Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F

Γ2maxκ2 A e N

Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F

where (i) holds if tmix 1 1 ρ log(( Γmax

Γ,Y + 2)dκA) so that ϵmix 10 3 Γ,Y , and (ii) holds if tmix 1 1 ρ log(( Γmax

2) dκATtotal

δ ) so that Γ2 max poly(d, κA, log Ttotal

δ ) ρtmix 1 10 3 2 Γ,Y .

Putting (46) and (49) together, we can ﬁnally check that, if it further holds that e N N/tmix Γ2 maxκ2 A

d K 2 Γ,Y + 1, then we

have with probability at least 1 δ,

median{statΓ,g} + median{stat Y,g}

8 2 Γ,Y if k = ℓ, 3

8 Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F 3

8 2 Γ,Y if k = ℓ.

This together with our choice of testing threshold τ [ 2 Γ,Y /8, 3 2 Γ,Y /8] in Algorithm 3 implies correct testing of the two trajectories {xt}, {zt}. Finally, taking the union bound over all pairs of trajectories in Mclustering leads to correct pairwise testing, which in turn implies exact clustering of Mclustering; this completes our proof of Theorem (B.2).

C.3.1. PROOF OF LEMMA C.4

We ﬁrst assume |Ω1| = |Ω2| = N for simplicity. Recall that

stat = D U 1

t Ω1 vec xt(xt ) zt(zt )

t Ω2 vec xt(xt ) zt(zt )

where a, b Rd K are i.i.d., and

E[a] = E h U vec xt(xt ) zt(zt ) i = U vec ( e Y (k) e Y (l)) = µk,l.

Therefore, we have the expectation E[stat] = E[a], E[b] = µk,l 2 2.

It remains to compute the variance var(stat) = E[stat2] E[stat]2, where

E[stat2] = E (a b)2 = Tr E[bb ]E[aa ] = Tr E[aa ]2 . (50)

Here E[aa ] = E[a]E[a] + cov(a), and since a is an empirical average of N i.i.d. random vectors, we have

N cov(f), where f := U vec xt(xt ) zt(zt ) Rd K.

For now, we claim that

cov(f) = U (Σ(k) + f W (k) eΓ(k) + Σ(ℓ) + f W (l) eΓ(l))U = Σk,l, (51)

which will be proved soon later. Putting these back to (50), one has

E[aa ] = cov(a) + E[a]E[a] = 1

N Σk,l + µk,lµ k,l,

E[aa ]2 = 1 N 2 Σ2 k,l + 1

N (Σk,lµk,lµ k,l + µk,lµ k,lΣk,l) + µk,l 2 2µk,lµ k,l,

var(stat) = E[stat2] E[stat]2 = Tr(E[aa ]2) µk,l 4 2 = 1 N 2 Tr(Σ2 k,l) + 2

N µ k,lΣk,lµk,l,

which completes our calculation of the variance for the case of |Ω1| = |Ω2| = N. For the more general case where (without loss of generality) |Ω1| = N |Ω2|, we simply need to modify the equation (50) to an inequality E[stat2] = Tr(E[bb ]E[aa ]) Tr(E[aa ]2), and the remaining analysis is the same. This ﬁnishes the proof of Lemma C.4.

Learning Mixtures of Linear Dynamical Systems

Proof of (51). For notational simplicity, we drop the subscripts t in the deﬁnition of f. Then we have f = U vec x(x ) z(z ) , and hence

cov(f) = U cov vec x(x ) z(z ) U = U cov vec(x(x ) ) + cov vec(z(z ) ) U, (52)

where the second equality uses the independence between (x, x ) and (z, z ).

Let us focus on cov vec(x(x ) ) . For notational simplicity, for now we rewrite e A(k), f W (k), eΓ(k) as A, W , Γ. Deﬁne y := Γ 1/2x N(0, Id), and recall that x = Ax + w where w N(0, W ). Using the fact that vec(ABC) = (C A)vec(B) for any matrices of compatible shape, we have

g := vec x x = vec(xx A ) + vec(xw )

= (A Id)vec(xx ) + vec(xw ) = (A Id)vec(Γ1/2yy Γ1/2) + vec(xw )

= (A Id)(Γ1/2 Γ1/2)vec(yy ) | {z } =:g1

+ vec(xw ) | {z } =:g2

Note that E[g2] = 0, E[g] = E[g1], and E[g1g 2 ] = 0. Hence

cov(g) = E[gg ] E[g]E[g] = cov(g1) + E[g2g 2 ]. (53)

For the second term on the right-hand side, we have

g2 = vec(xw ) =

w1x ... wdx

, E[g2g 2 ] = E h [wiwjxx ]1 i,j d i = W Γ;

as for the ﬁrst term, we claim (and prove soon later) that

cov vec(yy ) = Id2 + P , (54)

which implies that

cov(g1) = cov (A Id)(Γ1/2 Γ1/2)vec(yy )

= (A Id)(Γ1/2 Γ1/2)cov vec(yy ) (Γ1/2 Γ1/2)(A Id)

= (A Id)(Γ1/2 Γ1/2)(Id2 + P )(Γ1/2 Γ1/2)(A Id).

Putting these back to (53), one has

cov vec(x(x ) ) = cov(g) = (A Id)(Γ1/2 Γ1/2)(Id2 + P )(Γ1/2 Γ1/2)(A Id) + W Γ,

which is equal to Σ(k) + f W (k) eΓ(k) if we return to the original notation of e A(k), f W (k), eΓ(k). By a similar analysis, we can show that cov vec(z(z ) ) = Σ(ℓ) + f W (l) eΓ(l). Putting these back to (52) ﬁnishes our calculation of cov(f).

Finally, it remains to prove (54). Denote

u := vec(yy ) =

y1y ... ydy

Then E[u] = vec(Id), and thus E[u]E[u] = [eie j ]1 i,j d. Next, consider

E[uu ] = E h [yiyjyy ]1 i,j d i .

Learning Mixtures of Linear Dynamical Systems

E y2 i ykyℓ =

3 if k = ℓ= i, 1 if k = ℓ = i, 0 if k = ℓ,

and hence E[y2 i yy ] = Id + 2eie i .

E yiyjykyℓ =

( 1 if k = i, ℓ= j or k = j, ℓ= i, 0 otherwise,

and hence E[yiyjyy ] = eie j + eje i .

Putting together, the (i, j)-th d d block of cov(u) = E[uu ] E[u]E[u] is equal to Id + eie i if i = j, and eje i if i = j. In other words, cov(vec(yy )) = cov(u) = Id2 + P , where P = [eje i ]1 i,j d is a symmetric permutation matrix; this completes our proof of (54).

C.3.2. PROOF OF LEMMA C.5

First, it holds that

Ui ( e Y (k))i ( e Y (l))i 2

( e Y (k))i ( e Y (l))i 2

2 = e Y (k) e Y (l) 2 F,

which gives the upper bound on E[stat] = µk,l 2 2. For the lower bound, the triangle inequality tells us that Ui ( e Y (k))i ( e Y (l))i 2 = Ui Ui ( e Y (k))i Ui Ui ( e Y (l))i 2 max n ( e Y (k))i ( e Y (l))i 2 2ϵ, 0 o ,

which implies that Ui ( e Y (k))i ( e Y (l))i 2

2 ( e Y (k))i ( e Y (l))i 2 2 4ϵ ( e Y (k))i ( e Y (l))i 2,

Ui ( e Y (k))i ( e Y (l))i 2

i=1 ( e Y (k))i ( e Y (l))i 2 2 4ϵ

i=1 ( e Y (k))i ( e Y (l))i 2

= e Y (k) e Y (l) 2 F 4ϵ

i=1 ( e Y (k))i ( e Y (l))i 2.

It remains to upper bound Σk,l. Recall the deﬁnition

Σk,l = U Σ(k) + f W (k) eΓ(k) + Σ(ℓ) + f W (l) eΓ(l) U.

We will utilize the following basic facts: (1) for square matrices A and B with eigenvalues {λi} and {µj} respectively, their Kronecker product A B has eigenvalues {λiµj}; (2) For matrices A, B, C, D of compatible shapes, it holds that (A B)(C D) = (AC) (BD). These imply that

0 f W (k) eΓ(k) f W (k) eΓ(k) Id2 WmaxΓmax Id2,

0 Σ(k) = e A(k) Id (eΓ(k))1/2 (eΓ(k))1/2 Id2 + P (eΓ(k))1/2 (eΓ(k))1/2 ( e A(k)) Id

2 e A(k) Id (eΓ(k))1/2 (eΓ(k))1/2 (eΓ(k))1/2 (eΓ(k))1/2 ( e A(k)) Id

Learning Mixtures of Linear Dynamical Systems

= 2 e A(k) Id eΓ(k) eΓ(k) ( e A(k)) Id

2Γ2 max e A(k) Id ( e A(k)) Id

= 2Γ2 max e A(k)( e A(k)) Id 2Γ2 maxκ2 AId2.

Using the conditions Wmax Γmax and κA 1, we have

Σ(k) + f W (k) eΓ(k) (WmaxΓmax + 2Γ2 maxκ2 A)Id2 3Γ2 maxκ2 AId2.

We can upper bound Σ(ℓ) + f W (l) eΓ(l) by the same analysis. As a result,

Σk,l U (6Γ2 maxκ2 AId2)U = 6Γ2 maxκ2 AU U = 6Γ2 maxκ2 AId K,

which ﬁnishes the proof of Lemma C.5.

C.3.3. PROOF OF (48)

Let ϵ be the right-hand side of the condition (25) on the subspaces. Then, applying the second point of Lemma C.5 (with e Y (k) = Y (k), e Y (l) = Y (ℓ)) tells us that

U vec (Y (k) Y (ℓ)) 2

2 Y (k) Y (ℓ) 2 F 4ϵ

(Y (k))i (Y (ℓ))i 2

Y (k) Y (ℓ) 2 F 4ϵ

d Y (k) Y (ℓ) F,

where the last line follows from the Cauchy-Schwarz inequality:

(Y (k))i (Y (ℓ))i 2

(Y (k))i (Y (ℓ))i 2 2 =

d Y (k) Y (ℓ) F.

We can lower bound V vec((Γ(k) Γ(ℓ)) ) 2 2 similarly. Putting pieces together, we have V vec (Γ(k) Γ(ℓ)) 2

2 + U vec (Y (k) Y (ℓ)) 2

2 Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F 4ϵ

d Γ(k) Γ(ℓ) F + Y (k) Y (ℓ) F

Γ(k) Γ(ℓ) 2 F + Y (k) Y (ℓ) 2 F ,

where the last inequality is due to the assumption ϵ Γ,Y /

d. This completes our proof of (48).

C.4. Proof of Theorem B.3

It sufﬁces to prove the error bounds for one speciﬁc value of k, and then take the union bound over 1 k K. For notational convenience, in this proof we rewrite T (k) total, A(k), b A(k), W (k), c W (k) as T, A, b A, W , c W , respectively. We will investigate the close-form solution b A, and prepare ourselves with a self-normalized concentration bound; this will be helpful in ﬁnally proving the error bounds for b A A and c W W .

Step 1: preparation. Recall the least-squares solution

0 t Tm 1 xm,t+1xm,t X

0 t Tm 1 xm,txm,t 1 .

Using the notation

... xm,t ...

0 t Tm 1,m Ck

RT d, X+ :=

... xm,t+1 ...

... wm,t ...

Learning Mixtures of Linear Dynamical Systems

we have X + = AX + N , b A = X +X(X X) 1 = A + N X(X X) 1,

namely A := b A A = N X(X X) 1. (55)

We will utilize the following matrix form of self-normalized concentration; see Lemma C.4 of (Kakade et al., 2020).

Lemma C.7 (Self-normalized concentration, matrix form). Consider ﬁltrations {Ft}, and random vectors {xt, zt} satisfying xt Ft 1 and zt|Ft 1 N(0, Id). Let V Rd d be a ﬁxed, symmetric, positive deﬁnite matrix, and denote V T := PT t=1 xtxt + V . Then with probability at least 1 δ,

(V T ) 1/2 T X

δ + log det(V T )

Step 2: estimation error of b A. Let us rewrite (55) as

A = (X X) 1/2 (X X) 1/2X N. (56)

It is obvious that X X plays a crucial role. Recall from Lemma C.1 that with probability at least 1 δ, xm,t 2 Dvec for some Dvec Γmax poly(d, κA, log(Ttotal/δ)). Then trivially we have the upper bound

X X D2 vec T Id =: Vup.

For a lower bound, we claim (and prove later) that with probability at least 1 δ,

5T W =: Vlb, provided that T κ2 wd log Γmax

Now we are ready to control b A A = A . From (56), we have

A (X X) 1/2 (X X) 1/2X N .

First, the lower bound (57) on X X tells us that (X X) 1/2 1/ p

T λmin(W ). Moreover, applying Lemma C.7 with V = Vlb, one has with probability at least 1 δ,

(X X) 1/2X N (X X + V ) 1/2X N W 1/2

δ + log det(Vup + Vlb)

Putting these together, we have

T λmin(W ) p

which proves our upper bound for b A A in the theorem.

Step 3: estimation error of c W . By the deﬁnition of b wm,t = xm,t+1 b Axm,t = wm,t Axm,t, we have

... b w m,t ...

c N = N AX = N N X(X X) 1X = N IT X(X X) 1X .

Learning Mixtures of Linear Dynamical Systems

Notice that IT X(X X) 1X is a symmetric projection matrix. Therefore,

T c N c N = 1

T N IT X(X X) 1X N,

and thus c W W = 1

T N X(X X) 1X N. (59)

The ﬁrst term on the right-hand side can be controlled by a standard result for covariance estimation (see Proposition D.1): with probability at least 1 δ, 1

As for the second term, we rewrite it as

1 T N X(X X) 1X N = 1

(X X) 1/2X N (X X) 1/2X N .

Applying our earlier self-normalized concentration bound (58), we have

T N X(X X) 1X N W d log( Γmax

Wmin dκATtotal

Putting these back to (59), we have

T N N W + 1

T N X(X X) 1X N

δ T + W d log( Γmax

Wmin dκATtotal

d log( Γmax

Wmin dκATtotal

where the last inequality uses T d log( Γmax

Wmin dκATtotal

δ ). This ﬁnishes our proof of Theorem B.3.

C.4.1. PROOF OF (57)

We start with the following decomposition:

0 t Tm 1 xm,txm,t X

1 t Tm 1 xm,txm,t

0 t Tm 2 xm,t+1xm,t+1 = X

0 t Tm 2 (Axm,t + wm,t)(Axm,t + wm,t)

0 t Tm 2 wm,twm,t

Axm,txm,t A + Axm,twm,t + wm,txm,t A

| {z } =:Q = P + Q. (60)

Lower bound for P . By Proposition D.1, we have with probability at least 1 δ,

1 T |Ck|P W W

δ T λmin(W ) Id, provided that T κ2 w(d + log 1

As a result, we have 1 T |Ck|P 1

2W , which implies P 1

Learning Mixtures of Linear Dynamical Systems

Lower bound for Q. Let Nϵ be an ϵ-net of the unit sphere Sd 1 (where the value of 0 < ϵ < 1 will be speciﬁed later), and let π be the projection onto Nϵ. Recall the standard result that |Nϵ| (9/ϵ)d. Moreover, for any v Sd 1, denote v := v π(v), which satisﬁes v 2 ϵ. Then we have

λmin(Q) = inf v Sd 1 v Qv = inf v Sd 1(π(v) + v) Q(π(v) + v)

inf v Sd 1 π(v) Qπ(v) (2ϵ + ϵ2) Q inf v Nϵ v Qv 3ϵ Q . (61)

For Q , we simply use a crude upper bound, based on the boundedness of { xm,t 2} (Lemma C.1): with probability at least 1 δ, one has

Axm,t 2 2 + 2 Axm,t 2 wm,t 2 Γmax poly κA, d, Ttotal, 1

Next, we lower bound infv Nϵ v Qv. First, consider a ﬁxed v Nϵ; denoting ym,t := Axm,t and um,t := W 1/2wm,t N(0, Id), we have

0 t Tm 2 (v ym,t)2 + 2 X

0 t Tm 2 v ym,t u m,t W 1/2v,

where u m,t W 1/2v N(0, v W v). Lemma D.2 (the scalar version of self-normalized concentration) tells us that, for any ﬁxed λ > 0, with probability at least 1 δ, X

0 t Tm 2 v ym,t u m,t W 1/2v

0 t Tm 2 (v ym,t)2 + 1

0 t Tm 2 (v ym,t)2 + 1

Replacing δ with δ/(9/ϵ)d and taking the union bound, we have with probability at least 1 δ, for any v Nϵ,

0 t Tm 2 (v ym,t)2 p

0 t Tm 2 (v ym,t)2 + 2

0 t Tm 2 (v ym,t)2 p

With the choice of λ = 1/ p

W , this implies

v Qv 2 W (d log 9

δ ), for all v Nϵ. (63)

Putting (62) and (63) back to (61), we have with probability at least 1 δ,

λmin(Q) inf v Nϵ v Qv 3ϵ Q C0

δ + ϵ Γmax poly κA, d, Ttotal, 1

for some universal constant C0 > 0.

Putting things together. Recall the decomposition X X P + Q in (60). We have already shown that if T κ2 w(d + log 1

δ ), then with probability at least 1 δ,

X X P + Q 1

δ + ϵ Γmax poly κA, d, Ttotal, 1

It is easy to check that, if we further choose

ϵ 1 poly(κA, d, Ttotal, 1

Wmin ), T κ2 wd log Γmax

then we have X X 1

5T W , which ﬁnishes the proof of (57).

Learning Mixtures of Linear Dynamical Systems

C.5. Proof of Theorem B.4

In this proof, we show the correct classiﬁcation of one short trajectory {xm,t}0 t Tm (with true label km) for some m Mclassiﬁcation; then it sufﬁces to take the union bound to prove the correct classiﬁcation of all trajectories in Mclassiﬁcation. For notational simplicity, we drop the subscript m and rewrite {xm,t}, Tm, km as {xt}, T, k, respectively. The basic idea of this proof is to show that L( b A(ℓ), c W (ℓ)) > L( b A(k), c W (k)) (64)

for any incorrect label ℓ = k, where L is the loss function deﬁned in (23) and used by Algorithm 5 for classiﬁcation. Our proof below is simply a sequence of arguments that ﬁnally transform (64) into a sufﬁcient condition in terms of the coarse model errors ϵA, ϵW and short trajectory length T.

Before we proceed, we record a few basic facts that will be useful later. First, the assumption c W (k) W (k) ϵW 0.1Wmin implies that λmin(c W (k)) 0.9λmin(W (k)), and c W (k) is well conditioned with κ(c W (k)) κ(W (k)) κw. Morever, by Lemma C.1, with probability at least 1 δ, we have for all 0 t T,

In Case 0, xt 2 Dx q

Γmax(d + log Ttotal

δ ), provided that T 1;

In Case 1, xt 2 Dx κA q

Γmax(d + log Ttotal

δ ), provided that T 1 1 ρ log(2κA).

Now we are ready to state our proof. Throughout our analyses, we will make some intermediate claims, whose proofs will be deferred to the end of this subsection.

Step 1: a sufﬁcient condition for correct classiﬁcation. In the following, we prove that for a ﬁxed ℓ = k, the condition (64) holds with high probability; at the end of the proof, we simply take the union bound over ℓ = k. Using xt+1 = A(k)xt + wt where wt N(0, W (k)), we can rewrite the loss function L as

L(A, W ) = T log det(W ) +

t=0 wt W 1wt

t=0 xt (A(k) A) W 1(A(k) A)xt + 2

t=0 wt W 1(A(k) A)xt.

After some basic calculation, (64) can be equivalently written as

L( b A(ℓ), c W (ℓ)) L( b A(k), c W (k)) = (A) + (B) (C) > 0,

where (A) := T log det(c W (ℓ)) log det(c W (k)) +

t=0 wt (c W (ℓ)) 1 (c W (k)) 1 wt

(B) := T 1 X

t=0 xt (A(k) b A(ℓ)) (c W (ℓ)) 1(A(k) b A(ℓ))xt + 2

t=0 wt (c W (ℓ)) 1(A(k) b A(ℓ))xt

(C) := T 1 X

t=0 xt (A(k) b A(k)) (c W (k)) 1(A(k) b A(k))xt + 2

t=0 wt (c W (k)) 1(A(k) b A(k))xt

Step 2: a lower bound for (A) + (B) (C). Intuitively, we expect that (A) + (B) should be large because the LDS models (A(k), W (k)) and (A(ℓ), W (ℓ)) are well separated, while (C) should be small if ( b A(k), c W (k)) (A(k), W (k)). More formally, we claim that the following holds for some universal constants C1, C2, C3 > 0:

With probability at least 1 δ,

(A) T log det(c W (ℓ)) log det(c W (k)) + Tr (W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2

T (W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 F log 1

Learning Mixtures of Linear Dynamical Systems

With probability at least 1 δ,

T A(k) b A(ℓ) 2 F κw,cross κwκw,cross log 1

provided that T κ2 w log2(1/δ);

With probability at least 1 δ,

T D2 x A(k) b A(k) 2

Wmin + κw log 1

provided that T 1 (under Case 0) or T 1 1 ρ log(2κA) (under Case 1).

Putting these together, we have with probability at least 1 δ,

(A) + (B) (C) C4 T log det(c W (ℓ))

det(c W (k)) + Tr W (k) (c W (ℓ)) 1 (c W (k)) 1

+ A(k) b A(ℓ) 2 F κw,cross D2 x A(k) b A(k) 2

T (W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 F + κwκw,cross

for some universal constants C4, C5 > 0, provided that T κ2 w log2 1

δ (under Case 0), or T κ2 w log2 1

δ + 1 1 ρ log(2κA) (under Case 1). Now we have a lower bound of (A) + (B) (C) as an order-T term minus a low-order term. Therefore, to show (A) + (B) (C) > 0, it sufﬁces to prove that (a) the leading factor of order-T term is positive and large, and (b) the low-order term is negligible compared with the order-T term. More speciﬁcally, under the assumption that the coarse models { b A(j), c W (j)} satisfy b A(j) A(j) ϵA, c W (j) W (j) ϵW 0.1Wmin for all 1 j K, we make the following claims:

(Order-T term is large.) Deﬁne the leading factor b Dk,ℓof the order-T term and a related parameter Dk,ℓas follows:

b Dk,ℓ:= log det(c W (ℓ))

det(c W (k)) + Tr W (k) (c W (ℓ)) 1 (c W (k)) 1 + A(k) b A(ℓ) 2 F κw,cross D2 x A(k) b A(k) 2

Dk,ℓ:= 1 κw,cross

W (k) W (ℓ) 2 F W 2max + A(k) A(ℓ) 2 F

2 A,W κw,cross , (68)

where the last inequality follows from Assumption 3.1. They are related in the sense that

Dk,ℓ e Dk,ℓ:= log det(W (ℓ))

det(W (k)) + Tr W (k) (W (ℓ)) 1 (W (k)) 1 + A(k) A(ℓ) 2 F κw,cross , (69)

where e Dk,ℓis deﬁned in the same way as b Dk,ℓ, except that the coarse models are replaced with the accurate ones; moreover, we have

Dk,ℓ b Dk,ℓ, provided that ϵA

D2x , ϵW Wmin

(Low-order term is negligible.) With (70) in place, we have

(A) + (B) (C)

C6T Dk,ℓ C5

T (W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 F + κwκw,cross

We claim that

if T κ5 w,cross Dk,ℓ + 1 log2 1

δ , then (A) + (B) (C) C7T Dk,ℓ> 0. (72)

Learning Mixtures of Linear Dynamical Systems

Step 3: putting things together. So far, we have proved that for a short trajectory generated by (A(k), W (k)) and for a ﬁxed ℓ = k, it holds with probability at least 1 δ that L( b A(ℓ), c W (ℓ)) > L( b A(k), c W (k)), provided that

D2x , ϵW Wmin min n 1,

κ2 w + κ5 w,cross Dk,ℓ

δ for Case 0, κ2 w + κ5 w,cross Dk,ℓ

δ + 1 1 ρ log(2κA) for Case 1.

Plugging in the relation Dk,ℓ 2 A,W /κw,cross and Dx q

Γmax(d + log Ttotal

δ ) (for Case 0) or Dx

Γmax(d + log Ttotal

δ ) (for Case 1), the above conditions become

For Case 0: ϵA

Wmin 2 A,W Γmaxκw,cross(d + log Ttotal

δ ), ϵW Wmin min n 1, A,W p

T κ2 w + κ6 w,cross 2 A,W

For Case 1: ϵA

Wmin 2 A,W Γmaxκw,crossκ2 A(d + log Ttotal

δ ), ϵW Wmin min n 1, A,W p

T κ2 w + κ6 w,cross 2 A,W

δ + 1 1 ρ log(2κA).

Finally, taking the union bound over all ℓ = k as well as over all trajectories in Mclassiﬁcation ﬁnishes the proof of Theorem B.4.

C.5.1. PROOF OF (65).

Since wt i.i.d. N(0, W (k)), we have

t=0 wt (c W (ℓ)) 1 (c W (k)) 1 wt = z Mz,

where z N(0, IT d), and M RT d T d is a block-diagonal matrix with Q := (W (k))1/2((c W (ℓ)) 1 (c W (k)) 1)(W (k))1/2 Rd d as its diagonal blocks. Therefore, by the Hanson-Wright inequality (Theorem 6.2.1 of (Vershynin, 2018)), we have

P z Mz E[z Mz] u 2 exp c min n u2

M 2 F , u M

where M 2 F = T Q 2 F, M = Q , and E[z Mz] = Tr(M) = T Tr(Q). Choosing u

T Q F log 1

δ, we have with probability at least 1 δ, |z Mz T Tr(Q)| C1

T Q F log 1

δ , which immediately leads to our lower bound (65) for (A).

C.5.2. PROOF OF (66).

Denote ut = (W (k)) 1/2wt N(0, Id) and yt = (W (k))1/2(c W (ℓ)) 1(A(k) b A(ℓ))xt. Then we have

t=0 xt (A(k) b A(ℓ)) (c W (ℓ)) 1(A(k) b A(ℓ))xt + 2

By Lemma D.2, we have with probability at least 1 δ, PT 1 t=0 ut yt ( λ

2 PT 1 t=0 yt 2 2 + 1

δ ) for any ﬁxed λ > 0. This implies that

t=0 xt (A(k) b A(ℓ)) (c W (ℓ)) 1(A(k) b A(ℓ))xt λ

t=0 yt yt 2

Learning Mixtures of Linear Dynamical Systems

t=0 xt (A(k) b A(ℓ)) (c W (ℓ)) 1 λ (c W (ℓ)) 1W (k)(c W (ℓ)) 1 (A(k) b A(ℓ))xt 2

Choosing λ = 0.05/(κwκw,cross), we have λ (c W (ℓ)) 1W (k)(c W (ℓ)) 1 0.1(c W (ℓ)) 1, and thus

t=0 xt (A(k) b A(ℓ)) (c W (ℓ)) 1(A(k) b A(ℓ))xt 40κwκw,cross log 2

0.9λmin (c W (ℓ)) 1 T 1 X

t=0 xt (A(k) b A(ℓ)) (A(k) b A(ℓ))xt 40κwκw,cross log 2

Now it remains to lower bound PT 1 t=0 xt xt, where := A A and A := A(k) b A(ℓ). Since 0, we have

t=0 xt+1 xt+1 =

t=0 (A(k)xt + wt) (A(k)xt + wt)

t=0 xt A(k) A(k)xt + 2

t=0 wt A(k)xt

| {z } (ii)

We can lower bound (i) by the Hanson-Wright inequality, similar to our previous proof of (65); the result is that, with probability at least 1 δ, one has

(i) T Tr (W (k))1/2 (W (k))1/2 C0

T (W (k))1/2 (W (k))1/2 F log 1

To lower bound (ii), we apply Lemma D.2, which shows that with probability at least 1 δ,

t=0 wt A(k)xt

t=0 xt (A(k)) W (k) A(k)xt + 1

for any ﬁxed λ > 0, hence

t=0 xt A(k) ( λ W (k) )A(k)xt 2

Recall = A A, and thus λ W (k) = A(Id λ AW (k) A) A 0 if we choose λ = 1/( W (k) A 2) = 1/( W (k) ); this implies that

δ = 2 W (k) log 2

Putting these back to (74), we have

t=0 xt xt (i) + (ii)

T Tr (W (k))1/2 (W (k))1/2 C0

T (W (k))1/2 (W (k))1/2 F log 1

δ 2 W (k) log 2

T λmin(W (k))Tr( A A) C0

T W (k) A A F log 1

δ 2 W (k) A A log 2

T λmin(W (k)) A(k) b A(ℓ) 2 F 1 C0

T κ(W (k)) log 1

T κ(W (k)) log 2

0.9T λmin(W (k)) A(k) b A(ℓ) 2 F,

Learning Mixtures of Linear Dynamical Systems

where the last inequality holds if T κ2 w log2 1

Going back to (73), we have

(B) 0.9λmin((c W (ℓ)) 1)

t=0 xt xt 40κwκw,cross log 2

0.81T λmin((c W (ℓ)) 1)λmin(W (k)) A(k) b A(ℓ) 2 F 40κwκw,cross log 2

0.7T A(k) b A(ℓ) 2 F κw,cross 40κwκw,cross log 2

which ﬁnishes the proof of (66).

C.5.3. PROOF OF (67).

Denote = A(k) b A(k), ut = (W (k)) 1/2wt N(0, Id) and yt = (W (k))1/2(c W (k)) 1 xt. Then one has

t=0 xt (c W (k)) 1 xt + 2

t=0 wt (c W (k)) 1 xt =

t=0 xt (c W (k)) 1 xt + 2

By Lemma D.2, we have with probability at least 1 δ, PT 1 t=0 ut yt λ

2 PT 1 t=0 yt 2 2 + 1

δ for any ﬁxed λ > 0. This implies that

t=0 xt (c W (k)) 1 xt + λ

t=0 yt yt + 2

t=0 xt (c W (k)) 1 xt + λ

t=0 xt (c W (k)) 1W (k)(c W (k)) 1 xt + 2

t=0 xt (c W (k)) 1 + λ (c W (k)) 1W (k)(c W (k)) 1 xt + 2

1.5 1 λmin(W (k)) + λ W (k)

λmin(W (k))2

t=0 xt xt + 2

Choosing λ 1/κw and recalling xt 2 Dx, we have (C) 1 Wmin T D2 x 2 + κw log 1

δ , which ﬁnishes the proof of (67).

C.5.4. PROOF OF (69).

Denote := W (k) W (ℓ). Then the right-hand side of (69) becomes

e Dk,ℓ= log det W (ℓ) log det W (k) + Tr W (k)(W (ℓ)) 1 Id + A(k) A(ℓ) 2 F κw,cross

= log det W (ℓ) log det(W (ℓ) + ) + Tr (W (ℓ) + )(W (ℓ)) 1 Id + A(k) A(ℓ) 2 F κw,cross

= log det W (ℓ) log det (W (ℓ))1/2 Id + (W (ℓ)) 1/2 (W (ℓ)) 1/2 (W (ℓ))1/2

+ Tr (W (ℓ)) 1/2 (W (ℓ)) 1/2 + A(k) A(ℓ) 2 F κw,cross

= Tr(X) log det(Id + X) + A(k) A(ℓ) 2 F κw,cross ,

where we deﬁne X := (W (ℓ)) 1/2 (W (ℓ)) 1/2. Notice that X is symmetric and satisﬁes

X + Id = (W (ℓ)) 1/2W (k)(W (ℓ)) 1/2 0,

Learning Mixtures of Linear Dynamical Systems

X (W (ℓ)) 1/2 2 W (k) W (ℓ) 2Wmax

Wmin = 2κw,cross,

X 2 F = (W (ℓ)) 1/2 (W (ℓ)) 1/2 2 F 2 F W 2max .

Therefore, by Lemma D.3, we have

Tr(X) log det(Id + X) X 2 F 6κw,cross W (k) W (ℓ) 2 F 6κw,cross W 2max ,

e Dk,ℓ= Tr(X) log det(Id + X) + A(k) A(ℓ) 2 F κw,cross W (k) W (ℓ) 2 F 6κw,cross W 2max + A(k) A(ℓ) 2 F κw,cross Dk,ℓ,

which ﬁnishes the proof of (69).

C.5.5. PROOF OF (70).

Recall the deﬁnition

b Dk,ℓ= log det(c W (ℓ))

det(c W (k)) + Tr W (k) (c W (ℓ)) 1 (c W (k)) 1 + A(k) b A(ℓ) 2 F κw,cross D2 x A(k) b A(k) 2

First, we have

log det(c W (ℓ))

det(c W (k)) + Tr W (k) (c W (ℓ)) 1 (c W (k)) 1

= log det(c W (ℓ)) det(W (k)) + Tr W (k) (c W (ℓ)) 1 (W (k)) 1

log det(c W (k)) det(W (k)) + Tr W (k) (c W (k)) 1 (W (k)) 1

| {z } (ii)

We can lower bound (i) by the same idea of our earlier proof for (69), except that we replace W (ℓ) in that proof with c W (ℓ); this gives us

(i) W (k) c W (ℓ) 2 F κw,cross W 2max .

As for (ii), applying Lemma D.3 with X = (W (k)) 1/2(c W (k) W (k))(W (k)) 1/2, one has

(ii) = Tr(X) log det(X + Id) X 2 F ϵW 2d

Putting things together, we have

b Dk,ℓ= (i) (ii) + A(k) b A(ℓ) 2 F κw,cross D2 x A(k) b A(k) 2

C1 1 κw,cross

W (k) c W (ℓ) 2 F W 2max + A(k) b A(ℓ) 2 F

| {z } (iii)

W 2 min + D2 xϵA2

| {z } (iv)

If ϵW , ϵA satisfy (70), then (iv) c0Dk,ℓfor some sufﬁciently small constant c0 > 0. As for (iii), according to the deﬁnition of Dk,ℓin (68), there are two possible cases:

Learning Mixtures of Linear Dynamical Systems

If 1 κw,cross W (k) W (ℓ) 2 F W 2 max Dk,ℓ

2 , then it is easy to check that (70) implies that

4 W (k) W (ℓ) F

(iii) 1 κw,cross

W (k) c W (ℓ) 2 F W 2max 1 κw,cross

W (k) W (ℓ) F

W (k) W (ℓ) 2 F W 2max Dk,ℓ.

On the other hand, if 1 κw,cross A(k) A(ℓ) 2 F Dk,ℓ

2 , then one can check that (70) implies that

4 A(k) A(ℓ) F,

(iii) A(k) b A(ℓ) 2 F κw,cross ( A(k) A(ℓ) F ϵA

κw,cross A(k) A(ℓ) 2 F κw,cross Dk,ℓ.

In sum, it is always guaranteed that (iii) Dk,ℓ. Going back to (75), we claim that b Dk,ℓ (iii) (iv) Dk,ℓas long as ϵW and ϵA satisfy (70), which ﬁnishes the proof of (70).

C.5.6. PROOF OF (72).

Recall the lower bound (71) for (A) + (B) (C). We want to show that the low-order term is dominated by the order-T term, namely T Dk,ℓ. First, if T κwκw,cross

δ , then κwκw,cross log 1

δ T Dk,ℓ. Next, we have

(W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 F W (k) (c W (ℓ)) 1 (c W (k)) 1 F = W (k) (c W (ℓ)) 1(c W (k) c W (ℓ))(c W (k)) 1 F

W (k) (c W (ℓ)) 1 (c W (k)) 1 W (k) W (ℓ) F + 2

W (k) W (ℓ) F + 2

dϵW = 2κw,cross

W (k) W (ℓ) F + 2

Notice that the deﬁnition (68) of Dk,ℓimplies that W (k) W (ℓ) F p

κw,cross W 2max Dk,ℓ, and thus

(W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 F κw,cross

κw,cross W 2max Dk,ℓ+

Now it is easy to checked that

if T κ5 w,cross Dk,ℓ + κw,cross

dϵW Wmin Dk,ℓ

T (W (k))1/2 (c W (ℓ)) 1 (c W (k)) 1 (W (k))1/2 F log 1

Due to our assumption (70) on ϵW , we have ( κw,cross

dϵW Wmin Dk,ℓ )2 κ2 w,cross Dk,ℓ κ5 w,cross Dk,ℓ, and thus it sufﬁces to have T

( κ5 w,cross Dk,ℓ+ 1) log2 1

δ , which ﬁnishes the proof of (72).

Learning Mixtures of Linear Dynamical Systems

D. Miscellaneous results

Proposition D.1. Consider at, bt i.i.d. N(0, Id), 1 t N, where N d + log(1/δ). Then with probability at least 1 δ, 1 N

t=1 ata t Id

The proof follows from a standard covering argument (cf. (Vershynin, 2018)), which we skip for brevity.

Lemma D.2 (Self-normalized concentration, scalar version). Suppose that random vectors {ut, yt}1 t T and ﬁltrations {Ft}0 t T 1 satisfy Ft = σ(ui, 1 i t), yt Ft 1, and ut|Ft 1 N(0, Id). Then for any ﬁxed λ > 0, with probability at least 1 δ, we have

Proof. Using basic properties of the Gaussian distribution, we have for all ﬁxed λ R,

= E exp T 1 X

2λ2 yt 2 2 E h exp λ u T y T 1

2λ2 y T 2 2 |FT 1 i

E exp T 1 X

2λ2 yt 2 2 .

Continuing this expansion leads to the result E[exp(PT t=1(λ ut yt 1

2λ2 yt 2 2))] 1. Now, letting z = log(2/δ) and using Markov s inequality, we have

2λ2 yt 2 2 z = P exp T X

2λ2 yt 2 2 exp(z)

exp( z) E exp T X

2λ2 yt 2 2 exp( z) = δ

In other words, with probability at least 1 δ/2, we have PT t=1(λ ut yt 1

2λ2 yt 2 2) < z = log(2/δ), which implies PT t=1 ut yt < λ 2 PT t=1 yt 2 2 + 1

δ if λ > 0. By a similar argument (but with λ replaced by λ), we have with probability at least 1 δ/2, PT t=1( λ ut yt 1

2λ2 yt 2 2) < log(2/δ), which implies PT t=1 ut yt > ( λ

2 PT t=1 yt 2 2 + 1

δ ) if λ > 0. Finally, taking the union bound ﬁnishes the proof of the lemma.

Lemma D.3. Consider a symmetric matrix X Rd d that satisﬁes Id + X 0. If X B for some B 2, then we have

Tr(X) log det(Id + X) X 2 F 3B .

On the other hand, if X 1

2Id, then we have

Tr(X) log det(Id + X) X 2 F.

Proof. Denote {λi}1 i d as the eigenvalues of X, which satisﬁes λi > 1 for all 1 i d. Then

Tr(X) log det(Id + X) =

i=1 (1 + λi) =

λi log(1 + λi) .

Learning Mixtures of Linear Dynamical Systems

It can be checked (via elementary calculus) that, for all 1 < λ B where B 2, it holds that λ log(1 + λ) λ2/(3B). Therefore,

Tr(X) log det(Id + X)

λ2 i 3B = X 2 F 3B ,

which completes the proof of our ﬁrst claim. Similarly, it can be checked that, for all λ 1/2, one has λ log(1+λ) λ2, which implies that

Tr(X) log det(Id + X)

i=1 λ2 i = X 2 F;

this completes the proof of our second claim.

Fact D.4. In the setting of Section 3.1, it holds that Γmax Wmaxκ2 A/(1 ρ) and Γ,Y A,W Wmax Wmin/(4κ2 AΓmax).

Proof. First, consider Γ = P i=0 Ai W (Ai) , where W Wmax and Ai κAρi. Then we have Γ P i=0 Ai 2 W Wmaxκ2 A P i=0 ρ2i Wmaxκ2 A/(1 ρ), which proves our upper bound for Γmax.

Next, let us turn to Γ,Y . Our proof below can be viewed as a quantitative version of the earlier proof for Fact 2.1. We will show that, if the autocovariance matrices between the k-th and the ℓ-th models are close (in Frobenius norm), then the models themselves should also be close; our lower bound for Γ,Y in terms of A,W then follows from contraposition.

Consider two LDS models (A(k), W (k)) = (A(ℓ), W (ℓ)) and their autocovariance matrics (Γ(k), Y (k)), (Γ(ℓ), Y (ℓ)). Recall from (17) that A(k) = Y (k)Γ(k) 1 and W (k) = Γ(k) A(k)Γ(k)A(k) = Γ(k) A(k)Y (k) ; A(ℓ) and W (ℓ) can be expressed similarly.

First, regarding A(k) A(ℓ), one has

A(k) A(ℓ) = Y (k)Γ(k) 1 Y (ℓ)Γ(ℓ) 1 = (Y (k) Y (ℓ))Γ(k) 1 + Y (ℓ)(Γ(k) 1 Γ(ℓ) 1),

where Y (ℓ)(Γ(k) 1 Γ(ℓ) 1) = Y (ℓ)Γ(ℓ) 1(Γ(ℓ) Γ(k))Γ(k) 1 = A(ℓ)(Γ(ℓ) Γ(k))Γ(k) 1.

Therefore, we have

A(k) A(ℓ) F Y (k) Y (ℓ) F Γ(k) 1 + A(ℓ) Γ(ℓ) Γ(k) F Γ(k) 1

1 Wmin Y (k) Y (ℓ) F + κA Wmin Γ(ℓ) Γ(k) F, (76)

where the last line follows from Γ(k) W (k) Wmin Id and A(ℓ) κAρ κA.

Next, we turn to the analysis of W (k) W (ℓ), which satisﬁes

W (k) W (ℓ) = (Γ(k) Γ(ℓ)) (A(k)Y (k) A(ℓ)Y (ℓ) ),

W (k) W (ℓ) F Γ(k) Γ(ℓ) F + A(k)Y (k) A(ℓ)Y (ℓ) F.

Notice that

A(k)Y (k) A(ℓ)Y (ℓ) F = A(k)(Y (k) Y (ℓ)) + (A(k) A(ℓ))Y (ℓ) F

A(k)(Y (k) Y (ℓ)) F + (A(k) A(ℓ))Y (ℓ) F

κA Y (k) Y (ℓ) F + κAΓmax A(k) A(ℓ) F,

where the last line is due to Y (ℓ) = A(ℓ)Γ(ℓ) A(ℓ) Γ(ℓ) κAΓmax. Therefore, we have

W (k) W (ℓ) F Γ(k) Γ(ℓ) F + κA Y (k) Y (ℓ) F + κAΓmax A(k) A(ℓ) F. (77)

Learning Mixtures of Linear Dynamical Systems

For notational simplify, denote A := A(k) A(ℓ), W := W (k) W (ℓ), Γ := Γ(k) Γ(ℓ), Y := Y (k) Y (ℓ). From (77), one has

A 2 F + W 2 F W 2max A 2 F + 3 W 2max

Γ 2 F + κ2 A Y 2 F + κ2 AΓ2 max A 2 F

3 W 2max Γ 2 F + 3κ2 A W 2max Y 2 F + 1 + 3κ2 AΓ2 max W 2max

3 W 2max Γ 2 F + 3κ2 A W 2max Y 2 F + 4κ2 AΓ2 max W 2max A 2 F,

where the last line follows from κ2 AΓ2 max/W 2 max 1. Morever, (76) tells us that A 2 F 2 W 2 min Y 2 F + 2κ2 A W 2 min Γ 2 F. Putting together, we have

A 2 F + W 2 F W 2max 3 W 2max Γ 2 F + 3κ2 A W 2max Y 2 F + 4κ2 AΓ2 max W 2max A 2 F

3 W 2max Γ 2 F + 3κ2 A W 2max Y 2 F + 4κ2 AΓ2 max W 2max

2 W 2 min Y 2 F + 2κ2 A W 2 min Γ 2 F

= 3 W 2max + 4κ2 AΓ2 max W 2max

2κ2 A W 2 min

Γ 2 F + 3κ2 A W 2max + 4κ2 AΓ2 max W 2max

11κ4 AΓ2 max W 2max W 2 min

Γ 2 F + Y 2 F .

In sum, we have just shown that, if Γ 2 F + Y 2 F < 2 Γ,Y , then A 2 F + W 2 F W 2 max < 11κ4 AΓ2 max W 2 max W 2 min 2 Γ,Y . Equivalently (by

contraposition), if A 2 F + W 2 F W 2 max 2 A,W , then Γ 2 F + Y 2 F W 2 max W 2 min 11κ4 AΓ2max 2 A,W . This proves our lower bound for Γ,Y in terms of A,W .

Example D.5. It has been known that in our Case 1 (i.e. a single continuous trajectory), the quick switching of multiple LDS models may lead to exponentially large states, even if each individual model is stable (Liberzon, 2003). We give a quick example for completeness. Consider

A(k) = 0.99 0 2 1 2 0

, A(ℓ) = 0.99 0 3 1 3 0

both satisfying the stability condition in Assumption 3.1 with ρ = 0.99 < 1 and hence tmix 1/(1 ρ) = 100. Suppose that each short trajectory has only a length of 2, and the m-th (resp. (m + 1)-th) trajectory has label km = ℓ(resp. km+1 = k). Then xm+2,0 is equal to A(k)A(ℓ)xm,0 plus a mean-zero noise term, where

A(k)A(ℓ) = 0.992 0 2 1 2 0

has spectral radius 0.992 3/2 > 1; this will cause the exponential explosion of the states.

E. Extensions of Algorithm 1

Different trajectory lengths. Recall that in Section 2, we assume that all short trajectories within each subset of data Mo have the same length Tm = To. If this is not the case, we can easily modify our algorithms in the following ways:

For subspace estimation, the easiest way to handle different Tm s is to simply truncate the trajectories in Msubspace so that they have the same length Tsubspace = minm Msubspace Tm, and then apply Algorithm 2 without modiﬁcation. However, this might waste many samples when some trajectories of Msubspace are much longer than others; one way to resolve this is to manually divide the longer trajectories into shorter segments of comparable lengths, before doing truncation. A more reﬁned method is to modify Algorithm 2 itself, by re-deﬁning the index sets Ω1, Ω2 separately for each trajectory; moreover, in the deﬁnition of c Hi and b Gi, one might consider assigning larger weights to longer trajectories, instead of using the uniform weight 1/|Msubspace|.

Learning Mixtures of Linear Dynamical Systems

For clustering (or pairwise testing) of Mclustering, we can handle various Tm s similarly, by either truncating each pair of trajectories to the same length, or modifying Algorithm 3 itself (via re-deﬁning the index sets {Ωg,1, Ωg,2}1 g G separately for each trajectory).

Our methods for model estimation and classiﬁcation, namely Algorithms 4 and 5, are already adaptive to different Tm s in Mclustering and Mclassiﬁcation, and hence need no modiﬁcation.

Unknown parameters. Next, we show how to handle the case when certain parameters are unknown to the algorithms:

In Algorithm 2, we set the dimension of the output subspaces {Vi, Ui} to be K (the number of models). If K is unknown, we might instead examine the eigenvalues of c Hi + c H i and b Gi + b G i , and pick the subspace dimension that covers most of the energy in the eigenvalues.

In Algorithm 3, we need to know the separation parameter Γ,Y (in order to choose the testing threshold τ appropriately) and the number of models K (for clustering). If either Γ,Y or K is unknown, we might instead try different values of threshold τ, and pick the one that (after permutation) makes the similarity matrix S block-diagonal with as few blocks as possible.

F. Additional experiments

F.1. Synthetic experiments: clustering and classiﬁcation

First, we take a closer look at the performance of our clustering method (Algorithm 3) through synthetic experiments. We set the parameters d = 40, K = 2, ρ = 0.5, δ = 0.12. The LDS models are generated by A(k) = (ρ δ)R and W (k) = Id, where R is a random orthogonal matrix. We let |Mclustering| = 5d, and vary Tclustering [10, 60]. We run our clustering method on the dataset Mclustering, either with or without the assistance of subspace estimation (Algorithm 2) and dimensionality reduction. For the former case, we use the same dataset Mclustering for subspace estimation, without sample splitting, which is closer to practice; for the latter, we simply replace the subspaces {Vi, Ui} with Id. The numerical results are illustrated in Figure 3 (left), conﬁrming that (1) in both cases, the clustering error decreases as Tclustering increases, and (2) subspace estimation and dimensionality reduction signiﬁcantly improve the clustering accuracy.

Next, we examine the performance of our classiﬁcation method (Algorithm 5) in the same setting as above. We ﬁrst obtain a coarse model estimation by running Stage 1 of Algorithm 1 on the dataset Mclustering, with |Mclustering| = 10d and Tclustering = 30. Then, we run classiﬁcation on the dataset Mclassiﬁcation, with varying Tclassiﬁcation [4, 50]. The numerical results are included in Figure 3 (right), showing that the classiﬁcation error rapidly decreases to zero as Tclassiﬁcation grows.

Figure 3. Left: mis-clustering rate versus Tclustering. Right: mis-classiﬁcation rate versus Tclassiﬁcation.

F.2. Real-data experiments: Motion Sense

To show the practical relevance of the proposed algorithms, we work with the Motion Sense dataset (Malekzadeh et al., 2019). This datasets consists of multivariate time series of dimension d = 12, collected (at a rate of 50Hz) by accelerometer

Learning Mixtures of Linear Dynamical Systems

and gyroscope sensors on a mobile phone while a person performs various activities, such as jogging , walking , sitting , and so on. In our experiments, we break the data into 8-second short trajectories, and treat the human activities as latent variables. Figure 4 (left) illustrates what the data looks like. Notice that the time series do not satisfy the mixing property assumed in our theory, but are rather periodic instead.

As a preliminary attempt to apply our algorithms in the real world, we show that the proposed clustering method (which is one of the most crucial step in our overall approach), without any modiﬁcation, works reasonably well even for this dataset. To be concrete, we apply Algorithm 3 (without dimensionality reduction, i.e. {Vi, Ui} are set to Id) to a mixture of 12 jogging and 12 walking trajectories. Figure 4 (right) shows the resulted distance matrix, which is deﬁned in the same way as Line 11 of Algorithm 3, but without thresholding. Its clear block structure conﬁrms that, with an appropriate choice of threshold τ, Algorithm 3 will return an accurate/exact clustering of the mixed trajectories. These results are strong indication that the proposed algorithms in this work might generalize to much broader settings than what our current theory suggests, and we hope that this will inspire further extensions and applications of the proposed methods.

Figure 4. Left: examples of jogging and walking trajectories from the Motion Sense dataset. Right: the distance matrix constructed by Algorithm 3 for 12 jogging and 12 walking trajectories.