# learning_mixtures_of_markov_chains_and_mdps__01b307b0.pdf

Learning Mixtures of Markov Chains and MDPs

Chinmaya Kausik 1 Kevin Tan 2 Ambuj Tewari 2

Abstract We present an algorithm for learning mixtures of Markov chains and Markov decision processes (MDPs) from short unlabeled trajectories. Specifically, our method handles mixtures of Markov chains with optional control input by going through a multi-step process, involving (1) a subspace estimation step, (2) spectral clustering of trajectories using "pairwise distance estimators," along with refinement using the EM algorithm, (3) a model estimation step, and (4) a classification step for predicting labels of new trajectories. We provide end-to-end performance guarantees, where we only explicitly require the length of trajectories to be linear in the number of states and the number of trajectories to be linear in a mixing time parameter. Experimental results support these guarantees, where we attain 96.6% average accuracy on a mixture of two MDPs in gridworld, outperforming the EM algorithm with random initialization (73.2% average accuracy). We also significantly outperform the EM algorithm on real data from the Last FM song dataset.

1. Introduction Efficiently clustering a mixture of time series data, especially with access to only short trajectories, is a problem that pervades sequential decision making and prediction (Liao (2005), Huang et al. (2021), Maharaj (2000)). This is motivated by various real-world problems, ranging through psychology (Bulteel et al. (2016)), economics (Mc Culloch & Tsay (1994)), automobile sensing (Hallac et al. (2017)), biology (Wong & Li (2000)), neuroscience (Albert (1991)), to name a few. One natural and important time series model is that of a mixture of K MDPs, which includes the case of a mixture of K Markov chains. We want to cluster from a set of short trajectories where (1) one does not know which

1Department of Mathematics, University of Michigan, Ann Arbor, USA 2Department of Statistics, University of Michigan, Ann Arbor, USA. Correspondence to: Chinmaya Kausik <ckausik@umich.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

MDP or Markov chain any trajectory comes from and (2) one does not know the transition structures of any of the K MDPs or Markov chains. Previous literature like Kwon et al. (2021) and Gupta et al. (2016) has stated and underlined the importance of this problem, but so far, the literature on methods to solve it with theoretical guarantees and empirical results has been sparse.

Broadly, there are three threads of literature on problems related to ours. Within reinforcement learning literature, there has been a sustained interest in frameworks very similar to mixtures of MDPs latent MDPs (Kwon et al. (2021)), multi-task RL (Brunskill & Li (2013)), hidden model MDPs (Chades et al. (2021)), to name a few. However, most effort in this thread has been towards regret minimization in the online setting, where the agent interacts with an MDP from a set of unknown MDPs. The framework of latent MDPs in Kwon et al. (2021) is equivalent to adding reward information to ours. They have shown that one can only learn latent MDPs online with number of episodes required polynomial in states and actions to the power of trajectory length (under a reachability assumption similar to our mixing time assumption). On the other hand, our method learns latent MDPs offline with number of episodes needed only linear in the number of states (in no small part due to the subspace estimation step we make).

The other thread of literature deals with using a "subspace estimation" idea to efficiently cluster mixture models, from which we gain inspiration for our algorithm. Vempala & Wang (2004) first introduce the idea of using subspace estimation and clustering steps, with application to learning mixtures of Gaussians. Kong et al. (2020) adapt these ideas to the setting of meta-learning for mixed linear regression, adding a classification step. Chen & Poor (2022) bring these ideas to the time-series setting to learn mixtures of linear dynamical systems. They leave open the problems of (1) adapting the method to handle control inputs (mentioning mixtures of MDPs as an important example) and (2) handling other time series models (like autoregressive models and Markov chains), and state that the former is of great importance. There are many technical and algorithmic subtleties in adapting the ideas developed so far to MDPs and Markov Chains. The most obvious one comes from the following observation: in linear dynamical systems, the deviation from the predicted next-state value under the lin-

Learning Mixtures of Markov Chains and MDPs

ear model occurs with additive i.i.d. noise. In MDPs and Markov chains, we are sampling from the next-state probability simplex at each timestep, and this cannot be cast as a deterministic function of the current state with additive i.i.d. noise.

Gupta et al. (2016) also provide a method for learning a mixture of Markov chains using only 3-trails, and compare its performance to the EM algorithm. While the requirement on trajectory length is as lax as can be, their method needs to estimate the distribution of 3-trails using all available data, incurring an estimation error in estimating S3A3 parameters, while providing no finite-sample theoretical guarantees. If the method can be shown to enjoy finite sample guarantees, the need to estimate S3A3 parameters indicates that the guarantees will scale poorly with S and A.

The problem that we aim to solve is the following.

Is there a method with finite-sample guarantees that can learn both mixtures of Markov chains and MDPs offline, with only data on trajectories and the number of elements in the mixture K?

1.1. Summary of Contributions We provide such a method, with trajectory length requirements free from an S, A dependence. The method performs (1) subspace estimation, (2) spectral clustering, an optional step of using clusters to initialize the EM algorithm, (3) estimating models, and finally (4) classifying future trajectories.

Theorem (Informal). Ignoring logarithmic terms, we can recover all labels exactly with K2S trajectories of length K3/2tmix, up to logarithmic terms and instance-dependent constants characterizing the models but not explicitly dependent on S, A, tmix or K.

Other contributions include:

This is the first method, to our knowledge, that can cluster MDPs with finite-sample guarantees where the length of trajectories does not depend explicitly on S, A. The length only explicitly depends linearly on the mixing time tmix, and the number of trajectories only explicitly depends linearly on S.

We are able to provide theoretical guarantees while making no explicit demands on the policies and rewards used to collect the data, only relying on a difference in the transition structures at frequently occurring (s, a) pairs.

Chen & Poor (2022) work under deterministic transitions with i.i.d. additive Gaussian noise, and we need to bring in non-trivial tools to analyse systems like ours, determined by transitions with non-i.i.d. additive

noise. Our use of the blocking technique of Yu (1994) opens the door for the analysis of such systems.

Empirical results in our experiments show that our method outperforms, outperforming the EM algorithm by a significant margin (73.2% for soft EM and 96.6% for us on gridworld).

2. Background and Problem Setup We work in the scenario where we have K unknown models, either K Markov chains or K MDPs, and data of Ntraj trajectories collected offline. Throughout the rest of the paper, we work with the case of MDPs, as we can think of Markov chains as an MDP where there is only one action (A = { }) and rewards are ignored by our algorithm anyway.

We have a tuple (S, A, {Pk}K k=1, {fk}K k=1, pk) describing our mixture. Here, S, A are the state and action sets respectively. Pk(s | s, a) describes the probability of an s, a, s transition under label k. At the start of each trajectory, we draw k Categorical(f1, ..., f K), and starting state according to pk, and generate the rest of the trajectory under policies πk(a | s). We have stationary distributions on the state-action pairs dk(s, a) for πk interacting with Pk. We do not know (1) the parameters Pk, fk, pk, πk( | s) of each model or the policies, and (2) k, i.e., which model each trajectory comes from.

This coincides with the setup in Gupta et al. (2016) in the case of Markov chains (|A| = 1). It also overlaps with the setup of learning latent MDPs offline, in the case of MDPs. However, one difference is that we make no assumptions about the reward structure once trajectories are clustered, we can learn the models, including the rewards. It is also possible to learn the rewards with a term in the distance measure that is alike to the model separation term. However, this would require extra assumptions on reward separation that are not necessary for clustering.

Assumption 1 (Mixing). The K Markov chains on S A induced by the behaviour policies πk, each achieve mixing to a stationary distribution dk(s, a) with mixing time tmix,k. Define the overall mixing time of the mixture of MDPs to be tmix := maxk tmix,k.

Assumption 2 (Model Separation). There exist α, so that for each pair k1, k2 of hidden labels, there exists a state action pair (s, a) (possibly depending on k1, k2) so that dk1(s, a), dk2(s, a) α and Pk1( | s, a) Pk2( | s, a) 2 .

Assumption 2 is merely saying that for any pair of labels, at least one visible state action pair witnesses a model difference . Call this the separating state-action pair. If no visible pair witnesses a model difference between the labels, then one certainly cannot hope to distinguish them using

Learning Mixtures of Markov Chains and MDPs

trajectories.

Remark 1. Why is there no assumption about policies? Notice that we make no explicit assumptions about policies. The nature of our algorithm allows us to work with the transition structure directly, and so we only demand that we observe a state action pair that witnesses a difference in transition structures. The policy is implicitly involved in this assumption through the stationary distribution dk(s, a) it induces, but our results demonstrate that this is the minimal demand we need to make in relation to the policies.

It is important to note that in Assumption 2, we assume a separation of the MDP models themselves, not the induced Markov chain models. To directly apply a method designed only for clustering mixtures of Markov chains to mixtures of MDPs, one would need to assume that the induced Markov chain models are separated. While our method handles Markov chains as described earlier, it does not need to use the induced Markov chain models for MDPs, and instead relies directly on the MDP models for clustering. This allows us to make a more natural assumption on separation.

Additionally, Assumption 1, which establishes the existence of a mixing time, is not a strong assumption (outside of the implicit hope that tmix is small). This is because any irreducible aperiodic finite state space Markov chain mixes to a unique stationary distribution. If the Markov chain is not irreducible, it mixes to a unique distribution determined by the irreducible component of the starting distribution.

The only requirement is thus aperiodicity, which is also technically superficial, as we now clarify. If the induced Markov chains were periodic with period L, we would have a finite set of stationary distributions du,l(s, a) that the chain would cycle through over a single period, indexed by l = 1 L. One can follow the proofs to verify that the guarantees continue to hold if we modify α in Assumption 2 to be a lower bound for mini,l dui,l(s, a) instead of just mini dui(s, a).

3. Algorithm 3.1. Setup and Notation We have short trajectories of length Tn, divided into 4 segments of equal length. We call the second and fourth segment Ω1 and Ω2 respectively.1 We further sub-divide Ωi into G blocks, and focus only on the first state-action observation in each sub-block and its transition (discard all other observations). We often refer to these observations as "single-step sub-blocks." See Figure 1 for an illustration of this. Divide the set of trajectory indices into two sets and call them Nsub and Nclust, for subspace estimation

1As the proofs demonstrate, we do not technically need all segments to be equal. If the order of tmix is known a priori, then we only need the first and third segment to be Ω(tmix) to allow for sufficient mixing before and after Ω1.

and clustering. Denote their sizes by Nsub and Nclust respectively. Let Ntraj(s, a) be the set of trajectory indices in either Nsub or Nclust (to be inferred from the context) where (s, a) is observed in both Ω1 and Ω2. Let Ntraj(s, a) be the size of this set. Denote by N(n, i, s, a) the number of times (s, a) is recorded in segment i of trajectory n, and let N(n, i, s, a, ) be the vector of next-state counts. We denote by Pk( | s, a) the vector of next state transition probabilities. We denote by Freqβ the set of all state action pairs whose occurrence frequency in our observations is higher than β.

We will call the predicted clusters returned by the clustering algorithm Ck. For model estimation and classification, we do not use segments, and merely split the entire trajectory into G blocks, discarding all but the last observation in each block. We call this observation the corresponding single-step sub-block. We denote the total count of s, a observations in trajectory n by N(n, s, a) and that of s , s, a triples by N(n, s, a, s ).

In practice, we choose to not be wasteful and observations are not discarded while computing the transition probability estimates. To clarify, in that case N(n, i, s, a) is just the count of (s, a) in segment i and similarly for N(n, i, s, a, ), N(n, s, a) and N(n, s, a, ). Estimators in both cases, that is both with and without discarding observations, are MLE estimates of the transition probabilities. One of them maximizes the likelihood of just the single-step sub-blocks and the other maximizes the likelihood of the entire segment. We need the latter for good finite-sample guarantees (using mixing). However, the former satisfies asymptotic normality, which is not enough for finite-sample guarantees, but it often makes it a good and less wasteful estimator in practice.

0 T 2T 3T 4T

Figure 1. Breaking up a trajectory into 4 segments and G blocks per segment (G = 4) for the single-step estimator. Observations are only recorded at the orange points.

3.2. Overview The algorithm amounts to (1) a PCA-like subspace estimation step, (2) spectral clustering of trajectories using "thresholded pairwise distance estimates," along with an optional step of using clusters to initialize the EM algorithm, (3) estimating models (MDP transition probailities) and finally (4) classifying any trajectories not in Nclust (for example, Nsub). We provide performance guarantees for each step of the algorithm in section 4.

Learning Mixtures of Markov Chains and MDPs

3.3. Subspace Estimation The aim of this algorithm is to estimate for each (s, a) pair a matrix Vs,a satisfying rowspan VT s,a span(Pk( |s, a))k=1,..,K. That is, we want to obtain an orthogonal projector to the subspace spanned by the next-state distributions Pk( |s, a) for 1 k K.

Summarizing the algorithm in natural language, we perform subspace estimation via 3 steps. We first estimate the next state distribution given state and action for each trajectory. We then obtain the outer product of the next state distributions thus estimated. These outer product matrices are averaged over trajectories, and the average is used to find the orthogonal projectors V T s,a to the top K eigenvectors.

Algorithm 1 Subspace Estimation

1: Compute Ntraj(s, a) for all s, a. Initialize the S S matrix ˆMs,a 0 and the SA SA matrix ˆD 0. 2: ˆdn,1, ˆdn,2 0 RSA for all n Nsub 3: for (i, s, a) {1, 2} S A do 4: Compute N(n, i, s, a, ), N(n, i, s, a), n Nsub 5: ˆPn,i( |s, a) N(n,i,s,a, )

N(n,i,s,a) 1N(n,i,s,a) =0, n

6: [ˆdn,i]s,a N(n,i,s,a)

7: ˆMs,a ˆMs,a+P n Ntraj(s,a) ˆPn,1( |s,a)ˆPn,2( |s,a)T

Ntraj(s,a) 8: end for 9: ˆD ˆD + P n Nsub 1 Nsub ˆdn,1ˆd T n,2 10: Using SVD, return the orthogonal projectors (VT s,a)K S to the top K eigenspaces of ˆMs,a + ˆMT s,a for each (s, a) where Ntraj(s, a) = 0 (set the others to 0), along with the orthogonal projector (UT )K SA to the top K eigenspace of ˆD + ˆDT .

Remark 2. Why do we split the trajectories? We use two approximately independent segments Ω1 and Ω2 time separated by a multiple of the mixing time tmix to estimate the next state distributions. The reduced correlation between the two estimates obtained allows us to give theoretical guarantees for concentration, despite using dependent data within each trajectory n in the estimation of the rank 1 matrices (Pkn( |s, a))(Pkn( |s, a))T . The key point is that the double estimator ˆPn,1( | s, a)ˆPn,2( | s, a) is in expectation very close to this matrix.

Notice that our estimator ˆMs,a is in expectation then given approximately by PK k=1 fk(Pk( |s, a))(Pk( |s, a))T . The eigenspace of this matrix is clearly span(Pk( |s, a))k=1,..,K. The deviation from the expectation is controlled by the total number of trajectories, while the "approximation error" separating the expectation from the desired matrix is controlled by the separation between Ω1 and Ω2.

Remark 3. Why is this not PCA? This procedure has many linear-algebraic similarities to uncentered PCA on the

dataset of (trajectories, next state frequencies), but statistically has a very different target. Crucially, (centered) PCA is concerned with the variance E[XT X], while we are interested in a decent estimate of the target E[XT ]E[X] above and thus use a double estimator. Our theoretical analysis also has nothing to do with analyses of PCA due to this difference in the statistical target.

3.4. Clustering Using the subspace estimation algorithm s output, we can embed estimates from trajectories in a low dimensional subspace. For the clustering algorithm, we aim to compute the pairwise distances of these estimates from trajectories in this embedding. A double estimator is used yet again, to reduce the covariance between the two terms in the inner product used to compute such a distance.

This embedding is crucial because it reduces the variance of the pairwise distance estimators from a dependence on SA to a dependence on K. This is the intuition for how we can shift the onus of good clustering from being heavily dependent on the length of trajectories to being more dependent on the subspace estimate and thus on the number of trajectories.

There are many ways to use such "pairwise distance estimates" for clustering trajectories. In one successful example, we use a test: if the squared distances are below some threshold (details provided later), then we can conclude that they come from the same element of the mixture, and different ones otherwise. This allows us to construct (the adjacency matrix of) a graph with vertices as trajectories, and we can feed the results into a clustering algorithm like spectral clustering. Alternatively, one can use other graph partitioning methods or agglomerative methods on the distance estimates themselves.

We present the algorithm formally in Algorithm 2. Choosing hyperparameters β, λ and the threshold τ involve heuristic choices, much like how choosing the threshold in Chen & Poor (2022) needs heuristics. However, our methods are very different, and we describe them in more detail in Section 5.

3.4.1. REFINEMENT USING EM Our guarantees in section 4 will show that we can recover exact clusters with high probability at the end of Algorithm 2. However, in practice, it makes sense to refine the clusters if trajectories are not long enough for exact clustering. Remember that an instance of the EM algorithm for any model is specified by choosing the observations Y , the hidden variables Z and the parameters θ.

If we consider observations to be next-state transitions from (s, a) Freqβ, hidden variables to be the hidden labels and the parameters θ to include both next-state transition

Learning Mixtures of Markov Chains and MDPs

Algorithm 2 Clustering

1: Compute the set Freqβ by picking (s, a) pairs with occurrence more than β. 2: dn,1, dn,2 0 RSA

3: for (i, s, a) {1, 2} S A do 4: Compute N(n, i, s, a, ), N(n, i, s, a), n Nclust 5: ˆPn,i( |s, a) N(n,i,s,a, )

N(n,i,s,a) 1N(n,i,s,a) =0, n

6: [ˆdn,i]s,a N(n,i,s,a)

G , n 7: end for 8: for (n, m) Nclust Nclust do 9: for (i, s, a) {1, 2} S A do 10: ˆ i,s,a := VT s,a(ˆPn,i( | s, a) ˆPm,i( |s, a)) 11: end for 12: dist1(n, m) := max(s,a) Freqβ ˆ T 1,s,a ˆ 2,s,a

13: dist2(n, m) := (ˆdn,1 ˆdm,1)T UUT (ˆdn,2 ˆdm,2) 14: dist(n, m) := λ dist1(n, m) + (1 λ) dist2(n, m) 15: end for 16: Plot a histogram of dist to determine threshold τ and cluster trajectories sim(n, m) := 1dist(n,m) τ

probabilities for (s, a) Freqβ and cluster weights ˆfk, then one can now refine the clusters using the EM algorithm on this setup, which enjoys monotonicity guarantees in log-likelihood if one uses soft EM. The details of the EM algorithm are quite straightforward, described in Appendix C.

We hope that this is a step towards unifying the discussion on spectral and EM methods for learning mixture models, highlighting that we need not choose between one or the other spectral methods can initialize the EM algorithm, in one reinterpretation of the refinement step.

Note that refinement using EM is not unique to our algorithm. The model estimation and classification steps in Kong et al. (2020) (under the special case of Gaussian noise) and Chen & Poor (2022) (who already assume Gaussian noise) are exactly the E-step and M-step of the hard EM algorithm as well.

3.5. Model Estimation and Classification Given clusters from the clustering and refinement step, 2 tasks remain, namely those of estimating the models from them and correctly classifying any future trajectories. We can estimate the models exactly as in the M-step of hard EM.

ˆPk(s |s, a)

n Ck N(n, s, a, s ) P n Ck N(n, s, a)

ˆfk |Ck| Nclust

For classification, given a set Nclass of trajectories with size Nclass generated independently of Nclust, we can run a process very similar to Algorithm 2 to identify which cluster to assign each new trajectory to. It is worth noting that we can run the classification step on the subspace estimation dataset itself and recover true labels for those trajectories, since trajectories in Nsub and Nclust are independent.

We describe the algorithm in natural language here. The algorithm is presented formally as Algorithm 3 in Appendix D. We first compute an orthogonal projector Vs,a to the subspace spanned by the now known approximate models ˆPk( | s, a). For any new trajectory n and label k, we estimate a distance dist(n, k) between the model ˆPn,i( | s, a) estimated from n and the model ˆPk( | s, a) for k, after embedding both in the subspace mentioned above using Vs,a. Again, we use a double estimator as hinted at by the use of the subscript i, similar to Algorithm 2. In practice dist(n, k) could also include occupancy measure differences. Each trajectory n gets the label kn that minimizes dist(n, k).

Previous work like Chen & Poor (2022) and Kong et al. (2020) uses the word refinement for its model estimation and classification algorithms themselves. However, we posit that the monotonic improvement in log-likelihood offered by EM makes it well-suited for repeated application and refinement, while in our case, the clear theoretical guarantees for the model estimation and classification algorithms make them well-suited for single-step classification. Note that we can also apply repeated refinement using EM to the labels obtained by single-step classification, which should combine the best of both worlds.

4. Analysis We have the following end-to-end guarantee for correctly classifying all data.

Theorem 1 (End-to-End Guarantee). Let both Nsub and Nclust be Ω K2S log(1/δ) f 2 minα3 8 and let Tn =

Ω K3/2tmix log4((Nclust+Nsub)/δ) log3(1/ ) log4(1/α)

6α3 . If we execute algorithms 1, 2 and model estimation, and then apply algorithm 3 to Nsub with λ = 1, α/3 β < α and 2/4 τ 2/2 for clustering and classification, then we can recover the true labels for the entire dataset (Nclust Nsub) with probability at least 1 δ.

Proof. This follows directly from Theorems 2, 3, 4 and 5 upon combining the conditions on Nsub, Nclust, and Tn in both theorems. We also use the brief discussion after the statement of Theorem 5.

The dependence on model-specific parameters like α, and fmin is conservative and can be easily improved upon by

Learning Mixtures of Markov Chains and MDPs

following the proofs carefully. We chose the form of the guarantees in this section to present a clearer message. In one example, there are versions of these theorems that depend on both G and Tn. We choose G = (Tn/tmix)2/3 to present crisper guarantees. For understanding how the guarantees would behave depending on both G and Tn, or how to improve the dependence on model-specific parameters, the reader can follow the proofs in the appendix.

4.1. Techniques and Proofs We make a few remarks on the technical novelty of our proofs. As mentioned in Section 1, we are dealing with two kinds of non-independence. While we borrow some ideas in our analysis from Chen & Poor (2022) to deal with the temporal dependence, we crucially need new technical inputs to deal with the fact that we cannot cast the temporal evolution as a deterministic function with additive i.i.d. noise, unlike in linear dynamical systems.

We identify the blocking technique in Yu (1994) as a general method to leverage the "near-independence" in observations made in a mixing process when they are separated by a multiple of the mixing time. Our proofs involve first showing that estimates made from a single trajectory would concentrate if the observations were independent, and then we bound the "mixing error" to account for the non-independence of the observations. We first choose a distribution (often labelled as a variant of Q or Ξ) with desirable properties, and then bound the difference between probabilities of undesirable events under Q and under the true joint distribution of observations χ, using the blocking technique due to Yu (1994).

There are many other technical subtleties here. In one example, the number of (s, a) observations made in a single trajectory is itself a random variable and so our estimator takes a ratio of two random variables. To resolve this, we have to condition on the random set of (s, a) observations recorded in a trajectory and use a conditional version of Hoeffding s inequality (different from the Azuma-Hoeffding inequality), followed by a careful argument to get unconditional concentration bounds, all under Q.

4.2. Subspace Estimation For subspace estimation, we have the following guarantee.

Theorem 2 (Subspace Estimation Guarantee). Consider 2 models with labels k1, k2 and a state-action pair s, a with dmin(s, a) α/3. Consider the output VT s,a of Algorithm 1. Let fmin = min(fk1, fk2) be the lower of the label prevalences. Remember that each trajectory has length Tn.

Then given that Nsub = Ω log(1/δ)

Ω(tmix log4(1/α)), with probability at least 1 δ, for

Pk( | s, a) Vs,a VT s,a Pk( | s, a) 2 ϵsub(δ)

For Tn = Ω tmix log3 fmin Nsubα KS log(1/δ)

ϵsub(δ) = O

S Nsub α3 log 1

While for Tn = O tmix log3 fmin Nsubα KS log(1/δ)

ϵsub(δ) = O

Alternatively, we only need Nsub = Ω K2S log(1/δ)

f 2 minα3ϵ4 and

Tn = Ω tmix log3(1/ϵ) log4(1/α) trajectories for ϵ accuracy in subspace estimation with probability at least 1 δ.

Remark 4. Why are short trajectories enough? Notice that the length of trajectories only affects the bound as a multiple of tmix with some logarithmic terms. This is because intuitively, the onus of estimating the correct subspace lies on aggregating information across trajectories. So, as long as there are enough trajectories, each trajectory does not have to be long.

4.3. Clustering Remember that is the model separation and α is the corresponding "stationary occupancy measure" from Assumption 2. We give guarantees for choosing λ = 1, which corresponds to using only model difference information instead of also using occupancy measure information. This is unavoidable since we have no guarantees on the separation of occupancy measures. See Section 5.2 for a discussion. Here, we provide a high-probability guarantee for exact clustering.

Theorem 3 (Exact Clustering Guarantee). Pick any pair of trajectories n, m. Then for Freqβ so that it contains (s, a) with dmin(s, a) Ω(α), Tn = Ω(tmix log4(1/δ)/α3), with probability at least 1 δ, dist1(m, n) m,n 2 2

+ 4ϵsub(δ/2)

Learning Mixtures of Markov Chains and MDPs

This means that if we choose λ = 1, then if ϵsub(δ)

2/32 and Tn = Ω K3/2tmix log4(Nclust/(αδ))

6α3 , no dis-

tance estimate attains a value between 2/4 and 2/2. So, Algorithm 2 attains exact clustering using a threshold of say 2/3 with probability at least 1 δ.

Since we already have high probability guarantees for exact clustering before refinement of the clusters, guarantees for the EM step analogous to the single-step guarantees for refinement in Chen & Poor (2022) are not useful here. However, we do still present single-step guarantees for the EM algorithm in our case using a combination of Theorem 4 for the M-step and Theorem 6 in Appendix G.

4.4. Model Estimation and Classification We also have guarantees for correctly estimating the relevant parts of the models and classifying sets of trajectories different from Nclust.

Theorem 4 (Model Estimation Guarantee). For any state action pair (s, a) with dmin(s, a) α/3, and for GNclust

f 2 minα2 and Tn Ω(Gtmix log(G/δ)), with probability greater than 1 δ,

ˆPk( | s, a) Pk( | s, a) 1

is bounded above by

1/3 r 1 Nclustfminα(S + log(1

Note that since the 1-norm is greater than the 2-norm, the same bound holds in the 2-norm as well. Also notice that since our assumptions do not say anything about observing all (s, a) pairs often enough, we can only given guarantees in terms of the occurrence frequency of (s, a) pairs.

Theorem 5 (Classification Guarantee). Let ϵmod(δ) be a high probability bound on the model estimation error ˆPk( | s, a) Pk( | s, a) 2. Then there is a universal constant C3 so that Algorithm 3 can identify the true labels for trajectories in Nclass with probability at least 1 δ for Tn = Ω K3/2tmix log4(Nclass/(αδ))

6α3 , whenever

ϵmod(δ/2) C3 4fminα

K and Nclust Ω log(1/δ)

f 2 minα2 .

Note that by Theorem 4, a sufficient condition for ϵmod(δ/2) C3 4fminα

K is Nclust T 2/3 n

Ω K2t2/3 mix S log(1/δ) 8f 3 minα3 . Under the conditions on Tn in Theorem 5, a suboptimal but sufficient condition on Nclust is Nclust = Ω K2S log(1/δ) f 2 minα3 8 , which matches that for Nsub.

5. Practical Considerations 5.1. Subspace Estimation Heuristics for choosing K: One often does not know K beforehand and often wants temporal features to guide the process of determining K, for example in identifying the number of groups of similar people represented in a medical study. We suggest a heuristic for this. One can examine how many large eigenvalues there are in the decomposition, via (1) ordering the eigenvalues of ˆMsa s, a by magnitude, (2) taking the square of each to obtain the eigenvalue energy, (3) taking the mean or average over states and actions, and (4) plotting a histogram. See Figure 6 in the appendix.

One can also consider running the whole process with different values of K and choose the value of K that maximises the likelihood or the AIC of the data (if one wishes the mixture to be sparse). However, Fitzpatrick & Stewart (2022) points out that such likelihood-based methods can lead to incorrect predictions for K even with infinite data.

5.2. Clustering Picking β: Choosing β involves heuristically picking stateaction pairs that have high frequency and "witness" enough model separation. We propose one method for this. For each (s, a) pair, one first executes subspace estimation and then averages the value of dist1(m, n) across all pairs of trajectories. Call this estimate s,a, since it is a measure of how much model separation (s, a) can "witness". We then compute the occupancy measure value d(s, a) of (s, a) in the entire set of observations. Making a scatter-plot of s,a against d(s, a), we want a value of β so that there are enough pairs from Freqβ in the top right.

Picking thresholds τ: The histogram of dist plotted will have many modes. The one at 0 reflects distance estimates between trajectories belonging to the same hidden label, while all the other modes reflect distance between trajectories coming from various pairs of hidden labels. The threshold should thus be chosen between the first two modes. See Figure 8 in the appendix.

Picking λ: In general, occupancy measures are different for generic policies interacting with MDPs and should be included in the implementation by choosing λ < 1. The histogram for dist2 should indicate whether or not occupancy measures allow for better clustering (if they have the right number of well-separated modes).

Versions of the EM algorithm: In our description of the EM algorithm, we only use next-state transitions as observations instead of the whole trajectory. So, we do not learn other parameters like the policy and the starting state s distribution for the EM algorithm. This makes sense in principle, because our minimal assumptions only talk about separation in next-state transition probabilities, and there is no guarantee that other information will help with classification. In

Learning Mixtures of Markov Chains and MDPs

practice, one should make a domain-specific decision on whether or not to include them.

Initializing soft EM with cluster labels: We also recommend that when one initializes the soft EM algorithm with results from the clustering step, one introduces some degree of uncertainty instead of directly feeding in the 1-0 clustering labels. That is, for trajectory m, instead of assigning 1(i = km) to be the responsibilities, make them say 0.8 1(i Ck) + 0.2/K instead. We find that this can aid convergence to the global maximum, and do so in our experiments.

6. Experiments We perform two sets of experiments, one considering a mixture of MDPs, and another considering a mixture of Markov chains.2 In all experiments, the error is determined by matching clusters to labels in a way that minimizes the proportion of misclassified trajectories, and then reporting this proportion.

6.1. Gridworld MDPs, K = 2 We perform our experiments for MDPs on an 8x8 gridworld with K = 2 elements in the mixture (from (Bruns-Smith, 2021)). Unlike Bruns-Smith (2021), the behavior policy here is the same across both elements of the mixture to eliminate any favorable effects that a different behavior policy might have on clustering, so that we evaluate the algorithm on fair grounds. The mixing time of this system is roughly tmix 25. We only use dist1 for the clustering, omitting the occupancy measures to parallel the theoretical guarantees. Including them would likely improve performance. We chose to perform the experiments with 1000 trajectories, given the difficulty of obtaining large numbers of trajectories in important real-life scenarios that often arise in areas like healthcare.

Figure 2 plots the error at the end of Algorithm 2 (before refinement) while either using the projectors VT s,a determined in Algorithm 1 ("With Subspaces"), replacing them with a random projector ("Random Subspaces") or with the identity matrix ("Without Subspaces"). The difference in performance demonstrates the importance of our structured subspace estimation step. Also note that past a certain point, between Tn = 60 and Tn = 70 3tmix, the performance of our method drastically improves, showing that the dependence of our theoretical guarantees on the mixing time is reflected in practice as well. We briefly discuss the poor performance of choosing a random subspace in Appendix B.

In Figure 3, we benchmark our method s end-to-end performance against the most natural benchmark, the randomly

2Code is available at https://github.com/hetankevin/mdpmix.

Figure 2. Clustering error v.s. trajectory length on 1000 trajectories in gridworld, with a comparison between using VT s,a in Algorithm 2 and using IS S. The same threshold was used for each trajectory length. Results averaged over 30 trials. The mixing time of this system is roughly tmix 25.

initialized EM algorithm. We use the version of the soft EM algorithm that considers the entire trajectory to be our observation, and thus also includes policies and starting state distributions. So, we are comparing our method against the full power of the EM algorithm. We have three different plots, corresponding to (1) soft EM with random initialization, (2) refining models obtained from the model estimation step applied to Nclust using soft EM on Nclust Nsub, and (3) refining labels for Nclust and Nsub using soft EM (the latter obtained from applying Algorithm 3 to Nsub). We report the final label accuracies over the entire dataset, Nclust Nsub. Remember that we can view refinement using soft EM as initializing soft EM with the outputs of our algorithms. Note that the plot for (3), which reflects the true end-to-end version of our algorithm, almost always outperforms randomly initialized soft EM. Also, for Tn > 60, both variants of our method outperform randomly initialized soft EM. We present a variant of Figure 3 with hard EM included as Figure 10 in the appendix.

6.2. Last.fm Markov chains, K = 10 For our experiments with real-life data, we work with the Last.fm 1K dataset (Celma, 2010b; Lamere, 2008; Celma, 2010a). Like Gupta et al. (2016), we consider the listening history of individual users, modeled as a Markov chain with states given by the top 100 genres (S = 100). For each of the top 10 users, we chop up their listening history into 75 trajectories (Nsub +Nclust = 75) each, of varying horizons. The user generating a trajectory is then the hidden label to be inferred. We try to infer both the user corresponding to each trajectory and a Markovian model of each user s listening dynamics, using the Hungarian algorithm to compute the

Learning Mixtures of Markov Chains and MDPs

Figure 3. End-to-end error v.s. trajectory length on 1000 trajectories in gridworld, comparing initializations of the soft EM algorithm using (1) random initializations, (2) models from Nclust, and (3) classification and clustering labels from Nclust and Nsub. Results averaged over 30 trials, with 30 random initializations for randomly-initialized EM within each trial.

clustering error. The results are qualitatively similar to the results in the gridworld experiment above. Figure 4 demonstrates the importance of the subspace estimation step, and Figure 5 demonstrates that our method s end-toend performance improves upon that of randomly initialized EM algorithm.

Figure 4. Clustering error v.s. trajectory length on 750 trajectories obtained from the Last.fm 1K dataset, comparing an implementation of Algorithm 2 using VT s,a with one using IS S. Results averaged over 30 trials.

7. Discussion We have shown that we can recover the true trajectory labels with (1) the number of trajectories having only a linear dependence in the size of the state space, and (2) the length

Figure 5. End-to-end error v.s. trajectory length on 750 trajectories obtained from the Last.fm 1K dataset, comparing initializations of the soft EM algorithm using (1) random initializations and (2) models from Nclust. Results averaged over 30 trials, with 30 random initializations for randomly-initialized EM within each trial.

of the trajectories depending only linearly in the mixing time even before initializing the EM algorithm with these clusters (which would further improve the log-likelihood, and potentially cluster accuracy). End-to-end performance guarantees are provided in Theorem 1, and experimental results are both promising and in line with the theory.

7.1. Future Work Matrix sketching: The computation of dist1(m, n) is computationally intensive, amounting to computing about S A distance matrices. We could alternatively approximate the thresholded version of the matrix dist(m, n) (which in the ideal case is a rank-K binary matrix) with ideas from Musco & Musco (2016).

Function approximation: The question of the right extension of our ideas to Markov chains and MDPs with large, infinite, or uncountable state spaces is very much open (at least, those whose transition kernel is not described by a linear dynamical systems). This is important, as many applications often rely on continuous state spaces.

Other controlled processes: Chen & Poor (2022) learn a mixture of linear dynamical systems without control input. An extension to the case with control input will be very valuable. We believe that the techniques used in our work may prove useful in this, as well as for extensions to other controlled processes that may neither be linear nor Gaussian.

8. Acknowledgements Ambuj Tewari would like to acknowledge the support of NSF via grant IIS-2007055.

Learning Mixtures of Markov Chains and MDPs

Albert, P. S. A two-state markov mixture model for a time series of epileptic seizure counts. Biometrics, 47(4):1371 1381, 1991. ISSN 0006341X, 15410420. URL http: //www.jstor.org/stable/2532392.

Bruns-Smith, D. A. Model-free and model-based policy evaluation when causality is uncertain. In International Conference on Machine Learning, pp. 1116 1126. PMLR, 2021.

Brunskill, E. and Li, L. Sample complexity of multi-task reinforcement learning. Uncertainty in Artificial Intelligence - Proceedings of the 29th Conference, UAI 2013, 09 2013.

Bulteel, K., Tuerlinckx, F., Brose, A., and Ceulemans, E. Clustering vector autoregressive models: Capturing qualitative differences in withinperson dynamics. Frontiers in Psychology, 7, 2016. ISSN 1664-1078. doi: 10.3389/fpsyg.2016. 01540. URL https://www.frontiersin.org/ articles/10.3389/fpsyg.2016.01540.

Celma, O. Music Recommendation and Discovery in the Long Tail. Springer, 2010a.

Celma, O. Last.fm Dataset 1K users. http://www.dtic.upf.edu/~ocelma/ Music Recommendation Dataset/lastfm-1K. html, 2010b.

Chades, I., Carwardine, J., Martin, T., Nicol, S., Sabbadin, R., and Buffet, O. Momdps: A solution for modelling adaptive management problems. Proceedings of the AAAI Conference on Artificial Intelligence, 26(1):267 273, Sep. 2021. doi: 10.1609/ aaai.v26i1.8171. URL https://ojs.aaai.org/ index.php/AAAI/article/view/8171.

Chen, Y. and Poor, H. V. Learning mixtures of linear dynamical systems. Co RR, abs/2201.11211, 2022. URL https://arxiv.org/abs/2201.11211.

Fitzpatrick, M. and Stewart, M. Asymptotics for markov chain mixture detection. Econometrics and Statistics, 22:56 66, 2022. ISSN 2452-3062. doi: https://doi.org/10.1016/j.ecosta.2021.11.004. URL https://www.sciencedirect.com/ science/article/pii/S2452306221001337. The 2nd Special issue on Mixture Models.

Gupta, R., Kumar, R., and Vassilvitskii, S. On mixtures of markov chains. In NIPS, pp. 3441 3449, 2016. URL http://papers.nips.cc/paper/ 6078-on-mixtures-of-markov-chains.

Hallac, D., Vare, S., Boyd, S., and Leskovec, J. Toeplitz inverse covariance-based clustering of multivariate time series data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 17, pp. 215 223, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450348874. doi: 10.1145/3097983. 3098060. URL https://doi.org/10.1145/ 3097983.3098060.

Huang, L., Sudhir, K., and Vishnoi, N. Coresets for time series clustering. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 22849 22862. Curran Associates, Inc., 2021. URL https://proceedings. neurips.cc/paper/2021/file/ c115ba9e04ab27fbbb664f932112246d-Paper. pdf.

Kong, W., Somani, R., Song, Z., Kakade, S. M., and Oh, S. Meta-learning for mixed linear regression. Co RR, abs/2002.08936, 2020. URL https://arxiv.org/ abs/2002.08936.

Kwon, J., Efroni, Y., Caramanis, C., and Mannor, S. RL for latent mdps: Regret guarantees and a lower bound. Co RR, abs/2102.04939, 2021. URL https://arxiv. org/abs/2102.04939.

Lamere, P. Last FM-Artist Tags2007 dataset. http://musicmachinery.com/2010/11/ 10/lastfm-artisttags2007/, 2008.

Larsen, K. G. and Nelson, J. Optimality of the johnsonlindenstrauss lemma. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 633 638, 2017. doi: 10.1109/FOCS.2017.64.

Liao, T. W. Clustering of time series data a survey. Pattern Recognition, 38(11):1857 1874, nov 2005. doi: 10.1016/ j.patcog.2005.01.025. URL https://doi.org/10. 1016%2Fj.patcog.2005.01.025.

Maharaj, E. A. Cluster of time series. Journal of Classification, 17(2):297 314, Jul 2000. ISSN 1432-1343. doi: 10.1007/s003570000023. URL https://doi.org/ 10.1007/s003570000023.

Mc Culloch, R. and Tsay, R. Statistical analysis of economic time series via markov switching models. Journal of Time Series Analysis, 15(5):523 539, 1994. ISSN 0143-9782. doi: 10.1111/j.1467-9892.1994.tb00208.x.

Musco, C. and Musco, C. Recursive sampling for the nyström method. 2016. doi: 10.48550/ARXIV.1605.07583. URL https://arxiv.org/abs/1605.07583.

Learning Mixtures of Markov Chains and MDPs

Vempala, S. and Wang, G. A spectral algorithm for learning mixture models. J. Comput. Syst. Sci, 68:2004, 2004.

Vidyasagar, M. Learning and Generalization: With Applications to Neural Networks. Springer Publishing Company, Incorporated, 2nd edition, 2010. ISBN 1849968675.

Wong, C. S. and Li, W. K. On a mixture autoregressive model. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 62(1):95 115, 2000. ISSN 13697412, 14679868. URL http://www.jstor. org/stable/2680680.

Wong, W. H. and Shen, X. Probability Inequalities for Likelihood Ratios and Convergence Rates of Sieve MLES. The Annals of Statistics, 23(2):339 362, 1995. doi: 10.1214/aos/1176324524. URL https://doi.org/ 10.1214/aos/1176324524.

Yu, B. Rates of Convergence for Empirical Processes of Stationary Mixing Sequences. The Annals of Probability, 22(1):94 116, 1994. doi: 10.1214/aop/ 1176988849. URL https://doi.org/10.1214/ aop/1176988849.

Learning Mixtures of Markov Chains and MDPs

A. Additional Figures All figures here pertain to the gridworld experiment in Section 6.1.

A.1. Determining K See Figure 6 below, following the discussion in section 5.1.

Figure 6. Histogram of the average ordered eigenvalue energy (the square of the eigenvalue) where the mean is taken over states and actions. There are two large eigenvalues, corresponding to K = 2.

A.2. Block Matrix of Raw Distance Estimates See Figure 7 below, which presents the raw distance matrix before thresholding, to provide a sense of the quality of the pairwise distance estimates themselves. These could also be used for agglomerative clustering, for example.

Figure 7. Block structure of the matrix of squared pairwise distance estimates (after sorting).

A.3. Determining The Threshold τ See Figure 8 below, following the discussion in section 5.2.

Learning Mixtures of Markov Chains and MDPs

Figure 8. Histogram (and KDE) of pairwise squared distance estimates in projected subspace above, and accuracy against thresholds below. Note how there is a spurious mode around the 0.00015 mark, and picking any threshold past it yields a significant drop in accuracy.

A.4. Local Extrema in EM See Figure 9 below, illustrating how EM often gets stuck in suboptimal local extrema, given by the low final log-likelihood values recorded in the scatterplot.

Figure 9. Scatter-plot of likelihoods v.s. clustering accuracy achieved by the randomly-initialized soft EM algorithm over 30 trials on gridworld. Randomly-initialized soft EM does not achieve the global maximum all of the time.

A.5. Comparing End-To-End Performance Using Soft and Hard EM We compare various initializations of EM (1) random initializations, (2) models from Nclust, and (3) classification and clustering labels from Nclust and Nsub this time using both soft and hard EM.

Learning Mixtures of Markov Chains and MDPs

Figure 10. End-to-end error v.s. trajectory length on (left) 1000 MDP trajectories from the gridworld dataset and (right) 750 Markov chain trajectories from the Last.fm dataset, comparing various initializations of the soft and the hard EM algorithm. Results averaged over 30 trials, with 30 random initializations for randomly-initialized EM within each trial.

Learning Mixtures of Markov Chains and MDPs

B. Discussion on Using Random Projections We note that those familiar with the intuition behind the Johnson-Lindenstrauss lemma would guess that a projection to a random n-dimensional subspace for low n would preserve distances with good accuracy. However, note that the bound on the dimension n needed to preserve distances between our Nclust estimators up to a multiplicative distortion of 1 ϵ is log(Nclust)

ϵ2 . This bound is known to be tight, see for example Larsen & Nelson (2017). Upon thought, this shows that to get good distortion bounds (which will contribute to the deviation between distance estimates and the thresholds), we need a large dimension, interpreted as being affected by the 1/ϵ2. In fact, as soon as log(Nclust) exceeds 1, we will need a dimension of order 1/ 2, while K can be arbitrarily small compared to this.

In the gridworld case, K = 2, and we see that we don t get good performance using a random subspace until we hit dimension 50, where the maximum dimension is S = 64. Clearly, the 1/ϵ2 term in the Johnson-Lindenstrauss lemma drastically affects the performance of using random subspaces. Using a random subspace of dimension 50 for S = 64 is much closer to not projecting at all than to using a subspace of dimension 2.

Figure 11. Clustering error using random projections of varying dimension for a trajectory length of 100, benchmarked against the performance of the "with subspace" and "without subspace" versions. The gridworld MDP dataset is on the left, while the Last.fm Markov chain dataset is on the right.

Learning Mixtures of Markov Chains and MDPs

C. Details of the EM Algorithm We describe the E and M steps for hard EM below first, for simplicity.

M-step: Given the cluster labels, we can estimate each model with the MLE as:

ˆPk(s |s, a)

P n Nclust 1n Ck N(n, s, a, s ) P

n Nclust 1n Ck N(n, s, a)

n Nclust 1n Ck

Nclust = |Ck| Nclust

Readers can convince themselves that this is truly the MLE estimate by making the following observation. We can write the log-likelihood of the predicted clusters Ck and estimated models as PK k=1 P

n Nclust 1n Ckℓ(ˆPk, ˆfk, n), where

ℓ(ˆPk, ˆfk, n) = log fk Q

s,s ,a(ˆPk(s | s, a))N(n,s,a,s ) . The rest of the derivation mimics the well-known and straightforward computation for Markov chains, using Lagrange multipliers to constrain the estimates to probability distributions.

E-step: On new or unseen data, assign cluster membership according to the following rule:

km argmaxk ℓ(ˆPk, ˆfk, m) + log( ˆfi) (1)

where ℓ(ˆPk, m) is as above.

Note that for soft EM, we can replace every occurrence of 1n Ck in the M-step with pn(k), where pn( ) is the posterior for trajectory n having label k, which is constantly updated during soft EM. For the E-step, we replace the argmax computation by a computation of pn(k) = P(kn = k | ˆPk, ˆfk, 1 k K). Intuitively described, in hard EM, we recompute the values of 1n Ck using the argmax during the E-step, while in soft EM, we recompute the values of pn(k).

Learning Mixtures of Markov Chains and MDPs

D. The Classification Algorithm Note that we define a new quantity, ˆfk,s,a, which is the proportion of trajectories with label k among all trajectories in Nclust where s, a is observed. Quantities N(n, s, a), N(n, i, s, a, ) and N(n, i, s, a) carry their usual meanings with respect to either Nclust until step 5 and with respect to Nclass after that.

Algorithm 3 Classification

1: Input: Clusters Ck Nclust, models ˆPk( | s, a) estimated from Ck, and a set Nclass of trajectories to classify. 2: Compute ˆfk,s,a for all k, s, a.

3: Compute Ms,a = PK k=1 ˆfk,s,aˆPk( |s, a)ˆPk( |s, a)T and store the orthogonal projector VT s,a to its top-K eigenspace, for each (s, a).

4: Compute ˆdk = 1 |Ck| P n Ck N(n,s,a)

G for all k.

5: Compute D = PK k=1 ˆdkˆd T k and store the orthogonal projector UT to its top-K eigenspace. 6: Compute the set SAβ by picking (s, a) pairs with occurrence more than β 7: dn,1, dn,2 0 RSA

8: for (i, s, a) {1, 2} S A do 9: Compute N(n, i, s, a, ), N(n, i, s, a), n

10: ˆPn,i( |s, a) N(n,i,s,a, )

N(n,i,s,a) 1N(n,i,s,a) =0, n

11: [ˆdn,i]s,a N(n,i,s,a)

G , n 12: end for 13: for (n, k) Nclust {1, 2, . . . K} do 14: for (i, s, a) {1, 2} S A do 15: ˆ i,s,a := (ˆPn,i( | s, a) ˆPk( |s, a)) VT s,a 16: end for 17: dist1(n, k) := maxs,a ˆ T 1,s,a ˆ 2,s,a 18: dist2(n, k) := (ˆdn,1 ˆdk)T UUT (ˆdn,2 ˆdk) 19: dist(n, k) := λ dist1(n, k) + (1 λ) dist2(n, k) 20: end for 21: Assign kn argmink dist(n, k) for each n.

Learning Mixtures of Markov Chains and MDPs

E. Proof of Theorem 2 E.1. Proof of the theorem We recall the theorem here.

Theorem 2 (Subspace Estimation Guarantee). Consider 2 models with labels k1, k2 and a state-action pair s, a with dmin(s, a) α/3. Consider the output VT s,a of Algorithm 1. Let fmin = min(fk1, fk2) be the lower of the label prevalences. Remember that each trajectory has length Tn.

Then given that Nsub = Ω log(1/δ)

α2 , Tn = Ω(tmix log4(1/α)), with probability at least 1 δ, for k = k1, k2

Pk( | s, a) Vs,a VT s,a Pk( | s, a) 2 ϵsub(δ)

For Tn = Ω tmix log3 fmin Nsubα KS log(1/δ)

ϵsub(δ) = O

S Nsub α3 log 1

While for Tn = O tmix log3 fmin Nsubα KS log(1/δ)

ϵsub(δ) = O

Alternatively, we only need Nsub = Ω K2S log(1/δ)

f 2 minα3ϵ4 and Tn = Ω tmix log3(1/ϵ) log4(1/α) trajectories for ϵ accuracy in subspace estimation with probability at least 1 δ.

Remark 5. We can convert the α3 in the denominator to an α at the cost of making Tn more heavily dependent on α (more than just log(1/α)). Intuitively, α accounts for the probability of not observing s, a, so this is just saying that we can shift the onus for that from the number of trajectories to their length. We chose not to do that since we are trying to minimize the length of trajectories needed, and assume that we have access to many trajectories.

Proof. The main input is the proposition below, proved in the next section.

Proposition 1. Consider L < K models with labels jl, 1 l L, with dmin(s, a) := minl djl(s, a). Consider the output VT s,a of Algorithm 1. Let fmin = minl fjl be the minimum frequency across these models in the mixture. Remember that each trajectory has length Tn. Then we have the guarantee that with probability at least 1 δ

Pj( | s, a) Vs,a VT s,a Pj( | s, a) 2

is bounded above by v u u t 4K fmindmin(s, a)

128 Nsub dmin(s, a)(2S log(12) + log(4/δ)) + 1

Tn 8Gtmix !

for all j {jl | 1 l L}, when Nsub 32 dmin(s,a)2 log 1

δ and Tn 8tmix > G log(48G/dmin(s,a))

For a state-action pair with dmin(s, a) α/3, the conditions simplify to Nsub Ω log(1/δ)

Ω(Gtmix log(G/α)). We set G = Tn tmix

3 to get bounds that only depend on Tn. Note that this means a sufficient

Learning Mixtures of Markov Chains and MDPs

condition on Tn is Tn Ω(tmix log4(1/α)) (one can show this with an elementary computation). Also note that

S + log(1/δ)

Then with probability at least 1 δ, the following bound holds for any label j = jl for some l.

Pj( | s, a) Vs,a VT s,a Pj( | s, a) 2 O

v u u u t K fminα

So, there is a universal constant C2 so that for Tn > C2tmix log3 fmin Nsubα KS log(1/δ) ,

S Nsub α3 log 1

While for Tn = O tmix log3 fmin Nsubα KS log(1/δ) ,

S Nsub α3 log 1

So, combining all these, for Nsub = Ω log(1/δ)

α2 , Tn = Ω(tmix log4(1/α))

For Tn = Ω tmix log3 fmin Nsubα KS log(1/δ)

ϵsub(δ) = O

S Nsub α3 log 1

While for Tn = O tmix log3 fmin Nsubα KS log(1/δ)

ϵsub(δ) = O

E.2. Proof of the Proposition 1 We recall the proposition here.

Proposition 1. Consider L < K models with labels jl, 1 l L, with dmin(s, a) := minl djl(s, a). Consider the output VT s,a of Algorithm 1. Let fmin = minl fjl be the minimum frequency across these models in the mixture. Remember that each trajectory has length Tn. Then we have the guarantee that with probability at least 1 δ

Pj( | s, a) Vs,a VT s,a Pj( | s, a) 2

Learning Mixtures of Markov Chains and MDPs

is bounded above by

v u u t 4K fmindmin(s, a)

128 Nsub dmin(s, a)(2S log(12) + log(4/δ)) + 1

Tn 8Gtmix !

for all j {jl | 1 l L}, when Nsub 32 dmin(s,a)2 log 1

δ and Tn 8tmix > G log(48G/dmin(s,a))

Remark 6. We should point out that we will only need L = 2 for subsequent theorems. Also, remember that only s, a with dmin(s, a) > α will be relevant in subsequent theorems, with α as in our assumption.

Proof. For brevity of notation, we will denote cn,i := N(n, i, s, a), wn,i := N(n, i, s, a, ) and suppress the (s, a) dependence. We will first need the following lemma which guarantees that we can get past mixing and concentration hurdles with our estimator, modulo actually observing s, a in both segments.

Lemma 1. Let Bn be the event given by n Ntraj(s, a), which is the same as cn,1cn,2 = 0 and let

j=1 P(kn = j | Bn)Pj( | s, a)Pj( | s, a)T

Call our estimator ˆMs,a. Then we know that

ˆMs,a = 1 Ntraj(s, a)

n ˆPn,1( | s, a)ˆPn,2( | s, a)T

and we have

ˆMs,a Ms,a <

32 Ntraj(s, a)(2S log(12) + log(2

δ )) + 48G dmin(s, a)

Remark 7. Note that since all trajectories are generated independently of each other and the process that generates them is identical, P(kn = j Bn) is the same for all n. A similar observation holds for many conditional/unconditional probabilities and conditional/unconditional expectations in this proof, and will not be stated again.

Assume the lemma for now. The proof is delayed to after the proof of the theorem. We will combine this lemma with Lemma 3 from Chen & Poor (2022). In the context of their lemma, p(j) = P(kn = j | Bn), y(j) = Pj( | s, a). Now, we can use the first term on the right-hand side of the bound in Lemma 3 of Chen & Poor (2022) to get that for any 1 l L

Pjl( | s, a) Vs,a VT s,a Pjl( | s, a) 2

2K minl(P(kn = jl | Bn)) ˆMs,a Ms,a (2)

E.2.1. LOWER BOUNDING P(kn = jl | Bn) Note that

P(kn = jl | Bn) = P(kn = jl)P(Bn | kn = jl)

P(Bn) fjl P(Bn | kn = jl)

So, we need only lower bound P(Bn | kn = jl), for which we will need a lemma. We will use the following crucial lemma several times in our proofs. This is where we use (Yu, 1994) s blocking technique.

Learning Mixtures of Markov Chains and MDPs

Lemma 2. Consider a function h on segments of a Markov chain with mixing time tmix = tmix( 1

4) with C = sup h. Consider the joint distribution χ over the product of the σ-algebras of n such segments, with marginals χi. Let the product distribution of the marginals χi be called Ξ. Then for λ = ( 1

1 tmix and for the minimum distance between consecutive segments being an, we have |Eχh EΞh| 4C(n 1)λan

Proof. Remember that each of our Markov processes is mixing, so there exists tmix,j = tmix,j( 1

4) and a stationary distribution dj so that TV (P n j (x, ), dj) < 1

4 for n tmix,j. Let tmix = maxj tmix,j. Since the decay in total variation distance is multiplicative, TV (P n j (x, ), dj) < ( 1

4)c for all j and n ctmix. This implies that

max j TV (P n j (x, ), dj) < 1

Tn 4tmix 1 = 4λn

where λ = ( 1

This means that we satisfy the definition of V -geometric ergodicity from Vidyasagar (2010), with V being the constant function with value 1, µ = 4 and λ as above. That means that any of our processes is beta-mixing by (the proof of) Theorem 3.10 from the text and βn µλn = 4λn

we employ an argument analogous to the setup and argument used to prove Lemma 4.1 of Yu (1994), merely with Hi s replaced by the segments of arbitrary length instead of an-sized blocks while Ti s stay at an sized blocks. Then, Q from Corollary 2.7 is the probability distribution of the segments here, Ωi from Corollary 2.7 is the real vector space of the same dimension as the length of the ith segment, Σi is the product Borel field on this vector space and m in the theorem is the number of segments n here (note that n is called µn in Lemma 4.1). Q is the product distribution over the marginals of Q, as in the theorem. Note that β(Q) from Corollary 2.7 used in the proof remains less than βan. Now we can directly quote Corollary 2.7 to conclude that |Eχh EΞh| C(n 1)βan 4C(n 1)λan

Define h = 1(cn,1cn,2=0)

We are now ready to bound P(Bn | kn = j) = P(cn,1cn,2 = 0 | kn = j). Consider the joint distribution over the segments Ω1 and Ω2 of a trajectory sampled from hidden label j. Call this χ and let its marginals on Ωi be χi. Let the product distribution of its marginals be Ξ := χ1 χ2. Notice that then

EΞh = P(cn,1 = 0 | kn = j)P(cn,2 = 0 | kn = j)

by definition of Ξ. Also, clearly we have

Eχh = P(cn,1cn,2 = 0 | kn = j)

Now, using Lemma 2, we get that for C = sup h = 1 and n = 2, we have the following inequality.

|P(cn,1cn,2 = 0 | kn = j) P(cn,1 = 0 | kn = j)P(cn,2 = 0 | kn = j)| = |Eχh EΞh| 4λ Tn

Additionally, for i = 1, 2, if dt,j(s, a) is the distribution at time t, the following is obtained by the definition of mixing times.

P(cn,i = 0 | kn = j) (1 d(2i 1)T,j(s, a))

(1 dj(s, a) + TV (d(2i 1)T,j, π))

Learning Mixtures of Markov Chains and MDPs

(1 dmin(s, a) + 4λ Tn

1 dmin(s, a)

where the last inequality holds for Tn > 4tmix log(8/dmin(s,a))

log 4 . This allows us to use inequality 3 and

P(cn,1cn,2 = 0 | kn = j) 4λ Tn

4 + P(cn,1 = 0 | kn = j)P(cn,2 = 0 | kn = j)

4 + 1 dmin(s, a)

1 dmin(s, a) + dmin(s, a)2

1 dmin(s, a) + dmin(s, a)

1 dmin(s, a)

where the last inequality holds for Tn > 4tmix log(16/dmin(s,a))

log 4 . We conclude that for Tn > 4tmix log(16/dmin(s,a))

log 4 , and j = jl for some l,

P(Bn | kn = j) dmin(s, a)

min l fjl(P(kn = jl | Bn)) min l fjl(P(kn = jl Bn))

min l fjl(P(Bn | kn = j)P(kn = j))

fmindmin(s, a)

We can thus conclude that for Tn > 4tmix log(16/dmin(s,a))

Pjl( | s, a) Vs,a VT s,a Pjl( | s, a) 2

4K fmindmin(s, a) ˆMs,a Ms,a (6)

E.2.2. ABSORBING THE EXTRA TERMS INTO THE EXPONENT OF 1/4 Now remember from Lemma 1 that

ˆMs,a Ms,a <

32 Ntraj(s, a)(2S log(12) + log(2

δ )) + 48G dmin(s, a)

Notice that for Tn 8tmix > G log(48G/dmin(s,a))

log 2 > log(16/dmin(s,a))

2 log 4 , we have that

48G dmin(s, a)

Tn 8Gtmix = 48G dmin(s, a)

Tn 16Gtmix 1

= 48G dmin(s, a)

Tn 8Gtmix 1

Learning Mixtures of Markov Chains and MDPs

E.2.3. BOUNDING THE CONCENTRATION TERM We finally need to bound Ntraj(s, a) from below to bound the first term in this sum. Note that E[Ntraj(s, a)] Nsub(1 P(cn,1cn,2 = 0)) Nsub dmin(s,a)

2 from equation 5 above. Now, by Hoeffding s inequality, we have

P Ntraj(s, a) < Nsub dmin(s, a)

= P Ntraj(s, a) < Nsub dmin(s, a)

2 Nsub dmin(s, a)

P Ntraj(s, a) < E[Ntraj(s, a)] Nsub dmin(s, a)

n Nsub 1cn,1cn,2 =0 < Nsub E[1cn,1cn,2 =0] Nsub dmin(s, a)

exp dmin(s, a)2Nsub

This is less than δ for Nsub 8 dmin(s,a)2 log 1

Combining this with equation 6 and splitting the two δ, we have our result that

Pj( | s, a) Vs,a VT s,a Pj( | s, a) 2

is bounded above by v u u t 4K fmindmin(s, a)

128 Nsub dmin(s, a)(2S log(12) + log(4/δ)) + 1

Tn 8Gtmix !

for Nsub 8 dmin(s,a)2 log 1

δ and Tn 8tmix > G log(48G/dmin(s,a))

E.3. Proof of Lemma 1 We recall Lemma 1.

Lemma 1. Let Bn be the event given by n Ntraj(s, a), which is the same as cn,1cn,2 = 0 and let

j=1 P(kn = j | Bn)Pj( | s, a)Pj( | s, a)T

Call our estimator ˆMs,a. Then we know that

ˆMs,a = 1 Ntraj(s, a)

n ˆPn,1( | s, a)ˆPn,2( | s, a)T

and we have

ˆMs,a Ms,a <

32 Ntraj(s, a)(2S log(12) + log(2

δ )) + 48G dmin(s, a)

Proof. We divide the proof into subsections. We first remind ourselves that the estimator ˆMs,a is given by the matrix

ˆMs,a = 1 Ntraj(s, a)

n Ntraj(s,a)

ˆPn,1( |s, a)ˆPn,2( |s, a)T

Learning Mixtures of Markov Chains and MDPs

E.3.1. ESTIMATING E[ ˆMs,a] We will split the expectation into the desired term and the error coming from correlation between the two segments Ω1 and Ω2. Remember that for brevity of notation, let cn,i := N(n, i, s, a), wn,i := N(n, i, s, a, ). Call the estimate from each trajectory a random variable ˆMn,s,a, that is

ˆMn,s,a = wn,1w T n,2 cn,1cn,2

Now ˆMs,a = X

n Ntraj(s,a)

1 Ntraj(s, a) ˆMn,s,a

Remember that ˆPn,i( | s, a) := wn,i

cn,i 1cn,i =0

Let kn be the hidden label for trajectory n, as usual. Define the event Bn to be n Ntraj(s, a), which is the same as cn,1cn,2 = 0. We establish the following equality, essentially just defining the quantity Mix(j).

E[ ˆMn,s,a | Bn] = E

" wn,1w T n,2 cn,1cn,2

j=1 P(kn = j | Bn)E

" wn,1w T n,2 cn,1cn,2

j=1 P(kn = j | Bn)Pj( | s, a)Pj( | s, a)T +

j=1 P(kn = j | Bn) Mix(j) (7)

where Mix(j) = E wn,1w T n,2 cn,1cn,2

Pj( | s, a)Pj( | s, a)T . Notice that this has connotations of covariance. Now

note the following chain of equations.

E[ ˆMs,a | Ntraj(s, a)] = E

n Ntraj(s,a)

1 Ntraj(s, a) ˆMn,s,a

Ntraj(s, a)

n Ntraj(s,a)

1 Ntraj(s, a)E h ˆMn,s,a Ntraj(s, a) i

n Ntraj(s,a)

1 Ntraj(s, a)E h ˆMn,s,a 1B1, 1B2, . . . 1BNsub

n Ntraj(s,a)

1 Ntraj(s, a)E h ˆMn,s,a Bn i

= E h ˆMn,s,a Bn i

j=1 P(kn = j | Bn)Pj( | s, a)Pj( | s, a)T +

j=1 P(kn = j | Bn) Mix(j)

j=1 P(kn = j | Bn) Mix(j) (8)

Here, the third equality is because the set Ntraj(s, a) is exactly described by the indicators listed, and they generate the same σ-algebra, The fourth equality holds since all trajectories are independent and so conditioning on events in other trajectories doesn t affect the expectation of ˆMn,s,a. The fifth equality is because E[ ˆMn,s,a | Bn] is the same for all n as determined above (in fact, we have shown that it is a constant random variable).

Learning Mixtures of Markov Chains and MDPs

E.3.2. SETUP FOR THE MAIN BOUND We have that

j=1 P(kn = j | Bn)Pj( | s, a)Pj( | s, a)T

By equation 8,

ˆMs,a Ms,a ˆMs,a E[ ˆMs,a | Ntraj(s, a)] +

j=1 P(kn = j | Bn) Mix(j)

ˆMs,a E[ ˆMs,a | Ntraj(s, a)] +

j=1 P(kn = j | Bn)

max j Mix(j)

= ˆMs,a E[ ˆMs,a | Ntraj(s, a)] + max j Mix(j)

The first term represents the error in concentration across trajectories and the second term represents the correlation between the two segments Ω1 and Ω2 in the same trajectory. We bound the first using a covering argument and use Bin Yu s work to bound the other.

E.3.3. COVERING ARGUMENT TO BOUND ˆMs,a E[ ˆMs,a] We will need this conditional version of Hoeffding s inequality for this section. Note that this is not quite the Azuma Hoeffding inequality with a constant filtration due to the conditional probability involved, as well as due to the conditional independence needed.

Lemma 3. Consider a σ-algebra F and let Ai Bi be random variables measurable over it. If random variables Xi are almost surely bounded in [Ai, Bi] and are conditionally independent over some σ-algebra F, then the following inequalities hold for Sn = Pn i=1 Xi

P (Sn E[Sn | F] > ϵ|F) exp 2ϵ P

P (Sn E[Sn | F] < ϵ|F) exp 2ϵ2 P

Proof. The proof is essentially a repeat of one of the standard proofs of Hoeffding s inequality. Note that we have the conditional Markov inequality P(X a | F) 1

a E[X a | F], shown exactly the way Markov s inequality is shown. Now, we have the following chain of inequalities.

P((Sn E[Sn | F] > ϵ|F) = e sϵE[e Sn E[Sn|F] | F]

i=1 E[e Xi E[Xi|F] | F]

We now show a conditional expectation version of Hoeffding s lemma by repeating the steps for a standard proof to show that E[e X E[X|F] | F] λ2(B A)2

8 for random variables A B measurable over F and X [A, B] almost surely. Note that by convexity of eλx, we have the following for x [A, B] at any value of A and B.

B AeλA + x A

WLOG, we can replace X by X E[X | F] and assume E[X | F] = 0. In that case, we note the following inequality,

where we define for any fixed value of A and B the function L(y) := y A B A + log 1 + A ey B

E[eλX | F] B E[X | F]

B A eλA + E[X | F] A

Learning Mixtures of Markov Chains and MDPs

= B B AeλA + A B AeλB

= e L(λ(B A))

Basic computations involving Taylor s theorem from a standard proof of Hoeffding s inequality show that L(y) y2

any value of A, B. This gives us the condition version of Hoeffding s lemma, E[e X E[X|F] | F] λ2(B A)2

8 . This allows us to establish the following chain of inequalities.

P((Sn E[Sn | F] > ϵ|F) = e sϵ n Y

i=1 E[e Xi E[Xi|F] | F]

i=1 exp s2(Bi Ai)2

Since s is arbitrary, we can pick s = 4ϵ P

i(Bi Ai)2 above to get an upper bound of exp 2ϵ2 Pn i=1(Bi Ai)2 . The other inequality is proved analogously.

We now show that the first term from the previous section concentrates. Pick u, v SS 1, that is they lie in the unit Euclidean norm sphere in RS. We need only bound this term when Ntraj(s, a) = 0, as otherwise the lemma holds vacuously.

ˆMs,a E[ ˆMs,a | Ntraj(s, a)] = X

n Ntraj(s,a)

ˆMn,s,a E[ ˆMn,s,a | Ntraj(s, a)]

Ntraj(s, a)

Now we set up our covering argument. Consider a covering of SS 1 by balls of radius 1

4. We will need at most 12S such balls and if C is the set of their centers, then for any matrix X, the following holds in regard to its norm.

X = sup u,v SS 1 |u T Xv| 2 max u,v C |u T Xv| 2 X (9)

For any pair u,v C, note that

|u T ˆMn,s,av| = u T wn,1

w T n,2 cn,2 v

1cn,1cn,2 =0

and so |u T E[ ˆMn,s,a]v| E[|u T ˆMn,s,av|] 1. A little thought shows that the estimates ˆMn,s,a are independent for n Ntraj(s, a) when conditioned on the Ntraj(s, a).

n Ntraj(s,a)

1 Ntraj(s, a)u T ( ˆMn,s,a E[ ˆMn,s,a | Ntraj(s, a)])v

Ntraj(s, a)

< 2e ϵ2Ntraj (s,a)

Learning Mixtures of Markov Chains and MDPs

Doing this for all 122S pairs u,v, we use inequality 9 to get that the conditionalprobability given by

n Ntraj(s,a)

1 Ntraj(s, a) ˆMn,s,a E[ ˆMn,s,a | Ntraj(s, a)]

Ntraj(s, a)

is bounded above by the following expression.

n Ntraj(s,a)

1 Ntraj(s, a)u T ( ˆMn,s,a E[ ˆMn,s,a | Ntraj(s, a)])v

Ntraj(s, a)

n Ntraj(s,a)

1 Ntraj(s, a)u T ( ˆMn,s,a E[ ˆMn,s,a | Ntraj(s, a)])v

Ntraj(s, a)

< 2 122S e ϵ2Ntraj (s,a)

This is less than δ for Ntraj(s, a) > 32

ϵ2 (2S log(12) + log( 2

δ )). Since this holds for such values of Ntraj(s, a) irrespective of Ntraj(s, a), we can conclude that for Ntraj(s, a) > 32

ϵ2 (2S log(12) + log( 2

δ )), with probability universally greater than 1 δ,

ˆMs,a Ms,a < ϵ

2 + max j Mix(j)

Alternatively, this establishes that with probability greater than 1 δ, we have the following inequality involving the random variables ˆMs,a and Ntraj(s, a).

ˆMs,a Ms,a <

32 Ntraj(s, a)(2S log(12) + log(2

δ )) + max j Mix(j)

E.3.4. BOUNDING THE MIXING TERM We now resolve the last remaining thread, which is that of bounding the mixing term. Let s fix a j for this section, since proving our upper bounds for arbitrary j is sufficient. Let the joint distribution of the observations under label j be χ. Let its marginal on the segment Ωi be χi. Let the marginals on each of the G single-step sub-blocks be χi,g. Denote the product distribution Q

g χi,g by Qi.

" wn,1w T n,2 cn,1cn,2

Pj( | s, a)Pj( | s, a)T

" wn,1w T n,2 cn,1cn,2 1Bn

Pj( | s, a)Pj( | s, a)T

" wn,1w T n,2 cn,1cn,2 1Bn

cn,1 1cn,1 =0

" w T n,2 cn,2 1cn,2 =0

cn,1 1cn,1 =0

" w T n,2 cn,2 1cn,2 =0

PQ1(cn,1 = 0)PQ2(cn,2 = 0)Pj( | s, a)Pj( | s, a)T

(PQ1(cn,1 = 0)PQ2(cn,2 = 0) P(cn,1 = 0)P(cn,2 = 0))Pj( | s, a)Pj( | s, a)T (

(P(cn,1 = 0)P(cn,2 = 0) P(Bn)) Pj( | s, a)Pj( | s, a)T

" wn,1w T n,2 cn,1cn,2 1Bn

cn,1 1cn,1 =0

" w T n,2 cn,2 1cn,2 =0

Learning Mixtures of Markov Chains and MDPs

cn,1 1cn,1 =0

" w T n,2 cn,2 1cn,2 =0

PQ1(cn,1 = 0)PQ2(cn,2 = 0)Pj( | s, a)Pj( | s, a)T

+ 1 P(Bn) |PQ1(cn,1 = 0)PQ2(cn,2 = 0) Pχ1(cn,1 = 0)Pχ2(cn,2 = 0)|

+ 1 P(Bn) |Pχ1(cn,1 = 0)Pχ2(cn,2 = 0) P(Bn)| (10)

Here, in the last inequality, we used the fact that Pj( | s, a) 2 Pj( | s, a) 1 = 1 and aa T a 2 a 2. Also note that Pχi(cn,i = 0) = Pχ(cn,i = 0) = P(cn,i = 0).

Intuitively, the first term represents mixing of the expectation across the two segments, the second term represents mixing of the expectations across the single-step sub-blocks inside segments, the third term represents mixing of the observation probabilities across the single-step sub-blocks inside segments, and the fourth term represents mixing of the observation probabilities across the two segments. In short, the first and fourth represent segment-level mixing while the second and third represent sub-block-level mixing.

Bounding the first term (segment-level mixing)

We will use (Yu, 1994) s blocking technique again, invoking Lemma 2. Pick an arbitrary u,v SS 1. Recall that

ˆPn,i( | s, a) := N(n, i, s, a, )

N(n, i, s, a) 1N(n,i,s,a) =0 = wn,i

cn,i 1cn,i =0

Consider the real-valued random variable

hu,v := u T wn,1w T n,2 cn,1cn,2 1Bn

We have the following basic computations for expectations. Remember that 1Bn = 1cn,1cn,2 =0 = 1cn,1 =01cn,2 =0.

Eχhu,v = u T

" wn,1w T n,2 cn,1cn,2 1Bn

Eχ1 χ2hu,v = u T

cn,1 1cn,1 =0

" w T n,2 cn,2 1cn,2 =0

This allows us to establish the following relation.

sup u,v SS 1 |Eχhu,v Eχ1 χ2hu,v| =

" wn,1w T n,2 cn,1cn,2 1Bn

cn,1 1cn,1 =0

" w T n,2 cn,2 1cn,2 =0

Now, we want to use Lemma 2. Note the following upper bound.

So, we can use Lemma 2 with C = Cu,v := sup hu,v and n = 2 for any u,v SS 1, giving us the following inequality.

|Eχhu,v Eχ1 χ2hu,v| 4λ Tn

Learning Mixtures of Markov Chains and MDPs

Since inequality 11 holds for any u, v SS 1, we can take the supremum over such u, v to get the desired inequality below. We also recall that P(Bn) dmins,a

2 from equation 5.

" wn,1w T n,2 cn,1cn,2 1Bn

cn,1 1cn,1 =0

" w T n,2 cn,2 1cn,2 =0

# 1 P(Bn)4λ Tn

4 dmin(s, a)

Bounding the second term (sub-block-level mixing)

Remember that the product distribution Q

g χi,g is Qi. First note that, since under Qi, each observation is independent, we have the following expectation.

cn,i 1cn,i =0

EQi[wn,i | cn,i]

cn,i 1cn,i =0

Pj( | s, a)cn,i

cn,i 1cn,i =0

= Pj( | s, a)PQi(cn,i = 0) (12)

Remark 8. Note that this holds crucially because we are working with the product distribution Qi over the single-step sub-blocks.

Also, let hu = u T wn,1

cn,1 1cn,1 and let gv = w T n,2 cn,2 1cn,2v. Then the second term is exactly given by the following expression.

1 P(Bn) sup u,v SS 1 |Eχ1[hu]Eχ2[gv] EQ1[hu]EQ2[gv]|

Also note that both |hu| and |gv| are bounded by 1. We then have the following chain of inequalities.

|Eχ1[hu]Eχ2[gv] EQ1[hu]EQ2[gv]| |Eχ1[hu] EQ1[hu]| |Eχ2[gv]| + |Eχ2[gv] EQ2[gv]||EQ1[hu]|

|Eχ1[hu] EQ1[hu]| + |Eχ2[gv] EQ2[gv]|

Since the single step sub-blocks are separated by at least Tn

8G timesteps, we can apply Lemma 2 with C = 1 and n = G to get bounds on both terms here, since Qi = Q

g χi,g. Also remember that P(Bn) dmin(s,a)

2 from equation 5.

1 P(Bn) sup u,v SS 1 |Eχ1[hu]Eχ2[gv] EQ1[hu]EQ2[gv]| 1 P(Bn)

4Gλ Tn 8G + 4Gλ Tn 8G

16Gλ Tn 8G dmin(s, a)

Bounding the third term (sub-block-level mixing)

Again, note that the third term is given by the following expression.

EQ1[1cn,1 =0]EQ2[1cn,2 =0] Eχ1[1cn,1 =0]Eχ2[1cn,2 =0]

Learning Mixtures of Markov Chains and MDPs

We can bound this above using the fact that |ab cd| |b||a c| + |c||b d|, to get the following upper bound.

EQ2[1cn,2 =0] EQ1[1cn,1 =0] Eχ1[1cn,1 =0] + Eχ1[1cn,1 =0] EQ2[1cn,2 =0] Eχ2[1cn,2 =0]

This in turn is bounded above by the expression below.

|EQ1[1cn,1 =0] Eχ1[1cn,1 =0]| + |EQ2[1cn,2 =0] Eχ2[1cn,2 =0]|

Since indicator functions are bounded above by 1, we can apply Lemma 2 as in the second term (C = 1, n = G) to bound both the differences above. Skipping the routine details, we finally get the following inequality, analogous to the second term.

EQ1[1cn,1 =0]EQ2[1cn,2 =0] Eχ1[1cn,1 =0]Eχ2[1cn,2 =0] 16Gλ Tn 8G dmin(s, a)

Bounding the fourth term (segment-level mixing)

Now note that the fourth term is the same as the expression below.

1 P(Bn)|Eχ1[1cn,1 =0]Eχ2[1cn,2 =0] Eχ[1cn,1 =01cn,2 =0]| = 1 P(Bn)|Eχ1 χ2[1cn,1 =01cn,2 =0] Eχ[1cn,1 =01cn,2 =0]|

We can now apply Lemma 2 with C = 1 and n = 2. The segments are separated by T and P(Bn) dmin(s,a)

2 , giving us the following bound. 1 P(Bn) |Pχ1(cn,1 = 0)Pχ2(cn,2 = 0) P(Bn)| 8λ Tn

4 dmin(s, a)

Combining all these, we get that

ˆMs,a Ms,a <

32G Ntraj(s, a)(2S log(12) + log(2

δ )) + 16 dmin(s, a)

Tn 4tmix + 32G dmin(s, a)

32 Ntraj(s, a)(2S log(12) + log(2

δ )) + 48G dmin(s, a)

Tn 8Gtmix (13)

as desired.

Learning Mixtures of Markov Chains and MDPs

F. Proof of Theorem 3

Theorem 3 (Exact Clustering Guarantee). Pick any pair of trajectories n, m. Then for Freqβ so that it contains (s, a) with dmin(s, a) Ω(α), Tn = Ω(tmix log4(1/δ)/α3), with probability at least 1 δ, dist1(m, n) m,n 2 2

+ 4ϵsub(δ/2)

This means that if we choose λ = 1, then if ϵsub(δ) 2/32 and Tn = Ω K3/2tmix log4(Nclust/(αδ))

6α3 , no distance

estimate attains a value between 2/4 and 2/2. So, Algorithm 2 attains exact clustering using a threshold of say 2/3 with probability at least 1 δ.

Proof. Consider the testing of trajectories m and n. Recall that we defined

dist1(m, n) := max (s,a) SAα

" ˆPn,1( | s, a) ˆPm,1( | s, a) T Vs,a

ˆPn,2( | s, a) ˆPm,2( | s, a) T Vs,a

Let km be the label of trajectory m and kn the label of trajectory n. According to our assumptions, if km = kn, then we have an s, a so that dkm(s, a), dkn(s, a) α and Pkm( | s, a) Pkn( | s, a) 2 . We will make s, a implicit in our notation except in Pj( | s, a). Let cn,i := N(n, i, s, a), wn,i := N(n, i, s, a, ). Recall that we have two nested partitions: (1) of the entire trajectory into the two Ωi and (2) of each segment Ωi into G blocks. Finally, define dist1,(s,a) as below, suppressing m and n. Note that dist1(m, n) is the maximum of dist1,(s,a) over all (s, a) Freqβ, for the given two trajectories m and n.

dist1,(s,a) := (ˆPn,1( | s, a) ˆPm,1( | s, a))T Vs,a (ˆPn,2( | s, a) ˆPm,2( | s, a))T Vs,a T

We want to show that this is close to m,n(s, a) 2 2 for the (s, a) pairs that we search over, where

m,n(s, a) = Pkm( | s, a) Pkn( | s, a)

Assume the lemma below for now, we prove it in the next subsection.

Lemma 4. We claim that there is a universal constant C1 so that for any (s, a) with dmin(s, a) α/3, with probability at least 1 δ,

dist1,(s,a) m,n(s, a) 2 2 C1

K + log(1/δ)

+ 4ϵsub(δ/2)

whenever Tn Ω(Gtmix log(G/δ) log(1/α)) and G Ω log(1/δ)

α2 . Here, ϵsub(δ) is the high probability bound on

Pj( | s, a) Vs,a VT s,a Pj( | s, a) 2 with j = kn, km, from Theorem 2 (satisfied with probability > 1 δ).

We now set G = Tn tmix

3 . Then a sufficient condition on Tn to meet the conditions of the lemma is Tn =

Ω(tmix log4(1/δ)/α3), under which, with probability at lest 1 δ, we have the following bound for (s, a) with dmin(s, a) α/3.

dist1,(s,a) m,n(s, a) 2 2 O

+ 4ϵsub(δ/2) (14)

Learning Mixtures of Markov Chains and MDPs

It is now easy to see that the first term on the right-hand side is less than 2/8 when Tn = Ω K3/2tmix log3/2(1/δ)

Tn = Ω(tmix log4(1/δ)/α3). We can combine these to have the guarantee that the first term on the right-hand side is less

2/8 with probability at least 1 δ when Tn = Ω K3/2tmix log4(1/δ)

Now note that if β α/3, then a separating state action pair always lies in Freqβ and thus, the maximum over the m,n(s, a) 2 2 values corresponding to Freqβ is in fact either 0 if km = kn or larger than 2 if km = kn. So, if ϵsub(δ/2) 2/32 and for each of the (s, a) pairs, the first term on the right-hand side of inequality 20 is less than 2/8, then our distance estimate dist1(m, n) is on the right side of any threshold as long as 2/4 τ 2/2. That is, the distance estimate is then less than the threshold if km = kn, and larger than it if km = kn.

Note that upon choosing an occurrence threshold of order α, we will have at most O(1/α) many (s, a) pairs in Freqβ to maximize dist1,(s,a) over to get dist1(m, n). By applying a union bound over all (s, a) pairs in Freqβ and using the conclusion

of the previous paragraph, we correctly determine if km = kn with probability 1 δ for Tn = Ω K3/2tmix log4(1/(αδ))

as long as ϵsub(δ/2) 2/32 and 2/4 τ 2/2.

By applying a union bound over incorrectly deciding whether or not km = kn for any of the Nclust(Nclust 1)/2 pairs, we

get that we can recover the true clusters with probability at least 1 δ for Tn = Ω K3/2tmix log4(Nclust/(αδ))

6α3 , whenever

ϵsub 2/32 and as long as ϵsub(δ/2) 2/32 and 2/4 τ 2/2.

F.1. Proof of Lemma 4 We recall the statement of the lemma.

Lemma 4. We claim that there is a universal constant C1 so that for any (s, a) with dmin(s, a) α/3, with probability at least 1 δ,

dist1,(s,a) m,n(s, a) 2 2 C1

K + log(1/δ)

+ 4ϵsub(δ/2)

whenever Tn Ω(Gtmix log(G/δ) log(1/α)) and G Ω log(1/δ)

α2 . Here, ϵsub(δ) is the high probability bound on

Pj( | s, a) Vs,a VT s,a Pj( | s, a) 2 with j = kn, km, from Theorem 2 (satisfied with probability > 1 δ).

Notation: We say cn,i = N(n, i, s, a) as in the statement of the lemma and wn,i = N(n, i, s, a, ). Let the joint distribution of the observations over the pair of trajectories (m, n) be χ. This means that χ is the product of the joint distribution of the observations over the trajectory m and that of the observations over the trajectory n, since trajectories are generated independently. Let its marginals on the segments Ωi be χi. Let the marginals on each of the G single-step sub-blocks along with their next states be χi,g. Denote the product distribution Q

g χi,g by Qi. Let G(s, a) denote the two sets of indices where the state-action pair (s, a) is observed in trajectories n and m. For brevity, we will abbreviate G(s, a) to G. Note that the sizes of these two sets are exactly cn,i and cm,i respectively.

We first prove some preliminary lemmas.

F.1.1. DECOMPOSITION OF | dist1,(s,a) m,n(s, a) 2 2 |

Lemma 5. We claim that for each fixed value of G(s, a) (abbreviated to G), with probability at least 1 δ, the following bound holds. dist1,(s,a) m,n(s, a) 2 2

i=1 2 i EQi[ i | G] 2 + 4ϵsub(δ) + 4 max i 1cn,i=0 + max i 1cm,i=0 (15)

Here cn,i = N(n, i, s, a), ϵsub(δ) is the high probability bound on Pjl( | s, a) Vs,a VT s,a Pjl( | s, a) 2 from Theorem 2 (satisfied with probability > 1 δ), and

T i = (ˆPn,i( | s, a) ˆPm,i( | s, a))T Vs,a

Remark 9. In the inequality,

Learning Mixtures of Markov Chains and MDPs

The first term is a concentration-type term, which will be broken into an independent concentration" error and a mixing error to account for the low but non-zero dependence across blocks.

The second term accounts for subspace estimation error.

The third term accounts for actually observing s, a in our blocks.

Proof. We first establish a simple inequality.

| dist1,(s,a) EQ1[ T 1 | G]EQ2[ 2 | G]| = | T 1 2 EQ1[ T 1 | G]EQ2[ 2| | G]

|( T 1 EQ1[ T 1 | G])EQ2[ 2 | G]| + | T 1 ( 2 EQ2[ 2 | G])|

1 EQ1[ 1 | G] 2 EQ2[ 2 | G] 2 + 1 2 2 EQ2[ 2 | G] 2 2 1 EQ1[ 1 | G] 2 + 2 2 EQ2[ 2 | G] 2 (16)

Remark 10. Notice that because of this inequality, the double estimator does not impact any theoretical guarantees for exact clustering w.h.p, which is the form of the guarantees in both Kong et al. (2020) and Chen & Poor (2022). However, we find that using a double estimator allows for better performance in real life. This makes sense because while exact clustering doesn t need a double estimator, approximate clustering w.h.p. does depend on the expectation of the distances across pairs of trajectories. This expectation is controlled by the covariance of 1 and 2.

We define the following quantity.

diffi = 1cn,i =0Pkm( | s, a) 1cm,i =0Pkn( | s, a)

Note that diffi 2 2. Note the following expectation, which uses the dieas from equation 12.

EQi[ i | G] = EQi h VT s,a(ˆPn,i( | s, a) ˆPm,i( | s, a)) i

= VT s,a EQi[ˆPn,i( | s, a) | G] EQi[ˆPm,i( | s, a) | G]

cn,i 1cn,i =0 | G EQi

cm,i 1cm,i =0 | G

EQi[wn,i | G]

cn,i 1cn,i =0 EQi[wm,i | G]

cm,i 1cm,i =0

Pkn( | s, a)cn,i

cn,i 1cn,i =0 Pkm( | s, a)cm,i

cm,i 1cm,i =0

= VT s,a Pkn( | s, a)1cn,i =0 Pkm( | s, a)1cm,i =0

= VT s,adiffi

We recall the following definition before proceeding to show the main inequality.

m,n(s, a) = Pkm( | s, a) Pkn( | s, a)

EQ1[ T 1 | G]EQ2[ 2 | G] m,n(s, a) 2 2 = diff T 1 Vs,a VT s,adiff2 diff T 1 diff2 + diff T 1 diff2 m,n(s, a) 2 2

diff1 2 diff2 Vs,a VT s,adiff2 2 + diff1 m,n(s, a) 2 diff2 2

+ diff1 2 diff2 m,n(s, a) 2

diff1 1 diff2 Vs,a VT s,adiff2 2 + diff1 m,n(s, a) 2 diff2 1

+ diff1 1 diff2 m,n(s, a) 2

Learning Mixtures of Markov Chains and MDPs

2 diff2 Vs,a VT s,adiff2 2 + 2 diff1 m,n(s, a) 2

+ 2 diff2 m,n(s, a) 2

2 Pkm( | s, a) Vs,a VT s,a Pkm( | s, a) 2

+ 2 Pkn( | s, a) Vs,a VT s,a Pkn( | s, a) 2 + 2 1cm,1=0Pkm( | s, a) 1cn,1=0Pkn( | s, a) 2 + 2 1cm,2=0Pkm( | s, a) 1cn,2=0Pkn( | s, a) 2 4ϵsub(δ) + 2 1cm,1=0 Pkm( | s, a) 2 + 1cn,1=0 Pkn( | s, a) 2

+ 2 1cm,2=0 Pkm( | s, a) 2 + 1cn,2=0 Pkn( | s, a) 2

4ϵsub(δ) + 4 max i 1cn,i=0 + max i 1cm,i=0

Combining this with inequality 16, we have the following final bound.

dist1,(s,a) m,n(s, a) 2 2

i=1 2 i EQi[ i | G] 2 + 4ϵsub(δ) + 4 max i 1cn,i=0 + max i 1cm,i=0 (17)

where we remind the reader that cn,i = N(n, i, s, a) and recall the definition of i.

T i = (ˆPn,i( | s, a) ˆPm,i( | s, a))T Vs,a

F.1.2. BOUNDING THE CONCENTRATION-TYPE TERM We bound the first term in the decomposition lemma (Lemma 5) with high probability.

Lemma 6. With probability at least 1 δ, when Tn Ω Gtmix log G

δ log(1/α) and G Ω log(1/δ)

α2 , we have the following bound.

i EQi[ i | G]) 2 O

K + log(1/δ)

Proof. Recall that the joint distribution of the observations over the pair of trajectories (m, n) is χ. Its marginals on the segments Ωi are χi. The marginals on each of the G single-step sub-blocks is χi,g. The product distribution Q

g χi,g is Qi. Recall that G(n, s, a) denotes the two sets of indices where (s, a) is observed in trajectory n and m respectively, and the sets have sizes cn,i and cm,i respectively.

Let wn,i,g be the one hot vector of the next state if the (i, g) sub-block witnesses (s, a), and the zero vector otherwise. Let cn,i,g be the indicator of (s, a) in the (i, g) sub-block. Then wn,i = P

g wn,i,g and cn,i = P

1. Covering argument for the product distribution

Pick a unit vector u RK and consider the following inequality. Remember that we abbreviate G(n, s, a) to G.

|u T ( i EQi[ i | G])| |u T Vs,a(ˆPn,i( | s, a) EQi[ˆPn,i( | s, a) | G])|

+ |u T Vs,a(ˆPm,i( | s, a) EQi[ˆPm,i( | s, a) | G])|

We work with the term for trajectory n, WLOG. Any bounds thus obtained will also apply to trajectory m. Notice the following equation.

|u T VT s,a(ˆPn,i( | s, a) EQi[ˆPn,i( | s, a) | G])| =

u T VT s,awn,i,g EQi[u T VT s,awn,i,g | G])

Learning Mixtures of Markov Chains and MDPs

Note that |u T VT s,awn,i,g| u 2 VT s,awn,i,g 2 1. Note that conditioned on the set of (s, a) observations in trajectory n, the next states are independent under the product distribution Qi (but not under χi, of course). Now, using the conditional version of Hoeffding s inequality from Lemma 3, we get the following bound.

u T VT s,awn,i,g EQi[u T VT s,awn,i,g | G]) > ϵ

Note that if X Y + Z, then P(X > ϵ

8) + P(Z > ϵ

8) by a union bound. We apply this to the inequalities above with X = |u T ( i EQi[ i])| to get the following concentration inequality.

PQi |u T ( i EQi[ i | G])| > ϵ

4 | G 2e ϵ2cn,i

32 + 2e ϵ2cn,i

32 = 4e ϵ2cn,i

Consider a covering of SK 1 by balls of radius 1/4. We will need at most 12K such balls. Call the set of their centers C. We know that for any vector v, the following holds.

sup u 2 1 u T v = v 2 2 sup u C u T v

We use this to arrive at the concentration inequality below.

PQi i EQi[ i | G]) 2 > ϵ

2 | G PQi u C; |u T ( i EQi[ i | G])| > ϵ

u C PQi |u T ( i EQi[ i | G])| > ϵ

< 4 12K e ϵ2cn,i

2. Bounding cn,i

We bound cn,i with high probability under the distribution Qi, using the regular Hoeffding s inequality, noting that EQi[cn,i] = P

g PQi(cn,i,g = 0) = P

g Pχi(cn,i,g = 0). We can show that Pχi(cn,i,g = 0) dmin(s,a)

2 for Tn Ω(Gtmix log(1/α)) by using the same kind of computation as in equation 4.

cn,i < Gdmin(s, a)

cn,i < Gdmin(s, a)

2 Gdmin(s, a)

cn,i < EQi[cn,i] Gdmin(s, a)

g 1cn,i,g =0 < X

g EQi[1cn,i,g =0] Gdmin(s, a)

exp dmin(s, a)2G

This is less than δ/2 for G Ω log(2/δ)

α2 Ω log(2/δ) dmin(s,a)2 . So for such G, remembering that G was an abbreviation for

the random set G(n, s, a),

PQi i EQi[ i | G]) 2 > ϵ

2 | G(n, s, a) 4 12K e ϵ2Gdmin(s,a)

Since this holds for all possible G(n, s, a) values and the right hand side doesn t depend on G(n, s, a), we can take the expectation over the random set G(n, s, a) to get the following inequality.

Learning Mixtures of Markov Chains and MDPs

PQi i EQi[ i | G]) 2 > ϵ

4 12K e ϵ2Gdmin(s,a)

3. Accounting for non-independence (mixing error)

We know that we can bound the difference in the probability of any event E between χi and Qi by applying Lemma 2 to the function h = 1E with n = G and C = 1 as we have before, giving us the following inequality.

Pχi i EQi[ i | G]) 2 > ϵ

PQi i EQi[ i | G]) 2 > ϵ

4 12K e ϵ2Gdmin(s,a)

We know that both terms are less than δ

4 when Tn Ω Gtmix log G

δ and G Ω K+log(1/δ)

ϵ2α , since dmin(s, a)

α/3. We thus have the following bound with probability at least 1 δ, when Tn Ω Gtmix log G

δ log(1/α) and

G Ω log(1/δ)

i EQi[ i | G]) 2 O

K + log(1/δ)

F.1.3. BOUNDING THE PROBABILITY OF NOT OBSERVING s, a We bound the third term in the decomposition lemma (Lemma 5) with high probability. We first need an auxiliary lemma for this.

Lemma 7. For Tn Ω(Gtmix log(1/α)), we have the following bound.

P(cn,i = 0) 1 dmin(s, a)

Remark 11. Again, we can think of this sum as a bound on the probability of not observing s, a in the blocks if they were independent (the first term) versus a mixing error between blocks to account for their non-independence (the second term).

Proof. Recall that the joint distribution of the observations over the pair of trajectories (m, n) is χ. Its marginals on the segments Ωi are χi. The marginals on each of the G single-step sub-blocks is χi,g. The product distribution Q

g χi,g is Qi. Recall that G(n, s, a) denotes the two sets of indices where (s, a) is observed in trajectory n and m respectively, and the sets have sizes cn,i and cm,i respectively.

Remember that wn,i,g is the one hot vector of the next state if the (i, g) sub-block witnesses (s, a), and the zero vector otherwise, and that cn,i,g is the indicator of (s, a) in the (i, g) sub-block. Also recall that then wn,i = P

g wn,i,g and cn,i = P g cn,i,g.

Define h := QG g=1(1 cn,i,g). Under any distribution Q over these sub-blocks, EQh is the probability of not observing s, a in any of them. Let di,g,n be the distribution of state-action pairs at the first observation of sub-block (i, g). Let dkn( , ) be the stationary distribution under label kn for state-action pairs. We use Lemma 2 with h as above, C = 1, n = G and an = Tn

8G to note the following chain of inequalities.

P(cn,i = 0) = Eχih

Learning Mixtures of Markov Chains and MDPs

EQih + |EQih Eχih|

g=1 EQi(1 cn,i,g)

+ 4Gλ Tn 8G

g=1 (1 dkn(s, a) + TV (di,g,n, dkn)

+ 4Gλ Tn 8G

g=1 (1 dkn(s, a) + 4λ Tn 8G )

+ 4Gλ Tn 8G

= 1 dkn(s, a) + 4λ Tn 8G G + 4Gλ Tn 8G

1 dkn(s, a)

G + 4Gλ Tn 8G

1 dmin(s, a)

G + 4Gλ Tn 8G

where the inequality in the second to last line holds for Tn Ω(Gtmix log(1/α)) Ω(Gtmix log(1/dmin(s, a))).

From the above lemma, the following corollary immediately follows by getting conditions to bound each term on the right hand side by δ/2, upon also noting that log(1 x) x, so log 1 1 α/2 α/2.

Corollary 1. For Tn Ω(Gtmix log(G/δ) log(1/α)) and G Ω log(1/δ)

α , we have with probability at least 1 δ that

4 max i 1cn,i=0 + max i 1cm,i=0 = 0

F.1.4. COMBINING THE BOUNDS We finally combine these lemmas to prove Lemma 4 the lemma that this section was dedicated to. The conditions of the lemmas combine to ask that Tn Ω(Gtmix log(G/δ) log(1/α)) and G Ω log(1/δ)

Proof of Lemma 4. Combining the decomposition from Lemma 5 with the bounds in Lemma 6 and Corollary 1, we conclude using union bounds on the low probability events that we are excluding that there is a universal constant C1 so that with probability at least 1 δ,

dist1,(s,a) m,n(s, a) 2 2 C1

K + log(1/δ)

+ 4ϵsub(δ/2)

whenever Tn Ω(Gtmix log(G/δ) log(1/α)) and G Ω log(1/δ)

Learning Mixtures of Markov Chains and MDPs

G. Guarantees for one step of the EM Algorithm for mixtures of MDPs Remember that the M-step is just the model estimation step, so Theorem 4 provides guarantees for that. We also have the following guarantees for the E-step of hard EM.

Theorem 6. Consider any (s, a) with dmin(s, a) α/3 where model estimation accuracy is ϵ with ϵ min( /4, 2gmin/64) where gmin is the least non-zero value of Pk(s | s, a) across k, s . Using log-likelihood ratios of transitions of all such (s, a) pairs, we can classify any set of N new trajectories with probability 1 δ if it has length Tn = Ω(tmix log4(N/δ) log3(1/fmin)/α3 3).

Remark 12. The dependence on gmin is unavoidable. For example, if the estimate for the models was only off at the value of k, s attaining gmin and our estimate for gmin was ˆPk(s | s, a) = 0, then no trajectory from label k witnessing s will get correctly classified. This event will happen roughly with probability gmin, up to a mixing error, and gmin cannot be made less than some arbitrary δ chosen to bound the probability of all undesirable events.

Proof. We are inspired by the lower bound obtained in Lemma 1 of Wong & Shen (1995) for obtaining our sample complexity bounds. Consider a separating state-action pair s, a. We first establish Hellinger distance lower bounds between the distributions ˆPk( | s, a) and ˆPl( | s, a). Notice that

TV (ˆPk( | s, a), Pk( | s, a)) = 1

2 ˆPk( | s, a) Pk( | s, a) 1 ϵ/2 /4

The same holds for l as well. Combining the latter with Pk( | s, a) Pl( | s, a) 1 Pk( | s, a) Pl( | s, a) 2 and using the inequality H(P, Q) TV (P, Q)/

2, we get the following bound.

H(Pk( | s, a), ˆPl( | s, a)) 1

2TV (ˆPk( | s, a), ˆPl( | s, a)) 4

We now recall notation from the previous section. Again, we modify notation slightly, in a natural way. Let χn be the joint distribution of observations recorded in trajectory n, with their marginals on each single-element sub-block being χn,g. Let Qn be the product distribution Qn = Q

n,g χn,g. Let G(n, s, a) be the set of sub-blocks (n, g) in which (s, a) is observed in trajectory n. Let cn be the size of this set. We have the following lemma.

Lemma 8. Let the random variables for the next states following each (s, a) observation given by S1, S2, . . . Scn and let the true label be kn = k. Then for any l = k, consider the likelihood ratio over next state transitions from (s, a).

LRn(s, a) =

ˆPk(Si | s, a)

ˆPl(Si | s, a)

We claim that LRn(s, a) > 0 with probability at least 1 δ for Tn Ω Gtmix log G

δ log(1/α) and G

Ω log(1/fmin) log(1/δ)

Just like in the proof of Theorem 3, now set G = Tn tmix

3 . Then a sufficient condition on Tn to meet the conditions of the

lemma is Tn = Ω(tmix log4(1/δ) log3(1/fmin)/α3 3).

Now remember that upon choosing an occurrence threshold β of order α, we will have at most O(1/α) many (s, a) pairs in Freqβ. By applying a union bound over all (s, a) pairs in Freqβ, we get that with probability 1 δ, we get that the sum of the log-likelihood ratios of next-state transitions starting in Freqβ between the true label s model estimate and any other label s model estimate is positive whenever Tn = Ω(tmix log4(1/δ) log3(1/fmin)/α3 3).

We now take another union bound over the N new trajectories to get that we can exactly classify all of them with probability at least 1 δ whenever Tn Ω(tmix log4(N/δ) log3(1/fmin)/α3 3).

Learning Mixtures of Markov Chains and MDPs

G.1. Proof of Lemma 8 We first perform a computation analogous to Lemma 1 in Wong & Shen (1995). Let D1 = Pk( | s, a), D2 = Pl( | s, a), ˆD1 = ˆPk( | s, a), ˆD2 = ˆPl( | s, a). Fix b > 0. We use the conditional Markov inequality and the fact that conditioned on G(n, s, a) and under the product distribution ˆQn, the Hellinger distance between the next-state distributions at any (s, a) observation is H( ˆD1, ˆD2), which satisfies H( ˆD1, ˆD2) /4

2. This is crucially due to the independence and the fact that we are fixing G(n, s, a) by conditioning on it. As usual, abbreviate G(n, s, a) to G for brevity.

PQn(LRn(s, a) ecnb/2 | G) = PQn

!1/2 ˆD2(Si)

1 + 2/64 1/2 ˆD2(Si)

= ecnb/2 1 + 2/64 cn/2 1 H(D1, D2)2

ecnb/2 1 2/128 cn/2

ecnb/2e cn 2/128

Setting b = 2/256, we get that PQn(LRn(s, a) ecn 2/256 | G) e cn 2/256. Now by following a very similar

computation to that in point 2 in section F.1.2, we get that for Tn Ω(Gtmix log(1/α)) and G Ω log(1/δ)

Gdmin(s, a)/4 with probability at least 1 δ/2. That is, for such Tn and G,

PQn(LRn(s, a) e Gdmin(s,a) 2/512 | G) PQn(LRn(s, a) ecn 2/128 | G) e Gdmin(s,a) 2/512 + δ

Since this holds for any value of G = G(n, s, a), we can say that with probability at least 1 δ, for Tn Ω(Gtmix log(1/α))

and G Ω log(1/δ)

α2 , cn Gdmin(s, a)/4, we have the following bound.

PQn(LRn(s, a) e Gdmin(s,a) 2/512) e Gdmin(s,a) 2/512 + δ

After following a computation very similar to that in point 3 of section F.1.2, we get that for Tn Ω Gtmix log G

δ log(1/α) and G Ω log(1/δ)

Pχ(LRn(s, a) e Gdmin(s,a) 2/512) δ

Note that we want e Gdmin(s,a) 2/512 fl/fk, in which case it suffices to ask e Gdmin(s,a) 2/512 1/fmin. Combining

this with earlier conditions, for G Ω log(1/δ) log(1/fmin)

α2 2 and Tn Ω Gtmix log G

δ log(1/α) ,

Learning Mixtures of Markov Chains and MDPs

fl LRn(s, a) 1 δ

Learning Mixtures of Markov Chains and MDPs

H. Proof of Theorem 4 Theorem 4 (Model Estimation Guarantee). For any state action pair (s, a) with dmin(s, a) α/3, and for GNclust

f 2 minα2 and Tn Ω(Gtmix log(G/δ)), with probability greater than 1 δ,

ˆPk( | s, a) Pk( | s, a) 1

is bounded above by

1/3 r 1 Nclustfminα(S + log(1

Proof. The proof is quite straightforward and employs the techniques used so far, especially those used in section F.1.2. Let k be the (now known) label that we re working with.

We modify previous notation a bit for this proof. For brevity of notation, we denote by cn,g the indicator variable for observing (s, a) in the gth single-step sub-block of the trajectory n. Denote by wn,g one-hot vector of the next state observed if the currect state-action pair is (s, a), and set it to the zero-vector otherwise. Note that P

g cn,g = N(n, s, a) and P

g wn,g = N(n, s, a, ). We denote the set of indices (n, g) of all s, a observations that come from label k (across the GNclust observations recorded) by N(s, a, k). Let the size of this set be N(s, a, k). Note that N(s, a, k) = P n Ck N(n, s, a) = P n,g cn,g. Also note the following alternate expression for ˆPk( | s, a).

ˆPk( | s, a) :=

(n,g) N(s,a,k) wn,g P

(n,g) N(s,a,k) cn,g 1N(s,a,k) =0 =

(n,g) N(s,a,k) wn,g

N(s, a, k) 1N(s,a,k) =0 (18)

Let χn be the joint distribution of observations recorded in trajectory n, with their marginals on each single-element sub-block being χn,g. Let χ be the joint distribution of all observations recorded across all trajectories. Since the trajectories are independent, we know that χ = Q

n χn. Let Qg be the joint distribution of the observations at the gth sub-block. Note that this is also the marginal of the joint distribution χ on the gth sub-block, and since the trajectories are independent, Qg = Q

n χg,n. Finally, denote by Q the product distribution Q

n χg,n. This would be the distribution if all observations recorded were independent (across sub-blocks).

1. Concentration under the product distribution

We have the following computation.

EQ[ˆPk( | s, a) | N(s, a, k)] = EQ

n Nclust wn N(s, a, k) 1N(s,a,k) =0

n N(s,a,k) wn

1N(s,a,k) =0

= P n EQ[wn | N(s, a, k)]

N(s, a, k) 1N(s,a,k) =0

n Pk( | s, a)cn

N(s, a, k) 1N(s,a,k) =0

= Pk( | s, a)(P

n cn) N(s, a, k) 1N(s,a,k) =0

= Pk( | s, a)N(s, a, k)

N(s, a, k) 1N(s,a,k) =0

= Pk( | s, a)1N(s,a,k) =0

Now we set up our covering argument. Remember that [ 1, 1]S is the set of all vectors u RS with u 1. Consider a covering of [ 1, 1]S by boxes of side length 1

4 and centers lying in [ 1, 1]S. We will need at most 12S such boxes and if C is the set of their centers, then for any vector v

v 1 = sup u [ 1,1]S 1 |u T v| 2 max u C |u T v| 2 v 1

Learning Mixtures of Markov Chains and MDPs

Also, for any u C, note that

|u T ˆPn,i( | s, a)| u

and so |u T EQ[ˆPk( | s, a) | N(s, a, k)]| E[|u T ˆPk( | s, a)| | N(s, a, k)] 1. Again, note that conditioned on the set of all (s, a) observations recorded, the next states wn,g are all independent under the product distribution Q (but not under χ, of course). Recalling the expression for ˆPk( | s, a) from equation 18, this means that we can use the conditional version of Hoeffding s inequality, giving us the following bound.

PQ u T (ˆPk( | s, a) EQ[ˆPk( | s, a) | N(s, a, k)]) > ϵ

N(s, a, k) < 2e ϵ2N(s,a,k)

Doing this for all 12S vectors u C, we get the following inequality.

PQ (ˆPn,i( | s, a) EQ[ˆPn,i( | s, a) | N(s, a, k)]) 1 > ϵ

is bounded above by

PQ u C; u T (ˆPk( | s, a) EQ[ˆPk( | s, a) | N(s, a, k)]) > ϵ

u C PQ u T (ˆPk( | s, a) EQ[ˆPk( | s, a) | N(s, a, k)]) > ϵ

< 12S e ϵ2N(s,a,k)

2. Bounding N(s, a, k) under the product distribution

Now note that N(s, a, k) = P

(n,g) Nclust [G] cn,g. So,

EQ[N(s, a, k)] = X

(n,g) Nclust [G] EQ[cn,g] = X

(n,g) Nclust [G] Pχ(cn,g = 0)

We can show the following inequality.

Pχ(cn,g = 0) = Pχ(cn,g = 0 | kn = k)P(kn = k) dmin(s, a)

for Tn Ω(Gtmix log(1/α)), getting the last inequality by using a computation very similar to the one in equation 4, along with the fact that P(kn = k) = fk. So, EQ[N(s, a, k)] GNclustfmindmin(s,a)

N(s, a, k) < GNclust fmindmin(s, a)

N(s, a, k) < GNclust fmindmin(s, a)

2 GNclust fmindmin(s, a)

N(s, a, k) < E[N(s, a, k)] GNclust fmindmin(s, a)

(n,g) Nclust [G] cn,g < E[N(s, a, k)] GNclust fmindmin(s, a)

exp f 2 mindmin(s, a)2GNclust

Learning Mixtures of Markov Chains and MDPs

This is less than δ/2 for GNclust Ω log(1/δ)

f 2 minα2 . So, with probability at least 1 δ/2, for GNclust Ω log(1/δ)

f 2 minα2 and

Tn Ω(Gtmix log(1/α)), we have the following bound.

PQ (ˆPn,i( | s, a) EQ[ˆPn,i( | s, a) | N(s, a, k)]) 1 > ϵ

12Se ϵ2GNclustfmindmin(s,a)

3. Mixing error to account for non-independence in the true joint distribution

Note that we can think of the combined dataset as a Markov chain over the tuple of n observations, with a joint distribution χ over observations. Its marginal over the gth single-step sub-blocks is Qg and Q = Q

g Qg. We now want to apply Lemma 2, noting that the relevant function of this Markov chain is 1E where E is the event ˆPk( | s, a) Pk( | s, a) 1 < ϵ

2. Clearly, in this case, n from the lemma is G and C from the lemma is 1. We use this to get the following bound.

Pχ (ˆPn,i( | s, a) EQ[ˆPn,i( | s, a) | N(s, a, k)]) 1 > ϵ

is bounded above by

PQ (ˆPn,i( | s, a) EQ[ˆPn,i( | s, a) | N(s, a, k)]) 1 > ϵ

12Se ϵ2GNclustfmindmin(s,a)

Each term is less than δ/4 for GNclust Ω 1 ϵ2fminα(S + log( 1

δ ) and Tn Ω(Gtmix log(G/δ)). So for such G, Nclust, Tn, with probability greater than 1 δ,

ˆPk( | s, a) Pk( | s, a) 1 < ϵ

Alternatively, for GNclust Ω log(1/δ)

f 2 minα2 and Tn Ω(Gtmix log(G/δ) log(1/α)), with probability greater than 1 δ,

ˆPk( | s, a) Pk( | s, a) 1 O r 1 GNclustfminα(S + log(1

Letting G = Tn tmix

2/3 , for Tn tmix

2/3 Nclust Ω log(1/δ)

f 2 minα2 and Tn Ω(tmix log4(1/δ) log4(1/α)), with probability greater than 1 δ,

ˆPk( | s, a) Pk( | s, a) 1 O

1/3 r 1 Nclustfminα(S + log(1

Learning Mixtures of Markov Chains and MDPs

I. Proof of Theorem 5 We recall the theorem here.

Theorem 5 (Classification Guarantee). Let ϵmod(δ) be a high probability bound on the model estimation error ˆPk( | s, a) Pk( | s, a) 2. Then there is a universal constant C3 so that Algorithm 3 can identify the true labels for trajectories

in Nclass with probability at least 1 δ for Tn = Ω K3/2tmix log4(Nclass/(αδ))

6α3 , whenever ϵmod(δ/2) C3 4fminα

Nclust Ω log(1/δ)

f 2 minα2 .

Proof. The proof is very similar to the proof of theorem 3. Consider the testing of trajectory n. Recall that in algorithm 3, we defined

dist1(n, k) := max (s,a) SAα

" ˆPn,1( | s, a) ˆPk( | s, a) T Vs,a

ˆPn,2( | s, a) ˆPk( | s, a) T Vs,a

Let kn the label of trajectory n. According to our assumptions, if kn = k, then we have an s, a so that dkn(s, a) α and Pkn( | s, a) Pk( | s, a) 2 . Again, we will make s, a implicit in our notation except in Pj( | s, a). Let cn,i := N(n, i, s, a), wn,i := N(n, i, s, a, ). Recall that we have two nested partitions: (1) of the entire trajectory into the two Ωi and (2) of each segment Ωi into G blocks. Finally, define dist1,(s,a) as below, suppressing n and k. Note that dist1(n, k) is the maximum of dist1,(s,a) over all (s, a) Freqβ, for the given trajectory n and label k.

dist1,(s,a) := (ˆPn,1( | s, a) ˆPk( | s, a))T Vs,a (ˆPn,2( | s, a) ˆPk( | s, a))T Vs,a T

We want to show that this is close to n,k(s, a) 2 2 for the (s, a) pairs that we search over, where

n,k(s, a) = Pkn( | s, a) Pk( | s, a)

Recall that ˆPk( | s, a) Pk( | s, a) 2 ϵmod(δ) for any 1 k K. Let Mtrue s,a = P

1 k K ˆfk,s,a Pk( | s, a)Pk( | s, a)T . We use the fact that aa T bb T ( a 2 + b 2) a b 2 in the bound below.

Mtrue s,a Ms,a X

1 k K ˆfk,s,a Pk( | s, a)Pk( | s, a)T ˆPk( | s, a)ˆPk( | s, a)T

1 k K ˆfk,s,a( ˆPk( | s, a) 2 + Pk( | s, a) 2) ˆPk( | s, a) Pk( | s, a) 2

1 k K 2 ˆfk,s,a ˆPk( | s, a) Pk( | s, a) 2

Also note that if we redefine Bn to be the event of observing (s, a) in a trajectory (instead of in both segments as in the notation in previous proofs), then ˆfk,s,a =

n 1kn=k1Bn P

Nclust . So, E[ ˆfk,s,a] P(kn = k Bn) = P(Bn | kn = k)P(kn = k) fmin P(Bn | kn = k). Using a computation very similar to the one leading up to inequality 5, we note that P(Bn | kn = k) dmin(s, a)/2 for Tn Ω(tmix log(1/α)). In that case, E[ ˆfk,s,a] fmindmin(s, a)/2 fminα/2.

Additionally, using a standard concentration argument, ˆfk,s,a E[ ˆfk,s,a]/2 fminα/4 for Nclust Ω log(1/δ)

Ω log(1/δ) E[ ˆ fk,s,a]2

We now apply Lemma 3 of Chen & Poor (2022), with p(k) = ˆfk,s,a, y(k) = Pk( | s, a), M = Mtrue s,a and M = Ms,a. We use the right-hand side of the bound in the lemma to get the bound below for all 1 k K, which holds for a universal constant C2 with probability at least 1 δ whenever Nclust Ω log(1/δ)

f 2 minα2 and Tn Ω(tmix log(1/α)).

Pk( | s, a) Vs,a VT s,a Pk( | s, a) 2

Learning Mixtures of Markov Chains and MDPs

Assume the lemma below for now, we prove it in the next subsection.

Lemma 9. We claim that there is a universal constant C1 so that for any (s, a) with dmin(s, a) α/3, with probability at least 1 δ,

dist1,(s,a) n,k(s, a) 2 2 O

K + log(1/δ)

whenever Tn Ω(Gtmix log(G/δ) log(1/α)) and G Ω log(1/δ)

α2 . Here, ϵmod(δ) is a high probability bound on

Pk( | s, a) Vs,a VT s,a Pk( | s, a) 2 for all 1 k K (which holds with probability at least 1 δ).

We now set G = Tn tmix

3 . Then a sufficient condition on Tn to meet the conditions of the lemma is Tn =

Ω(tmix log4(1/δ)/α3), under which, with probability at lest 1 δ, we have the following bound for (s, a) with dmin(s, a) α/3.

dist1,(s,a) n,k(s, a) 2 2 O

It is now easy to see that the first term on the right-hand side is less than 2/8 when Tn = Ω K3/2tmix log3/2(1/δ)

Tn = Ω(tmix log4(1/δ)/α3). We can combine these to have the guarantee that the first term on the right-hand side is less

2/8 with probability at least 1 δ when Tn = Ω K3/2tmix log4(1/δ)

Now note that if β α/3, then a separating state action pair always lies in Freqβ and thus, the maximum over the n,k(s, a) 2 2 values corresponding to Freqβ is in fact either 0 if k = kn or larger than 2 if k = kn. So, if

fminα 2/32 and for each of the (s, a) pairs, the first term on the right-hand side of inequality 20 is

less than 2/8, then our distance estimate dist1(n, k) is on the right side of 2/3. That is, the distance estimate is then less than 2/4 if k = kn, and larger than it if k = kn. As a consequence, the output of the argmin in algorithm 3 is kn in this situation.

Note that upon choosing an occurrence threshold of order α, we will have at most O(1/α) many (s, a) pairs in Freqβ to maximize dist1,(s,a) over to get dist1(n, k). By applying a union bound over all (s, a) pairs in Freqβ and using the conclusion of the previous paragraph, algorithm 3 correctly predicts the label kn for trajectory n with probability 1 δ

whenever Tn = Ω K3/2tmix log4(1/(αδ))

6α3 and 8C2 q

fminα 2/32.

By applying a union bound over incorrectly predicting kn for any of the Nclass(Nclass 1)/2 pairs, we get that algorithm 3 can recover the true labels with probability at least 1 δ for Tn = Ω K3/2tmix log4(Nclass/(αδ))

6α3 , whenever

fminα 2/32.

Finally note that due to inequality 19, we get that algorithm 3 can recover the true labels with probability at least 1 δ for Tn = Ω K3/2tmix log4(Nclass/(αδ))

6α3 , whenever ϵmod(δ/2) C3 4fminα

I.1. Proof of Lemma 9 We recall the lemma here.

Lemma 9. We claim that there is a universal constant C1 so that for any (s, a) with dmin(s, a) α/3, with probability at

Learning Mixtures of Markov Chains and MDPs

dist1,(s,a) n,k(s, a) 2 2 O

K + log(1/δ)

whenever Tn Ω(Gtmix log(G/δ) log(1/α)) and G Ω log(1/δ)

α2 . Here, ϵmod(δ) is a high probability bound on

Pk( | s, a) Vs,a VT s,a Pk( | s, a) 2 for all 1 k K (which holds with probability at least 1 δ).

Proof. The proof of this lemma is very similar to the proof of Lemma 4.

Notation: We say cn,i = N(n, i, s, a) as in the statement of the lemma and wn,i = N(n, i, s, a, ). Let the joint distribution of the observations over trajectory n be χ. Let its marginals on the segments Ωi be χi. Let the marginals on each of the G single-step sub-blocks along with their next states be χi,g. Denote the product distribution Q

g χi,g by Qi. Let G(n, s, a) denote the set of indices where the state-action pair (s, a) is observed in trajectory n. For brevity, we will abbreviate G(n, s, a) to G. Note that the size of this set is exactly cn,i.

We first prove a preliminary lemma, similar to lemma 5.

I.1.1. DECOMPOSITION OF | dist1,(s,a) n,k(s, a) 2 2 |

Lemma 10. We claim that for each fixed value of G(s, a) (abbreviated to G), with probability at least 1 δ, the following bound holds.

dist1,(s,a) n,k(s, a) 2 2

i=1 2 ˆPn,i( | s, a) EQi[ˆPn,i( | s, a) | G] 2 + 8C2

fminα + 4 max i 1cn,i=0

(21) Here cn,i = N(n, i, s, a) and ϵmod(δ) is a high probability bound on Pk( | s, a) Vs,a VT s,a Pk( | s, a) 2 (satisfied with probability > 1 δ).

Remark 13. In the inequality,

The first term is a concentration-type term, which will be broken into an independent concentration" error and a mixing error to account for the low but non-zero dependence across blocks.

The second term accounts for subspace estimation error.

The third term accounts for actually observing s, a in our blocks.

Proof. Define the following quantities.

T i = (ˆPn,i( | s, a) ˆPk( | s, a))T Vs,a T i = (EQi[ˆPn,i( | s, a) | G] Pk( | s, a))T Vs,a

We first establish a simple inequality, using the fact that |a T b c T d| b 2 a c 2 + c 2 b d 2

| dist1,(s,a) T 1 2| = | T 1 2 T 1 2|

1 1 2 2 2 + T 1 2 2 2 2 2 1 1 2 + 2 2 2 2

i=1 2 ˆPn,i( | s, a) EQi[ˆPn,i( | s, a) | G] 2 +

i=1 2 ˆPk( | s, a) Pk( | s, a) 2

i=1 2 ˆPn,i( | s, a) EQi[ˆPn,i( | s, a) | G] 2 + 4ϵmod(δ) (22)

Learning Mixtures of Markov Chains and MDPs

Also note the following computation.

EQi[ˆPn,i( | s, a) | G] = EQi

cn,i 1cn,i =0 | G

= EQi[wn,i | G]

cn,i 1cn,i =0

= Pkn( | s, a)cn,i

cn,i 1cn,i =0

= 1cn,i =0Pkn( | s, a)

We define the following quantity, overloading notation from Lemma 5.

diffi = 1cn,i =0Pkn( | s, a) Pk( | s, a)

Note that i = diff T i Vs,a. We recall the following definition before proceeding to show the main inequality.

n,k(s, a) = Pkn( | s, a) Pk( | s, a)

T 1 2 n,k(s, a) 2 2 = diff T 1 Vs,a VT s,adiff2 diff T 1 diff2 + diff T 1 diff2 n,k(s, a) 2 2

diff1 2 diff2 Vs,a VT s,adiff2 2 + diff1 n,k(s, a) 2 diff2 2

+ diff1 2 diff2 n,k(s, a) 2

diff1 1 diff2 Vs,a VT s,adiff2 2 + diff1 n,k(s, a) 2 diff2 1

+ diff1 1 diff2 n,k(s, a) 2

2 diff2 Vs,a VT s,adiff2 2 + 2 diff1 n,k(s, a) 2 + 2 diff2 n,k(s, a) 2

2 Pkn( | s, a) Vs,a VT s,a Pkn( | s, a) 2

+ 2 Pk( | s, a) Vs,a VT s,a Pk( | s, a) 2 + 21cn,1=0 Pkn( | s, a) 2 + 21cn,2=0 Pkn( | s, a) 2

fminα + 4 max i 1cn,i=0

Notice that 4C2 q

fminα 4ϵmod(δ) since ϵmod(δ) 2, C2 2, K 1, fmin, α 1. Combining this and the computation above with inequality 16, we have the following final bound.

dist1,(s,a) n,k(s, a) 2 2

i=1 2 ˆPn,i( | s, a) EQi[ˆPn,i( | s, a) | G] 2 + 8C2

fminα + 4 max i 1cn,i=0

(23) where we remind the reader that cn,i = N(n, i, s, a).

I.1.2. BOUNDING THE CONCENTRATION-TYPE TERM We bound the first term in the decomposition lemma (Lemma 10) with high probability.

Lemma 11. With probability at least 1 δ, when Tn Ω Gtmix log G

δ log(1/α) and G Ω log(1/δ)

α2 , we have the following bound. ˆPn,i( | s, a) EQi[ˆPn,i( | s, a) | G] 2 O

K + log(1/δ)

Proof. The proof of this lemma is verbatim the proof of Lemma 6 after the first inequality.

Learning Mixtures of Markov Chains and MDPs

I.1.3. COMBINING THE BOUNDS We reuse Corollary 1 along with Lemma 11 applied to Lemma 10 to get the following bound with probability at least 1 δ,

dist1,(s,a) n,k(s, a) 2 2 O

K + log(1/δ)

whenever Tn Ω(Gtmix log(G/δ) log(1/α)) and G Ω log(1/δ)