# state_aggregation_learning_from_markov_transition_data__92be0a1d.pdf

State Aggregation Learning from Markov Transition Data

Yaqi Duan Princeton University yaqid@princeton.edu

Zheng Tracy Ke Harvard University zke@fas.harvard.edu

Mengdi Wang Princeton University mengdiw@princeton.edu

State aggregation is a popular model reduction method rooted in optimal control. It reduces the complexity of engineering systems by mapping the system s states into a small number of meta-states. The choice of aggregation map often depends on the data analysts knowledge and is largely ad hoc. In this paper, we propose a tractable algorithm that estimates the probabilistic aggregation map from the system s trajectory. We adopt a soft-aggregation model, where each meta-state has a signature raw state, called an anchor state. This model includes several common state aggregation models as special cases. Our proposed method is a simple twostep algorithm: The ﬁrst step is spectral decomposition of empirical transition matrix, and the second step conducts a linear transformation of singular vectors to ﬁnd their approximate convex hull. It outputs the aggregation distributions and disaggregation distributions for each meta-state in explicit forms, which are not obtainable by classical spectral methods. On the theoretical side, we prove sharp error bounds for estimating the aggregation and disaggregation distributions and for identifying anchor states. The analysis relies on a new entry-wise deviation bound for singular vectors of the empirical transition matrix of a Markov process, which is of independent interest and cannot be deduced from existing literature. The application of our method to Manhattan trafﬁc data successfully generates a data-driven state aggregation map with nice interpretations.

1 Introduction

State aggregation is a long-existing approach for model reduction of complicated systems. It is widely used as a heuristic to reduce the complexity of control systems and reinforcement learning (RL). The earliest idea of state aggregation is to aggregate similar states into a small number of subsets through a partition map. However, the partition is often handpicked by practitioners based on domain-speciﬁc knowledge or exact knowledge about the dynamical system [31, 6]. Alternatively, the partition can be chosen via discretization of the state space in accordance with some priorly known similarity metric or feature functions [33]. Prior knowledge of the dynamical system is often required in order to handpick the aggregation without deteriorating its performance. There lacks a principled approach to ﬁnd the best state aggregation structure from data.

In this paper, we propose a model-based approach to learning probabilistic aggregation structure. We adopt the soft state aggregation model, a ﬂexible model for Markov systems. It allows one to represent each state using a membership distribution over latent variables (see Section 2 for details). Such models have been used for modeling large Markov decision processes, where the membership can be used as state features to signiﬁcantly reduce its complexity [32, 36]. When the membership distributions are degenerate, it reduces to the more conventional hard state aggregation model, and so our method is also applicable to ﬁnding a hard partition of the state space.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

The soft aggregation model is parameterized by p aggregation distributions and r disaggregation distributions, where p is the total number of states in a Markov chain and r is the number of (latent) meta-states. Each aggregation distribution contains the probabilities of transiting from one raw state to different meta-states, and each disaggregation distribution contains the probabilities of transiting from one meta-state to different raw states. Our goal is to use sample trajectories of a Markov process to estimate these aggregation/disaggregation distributions. The obtained aggregation/disaggregation distributions can be used to estimate the transition kernel, sample from the Markov process, and plug into downstream tasks in optimal control and reinforcement learning (see Section 5 for an example). In the special case when the system admits a hard aggregation, these distributions naturally produce a partition map of states.

Our method is a two-step algorithm. The ﬁrst step is the same as the vanilla spectral method, where we extract the ﬁrst r left and right singular vectors of the empirical transition matrix. The second step is a novel linear transformation of singular vectors. The rationale of the second step is as follows: Although the left (right) singular vectors are not valid estimates of the aggregation (disaggregation) distributions, their linear span is a valid estimate of the linear span of aggregation (disaggregation) distributions. Consequently, the left (right) singular vectors differ from the targeted aggregation (disaggregation) distributions only by a linear transformation. We estimate this linear transformation by leveraging a geometrical structure associated with singular vectors. Our method requires no prior knowledge of meta-states and provides a data-driven approach to learning the aggregation map.

Our contributions. 1. We introduce a notion of anchor state for the identiﬁability of soft state aggregation. It creates an analog to the notions of separable feature in nonnegative matrix factorization [13] and anchor word in topic modeling [4]. The introduction of anchor states not only ensures model identiﬁability but also greatly improves the interpretability of meta-states (see Section 2 and Section 5). Interestingly, hard-aggregation indeed assumes all states are anchor states. Our framework instead assumes there exists one anchor state for each meta-state. 2. We propose an efﬁcient method for estimating the aggregation/disaggregation distributions of a soft state aggregation model from Markov transition data. In contrast, classical spectral methods are not able to estimate these distributions directly. 3. We prove statistical error bounds for the total variation divergence between the estimated aggregation/disaggregation distributions and the ground truth. The estimation errors depend on the size of state space, number of meta-states, and mixing time of the process. We also prove a sample complexity bound for accurately recovering all anchor states. To our best knowledge, this is the ﬁrst result of statistical guarantees for soft state aggregation learning. 4. At the core of our analysis is an entry-wise large deviation bound for singular vectors of the empirical transition matrix of a Markov process. This connects to the recent interests of entry-wise analysis of empirical eigenvectors [2, 23, 24, 38, 11, 14]. Such analysis is known to be challenging, and techniques that work for one type of random matrices often do not work for another. Unfortunately, our desired results cannot be deduced from any existing literature, and we have to derive everything from scratch (see Section 4). Our large-deviation bound provides a convenient technical tool for the analysis of spectral methods on Markov data, and is of independent interest. 5. We applied our method to a Manhattan taxi-trip dataset, with interesting discoveries. The estimated state aggregation model extracts meaningful trafﬁc modes, and the output anchor regions capture popular landmarks of the Manhattan city, such as Times square and WTC-Exchange. A plug-in of the aggregation map in a reinforcement learning (RL) experiment, taxi-driving policy optimization, signiﬁcantly improves the performance. These validate that our method is practically useful.

Connection to literature. While classical spectral methods have been used for aggregating states in Markov processes [34, 37], these methods do not directly estimate a state aggregation model. It was shown in [37] that spectral decomposition can reliably recover the principal subspace of a Markov transition kernel. Unfortunately, singular vectors themselves are not valid estimates of the aggregation/disaggregation distributions: the population quantity that singular vectors are trying to estimate is strictly different from the targeted aggregation/disaggregation distributions (see Section 3).

Our method is inspired by the connection between soft state aggregation, nonnegative matrix factorization (NMF) [26, 16], and topic modeling [7]. Our algorithm is connected to the spectral method in [22] for topic modeling. The method in [22] is a general approach about using spectral decomposition for nonnegative matrix factorization. Whether or not it can be adapted to state aggregation learning

and how accurately it estimates the soft-aggregation model was unknown. In particular, [22] worked on a topic model, where the data matrix has column-wise independence and their analysis heavily relies on this property. Unfortunately, our data matrix has column-wise dependence, so we are unable to use their techniques. We build our analysis from ground up.

There are recent works on statistical guarantees of learning a Markov model [15, 18, 28, 37, 35]. They target on estimating the transition matrix, but our focus is to estimate the aggregation/disaggregation distributions. Given an estimator of the transition matrix, how to obtain the aggregation/disaggregation distributions is non-trivial (for example, simply performing a vanilla PCA on the estimator doesn t work). To our best knowledge, our result is the ﬁrst statistical guarantee for estimating the aggregation/disaggregation distributions. We can also use our estimator of aggregation/disaggregation distributions to form an estimator of the transition matrix, and it can achieve the known minimax rate for a range of parameter settings. See Section 4 for more discussions.

2 Soft State Aggregation Model with Anchor States

We say that a Markov chain X0, X1, . . . , Xn admits a soft state aggregation with r meta-states, if there exist random variables Z0, Z1, . . . , Zn 1 {1, 2, . . . , r} such that

P(Xt+1 | Xt) =

k=1 P(Zt = k | Xt) P(Xt+1 | Zt = k), (1)

for all t with probability 1. Here, P(Zt | Xt) and P(Xt+1 | Zt) are independent of t and referred to as the aggregation distributions and disaggregation distributions, respectively. The soft state aggregation model has been discussed in various literatures (e.g., [32, 37]), where r is presumably much smaller than p. See [5, Section 6.3.7] for a textbook review. This decomposition means that one can map the states into meta-states while preserving the system s dynamics. In the special case where each aggregation distribution is degenerate (we say a discrete distribution is degenerate if only one outcome is possible), it reduces to the hard state aggregation model or lumpable Markov model.

The soft state aggregation model has a matrix form. Let P Rp p be the transition matrix, where Pij = P(Xt+1 = j | Xt = i). We introduce U Rp r and V Rp r, where Uik = P(Zt = k | Xt = i) and Vjk = P(Xt+1 = j | Zt = k). Each row of U is an aggregation distribution, and each column of V is a disaggregation distribution. Then, (1) is equivalent to (1s: the vector of 1 s)

P = UV , where U1r = 1p and V 1p = 1r. (2)

Here, U and V are not identiﬁable, unless with additional conditions. We assume that each meta-state has a signature raw state, deﬁned either through the aggregation process or the disaggregation process. Deﬁnition 1 (Anchor State). A state i is called an aggregation anchor state of the meta-state k if Uik = 1 and Uis = 0 for all s = k. A state j is called a disaggregation anchor state of the meta-state k, if Vjk > 0 and Vjs = 0 for all s = k.

An aggregation anchor state transits to only one meta-state, and a disaggregation anchor state can be transited from only one meta-state. Since (2) is indeed a nonnegative matrix factorization (NMF), the deﬁnition of anchor states are natural analogs of separable features in NMF [13]. They are also natural analogs of pure documents and anchor words in topic modeling. We note that in a hard state aggregation model, every state is an anchor state by default.

Throughout this paper, we take the following assumption: Assumption 1. There exists at least one disaggregation anchor state for each meta-state.

By well-known results in NMF [13], this assumption guarantees that U and V are uniquely deﬁned by (2), provided that P has a rank of r. Our results can be extended to the case where each meta-state has an aggregation anchor state (see the remark in Section 3). For simplicity, from now on, we call a disaggregation anchor state an anchor state for short.

The introduction of anchor states not only guarantees identiﬁability but also enhances interpretability of the model. This is demonstrated in an application to New York city taxi trafﬁc data. We model the taxi trafﬁc by a ﬁnite-state Markov chain, where each state is a pick-up/drop-off location in the city. Figure 1 illustrates the estimated soft-state aggregation model (see Section 5 for details). The

estimated anchor states coincide with notable landmarks in Manhattan, such as Times square area, the museum area on Park avenue, etc. Hence, each meta-state (whose disaggregation distribution is plotted via a heat map over Manhattan in (c)) can be nicely interpreted as a representative trafﬁc mode with exclusive destinations (e.g., the trafﬁc to Times square, the trafﬁc to WTC Exchange, the trafﬁc to museum park, etc.). In contrast, if we don t explore anchor states but simply use PCA to conduct state aggregation, the obtained meta-states have no clear association with notable landmarks, so are hard to interpret. The interpretability of our model also translates to better performance in downstream tasks in reinforcement learning (see Section 5).

Figure 1: Soft state aggregation learned from NYC taxi data. Left to right: (a) Illustration of 100 taxi trips (O: pick-up location, : drop-off location). (b) A principal component of P (heat map), lacking interpretability. (c) Disaggregation distribution of P corresponding to the Times square anchor region. (d) Ten anchor regions identiﬁed by Alg. 1, coinciding with landmarks of NYC.

3 An Unsupervised State Aggregation Algorithm

Given a Markov trajectory {X0, X1, . . . , Xn} from a state-transition system, let N Rp p be the matrix consisting of empirical state-transition counts, i.e., Nij = Pn 1 t=0 1{Xt = i, Xt+1 = j}. Our algorithm takes as input the matrix N and the number of meta-states r, and it estimates (a) the disaggregation distributions V, (b) the aggregation distributions U, and (c) the anchor states. See Algorithm 1. Part (a) is the core of the algorithm, which we explain below.

Insight 1: Disaggregation distributions are linear transformations of right singular vectors.

We consider an oracle case, where the transition matrix P is given and we hope to retrieve V from P. Let H = [h1, . . . , hr] Rp r be the matrix containing ﬁrst r right singular vectors of P. Let Span( ) denote the column space of a matrix. By linear algebra, Span(H) = Span(P ) = Span(VU ) = Span(V). Hence, the columns of H and the columns of V are two different bases of the same r-dimensional subspace. It follows immediately that there exists L Rr r such that H = VL. On the one hand, this indicates that singular vectors are not valid estimates of disaggregation distributions, as each singular vector hk is a linear combination of multiple disaggregation distributions. On the other hand, it suggests a promising two-step procedure for recovering V from P: (i) obtain the right singular vectors H, (ii) identify the matrix L and retrieve V = HL 1.

Insight 2: The linear transformation of L is estimable given the existence of anchor states.

The estimation of L hinges on a geometrical structure induced by the anchor state assumption [13]: Let C be a simplicial cone with r supporting rays, where the directions of the supporting rays are speciﬁed by the r rows of the matrix L. If j is an anchor state, then the j-th row of H lies on one supporting ray of this simplicial cone. If j is not an anchor state, then the j-th row of H lies in the interior of the simplicial cone. See the left panel of Figure 2 for illustration with r = 3. Once we identify the r supporting rays of this simplicial cone, we immediately obtain the desired matrix L.

Insight 3: Normalization on eigenvectors is the key to estimation of L under noise corruption.

In the real case where N, instead of P, is given, we can only obtain a noisy version of H. With noise corruption, to estimate supporting rays of a simplicial cone is very challenging. [22] discovered that a particular row-wise normalization on H manages to project the simplicial cone to a simplex with r vertices. Then, for all anchor states of one meta-state, their corresponding rows collapse to one single point in the noiseless case (and a tight cluster in the noisy case). The task reduces to estimating

x [ h1(x), h2(x), h3(x) ] states in feature space

anchor states

anchor states

anchor states

anchor states reveal meta-states

h2(x) h1(x), h3(x)

project onto

Figure 2: Geometrical structure of anchor states. Left: Each dot is a row of the matrix H = [h1, h2, h3]. The data points are contained in a simplicial cone with three supporting rays. Right: Each dot is a re-scaled row of H by SCORE, where the ﬁrst coordinate is dropped and so every row is in a plane. The data points are contained in a simplex with three vertices.

Algorithm 1 Learning the Soft State Aggregation Model. Input: empirical state-transition counts N, number of meta-states r, anchor state threshold δ0

1. Estimate the matrix of disaggregation distributions V.

(i) Conduct SVD on e N = N[diag(N 1p)] 1/2 Rp p, and let bh1, ..., bhr denote the ﬁrst r right singular vectors. Obtain a matrix b D = [diag(bh1)] 1[bh2, . . . , bhr] Rp (r 1). (ii) Run an existing vertex ﬁnding algorithm to rows of b D. Let bb1, . . . , bbr be the output vertices. (In our numerical experiments, we use the vertex hunting algorithm in [21]). (iii) For 1 j p, compute

bw j = argminq Rr bdj

k=1 qkbbk 2 + 1

Set the negative entries in bw j to zero and renormalize it to have a unit ℓ1-norm. The resulting vector is denoted as bwj. Let c W = [ bw1, bw2, . . . , bwp] Rp r. Obtain the matrix [diag(bh1)][diag(N 1p)]1/2 c W and re-normalize each column of it to have a unit ℓ1-norm. The resulting matrix is b V.

2. Estimate matrix of aggregation distributions U. Let b P = [diag(N1p)] 1N be the empirical transition probability matrix. Estimate U by b U = b P b V( b V b V) 1, 3. Estimate the set of anchor states. Let bwj be as in Step (iii). Let A = 1 j p : max 1 k r bwj(k) 1 δ0 . Output: estimates b V and b U, set of anchor states A

the vertices of a simplex, which is much easier to handle under noise corruption. This particular normalization is called SCORE [20]. It re-scales each row of H by the ﬁrst coordinate of this row. After re-scaling, the ﬁrst coordinate is always 1, so it is eliminated; the normalized rows then have (r 1) coordinates. See the right panel of Figure 2. Once we identify the r vertices of this simplex, we can use them to re-construct L in a closed form [22].

These insights have casted estimation of V to a simplex ﬁnding problem: Given data bd1, . . . , bdp Rr 1, suppose they are noisy observations of non-random vectors d1, . . . , dp Rr 1 (in our case, each dj is a row of the matrix H after SCORE normalization), where these non-random vectors are contained in a simplex with r vertices b1, . . . , br with at least one dj located on each vertex. We aim to estimate the vertices b1, . . . , br. This is a well-studied problem in the literature, also known as linear unmixing or archetypal analysis. There are a number of existing algorithms [8, 3, 29, 9, 21]. For example, the successive projection algorithm [3] has a complexity of O(pr4).

Algorithm 1 follows the insights above but is more sophisticated. Due to space limit, we relegate the detailed explanation to Appendix. It has a complexity of O(p3), where main cost is from SVD.

Remark (The case with aggregation anchor states). The anchor states here are the disaggregation anchor states in Deﬁntion 1. Instead, if each meta-state has an aggregation anchor state, there is a similar geometrical structure associated with the left singular vectors. We can modify our algorithm to ﬁrst use left singular vectors to estimate U and then estimate V and anchor states.

4 Main Statistical Results

Our main results are twofold. The ﬁrst is row-wise large-deviation bounds for empirical singular vectors. The second is statistical guarantees of learning the soft state aggregation model. Throughout this section, suppose we observe a trajectory {X0, X1, . . . , Xn} from an ergodic Markov chain with p states, where the transition matrix P satisﬁes (2) with r meta-states. Let π Rp denote the stationary distribution. Deﬁne a mixing time τ = min k 1 : max1 i p (Pk)i, π 1 1/2 . We assume there are constants c1, C1, C1, c2, c3, C4 > 0 such that the following conditions hold:

(a) The stationary distribution π satisﬁes c1p 1 πj C1p 1, for j = 1, 2, . . . , p. (b) The stationary distribution on meta-states satisﬁes U π

k C1r 1, for k = 1, 2, . . . , r. (c) λmin(U U) c2pr 1 and λmin(V V) c2p 1r. (d) The ﬁrst two singular values of [diag(π)]P[diag(π)] 1/2 satisfy σ1 σ2 c3p 1

2 . (e) The entries of the r-by-r matrix U PV satisfy maxk,l(U PV)kl

mink,l(U PV)kl C4.

Conditions (a)-(b) require that the Markov chain has a balanced number of visits to each state and to each meta-state when reaching stationarity. Such conditions are often imposed for learning a Markov model [37, 28]. Condition (c) eliminates the aggregation (disaggregation) distributions from being highly collinear, so that each of them can be accurately identiﬁed from the remaining. Condition (d) is a mild eigen-gap condition, which is necessary for consistent estimation of eigenvectors from PCA [2, 22]. Condition (e) says that the meta-states are reachable from one-another and that meta-state transitions cannot be overly imbalanced.

4.1 Row-wise large-deviation bounds for singular vectors

At the core of the analysis of any spectral method (the classical PCA or our unsupervised algorithm) is characterization of the errors of approximating population eigenvectors by empirical eigenvectors. If we choose the loss function to be the Euclidean norm between two vectors, it reduces to deriving a bound for the spectral norm of noise matrix [12] and is often manageable. However, the Euclidean norm bound is useless for our problem: In order to obtain the total-variation bounds for estimating aggregation/disaggregation distributions, we need sharp error bounds for each entry of the eigenvector. Recently, there has been active research on entry-wise analysis of eigenvectors [2, 23, 24, 38, 11, 14]. Since eigenvectors depend on data matrix in a complicated and highly nonlinear form, such analysis is well-known to be challenging. More importantly, there is no universal technique that works for all problems, and such bounds are obtained in a problem-by-problem manner (e.g., for Gaussian covariance matrix [23, 24], network adjacency matrix [2, 21], and topic matrix [22]). As an addition to the nascent literature, we develop such results for transition count matrices of a Markov chain. The analysis is challenging due to that the entries of the count matrix are dependent of each other.

Recall that e N is the re-scaled transition count matrix introduced in Algorithm 1 and bh1, . . . , bhr are its ﬁrst r right singular vectors (our technique also applies to the original count matrix N and the empirical transition matrix b P). Theorem 1 and Theorem 2 deal with the leading singular vector and the remaining ones, respectively. ( e O means bounded up to a logarithmic factor of n, p ).

Theorem 1 (Entry-wise perturbation bounds for bh1). Suppose the regularity conditions (a)-(e) hold. There exists a parameter ω { 1} such that if n = eΩ τ p , then with probability at least 1 n 1,

max1 j p |ωbh1(j) h1(j)| = e O (σ2 σ1) 1(1 + p

Theorem 2 (Row-wise perturbation bounds for b H). Suppose the regularity conditions (a)-(e) hold. For 1 s t r, let H = hs, . . . , ht , b H = bhs, . . . , bht , and = min σs 1 σs, σt σt+1 , where σ0 = + , σr+1 = 0. If n = eΩ τ p , then with probability 1 n 1, there is an or-

thogonal matrix Ω such that max1 j p e j b H Ω H 2 = e O 1 1 + p

4.2 Statistical guarantees of soft state aggregation

We study the error of estimating U and V, as well as recovering the set of anchor states. Algorithm 1 plugs in some existing algorithm for the simplex ﬁnding problem. We make the following assumption:

Assumption 2 (Efﬁciency of simplex ﬁnding). Given data bd1, ..., bdp Rr 1, suppose they are noisy observations of non-random vectors d1, ..., dp, where these non-random vectors are contained

in a simplex with r vertices b1, . . . , br with at least one dj located on each vertex. The simplex ﬁnding algorithm outputs bb1, ..., bbr such that max1 k r bbk bk C max1 j p bdj bdj .

Several existing simplex ﬁnding algorithms satisfy this assumption, such as the successive projection algorithm [3], the vertex hunting algorithm [21, 22], and the algorithm of archetypal analysis [19]. Since this is not the main contribution of this paper, we refer the readers to the above references for details. In our numerical experiments, we use the vertex hunting algorithm in [21, 22].

First, we provide total-variation bounds between estimated individual aggregation/disaggregation distributions and the ground truth. Write V = [V1, . . . , Vr] and U = [u1, . . . , up] , where each Vk Rp is a disaggregation distribution and each ui Rr is an aggregation distribution. Theorem 3 (Error bounds for estimating V). Suppose the regularity conditions (a)-(e) hold and Assumptions 1 and 2 are satisﬁed. When n = eΩ τ p 3 2 r , with probability at least 1 n 1, the

estimate b V given by Algorithm 1 satisﬁes 1

r Pr k=1 b Vk Vk 1 = e O 1 + p p

τ /n p τ pr

Theorem 4 (Error bounds for estimating U). Suppose the regularity conditions (a)-(e) hold and Assumptions 1 and 2 are satisﬁed. When n = eΩ τ p 3 2 r , with probability at least 1 n 1, the

estimate b U given by Algorithm 1 satisﬁes 1

p Pp j=1 buj uj 1 = e O r 3 2 1 + p p

τ /n p τ pr

Second, we provide sample complexity guarantee for the exact recovery of anchor states. To eliminate false positives, we need a condition that the non-anchor states are not too close to an anchor state; this is captured by the quantity δ below. (Note δj = 0 for anchor states j A .) Theorem 5 (Exact recovery of anchor states). Suppose the regularity conditions (a)-(e) hold and Assumptions 1 and 2 are satisﬁed. Let A be the set of (disaggregation) anchor states. Deﬁne δj = 1 max1 k r PX0 π(Z0 = k | X1 = j) and δ = minj / A δj. Suppose the threshold δ0 in Algorithm 1 satisﬁes δ0 = O(δ). If n = eΩ δ 2 0 τ p 3 2 r , then P(A = A ) 1 n 1.

We connect our results to several lines of works in the literature. First, in the special case of r = 1, our problem reduces to learning a discrete distribution with p outcomes, where the minimax rate of total-variation distance is O( p

p/n) [17]. Our bound matches with this rate when p = O( n). However, our problem is much harder: each row of P is a mixture of r discrete distributions. Second, our setting is connected to the setting of learning a mixture of discrete distributions [30, 27] but is different in important ways. Those works consider learning one mixture distribution, and the data are iid observations. Our problem is to estimate p mixture distributions, which share the same basis distributions but have different mixing proportions, and our data are a single trajectory of a Markov chain. Third, our problem is connected to topic modeling [4, 22], where we may view the empirical transition proﬁle of each raw state as a document . However, in topic modeling, the documents are independent of each other, but the documents here are highly dependent as they are generated from a single trajectory of a Markov chain. Last, we compare with the literature of estimating the transition matrix P of a Markov model. Without low-rank assumptions on P, the minimax rate of the total variation error is O(p/ n) [35] (also, see [25] and reference therein for related settings in hidden Markov models); with a low-rank structure on P, the minimax rate becomes O( p

rp/n) [37]. To compare, we use our estimator of (U, V) to construct an estimator of P by b P = b U b V . When r is bounded and p = O( n), this estimator achieves a total-variation error of O( p

rp/n), which is optimal. At the same time, we want to emphasize that estimating (U, V) is a more challenging problem than estimating P, and we are not aware of any existing theoretical results of the former. Simulation. We test our method on simulations (settings are in the appendix). The results are summarized in Figure 3. It suggests: (a) the rate of convergence in Theorem 3 is conﬁrmed by numerical evidence, and (b) our method compares favorably with existing methods on estimating P (to our best knowledge, there is no other method that directly estimates U and V; so we instead compare the estimation on P).

5 Analysis of NYC Taxi Data and Application to Reinforcement Learning

We analyze a dataset of 1.1 107 New York city yellow cab trips that were collected in January 2016 [1]. We treat each taxi trip as a sample transition generated by a city-wide Markov process over NYC, where the transition is from a pick-up location to some drop-off location. We discretize the map into p = 1922 cells so that the Markov process becomes a ﬁnite-state one.

105 106 107 108 10-2

100 #anchor = 25 #anchor = 50 #anchor = 75 #anchor = 100

max 1 k r k b Vk Vkk1

5.0 0.51 ln n. Squared correlation coefﬁcient = 0.998.

400 700 1000 2500 5000 0.12

0.18 #anchor = 0.025p #anchor = 0.050p #anchor = 0.075p #anchor = 0.100p

max 1 k r k b Vk Vkk1

1.8 0.03 ln p. Squared correlation coefﬁcient = 0.8.

105 106 107 10-2

Figure 3: Simulation Results. Left: Total variation error on V (p is ﬁxed and n varies). Middle: Total variation error on V (p/n is ﬁxed). Both panels validate the scaling of p

p/n in Theorem 3. Right: Recovery error of P, where b U b V is our method and P is the spectral estimator [37] (note: this method cannot estimate U,V ).

b Uk (downtown)

b Vk (downtown)

b Uk (midtown)

b Vk (midtown) Figure 4: State aggregation results. (a) Estimated anchor regions and partition of NYC. (b) Estimated disaggregation distribution ( b Vk) and aggregation likelihood ( b Uk) for two meta-states.

Anchor regions and partition. We apply Alg. 1 to the taxi-trip data with r = 4, 10. The algorithm identiﬁes sets of anchor states that are close to the vertices, as well as columns of b U and b V corresponding to each vertex (anchor region). We further use the estimated b U, b V to ﬁnd a partition of the city. Recall that in Algorithm 1, each state is projected onto a simplex, which can be represented as a convex combination of simplex s vertices (see Figure 2). For each state, we assign the state to a cluster that corresponds to largest value in the weights of convex combination. In this way, we can cluster the 1922 locations into a small number of regions. The partition results are shown in Figure 4 (a), where anchor regions are marked in each cluster.

Estimated aggregation and disaggregation distributions. Let b U, b V be the estimated aggregation and disaggregation matrices. We use heat maps to visualize their columns. Take r = 10 for example. We pick two meta-states, with anchor states in the downtown and midtown areas, respectively. We plot in Figure 4 (b) the corresponding columns of b U and b V. Each column of b V is a disaggregation distribution, and each column of b U can be thought of as a likelihood function for transiting to corresponding meta-states. The heat maps reveal the leading modes of the trafﬁc-dynamics.

Figure 5: The optimal driving policy learnt by feature-based RL with estimated aggregation distributions as state features. Arrows point out the most favorable directions and the thickness is proportional to favorability of the direction.

Aggregation distributions used as features for RL. Soft state aggregation can be used to reduce the complexity of reinforcement learning (RL) [32]. Aggregation/disaggregation distributions provide features to parameterize high-dimensional policies, in conjunction with feature-based RL methods [10, 36]. Next we experiment with using the aggregation distributions as features for RL.

Consider the taxi-driving policy optimization problem. The driver s objective is to maximize the daily revenue - a Markov decision process where the driver chooses driving directions in realtime based on locations. We compute the optimal policy using feature-based RL [10] and simulated NYC trafﬁc. The algorithm takes as input 27 estimated aggregation distributions as state features. For comparison, we also use a hard partition of the city which is handpicked according to 27 NYC districts. RL using aggregation distributions as features achieves a daily revenue of $230.57, while the method using handpicked partition achieves $209.14. Figure 5 plots the optimal driving policy. This experiment suggests that (1) state aggregation learning provides features for RL automatically; (2) using aggregation distributions as features leads to better performance of RL than using handpicked features.

[1] NYC Taxi and Limousine Commission (TLC) trip record data. http://www.nyc.gov/html/ tlc/html/about/trip_record_data.shtml. Accessed June 11, 2018.

[2] Emmanuel Abbe, Jianqing Fan, Kaizheng Wang, and Yiqiao Zhong. Entrywise eigenvector analysis of random matrices with low expected rank. ar Xiv preprint ar Xiv:1709.09565, 2017.

[3] Mário César Ugulino Araújo, Teresa Cristina Bezerra Saldanha, Roberto Kawakami Harrop Galvao, Takashi Yoneyama, Henrique Caldas Chame, and Valeria Visani. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemometrics and Intelligent Laboratory Systems, 57(2):65 73, 2001.

[4] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models going beyond SVD. In Foundations of Computer Science (FOCS), pages 1 10, 2012.

[5] Dimitri P Bertsekas. Dynamic programming and optimal control. Athena scientiﬁc Belmont, MA, 2007.

[6] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming. Athena Scientiﬁc, Belmont, MA, 1996.

[7] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993 1022, 2003.

[8] Joseph W Boardman, Fred A Kruse, and Robert O Green. Mapping target signatures via partial unmixing of aviris data. 1995.

[9] C-I Chang, C-C Wu, Weimin Liu, and Y-C Ouyang. A new growing method for simplex-based endmember extraction algorithm. IEEE Transactions on Geoscience and Remote Sensing, 44(10):2804 2819, 2006.

[10] Yichen Chen, Lihong Li, and Mengdi Wang. Scalable bilinear π learning using state and action features. Proceedings of the 35th International Conference on Machine Learning, 2018.

[11] Yuxin Chen, Jianqing Fan, Cong Ma, and Kaizheng Wang. Spectral method and regularized mle are both optimal for top-k ranking. ar Xiv preprint ar Xiv:1707.09971, 2017.

[12] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970.

[13] David Donoho and Victoria Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In Advances in neural information processing systems, pages 1141 1148, 2004.

[14] Justin Eldridge, Mikhail Belkin, and Yusu Wang. Unperturbed: spectral analysis beyond Davis-Kahan. ar Xiv preprint ar Xiv:1706.06516, 2017.

[15] Moein Falahatgar, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh. Learning markov distributions: Does estimation trump compression? In 2016 IEEE International Symposium on Information Theory (ISIT), pages 2689 2693. IEEE, 2016.

[16] Nicolas Gillis. The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines, 12(257), 2014.

[17] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of discrete distributions under ℓ1 loss. IEEE Transactions on Information Theory, 61(11):6343 6354, 2015.

[18] Yi HAO, Alon Orlitsky, and Venkatadheeraj Pichapati. On learning markov chains. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 648 657. Curran Associates, Inc., 2018.

[19] Hamid Javadi and Andrea Montanari. Non-negative matrix factorization via archetypal analysis. Journal of the American Statistical Association, (just-accepted):1 27, 2019.

[20] Jiashun Jin. Fast community detection by SCORE. Annals of Statistics, 43(1):57 89, 2015.

[21] Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Estimating network memberships by simplex vertex hunting. ar Xiv:1708.07852, 2017.

[22] Zheng Tracy Ke and Minzhe Wang. A new SVD approach to optimal topic estimation. ar Xiv:1704.07016, 2017.

[23] Vladimir Koltchinskii and Karim Lounici. Asymptotics and concentration bounds for bilinear forms of spectral projectors of sample covariance. In Annales de l Institut Henri Poincaré, Probabilités et Statistiques, volume 52, pages 1976 2013. Institut Henri Poincaré, 2016.

[24] Vladimir Koltchinskii and Dong Xia. Perturbation of linear forms of singular vectors under Gaussian noise. In High Dimensional Probability VII, pages 397 423. Springer, 2016.

[25] Aryeh Kontorovich, Boaz Nadler, and Roi Weiss. On learning parametric-output hmms. In International Conference on Machine Learning, pages 702 710, 2013.

[26] Daniel Lee and Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788 791, 1999.

[27] Jian Li, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. Learning arbitrary statistical mixtures of discrete distributions. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 743 752. ACM, 2015.

[28] Xudong Li, Mengdi Wang, and Anru Zhang. Estimation of Markov chain via rank-constrained likelihood. Proceedings of the 35th international conference on Machine learning, 2018.

[29] José MP Nascimento and José MB Dias. Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE transactions on Geoscience and Remote Sensing, 43(4):898 910, 2005.

[30] Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. Learning mixtures of arbitrary distributions over large discrete domains. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 207 224. ACM, 2014.

[31] David F Rogers, Robert D Plante, Richard T Wong, and James R Evans. Aggregation and disaggregation techniques and methodology in optimization. Operations Research, 39(4):553 582, 1991.

[32] Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Reinforcement learning with soft state aggregation. In Advances in neural information processing systems, pages 361 368, 1995.

[33] John N Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1-3):59 94, 1996.

[34] Marcus Weber and Susanna Kube. Robust Perron cluster analysis for various applications in computational life science. In International Symposium on Computational Life Science, pages 57 66. Springer, 2005.

[35] Geoffrey Wolfer and Aryeh Kontorovich. Minimax learning of ergodic markov chains. In Proceedings of the 30th International Conference on Algorithmic Learning Theory, pages 904 930, 2019.

[36] Lin F Yang and Mengdi Wang. Sample-optimal parametric Q-learning with linear transition models. Proceedings of the 36th International Conference on Machine Learning, 2019.

[37] Anru Zhang and Mengdi Wang. Spectral state compression of Markov processes. ar Xiv:1802.02920, 2018.

[38] Yiqiao Zhong and Nicolas Boumal. Near-optimal bounds for phase synchronization. SIAM Journal on Optimization, 28(2):989 1016, 2018.