# estimation_of_markov_chain_via_rankconstrained_likelihood__09f7ec6a.pdf

Estimation of Markov Chain via Rank-Constrained Likelihood

Xudong Li 1 Mengdi Wang 1 Anru Zhang 2

This paper studies the estimation of low-rank Markov chains from empirical trajectories. We propose a non-convex estimator based on rankconstrained likelihood maximization. Statistical upper bounds are provided for the Kullback Leiber divergence and the ℓ2 risk between the estimator and the true transition matrix. The estimator reveals a compressed state space of the Markov chain. We also develop a novel DC (difference of convex function) programming algorithm to tackle the rank-constrained non-smooth optimization problem. Convergence results are established. Experiments show that the proposed estimator achieves better empirical performance than other popular approaches.

1. Introduction

In scientiﬁc studies, engineering systems, and social environments, one often has to interact with complex processes with uncertainties, collect data, and learn to make predictions and decisions. A critical question is how to model the unknown random process from data. To answer this question, we particularly focus on discrete Markov chains with a ﬁnite but potentially huge state space. Suppose we are given a sequence of observations generated by an unknown Markov process. The true transition matrix is unknown, and the physical states of the system and its law of transition is hidden under massive noisy observations. We are interested in the estimation and state compression of Markov processes, i.e., to identify compact representations of the state space and recover a reduced-dimensional Markov model.

To be more speciﬁc, we focus on the estimation of a class

1Department of Operations Research and Financial Engineering, Princeton University, Sherrerd Hall, Princeton, NJ 08544 2Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706. Correspondence to: Xudong Li <xudongl@princeton.edu>, Mengdi Wang <mengdiw@princeton.edu>, Anru Zhang <anruzhang@stat.wisc.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

of Markov chains with low-rank transition matrices in this paper. Low-rankness is ubiquitous in practice. For example, the random walk of a taxi turns out to correspond to a nearly low-rank Markov chain. Each taxi trip can be viewed as a sample transition of a city-wide random walk that characterizes the trafﬁc dynamics. For another example, control and reinforcement learning, which are typically modeled as Markov decision processes, suffer from the curse of dimensionality (Sutton & Barto, 1998). It is of vital interest to identify the compressed representation of the state of game, in order to reduce the dimensionality of policy and value functions. Multiple efforts show that as long as a reduced model or a compact state representation is given, one can solve high-dimensional control problems in sample-efﬁcient ways; see (Singh et al., 1995; Van Roy, 2006; Malek et al., 2014) for examples.

The low-rank Markov chain is a basic model for analyzing data generated by stochastic processes. It is closely related to a variety of latent-variable models, including Markov process with rich observations (Azizzadenesheli et al., 2016), state aggregation models (Singh et al., 1995) and hidden Markov models (Hsu et al., 2012). Low-rankness of Markov chains can be used to recover network partition under some lumpability assumption (E et al., 2008). In particular, lowrank Markov chain can be viewed as a special form of hidden Markov model where the observations contain all the information of the hidden states.

In this article, we propose a rank-constrained maximum likelihood approach for transition matrix estimation from empirical trajectories. The statistical guarantees is developed for the proposed estimator. Especially, the upper bounds for the Kullback-Leibler (KL) divergence between the estimator and the true transition matrix as well as the ℓ2 risk are established. We further provide a minimax lower bound to show that the proposed estimator is near rate-optimal within a class of low-rank Markov chains. A related recent work (Zhang & Wang, 2017) considered the low-rank estimation by a spectral method and established upper bound for the total variation. Compared to the existing results, our maximum likelihood estimator and the KL-divergence bounds appear to be the ﬁrst ones of this type.

The proposed estimator requires solving an optimization problem with both rank and polyhedral constraints. Due to the presence of non-convex rank constraint, the optimiza-

Estimation of Markov Chain via Rank-constrained Likelihood

tion problem is difﬁcult there is no efﬁcient approach that directly applies to such a non-convex problem, to the best of our knowledge. In this paper, we use a penalty approach to relax the rank constraint and transform the original problem into a DC (difference of convex functions) programming one. Furthermore, we develop a corresponding DC algorithm to solve the problem. The algorithm proceeds by solving a sequence of inner subproblems, for which we develop an efﬁcient subroutine based on multi-block alternating direction method of multipliers (ADMM). As a byproduct of this research, we develop a new class of DC algorithms and a uniﬁed convergence analysis for solving non-convex non-smooth problems.

Finally, the performance of the proposed estimator and algorithm is veriﬁed through simulation studies.

2. Related literatures

Model reduction for complicated systems has a long history that traces back to variable-resolution dynamic programming (Moore, 1991) and state aggregation for decision process (Sutton & Barto, 1998). In particular, (Deng et al., 2011; Deng & Huang, 2012) considered low-rank reduction of Markov models with explicitly known transition probability matrix, but not the estimation of the reduced models.

Low-rank models and spectral methods have been proved powerful in extracting features from high-dimensional data, with numerous applications including network analysis (E et al., 2008), community detection (Newman, 2013), ranking (Negahban et al., 2016), product recommendation (Keshavan et al., 2010) and many more. Estimation of hidden Markov models is related to spectral matrix/tensor decomposition, which has been considered in (Hsu et al., 2012). Later, (Anandkumar et al., 2014) studied a more general tensor-decomposition-based spectral approach, which can be applied to a variety of latent variable models including hidden Markov models. (Huang et al., 2016) considered the estimation of a rank-two probabilistic matrix from a few number of independent samples. Estimation of low-rank Markov chain was considered for the ﬁrst time in (Zhang & Wang, 2017). Particularly, a spectral method via truncated singular value decomposition was introduced and the upper and lower error bounds in total variation were established. (Yang et al., 2017) developed an online stochastic gradient method for computing the leading singular space of a transition matrix from random walk data. The online stochastic approximation algorithms were also analyzed recently in (Li et al., 2017; 2018) for principal component estimation. The estimation of discrete distributions in the form of vectors is another line of related research, where the procedure and estimation risk has been considered in both classic and recent literature (Steinhaus, 1957; Lehmann & Casella, 2006; Han et al., 2015).

Form the perspective of optimization, DC programming is a backbone in handling the rank-constrained problem. First introduced by (Pham Dinh & Le Thi, 1997), the DC algorithm has become a prominent tool for handling a class of nonconvex optimization problems (see also (Pham Dinh & Le Thi, 2005; Le Thi et al., 2012; 2017; Le Thi & Pham Dinh, 2018)). In particular, (Van Dinh et al., 2015; Wen et al., 2017) considered the majorized indeﬁnite-proximal DC algorithm, which is closely related to the optimization methods in this paper. However, both (Van Dinh et al., 2015; Wen et al., 2017) used the majorization technique with restricted choices of majorants, neither considered the introduction of the indeﬁnite proximal terms, and (Wen et al., 2017) further requires the smooth part in the objective to be convex. A more ﬂexible algorithmic framework and a uniﬁed convergence analysis is provided in Section 5.4.

3. Low-rank Markov chains and latent variable models

Consider a discrete-time Markov chain on p states {s1, . . . , sp} with transition matrix P ℜp p. Then P satisﬁes Pij = P(sj|si), 0 Pij 1, and Pp j=1 Pij = 1 for 1 i, j p. Let µ be the corresponding stationary distribution. Suppose rank(P) = r. Motivated by the examples that we will discuss below, we focus on the case where r p, i.e., the rank of P is signiﬁcantly smaller than the total number of states.

The ﬁrst example of low-rank Markov chain in reducedorder models is the state aggregation a basic model for describing complicated systems (Bertsekas, 1995; Bertsekas & Tsitsiklis, 1995). If a process admits an inherent state aggregation structure, its states can be clustered into a small number of disjoint subsets, such that states from the same cluster have identical transition distribution. We say that a Markov chain is state aggregatable if there exists a partition S1, S2, . . . , Sr such that

P( | s) = P( | s ), s, s Si, i [r].

It can be shown that the state aggregation yields a special factorization of transition matrix. Particularly, there exists U, V ℜp r such that

where each row of U has all 0 s except exact one 1 and V is nonnegative.

Another related model is referred to as Markov decision process with rich observations (Azizzadenesheli et al., 2016) in the context of reinforcement learning. It is essentially a latent-variable model for Markov chains. We say that a Markov chain is a r-latent-variable model if there exists a stochastic process {zt} [r] such that

P(zt | st) = P(zt | s1, . . . , st)

Estimation of Markov Chain via Rank-constrained Likelihood

P(st+1 | zt) = P(st+1 | s1, . . . , st, zt).

The latent-variable model, another instance of low-rank Markov chains, is slightly more general than the state aggregation model. One can show that the transition matrix P of a r-latent-variable Markov chain has a nonnegative rank at most r and there exists U, V ℜp r, P ℜr r such that

where P is a stochastic matrix, rows of U and columns of V are vectors of probability distributions.

A third related notion is the lumpability of Markov chain (Buchholz, 1994). We say that a Markov chain is lumpable with respect to a partition S1, S2, . . . , Sr if for any i, j [r] and s , s Si,

P(Sj | s) = P(Sj | s ).

Lumpability characterizes some partition of the state space that preserves the strong Markov property. It implies that some eigenvectors of the transition matrix is equal to the indicator functions of the subsets. However, it is a weaker assumption than state aggregation and does not necessarily imply low-rankness. As pointed out by (E et al., 2008), the lumpable partition can be recovered by clustering the singular vectors of the transition matrix.

The readers are also referred to (Zhang & Wang, 2017) for more discussions on low-rank Markov process examples. In summary, low-rankness is a ubiquitous in a broad class of reduced-order Markov models. Estimation of low-rank transition matrices involves estimating the leading subspace, which provides a venue towards state compression.

4. Minimax rate-optimal estimation of low-rank Markov chains

Now we consider the estimation of transition matrix based on a single state-transition trajectory. Given a Markov chain X = {X0, X1, . . . , Xn}, we aim to estimate the transition matrix P given the assumption that rank(P) = r and r p. For 1 i, j p, denote the transition counts from state si to sj by nij:

nij := |{1 k N | Xk 1 = si, Xk = sj}|.

Let ni := Pp j=1 nij for i = 1, . . . , p and n = Pp i=1 ni.

Proposition 1. The negative log-likelihood of P based on state-transition trajectory {x0, . . . , xn} is

j=1 nij log(Pij). (1)

We propose the following rank-constrained maximum likelihood estimator:

b P = arg min L(Q) s.t. Q1p = 1p, rank(Q) r, 0 Qij 1, 1 i, j p. (2)

Then we aim to analyze the theoretical property of b P. Since the direct analysis of (2) for general transition matrices is technical challenging, we consider a slightly different setting, assume P is entry-wise bounded, and analyze the rankconstrained optimization with corresponding constraints. As we will show later in numerical analysis, such additional assumptions and constraints may not be necessary in practice.

Theorem 1 (Statistical Recovery Guarantee). Suppose the transition matrix P satisﬁes rank(P) r, α/p Pij β/p for constants 0 < α < 1 < β. Consider the following programming,

b P = arg min L(Q) s.t. Q1p = 1p, rank(Q) r, α p Qij β 1 i, j p. (3)

If we observe a total number of n independent state transitions generated from the stationary distribution and n Cp log(p), the optimal solution b P of problem (3) satisﬁes

DKL(P, b P) =

i,j=1 µi Pij log(Pij/ b Pij)

i=1 Pi b Pi 2 2 C

with probability at least 1 Cp c. Here, C, c > 0 are universal constants.

The proof is given in Section 2 of the supplementary material.

Remark 1. Note that the result in Theorem 1 applies to the data when each state transition is sampled independently from the stationary distribution of the Markov chain. In practice, samples from Markov chains are typically dependent. Then we can artiﬁcially introduce independence by sampling the dependent random walk data. Speciﬁcally, let τ(ϵ) be the ϵ-mixing time of the Markov chain (see, e.g., (Levin & Peres, 2017)). We pick a sufﬁciently small ϵ and sample one transition every τ(ϵ) transitions so that the down-sampled data are nearly independent (whose distribution is within ϵ total-variation distance from the independent data). Note that τ(ϵ) τ(1/4) log(1/ϵ) scales logarithmic

Estimation of Markov Chain via Rank-constrained Likelihood

in 1/ϵ. So we can pick ϵ to be sufﬁciently small without deteriorating the analysis much. Therefore, the overall sample complexity in the case of random walk data is roughly τ(1/4) times the sample complexity in the independent case (up to polylog factors).

Remark 2. In the proof of Theorem 1, in order to bound the difference between the empirical and real loss functions for the rank-constraint estimator, we apply the concentration inequalities for empirical process and a peeling scheme. In particular, we consider a partition for the set of all possible b P, then derive estimation loss upper bounds for each of these subsets based on concentration inequalities. The techniques used here are related to recent works on matrix completion (Negahban & Wainwright, 2012), although our problem setting, methodology, and sampling scheme are all very different from matrix completion.

Remark 3. In Theorem 1, we use entry-wise boundedness condition α & β instead of the incoherence condition, to ensure bounded Hessian of the KL-divergence. α & β do not appear in bounds as they are treated as constants dependence on α & β is complicated and beyond the scope of this paper. Similar condition of entry-wise boundedness was used in matrix completion (Recht, 2011, Assumption A1) and discrete distribution estimation (Kamath et al., 2015). In fact, based on (Kamath et al., 2015, Theorem 10), the entry-wise lower bound is necessary for deriving reasonable statistical recovery bounds in KL-divergence.

Deﬁne the following set of transition matrices P = {P : P1p = 1p, 0 Pij 1, rank(P) r}. We further have the following lower bound results.

Theorem 2 (Lower Bound). When n Cpr, b P is any estimator for the transition matrix P based on a sample trajectory of length n. Then,

inf b P sup P P EDKL(P, b P) cpr

inf b P sup P P E

i=1 b Pi Pi 2 2 cpr

where C, c are universal constants.

Its proof is given in Section 3 of the supplementary material. Theorems 1 and 2 together show that the proposed estimator is rate-optimal up to a logarithm factor, as long as n = O(pr log p).

Remark 4. Especially when r = 1, P can be written as 1v for some vector v ℜp, and then estimating P essentially reduces to estimating a discrete distribution from multinomial count data. Our upper bound in Theorem 1 nearly matches (up to a log factor) the classical results of discrete distribution estimation ℓ2 risks (see, e.g. (Lehmann & Casella, 2006, Pg. 349)).

In addition to the full transition matrix P, the leading left and right singular subspaces of P, say U, V ℜp r, also play important roles in Markov chain analysis. For example, by performing k-means on the reliable estimations of U or V for a state aggregatable Markov chains, one can achieve good performance of state aggregation (Zhang & Wang, 2017). Based on previous discussions, one can further establish the error bounds on singular subspace estimation for Markov transition matrix. The proof of the following theorem is given in Section 4 of the supplementary material.

Theorem 3. Under the setting of Theorem 1, suppose the left and right singular subspaces of b P are b U ℜp r and b V ℜp r, where b P is the optimizer of (3), one has

max{ sin Θ( b U, U) 2 F , sin Θ( b V, V) 2 F }

min n C(pr log p)/n + C p

(log p)/n σ2r(P) , r o

with probability at least 1 Cp c. Here, σr(P) is the r-th singular value of P and sin Θ( b U, U) F = (r b U U 2 F )1/2.

Remark 5. Similarly as Theorem 2, one can develop the corresponding lower bound to show the optimality for the upper bound in Theorem 3.

5. Optimization methods for the rank-constrained likelihood problem

In this section, we develop the optimization method for computing the rank-constrained likelihood maximizer (2). In Section 5.1, a penalty approach is applied to transform the original intractable rank-constrained problem into a DC programming problem. Then we solve this problem by a proximal DC (PDC) algorithm in Section 5.2. We discuss the solver for the subproblems involved in the proximal DC algorithm in Section 5.3. Lastly, a uniﬁed convergence analysis of a class of majorized indeﬁnite-proximal DC (Majorized i PDC) algorithms is provided in Section 5.4.

5.1. A penalty approach for problem (2)

Recall (2) is intractable due to the non-convex rank constraint, we introduce a penalty approach to relax such a constraint, and particularly study the following general optimization problem:

min {f(X) | A(X) = b, rank(X) r} , (6)

where f : ℜp p ( , + ] is a closed, convex, but possibly non-smooth function, A : ℜp p ℜm is a linear map, b ℜm and r > 0 are given data. Especially when f(X) = 1

n Pp i=1 Pp j=1 nij log(Xij) + δ(X | X 0), A(X) = X1p, b = 1p, and δ( | X 0) is the indicator function of the closed convex set {X ℜp p | X 0}, the

Estimation of Markov Chain via Rank-constrained Likelihood

10 20 30 40 50 60 70 80 90 100 C

10 20 30 40 50 60 70 80 90 100 C

10 20 30 40 50 60 70 80 90 100 C

Figure 1. Comparison between rank and mle with accuracy measures (ηF , ηU, ηV ) versus number of state jumps n = C2rp log(p).

general problem (6) becomes the original rank-constraint maximum likelihood problem (2).

Given X ℜp p, let σ1(X) σp(X) 0 be the singular values of X. Since rank(X) r if and only σr+1(X) + . . . + σp(X) = X X (r) = 0, where X (r) = Pr i=1 σi(X) is the Ky Fan r-norm of X, (6) can be equivalently formulated as

min f(X) | X X (r) = 0, A(X) = b .

See also (Gao & Sun, 2010, equation (29)). The penalized formulation of problem (6) is

min X ℜp p f(X) + c( X X (r)) | A(X) = b , (7)

where c > 0 is a penalty parameter. Since (r) is convex, it is clear that the objective in problem (7) is a difference of two convex functions: f(X) + c X and c X (r), i.e., (7) is a DC program.

Let X c be an optimal solution to the penalized problem (7). The following proposition shows that X c is also the optimizer to (6) when it is low-rank. Proposition 2. If rank(X c) r, then X c is also an optimal solution to the original problem (6).

In practice, one can gradually increase the penalty parameter c to obtain a sufﬁcient low rank solution X c. In our numerical experiments, we can obtain solutions with the desired rank with a properly chosen parameter c.

5.2. A PDC algorithm for penalized problem (7)

The central idea of the DC algorithm (Pham Dinh & Le Thi, 1997) is as follows: at each iteration, one approximates the concave part of the objective function by its afﬁne majorant, then solves the resulting convex optimization problem. In this subsection, we present a variant of the classic DC algorithm for solving (7). For the execution of the algorithm, we recall that the sub-gradient of Ky Fan r-norm at a point X ℜp p (Watson, 1993) is

X (r) = U Diag(q )VT | q ,

where U and V are the singular vectors of X, is the optimal solution set of the following problem

i=1 σi(X)qi | 1p, q = r, 0 q 1

Note that one can efﬁciently obtain a component of X (r) by computing the SVD and picking up the SVD vectors corresponding to the r largest singular values. After these preparations, we are ready to state the PDC algorithm for problem (7). Different from the classic DC algorithm, an additional proximal term is added to ensure the existence of solutions of subproblems (8) and the convergence of the difference of two consecutive iterations generated by the algorithm. See Theorem 4 and Remark 6 for more details.

Algorithm 1 A PDC algorithm for solving problem (7) Given c > 0, α 0, and the stopping tolerance η, choose initial point X0 ℜp p. Iterate the following steps for k = 0, 1, . . . : 1. Choose Wk Xk (r). Compute

Xk+1 = arg min

f(X) + c( X Wk, X Xk

Xk (r)) + α

(8) 2. If Xk+1 Xk F η, stop.

We say that X is a critical point of problem (7) if

(f(X) + c X ) (c X (r)) = .

We state the following convergence results for Algorithm 1.

Theorem 4. Let {Xk} be the sequence generated by Algorithm 1 and α 0. Then {f(Xk) + c( Xk Xk (r))} is a non-increasing sequence. If Xk+1 = Xk for some integer k 0, then Xk is a critical point of (7). Otherwise,

Estimation of Markov Chain via Rank-constrained Likelihood

10 20 30 40 50 60 70 80 90 100 C

10 20 30 40 50 60 70 80 90 100 C

10 20 30 40 50 60 70 80 90 100 C

Figure 2. Comparison between rank and svd with accuracy measures (ηF , ηU, ηV ) VS. number of state jumps n = C2rp log(p).

it holds that f(Xk+1) + c( Xk+1 Xk+1 (r))

f(Xk) + c( Xk Xk (r))

2 Xk+1 Xk 2 F .

Moreover, any accumulation point of the bounded sequence {Xk} is a critical point of problem (7). In addition, if α > 0, it holds that limk Xk+1 Xk F = 0.

Remark 6 (Selection of Parameters). In practice, a small α > 0 is suggested to ensure strict decrease of the objective value and convergence of { Xk+1 Xk F }; if f is strongly convex, one achieves these nice properties even if α = 0 based on the results of Theorem 6. The penalty parameter c can be adaptively adjusted according to the rank of the sequence generated by Algorithm 1.

5.3. A multi-block ADMM for subproblem (8)

It is important and non-trivial to solve the convex subproblem (8) in Algorithm 1. Note that (8) is a nuclear norm penalized convex optimization problem, we propose to apply an efﬁcient multi-block alternating direction method of multipliers (ADMM) for solving the dual of (8). A comprehensive numerical study has been conducted in (Li et al., 2016) and justiﬁes our procedure.

Instead of working directly on (8), we study a slightly general model as follows. Given W ℜp p and α 0, consider

(P) min f(X) + W, X + c X + α

2 X 2 F s.t. A(X) = b.

Its dual problem can be written as

(D) min f ( Ξ) b, y + α

2 Z 2 F s.t. Ξ + A (y) + S + αZ = W, S 2 c,

where f is the conjugate function of f, 2 denotes the spectral norm. When we set f to be the negative likelihood

function (1), f becomes

n 1 log( Ξij))

(i,j) Ω δ(Ξij | Ξij 0) Ξ ℜp p,

where Ω= {(i, j) | nij = 0} and Ω= {(i, j) | nij = 0}. Given σ > 0, the augmented Lagrangian function Lσ associated with (D) is

Lσ(Ξ, y, S, Z; X) = f ( Ξ) b, y + α

2 Ξ + A (y) + S + αZ W + X/σ 2 1

We consider ADMM type methods for solving problem (D). Since there are four separable blocks in (D) (namely Ξ, y, S, and Z), the direct extended ADMM is not applicable. Fortunately, the functions corresponding to blocks y and Z in the objective of (D) are linear-quadratic. Thus we can apply the multi-block symmetric Gauss-Sediel based ADMM (s GS-ADMM) (Li et al., 2016). In literature (Chen et al., 2017; Ferreira et al., 2017; Lam et al., 2018; Li et al., 2016; Wang & Zou, 2018), extensive numerical experiments demonstrate that s GS-ADMM is not only convergent but also faster than the directly extended multi-block ADMM and its many other variants. Speciﬁcally, the algorithmic framework of s GS-ADMM for solving (D) is presented in Algorithm 2. Note that when α = 0, the computation steps corresponding to block Z will not be performed.

When implementing Algorithm 2, only partial SVD, which is much cheaper than full SVD, is needed as r p. The following convergence results can be directly obtained from (Li et al., 2016). A sketch of the proof is given in supplementary material. Theorem 5. Suppose that the solution sets of (P) and (D) are nonempty. Let {(Ξk, yk, Sk, Zk, Xk)} be the sequence generated by Algorithm 2. If τ (0, (1 +

5 )/2), then the sequence {(Ξk, yk, Sk, Zk)} converges to an optimal solution of (D) and {Xk} converges to an optimal solution of (P).

Estimation of Markov Chain via Rank-constrained Likelihood

Algorithm 2 An s GS-ADMM for solving (D)

Input: initial point (Ξ0, y0, S0, Z0, X0), penalty parameter σ > 0, maximum iteration number K, and the steplength γ (0, (1 +

5)/2) for k = 0 to K do

2 = arg miny Lσ(Ξk, y, Sk, Zk; Xk) Ξk+1 = arg minΞ Lσ(Ξ, yk+ 1

2 , Sk, Zk; Xk) yk+1 = arg miny Lσ(Ξk+1, y, Sk, Zk; Xk) Zk+ 1

2 = arg min Z Lσ(Ξk+1, yk+1, Sk, Z; Xk) Sk+1 = arg min S Lσ(Ξk+1, yk+1, S, Zk+ 1

2 ; Xk) Zk+1 = arg min Z Lσ(Ξk+1, yk+1, Sk+1, Z; Xk) Xk+1 = Xk + γσ(Ξk+1 + A (yk+1) + Sk+1 + αZk+1 W) end for

5.4. A uniﬁed analysis for the majorized i PDC algorithm

Due to the presence of the proximal term α

2 X Xk 2 in Algorithm 1, the classical DC analyses cannot be applied directly. Hence, in this subsection, we provide a uniﬁed convergence analysis for the majorized indeﬁnite-proximal DC (majorized i PDC) algorithm which includes Algorithm 1 as a special instance. Let X be a ﬁnite-dimensional real Euclidean space endowed with inner product , and induced norm . Consider the following optimization problem

min x X θ(x) g(x) + p(x) q(x), (9)

where g : X ℜis a continuously differentiable function (not necessarily convex) with a Lipschitz continuous gradient and Lipschitz modulus Lg > 0, i.e.,

f(x) f(x ) Lg x x x, x X,

p : X ( , + ] and q : X ( , + ] are two proper closed convex functions. It is not difﬁcult to observe that penalized problem (7) is a special instance of problem (9). For general model (9), one can only expect the DC algorithm converges to a critical point x X of (9) satisfying

( g( x) + p( x)) q( x) = .

Since g is continuously differentiable with Lipschitz continuous gradient, there exists a self-adjoint positive semideﬁnite linear operator G : X X such that for any x, x X,

g(x) bg(x; x ) g(x )+ g(x ), x x + 1

The majorized i PDC algorithm for solving (9) is presented in Algorithm 3. We further provide the following convergence results.

Algorithm 3 A majorized indeﬁnite-proximal DC algorithm for solving problem (9) Given initial point x0 X and stopping tolerance η, choose a self-adjoint linear operator T : X X and G. Iterate the following steps for k = 0, 1, . . . : 1. Choose ξk q(xk). Compute

xk+1 arg min x X

bθ(x; xk) + 1

where bθ(x; xk) bg(x; xk)+p(x) q(xk)+ x xk, ξk . 2. If xk+1 xk η, stop.

Theorem 6. Assume that infx X θ(x) > . Let {xk} be the sequence generated by Algorithm 3. If xk+1 = xk for some k 0, then xk is a critical point of (9). If G+2T 0, then any accumulation point of {xk}, if exists, is a critical point of (9). In addition, if G + 2T 0, it holds that lim i xk+1 xk = 0.

The proof of Theorem 6 and more discussions are provided in the supplementary material.

6. Simulation Results

We ﬁrst generate the rank-r transition matrix as P = UΣVT , where U, V ℜp r have orthonormal columns and the diagonal matrix Σ = diag(σi) ℜr r with σi > 0 for i = 1, . . . , r. Then we simulate a Markov chain trajectory of length n = C2rp log(p). Here, r = 10, p = 500 and C is a varying constant from 10 to 102.

We compare four estimation procedures, i.e., maximum likelihood estimator (mle), truncated SVD estimator (svd) (Zhang & Wang, 2017), nuclear norm penalized estimator (nu), and our rank-constrained maximum likelihood estimator (rank). For each estimator b P, let b U and b V be the leading r left and right singular vectors of b P, respectively. We measure the performance of b P through three quantities:

ηF = P b P F / p, ηU = sin Θ(U, b U) F ,

ηV = sin Θ(V, b V) F .

The comparison between mle and rank is presented in Figure 1. As one can observe, the empirical error of our rankconstrained maximum likelihood estimator is signiﬁcantly smaller than the plain maximum likelihood estimator.

From Figure 2, one can see that our rank-constrained estimator outperforms svd in terms of all three accuracy measurements. Recently, (Zhang & Wang, 2017) studied the svd approach and developed the total variation error bounds. The rank-constrained estimator and the KL-divergence error bounds studied in this paper are and harder to analyze and

Estimation of Markov Chain via Rank-constrained Likelihood

10 20 30 40 50 60 70 80 90 100 C

10 20 30 40 50 60 70 80 90 100 C

10 20 30 40 50 60 70 80 90 100 C

Figure 3. Comparison between rank and nu with accuracy measures (ηF , ηU, ηV ) VS. number of state jumps n = C2rp log(p).

more meaningful for discrete distribution estimation.

In the implementation of nuclear norm penalized estimator nu, the penalty parameter is chosen via 5-fold crossvalidation. The comparison between rank and nu is plotted in Figure 3. Clearly, the estimation error of rank is much smaller than that of nu. In fact, as one can see from Algorithm 1, the nuclear norm penalized estimator is in fact a single step in the procedure of our rank-constrained estimator, and rank gradually reﬁnes based on nu. Theoretically speaking, one could expect that, similar to the matrix recovery (Recht et al., 2010), the rank-constrained estimator enjoys better statistical performances and needs weaker assumption over the nuclear norm penalized estimator. Moreover, Theorems 1 and 2 indicate that the rank-constrained maximum likelihood estimator is statistically optimal for the estimation of Markov Chains. However, such a result is not clear for the nuclear norm approach.

7. Conclusion

This paper studies the recovery and state compression of low-rank Markov chains from empirical trajectories via a rank-constrained likelihood approach. We provide statistical upper bounds for the ℓ2 risk and Kullback-Leiber divergence between the estimator and the true probability transition matrix for the proposed estimator. Then, a novel DC programming algorithm is developed to solve the associated rank-constrained optimization problem. The proposed algorithm non-trivially combines several recent optimization techniques, such as the penalty approach, the proximal DC algorithm, and the multi-block s GS-ADMM. We further study a new class of majorized indeﬁnite-proximal DC algorithms for solving general non-convex non-smooth DC programming problems and provide a uniﬁed convergence analysis. Experiments on simulated data illustrate the merits of our approach.

Anandkumar, Animashree, Ge, Rong, Hsu, Daniel, Kakade, Sham M, and Telgarsky, Matus. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773 2832, 2014.

Azizzadenesheli, Kamyar, Lazaric, Alessandro, and Anandkumar, Animashree. Reinforcement learning in richobservation mdps using spectral methods. ar Xiv preprint ar Xiv:1611.03907, 2016.

Bertsekas, Dimitri P. Dynamic Programming and Optimal Control, volume 1. Athena scientiﬁc Belmont, MA, 1995.

Bertsekas, Dimitri P and Tsitsiklis, John N. Neuro-dynamic programming: an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1, pp. 560 564. IEEE, 1995.

Buchholz, Peter. Exact and ordinary lumpability in ﬁnite Markov chains. Journal of applied probability, 31(01): 59 75, 1994.

Chen, Liang, Sun, Defeng, and Toh, Kim-Chuan. An efﬁcient inexact symmetric Gauss Seidel based majorized ADMM for high-dimensional convex composite conic programming. Mathematical Programming, 161(1-2): 237 270, 2017.

Deng, Kun and Huang, Dayu. Model reduction of Markov chains via low-rank approximation. In American Control Conference (ACC), 2012, pp. 2651 2656. IEEE, 2012.

Deng, Kun, Mehta, Prashant G, and Meyn, Sean P. Optimal Kullback-Leibler aggregation via spectral theory of Markov chains. IEEE Transactions on Automatic Control, 56(12):2793 2808, 2011.

E, Weinan, Li, Tiejun, and Vanden-Eijnden, Eric. Optimal partition and effective dynamics of complex networks. Proceedings of the National Academy of Sciences, 105 (23):7907 7912, 2008.

Estimation of Markov Chain via Rank-constrained Likelihood

Ferreira, Jos e FS Bravo, Khoo, Yuehaw, and Singer, Amit. Semideﬁnite programming approach for the quadratic assignment problem with a sparse graph. Computational Optimization and Applications, pp. 1 36, 2017.

Gao, Yan and Sun, Defeng. A majorized penalty approach for calibrating rank constrained correlation matrix problems. technical reprot, 2010.

Han, Yanjun, Jiao, Jiantao, and Weissman, Tsachy. Minimax estimation of discrete distributions under ℓ1 loss. IEEE Transactions on Information Theory, 61(11):6343 6354, 2015.

Hsu, Daniel, Kakade, Sham M, and Zhang, Tong. A spectral algorithm for learning hidden Markov models. Journal of Computer and System Sciences, 78(5):1460 1480, 2012.

Huang, Qingqing, Kakade, Sham M, Kong, Weihao, and Valiant, Gregory. Recovering structured probability matrices. ar Xiv preprint ar Xiv:1602.06586, 2016.

Kamath, Sudeep, Orlitsky, Alon, Pichapati, Dheeraj, and Suresh, Ananda Theertha. On learning distributions from their samples. In Conference on Learning Theory, pp. 1066 1100, 2015.

Keshavan, Raghunandan H, Montanari, Andrea, and Oh, Sewoong. Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980 2998, 2010.

Lam, Xin Yee, Marron, J. S., Sun, Defeng, and Toh, Kim Chuan. Fast algorithms for large-scale generalized distance weighted discrimination. Journal of Computational and Graphical Statistics, 27(2):368 379, 2018. doi: 10.1080/10618600.2017.1366915.

Le Thi, Hoai An and Pham Dinh, Tao. DC programming and DCA: thirty years of developments. Mathematical Programming, pp. 1 64, 2018. doi: 10.1007/ s10107-018-1235-y.

Le Thi, Hoai An, Pham Dinh, Tao, and Van Ngai, Huynh. Exact penalty and error bounds in DC programming. Journal of Global Optimization, 52(3):509 535, 2012.

Le Thi, Hoai An, Le, Hoai Minh, Phan, Duy Nhat, and Tran, Bach. Stochastic DCA for the large-sum of non-convex functions problem and its application to group variable selection in classiﬁcation. In International Conference on Machine Learning, pp. 3394 3403, 2017.

Lehmann, Erich L and Casella, George. Theory of point estimation. Springer Science & Business Media, 2006.

Levin, David A and Peres, Yuval. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.

Li, Chris Junchi, Wang, Mengdi, Liu, Han, and Zhang, Tong. Diffusion approximations for online principal component estimation and global convergence. In Advances in Neural Information Processing Systems, pp. 645 655, 2017.

Li, Chris Junchi, Wang, Mengdi, Liu, Han, and Zhang, Tong. Near-optimal stochastic approximation for online principal component estimation. Mathematical Programming, 167(1):75 97, Jan 2018. doi: 10.1007/ s10107-017-1182-z.

Li, Xudong, Sun, Defeng, and Toh, Kim-Chuan. A Schur complement based semi-proximal ADMM for convex quadratic conic programming and extensions. Mathematical Programming, 155(1-2):333 373, 2016.

Malek, Alan, Abbasi-Yadkori, Yasin, and Bartlett, Peter. Linear programming for large-scale Markov decision problems. In International Conference on Machine Learning, pp. 496 504, 2014.

Moore, Andrew W. Variable resolution dynamic programming: Efﬁciently learning action maps in multivariate real-valued state-spaces. In Machine Learning Proceedings 1991, pp. 333 337. Elsevier, 1991.

Negahban, Sahand and Wainwright, Martin J. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 13(May):1665 1697, 2012.

Negahban, Sahand, Oh, Sewoong, and Shah, Devavrat. Rank centrality: Ranking from pairwise comparisons. Operations Research, 65(1):266 287, 2016.

Newman, Mark EJ. Spectral methods for community detection and graph partitioning. Physical Review E, 88(4): 042822, 2013.

Pham Dinh, Tao and Le Thi, Ho ai. Convex analysis approach to DC programming: Theory, algorithms and applications. Acta Mathematica Vietnamica, 22(1):289 355, 1997.

Pham Dinh, Tao and Le Thi, Ho ai. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of operations research, 133(1-4):23 46, 2005.

Recht, Benjamin. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(Dec):3413 3430, 2011.

Recht, Benjamin, Fazel, Maryam, and Parrilo, Pablo A. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471 501, 2010.

Estimation of Markov Chain via Rank-constrained Likelihood

Singh, Satinder P, Jaakkola, Tommi, and Jordan, Michael I. Reinforcement learning with soft state aggregation. In Advances in neural information processing systems, pp. 361 368, 1995.

Steinhaus, Ho. The problem of estimation. The Annals of Mathematical Statistics, 28(3):633 648, 1957.

Sutton, Richard S and Barto, Andrew G. Reinforcement Learning: An Introduction, volume 1. MIT press Cambridge, 1998.

Van Dinh, Bui, Kim, Do Sang, and Jiao, Liguo. Convergence analysis of algorithms for DC programming. ar Xiv preprint ar Xiv:1508.03899, 2015.

Van Roy, Benjamin. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234 244, 2006.

Wang, Boxiang and Zou, Hui. Another look at distanceweighted discrimination. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):177 198, 2018.

Watson, GA. On matrix approximation problems with Ky Fank norms. Numerical Algorithms, 5(5):263 272, 1993.

Wen, Bo, Chen, Xiaojun, and Pong, Ting Kei. A proximal difference-of-convex algorithm with extrapolation. Computational Optimization and Applications, pp. 1 28, 2017.

Yang, Lin F, Braverman, Vladimir, Zhao, Tuo, and Wang, Mengdi. Dynamic partition of complex networks. ar Xiv preprint ar Xiv:1705.07881, 2017.

Zhang, Anru and Wang, Mengdi. Spectral state compression of Markov processes. ar Xiv preprint ar Xiv:1802.02920, 2017.