# learning_parametricoutput_hmms_with_two_aliased_states__c06624e9.pdf Learning Parametric-Output HMMs with Two Aliased States Roi Weiss ROIWEI@CS.BGU.AC.IL Department of Computer Science, Ben-Gurion University, Beer Sheva, 84105, Israel. Boaz Nadler BOAZ.NADLER@WEIZMANN.AC.IL Department of Computer Science and Applied Mathematics, The Weizmann Institute of Science, Rehovot, 76100, Israel. In various applications involving hidden Markov models (HMMs), some of the hidden states are aliased, having identical output distributions. The minimality, identifiability and learnability of such aliased HMMs have been long standing problems, with only partial solutions provided thus far. In this paper we focus on parametricoutput HMMs, whose output distributions come from a parametric family, and that have exactly two aliased states. For this class, we present a complete characterization of their minimality and identifiability. Furthermore, for a large family of parametric output distributions, we derive computationally efficient and statistically consistent algorithms to detect the presence of aliasing and learn the aliased HMM transition and emission parameters. We illustrate our theoretical analysis by several simulations. 1. Introduction HMMs are a fundamental tool in the analysis of time series. A discrete time HMM with n hidden states is characterized by a n n transition matrix and by the emissions probabilities from these n states. In several applications, the HMMs, or more general processes such as partially observable Markov decision processes, are aliased, with some states having identical output distributions. In modeling of ion channel gating, for example, a common assumption is that at any given time an ion channel can be in only one of a finite number of hidden states, some of which are open and conducting current while others are closed, see e.g. Fredkin & Rice (1992). Fitting an aliased HMM to electric current measurements, allows biologists to gain important insights regarding the gating process. Other examples appear in the Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). fields of reinforcement learning (Chrisman, 1992; Mc Callum, 1995; Brafman & Shani, 2004; Shani et al., 2005) and robot navigation (Jefferies & Yeap, 2008; Zatuchna & Bagnall, 2009). In the latter case, aliasing occurs whenever different spatial locations appear (statistically) identical to the robot, given its limited sensing devices. As a last example, HMMs with several silent states that do not emit any output (Leggetter & Woodland, 1994; Stanke & Waack, 2003; Brejova et al., 2007), can also be viewed as aliased. Key notions related to the study of HMMs, be them aliased or not, are their minimality, identifiability and learnability: Minimality. Is there an HMM with fewer states that induces the same distribution over all output sequences? Identifiability. Does the distribution over all output sequences uniquely determines the HMM s parameters, up to a permutation of its hidden states? Learning. Given a long output sequence from a minimal and identifiable HMM, efficiently learn its parameters. For non-aliased HMMs, these notions have been intensively studied and by now are relatively well understood, see for example Petrie (1969); Finesso (1990); Leroux (1992); Allman et al. (2009) and Capp e et al. (2005). The most common approach to learn the parameters of an HMM is via the Baum-Welch iterative algorithm (Baum et al., 1970). Recently, tensor decompositions and other computationally efficient spectral methods have been developed to learn non-aliased HMMs (Hsu et al., 2009; Siddiqi et al., 2010; Anandkumar et al., 2012; Kontorovich et al., 2013). In contrast, the minimality, identifiability and learnability of aliased HMMs have been long standing problems, with only partial solutions provided thus far. For example, Blackwell & Koopmans (1957) characterized the identifiability of a specific aliased HMM with 4 states. The identifiability of deterministic output HMMs, where each hidden state outputs a deterministic symbol, was partially resolved by Ito et al. (1992). To the best of our knowledge, precise characterizations of the minimality, identifiability and learnability of probabilistic output HMMs with aliased Learning Parametric-Output HMMs with Two Aliased States states are still open problems. In particular, the recently developed tensor and spectral methods mentioned above, explicitly require the HMM to be non-aliasing, and are not directly applicable to learning aliased HMMs. Main results. In this paper we study the minimality, identifiability and learnability of parametric-output HMMs that have exactly two aliased states. This is the simplest possible class of aliased HMMs, and as shown below, even its analysis is far from trivial. Our main contributions are as follows: First, we provide a complete characterization of their minimality and identifiability, deriving necessary and sufficient conditions for each of these notions to hold. Our identifiability conditions are easy to check for any given 2aliased HMM, and extend those derived by Ito et al. (1992) for deterministic outputs. Second, we address the problem of learning a possibly aliased HMM from a long sequence of its outputs. To this end, we first derive an algorithm to detect whether an observed output sequence corresponds to a non-aliased HMM or to an aliased one. In the former case, the HMM can be learned by various methods, such as Anandkumar et al. (2012); Kontorovich et al. (2013). In the latter case we show how the aliased states can be identified and present a method to recover the HMM parameters. Our approach is applicable to any family of output distributions whose mixtures are efficiently learnable. Examples include high dimensional Gaussians and products distributions, see Feldman et al. (2008); Belkin & Sinha (2010); Anandkumar et al. (2012) and references therein. After learning the output mixture parameters, our moment-based algorithm requires only a single pass over the data. As far as we know, it is the first statistically consistent and computationally efficient scheme to handle 2-aliased HMMs. While our approach may be extended to more complicated aliasing, such cases are beyond the scope of this paper. We conclude with some simulations illustrating the performance of our proposed algorithms. 2. Definitions & Problem Setup Notation. We denote by In the n n identity matrix and 1n = (1, . . . , 1)T Rn. For v Rn, diag(v) is the n n diagonal matrix with entries vi on its diagonal. The i-th row and column of a matrix A Rn n are denoted by A[i, ] and A[ ,i], respectively. We also denote [n] = {1, 2, . . . , n}. For a discrete random variable X we abbreviate P(x) for Pr(X = x). For a second random variable Z, the quantity P(z | x) denotes either Pr(Z = z | X = x), or the conditional density p(Z = z|X = x), depending on whether Z is discrete or continuous. Hidden Markov Models. Consider a discrete-time HMM with n hidden states {1, . . . , n}, whose output alphabet Y is either discrete or continuous. Let Fθ = {fθ : Y R | θ Θ} be a family of parametric probability density functions where Θ is a suitable parameter space. A parametric-output HMM is defined by a tuple H = (A, θ, π0) where A is the n n transition matrix between the hidden states Aij = Pr(Xt+1 = i | Xt = j) = P(i | j), π0 Rn is the distribution of the initial state, and the vector of parameters θ = (θ1, θ2, . . . , θn) Θn determines the n probability density functions (fθ1, fθ2, . . . , fθn). To produce the HMM s output sequence, first a Markov sequence of hidden states x = (xt)T 1 t=0 is generated according to the distribution P(x) = π0 x0 t=1 P(xt | xt 1). Next, the output sequence y = (yt)T 1 t=0 , where the output yt at time t depends only on xt, is generated according to t=0 P(yt | xt) = t=0 fθxt(yt). We denote by PH,k : Yk R the joint distribution of the first k consecutive outputs of the HMM H. For y = (y0, . . . , yk 1) Yk this distribution is given by PH,k(y) = X x [n]k P(y | x)P(x). Further we denote by PH = {PH,k | k 1} the set of all these distributions. 2-Aliased HMMs. For an HMM H with output parameters θ = (θ1, θ2, . . . , θn) Θn we say that states i and j are aliased if θi = θj. In this paper we consider the special case where H has exactly two aliased states, denoted as 2AHMM. Without loss of generality, we assume the aliased states are the two last ones, n 1 and n. Thus, θi = θj for all 1 i < j n 1, whereas θn 1 = θn. We denote the vector of the n 1 unique output parameters of H by θ = (θ1, θ2, . . . , θn 2, θn 1) Θn 1. For future use, we define the aliased kernel K R(n 1) (n 1) as the matrix of inner products between the n 1 different fθi s, Kij fθi, fθj = Z Y fθi(y)fθj(y)dy, i, j [n 1]. (1) Assumptions. As in previous works (Leroux, 1992; Kontorovich et al., 2013), we make the following standard assumptions: (A1) The parametric family Fθ of the output distributions is linearly independent of order n: for any distinct {θi}n i=1, Pn i=1 aifθi 0 iff ai = 0 for all i [n]. Learning Parametric-Output HMMs with Two Aliased States (A2) The transition matrix A is ergodic and its unique stationary distribution π = (π1, π2, . . . , πn) is positive. Note that assumption (A1) implies that the parametric family Fθ is identifiable, namely fθ = fθ iff θ = θ . It also implies that the kernel matrix K of (1) is full rank n 1. 3. Decomposing the transition matrix A The main tool in our analysis is a novel decomposition of the 2A-HMM s transition matrix into its non-aliased and aliased parts. As shown in Lemma 1 below, the aliased part consists of three rank-one matrices, that correspond to the exit from, entrance to, and dynamics within the two aliased states. This decomposition is used to derive the conditions for minimality and identifiability (Section 4), and plays a key role in learning the HMM (Section 5). To this end, we introduce a pseudo-state n, combining the two aliased states n 1 and n. We define π n = πn 1 + πn and β = πn 1/π n. (2) We shall make extensive use of the following two matrices: ... ... 0 0 0 . . . 0 1 1 ... 0 0 . . . 0 β 0 . . . 0 1 β As explained below, these matrices can be viewed as projection and lifting operators, mapping between non-aliased and aliased quantities. Non-aliased part. The non-aliased part of A is a stochastic matrix A R(n 1) (n 1), obtained by merging the two aliased states n 1 and n into the pseudo-state n. Its entries are given by A[1:n 2] [1:n 2] ... P(n 2 | n) P( n | 1) . . . P( n | n 2) P( n | n) where the transition probabilities into the pseudo-state are P( n | j) = P(n 1 | j) + P(n | j), j [n], the transition probabilities out of the pseudo-state are defined with respect to the stationary distribution by P(i | n) = βP(i | n 1) + (1 β)P(i | n), i [n] and lastly, the probability to stay in the pseudo-state is P( n | n) = βP( n | n 1) + (1 β)P( n | n). It is easy to check that the unique stationary distribution of A is π = (π1, π2, . . . , πn 2, π n) Rn 1. Finally, note that A = BACβ, π = Bπ and π = Cβ π, justifying the lifting and projection interpretation of the matrices B, Cβ. Aliased part. Next we introduce some key quantities that distinguish between the two aliased states. Let suppin = {j [n] | P( n | j) > 0} be the set of states that can move into at least one of the aliased states. We define ( P (n 1 | j) P ( n | j) j suppin 0 otherwise, as the relative probability of moving from state j to state n 1, conditional on moving to either n 1 or n. We define the two vectors δout, δin Rn 1 as follows: i, j [n 1], ( P(i | n 1) P(i | n) i < n 1 P( n | n 1) P( n | n) i = n 1 (5) (αj β)P( n | j) j < n 1 β(αn 1 β)P( n | n 1) + (1 β)(αn β)P( n | n) j = n 1. In other words, δout captures differences in the transition probabilities out of the aliased states. In particular, if δout = 0 then starting from either one of the two aliased states, the Markov chain evolution is identical. As proven in Theorem 1 below, such an HMM is not minimal as its two aliased states can be lumped together, Similarly, δin compares the relative probabilities into the aliased states αj, to the stationary one β = πn 1/π n. This quantity also plays a role in the minimality of the HMM. Lastly, for our decomposition, we define the scalar κ = (αn 1 β)P( n | n 1) (αn β)P( n | n). (7) Decomposing A. The following lemma provides a decomposition of the transition matrix in terms of A, δout, δin, κ and β (all proofs are given in the Appendix). Lemma 1. The transition matrix A of a 2A-HMM can be decomposed as A = Cβ AB + Cβδoutc T β + b(δin)TB + κ bc T β, (8) where cβ T = (0, . . . , 0, 1 β, β) Rn and b = (0, . . . , 0, 1, 1)T Rn. In (8), the first term is the merged transition matrix A R(n 1) (n 1) lifted back into Rn n. This term captures all Learning Parametric-Output HMMs with Two Aliased States of the non-aliased transitions. The second matrix is zero except in the last two columns, accounting for the exit transition probabilities from the two aliased states. Similarly, the third matrix is zero except in the last two rows, differentiating the entry probabilities. The fourth term is non-zero only on the lower right 2 2 block involving the aliased states n 1, n. This term corresponds to the internal dynamics between them. Note that each of the last three terms is at most a rank-1 matrix, which together can be seen as a perturbation due to the presence of aliasing. In Section 4 we will show the importance of Eq. (8) for the minimality and identifiability of two-aliased HMMs. In section 5 we shall see that given a long output sequence from the HMM, the presence of aliasing can be detected and the quantities A, δout, δin, κ and β can all be estimated from it. An estimate for A is then obtained via Eq. (8). 4. Minimality and Identifiability Two HMMs H and H are said to be equivalent if their observed output sequences are statistically indistinguishable, namely PH = PH. Similarly, an HMM H is minimal if there is no equivalent HMM with fewer number of states. Note that if H is non-aliased then Assumptions (A1-A2) readily imply that it is also minimal (Leroux, 1992). In this section we present necessary and sufficient conditions for a 2A-HMM to be minimal, and for two minimal 2A-HMMs to be equivalent. Finally, we derive necessary and sufficient conditions for a minimal 2A-HMM to be identifiable. 4.1. Minimality The minimality of an HMM is closely related to the notion of lumpability: can hidden states be merged without changing the distribution PH (Fredkin & Rice, 1986; White et al., 2000; Huang et al., 2014). Obviously, an HMM is minimal iff no subset of hidden states can be merged. The following theorem gives precise conditions for the minimality of a 2A-HMM. Theorem 1. Let H be a 2A-HMM satisfying Assumptions (A1-A2) whose initial state X0 is distributed according to π0 = (π0 1, π0 2, . . . , β0π0 n, (1 β0)π0 n). Then, (i) If π0 n = 0 and β0 = β then H is minimal iff δout = 0. (ii) If π0 n = 0 or β0 = β then H is minimal iff both δout = 0 and δin = 0. By Theorem 1, a necessary condition for minimality of a 2A-HMM is that the two aliased states have different exit probabilities, δout = 0. Namely, there exists a non-aliased state i [n 2] such that P(i | n 1) = P(i | n). Otherwise the two aliased states can be merged. If the 2A-HMM is started from its stationary distribution, then an additional necessary condition is δin = 0. This last condition implies that there is a non-aliased state j suppin \{n 1, n} with relative entrance probability αj = β. 4.2. Identifiability Recall that an HMM H is (strictly) identifiable if PH uniquely determines the transition matrix A and the output parameters θ, up to a permutation of the hidden states. We establish the conditions for identifiability of a 2A-HMM in two steps. First we derive a novel geometric characterization of the set of all minimal HMMs that are equivalent to H, up to a permutation of the hidden states (Theorem 2). Then we give necessary and sufficient conditions for H to be identifiable, namely for this set to be the singleton set, consisting of only H itself (Appendix C). In the process, we provide a simple procedure (Algorithm 1) to determine whether a given minimal 2A-HMM is identifiable or not. Equivalence between minimal 2A-HMMs. Necessary and sufficient conditions for the equivalence of two minimal HMMs were studied in several works (Finesso, 1990; Ito et al., 1992; Vanluyten et al., 2008). We now provide analogous conditions for parametric output 2A-HMMs. Toward this end, we define the following 2-dimensional family of matrices S(τn 1, τn) Rn n given by S(τn 1, τn) = ... ... 0 0 0 . . . 0 τn 1 τn 0 . . . 0 1 τn 1 1 τn Clearly, for τn 1 = τn, S is invertible. As in (Ito et al., 1992), consider then the following similarity transformation of the transition matrix A, AH(τn 1, τn) = S(τn 1, τn) 1AS(τn 1, τn). (9) It is easy to verify that 1T n AH = 1T n. However, AH is not necessarily stochastic, as depending on τn 1, τn it may have negative entries. The following lemma resolves the equivalence of 2A-HMMs, in terms of this transformation. Lemma 2. Let H = (A, θ, π) be a minimal 2A-HMM satisfying Assumptions (A1-A2). Then a minimal HMM H = (A , θ , π ) with n states is equivalent to H iff n = n and there exists a permutation matrix Π Rn n and τn 1 > τn such that θ = Π θ and π = Π S(τn 1, τn) 1π, A = Π AH(τn 1, τn) Π 1 0. The feasible region. By Lemma 2, any matrix AH(τn 1, τn) whose entries are all non-negative yields an HMM equivalent to the original one. We thus define the feasible region of H by ΓH = {(τn 1, τn) R2 | AH(τn 1, τn) 0, τn 1 >τn}. (10) Learning Parametric-Output HMMs with Two Aliased States By definition, ΓH is non-empty, since (τn 1, τn) = (1, 0) recovers the original matrix A. As we show below, ΓH is determined by three simpler regions Γ1, Γ2, Γ3 R2. The region Γ1 ensures that all entries of AH are non-negative except possibly in the lower right 2 2 block corresponding to the two aliased states. The regions Γ2 and Γ3 ensure nonnegativity of the latter, depending on whether the aliased relative probabilities of (4) satisfy αn 1 αn or αn 1 < αn, respectively. For ease of exposition we assume as a convention that P( n | n 1) P( n | n). Theorem 2. Let H be a minimal 2A-HMM satisfying Assumptions (A1-A2). There exist (τ min n 1 , τ min n ), (τ max n 1 , τ max n ), (τ , τ +) R2, and convex monotonic decreasing functions f, g : R R such that ( Γ1 Γ2 αn 1 αn Γ1 Γ3 αn 1 < αn, where the regions Γ1, Γ2, Γ3 R2 are given by Γ1 = [τ min n 1 , τ max n 1 ] [τ max n , τ min n ] Γ2 = [τ +, ) [τ , τ +] Γ3 = {(τn 1, τn) Γ1 | f(τn 1) τn g(τn 1) }. In addition, the set ΓH is connected. The feasible regions in the two possible cases (αn 1 αn or αn 1 < αn) are illustrated in Appendix C, Fig.4. Strict Identifiability. By Lemma 2, for strict identifiability of H, ΓH should be the singleton set ΓH = {(1, 0)}. Due to lack of space, sufficient and necessary conditions for this to hold, as well as a corresponding simple procedure to determine whether a 2A-HMM is identifiable, are given in Appendix C.2. Remark. While beyond the scope of this paper, we note that instead of strict identifiability of a given HMM, several works studied a different concept of generic identifiability (Allman et al., 2009), proving that under mild conditions the class of HMMs is generically identifiable. In contrast, if we restrict ourselves to the class of 2A-HMMs, then our Theorem 2 implies that this class is generically non-identifiable. The reason is that by Theorem 2, for any 2A-HMM whose matrix A has all its entries positive, there are an infinite number of equivalent 2A-HMMs, implying its non-identifiability. 5. Learning a 2A-HMM Let (Yt)T 1 t=0 be an output sequence generated by a parametric-output HMM that satisfies Assumptions (A1A2) and initialized with its stationary distribution, X0 π. We assume the HMM is either non-aliasing, with n 1 determine model order n 1 and fit θ = (θ1, . . . , θn 1) (i) (Yt)T 1 t=0 Is the HMM 2-aliasing? (ii) non-aliasing (n 1 states) (iii) identify aliasing component θn 1 and estimate A 2-aliasing (n states) Figure 1. Learning a 2A-HMM. states, or 2-aliasing with n states. We further assume that the HMM is minimal and identifiable, as otherwise its parameters cannot be uniquely determined. In this section we study the problems of detecting whether the HMM is aliasing and recovering its output parameters θ and transition matrix A, all in terms of (Yt)T 1 t=0 . As outlined in Fig.1, the proposed learning procedure consists of the following steps: (i) Determine the number of output components n 1 and estimate the n 1 unique output distribution parameters θ and the projected stationary distribution π. (ii) Detect if the HMM is 2-aliasing. (iii) In case of a non-aliased HMM, estimate the (n 1) (n 1) transition matrix A, as for example in Kontorovich et al. (2013) or Anandkumar et al. (2012). (iv) In case of a 2-aliased HMM, identify the component θn 1 corresponding to the two aliased states, and estimate the n n transition matrix A. We now describe in detail each of these steps. As far as we know, our learning procedure is the first to consistently learn a 2A-HMM in a computationally efficient way. In particular, the solutions for problems (ii) and (iv) are new. Estimating the output distribution parameters. As the HMM is stationary, each observable Yt is a random realization from the following parametric mixture model, i=1 πif θi(y). (11) Hence, the number of unique output components n 1, the corresponding output parameters θ and the projected stationary distribution π can be estimated by fitting a mixture model (11) to the observed output sequence (Yt)T 1 t=0 . Consistent methods to determine the number of components in a mixture are well known in the literature (Titter- Learning Parametric-Output HMMs with Two Aliased States ington et al., 1985). Estimating θ and π can be done by either the EM algorithm, or any recently developed spectral method (Dasgupta, 1999; Achlioptas & Mc Sherry, 2005; Anandkumar et al., 2012). As our focus is on the aliasing aspects of the HMM, in what follows we assume that the number of unique output components n 1, the output parameters θ and the projected stationary distribution π are exactly known. As in Kontorovich et al. (2013), it is possible to show that our method is robust to small errors in these quantities (not presented). 5.1. Moments To solve problems (ii), (iii) and (iv) above, we first introduce the moment-based quantities we shall make use of. Given θ and π or estimates of them, for any i, j [n 1], we define the second order moments with time lag t by M(t) ij = E[fθi(Y0)fθj(Yt)], t {1, 2, 3}. (12) The consecutive in time third order moments are defined by G(c) ij = E[fθi(Y0)fθc(Y1)fθj(Y2)], c [n 1]. (13) We also define the lifted kernel, K = BT KB Rn n. One can easily verify that for a 2A-HMM, M(t) = KBAt Cβ diag( π) K (14) G(c) = KBA diag(K[ ,c])ACβ diag( π) K. (15) Next we define the kernel free moments M (t), G(c) R(n 1) (n 1) as follows: M (t) = K 1M(t) K 1 diag( π) 1 (16) G(c) = K 1G(c) K 1 diag( π) 1. (17) Note that by Assumption (A1), the kernel K is full rank and thus K 1 exists. Similarly, by (A2) π > 0, so diag( π) 1 also exists. Thus, (16,17) are well defined. Let R(2), R(3), F (c) R(n 1) (n 1) be given by R(2) = M (2) (M (1))2 (18) R(3) = M (3) M (2)M (1) M (1)M (2) + (M (1))3 (19) F (c) = G(c) M (1) diag( K[ ,c])M (1). (20) The following key lemma relates the moments (18, 19, 20) to the decomposition (8) of the transition matrix A. Lemma 3. Let H be a minimal 2A-HMM with aliased states n 1 and n. Let A, δout, δin and κ be defined in (3,5,6,7) respectively. Then the following relations hold: M (1) = A (21) R(2) = δout(δin)T (22) R(3) = κR(2) (23) F (c) = Kn 1,c R(2), c [n 1]. (24) In the following, these relations will be used to detect aliasing, identify the aliased states and recover the aliased transition matrix A. Empirical moments. In practice, the unknown moments (12,13) are estimated from the output sequence (Yt)T 1 t=0 by ˆ M(t) ij = 1 T t l=0 fθi(Yl)fθj(Yl+t), ˆG(c) ij = 1 T 2 l=0 fθi(Yl)fθc(Yl+1)fθj(Yl+2). With known (or estimated) K, π the corresponding empirical kernel free moments are given by ˆ M (t) = K 1 ˆ M(t) K 1 diag( π) 1 (25) ˆG(c) = K 1 ˆG(c) K 1 diag( π) 1. (26) The empirical estimates for (18,19,20) similarly follow. To analyze the error between the empirical and population quantities, we make the following additional assumption: (A3) The output distributions are bounded. Namely there exists L > 0 such that i [n] and y Y, fθi(y) L. Lemma 4. Let (Yt)T 1 t=0 be an output sequence generated by an HMM satisfying Assumptions (A1-A3). Then, as T , for any t {1, 2, 3} and c [n 1], all error terms ˆ M (t) M (t), ˆR(t) R(t) and ˆF (c) F (c) are OP (T 1 In fact, due to strong mixing, all of the above quantities are asymptotically normally distributed (Bradley, 2005). 5.2. Detection of aliasing We now proceed to detect if the HMM is aliased (step (ii) in Fig.1). We pose this as a hypothesis testing problem: H0 : H is non-aliased with n 1 states vs. H1 : H is 2-aliased with n states. We begin with the following simple observation: Lemma 5. Let H be a minimal non-aliased HMM with n 1 states, satisfying Assumptions (A1-A3). Then R(2) = 0. In contrast, if H is 2-aliasing then according to (22) we have R(2) = δout(δin)T. In addition, since the HMM is assumed to be minimal and started from the stationary distribution, Theorem 1 implies that both δout = 0 and δin = 0. Thus R(2) is exactly a rank-1 matrix, which we write as R(2) = σuv T with u 2 = v 2 = 1, σ > 0, (27) where σ is the unique non-zero singular value of R(2). Hence, our hypothesis testing problem takes the form: H0 : R(2) = 0 vs. H1 : R(2) = σuv T with σ > 0. Learning Parametric-Output HMMs with Two Aliased States In practice, we only have the empirical estimate ˆR(2). Even if σ = 0, this matrix is typically full rank with n 1 nonzero singular values. Our problem is thus detecting the rank of a matrix from a noisy version of it. There are multiple methods to do so. In this paper, motivated by Kritchman & Nadler (2009), we adopt the largest singular value ˆσ1 of ˆR(2) as our test statistic. The resulting test is if ˆσ1 h T return H1, otherwise return H0, (28) where h T is a predefined threshold. By Lemma 4, as T the singular values of ˆR(2) converge to those of R(2). Thus, as the following lemma shows, with a suitable threshold this test is asymptotically consistent. Lemma 6. Let H be a minimal HMM satisfying Assumptions (A1-A3) which is either non-aliased or 2-aliased. Then for any 0<ϵ< 1 2, the test (28) with h T = Ω(T 1 2 +ϵ) is consistent. Namely, as T P(reject H1 | H1 holds) + P(reject H0 | H0 holds) 0 and asymptotically the test correctly detects whether the HMM is non-aliased or 2-aliased. If the HMM was detected as non-aliasing, then its (n 1) (n 1) transition matrix can be estimated, for example, by the spectral methods of Kontorovich et al. (2013) or Anandkumar et al. (2012). We now turn our attention to the case where the HMM was detected as an aliased one. Estimating the non-aliased transition matrix A. It is easy to show that in the 2-aliased case, the (n 1) (n 1) transition matrix most consistent with the first two moments is nothing but the non-aliased transition matrix A. Hence, applying for example the spectral method of Kontorovich et al. (2013) yields an estimate ˆ A, which is not only strongly consistent, but also satisfies that as T , ˆ A = A + OP (T 1 5.3. Identifying the aliased component θn 1 Assuming the HMM was detected as 2-aliasing, our next task, step (iv), is to identify the aliased component. Recall that if the aliased component is θn 1, then by (24) F (c) = Kn 1,c R(2), c [n 1]. We thus estimate the index i [n 1] of the aliased component by solving the following least squares problem: ˆi = argmin i [n 1] ˆF (c) Ki,c ˆR(2) 2 The following result shows this method is consistent. Lemma 7. For a minimal 2A-HMM satisfying Assumptions (A1-A3) with aliased states n 1 and n, lim T Pr( ˆi = n 1) = 0. 5.4. Learning the aliased transition matrix A Given the aliased component, we estimate the n n transition matrix A using the decomposition (8). First, recall that by (22), R(2) = δout(δin)T = σuv T. As singular vectors are determined only up to scaling, we have that δout = γu and δin = σ where γ R is a yet undetermined constant. Thus, the decomposition (8) of A takes the form: A = Cβ AB + γCβuc T β + σ γ bv TB + κ bc T β. (31) Since A, σ, u and v were estimated in previous steps, we are left to determine the scalars γ, β and κ of Eq. (7). As for κ, according to (23) we have R(3) = κR(2). Thus, plugging the empirical versions, ˆκ is estimated by ˆκ = argmin r R ˆR(3) r ˆR(2) 2 To determine γ and β we turn to the similarity transformation AH(τn 1, τn), given in (9). As shown in Section 3, this transformation characterizes all transition matrices equivalent to A. To relate AH to the form of the decomposition (31), we reparametrize τn 1 and τn as follows: γ = γ(τn 1 τn), β = β τn τn 1 τn . Replacing τn 1, τn with γ , β we find that AH is given by AH = Cβ AB + γ Cβ uc T β + σ γ bv TB + κ bc T β . (33) Note that putting γ = γ and β = β recovers the decomposition (31) for the original transition matrix A. Now, since H is assumed identifiable, the constraint AH(τn 1, τn) 0 has the unique solution (τn 1, τn) = (1, 0), or equivalently (γ , β ) = (γ, β). Thus, with exact knowledge of the various moments, only a single pair of values (γ , β ) will yield a non-negative matrix (33). This perfectly recovers γ, β and the original transition matrix A. In practice we plug into (33) the empirical versions ˆ A, ˆκ, ˆσ1, ˆu1 and ˆv1, where ˆu1, ˆv1 are the left and right singular vectors of ˆR(2), corresponding to the singular value ˆσ1. As described in Appendix D.5, the values (ˆγ, ˆβ) are found by maximizing a simple two dimensional smooth function. The resulting estimate for the aliased transition matrix is ˆA = C ˆβ ˆ AB + ˆγC ˆβ ˆu1c T ˆβ + ˆσ1 ˆγ bˆv T 1B + ˆκ bc T ˆβ. The following theorem proves our method is consistent. Theorem 3. Let H be a 2A-HMM satisfying assumption (A1-A3) with aliased states n 1 and n. Then as T , ˆA = A + o P (1). Learning Parametric-Output HMMs with Two Aliased States Figure 2. The aliased HMM (left) and its corresponding nonaliased version with states 3 and 4 merged (right). 6. Numerical simulations The following simulation results illustrate the consistency of our methods to detect aliasing, identify the aliased component and learn the transition matrix A. As our focus is on the aliasing, we assume for simplicity that the output parameters θ and the projected stationary distributions π are exactly known. Motivated by ion channel gating (Crouzy & Sigworth, 1990; Rosales et al., 2001; Witkoskie & Cao, 2004), we consider the following HMM H with n = 4 hidden states (Fig.2, left). The output distributions are univariate Gaussians N(µi, σ2 i ) , the matrix A and (fθi)4 i=1 are given by 0.3 0.25 0.0 0.8 0.6 0.25 0.2 0.0 0.0 0.5 0.1 0.1 0.1 0.0 0.7 0.1 fθ1 = N(3, 1) fθ2 = N(6, 1) fθ3 = N(0, 1) fθ4 = N(0, 1). States 3 and 4 are aliased and by Procedure 1 in Appendix C.3 this 2A-HMM is identifiable. The rank-1 matrix R(2) has a singular value σ = 0.33. Fig.2 (right) shows its nonaliased version H with states 3 and 4 merged. To test our aliasing detection algorithm, we generated T outputs from the original aliased HMM and from its nonaliased version H. Fig.3 (top left) shows the empirical densities (averaged over 1000 independent runs) of the largest singular value of ˆR(2), for both H and H. Fig.3 (top right) shows similar results for a 2A-HMM with σ = 0.22. When σ = 0.33, already T = 1000 outputs suffice for essentially perfect detection of aliasing. For the smaller σ = 0.22, more samples are required. Fig.3 (middle left) shows the false alarm and misdetection probabilities vs. sample size T of the aliasing detection test (28) with threshold h T = 2T 1 3 . The consistency of our method is evident. Fig.3 (middle right) shows the probability of misidentifying the aliased component θ 3. We considered the same 2AHMM H but with different means for the Gaussian output distribution of the aliased states, µ 3 = {0, 1, 2}. As expected, when fθ 3 is closer to the output distribution of the non-aliased state fθ1 (with mean µ1 = 3), identifying the aliased component is more difficult. Finally, we considered the following methods to estimate A: The Baum-Welch algorithm with random initial guess 0 0.2 0.4 0.6 0.8 1 0 ˆσ1( ˆR(2)) T = 100 - non-aliased T = 100 - 2-aliased T = 1000 - non-aliased T = 1000 - 2-aliased 0 0.2 0.4 0.6 0.8 1 0 ˆσ1( ˆR(2)) T = 100 - non-aliased T = 100 - 2-aliased T = 1000 - non-aliased T = 1000 - 2-aliased 0 200 400 600 800 1000 0 sample size T detection error non-aliased σ = 0 2-aliased σ = .22 2-aliased σ = .33 0 200 400 600 800 1000 0 sample size T 2-aliased - µ 3 = 0 2-aliased - µ 3 = 1 2-aliased - µ 3 = 2 10 3 10 4 10 5 sample size T E|| ˆA A||2 F BW Mo M+exact BW+Mo M+exact BW+exact 1 2 3 4 5 x 10 5 sample size T running time (secs) BW Mo M+exact BW+Mo M+exact BW+exact Figure 3. Top: Empirical density of the largest singular value of ˆR(2) with σ = 0.33 (left) and σ = 0.22 (right). Middle: Misdetection probability of aliasing/non-aliasing (left) and probability of misidentifying the correct aliased component (right). Bottom: Average error E|| ˆA A||2 F and runtime comparison of different algorithms vs. sample size T. of the HMM parameters (BW); our method of moments with exactly known θ (Mo M+Exact); BW initialized with the output of our method (BW+Mo M+Exact); and BW with exactly known output distributions but random initial guess of the transition matrix (BW+Exact). Fig.3 (bottom left) shows on a logarithmic scale the mean square error E|| ˆA A||2 F vs. sample size T, averaged over 100 independent realizations. Fig.3 (bottom right) shows the running time as a function of T. In both figures, the number of iterations of the BW was set to 20. These results show that with a random initial guess of the HMM parameters, BW requires far more than 20 iterations to converge. Even with exact knowledge of the output distributions but a random initial guess of the matrix A, BW still fails to converge after 20 iterations. In contrast, our method yields a relatively accurate estimator in only a fraction of run-time. Furthermore, using this estimator as an initial guess for BW yields even better accuracy. Learning Parametric-Output HMMs with Two Aliased States Acknowledgments We thank Aryeh Kontorovich for the helpful discussions. This research was supported by the Frankel Center for Computer Science. Achlioptas, Dimitris and Mc Sherry, Frank. On spectral learning of mixtures of distributions. In Learning Theory, pp. 458 469. Springer, 2005. Allman, Elizabeth S, Matias, Catherine, and Rhodes, John A. Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, pp. 3099 3132, 2009. Anandkumar, Animashree, Hsu, Daniel, and Kakade, Sham M. A method of moments for mixture models and hidden Markov models. In COLT, 2012. Baum, L.E., Petrie, T., Soules, G., and Weiss, N. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41(1):pp. 164 171, 1970. Belkin, M. and Sinha, K. Polynomial learning of distribution families. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pp. 103 112, 2010. Blackwell, David and Koopmans, Lambert. On the identifiability problem for functions of finite Markov chains. The Annals of Mathematical Statistics, pp. 1011 1015, 1957. Bradley, Richard C. Basic properties of strong mixing conditions. a survey and some open questions. Probab. Surveys, 2:107 144, 2005. Brafman, R. I. and Shani, G. Resolving perceptual aliasing in the presence of noisy sensors. In NIPS, pp. 1249 1256, 2004. Brejova, B., Brown, D. G., and Vinar, T. The most probable annotation problem in HMMs and its application to bioinformatics. Journal of Computer and System Sciences, 73(7):1060 1077, 2007. Capp e, Olivier, Moulines, Eric, and Ryd en, Tobias. Inference in hidden Markov models. Springer Series in Statistics. Springer, New York, 2005. Chrisman, L. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In AAAI, pp. 183 188. Citeseer, 1992. Crouzy, Serge C and Sigworth, Frederick J. Yet another approach to the dwell-time omission problem of singlechannel analysis. Biophysical journal, 58(3):731, 1990. Dasgupta, Sanjoy. Learning mixtures of gaussians. In Foundations of Computer Science, 1999. 40th Annual Symposium on, pp. 634 644, 1999. Feldman, Jon, O Donnell, Ryan, and Servedio, Rocco A. Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 37(5):1536 1564, 2008. Finesso, Lorenzo. Consistent estimation of the order for Markov and hidden Markov chains. Technical report, DTIC Document, 1990. Fredkin, Donald R and Rice, John A. On aggregated markov processes. Journal of Applied Probability, pp. 208 214, 1986. Fredkin, Donald R and Rice, John A. Maximum likelihood estimation and identification directly from singlechannel recordings. Proceedings of the Royal Society of London. Series B: Biological Sciences, 249(1325):125 132, 1992. Hsu, Daniel, Kakade, Sham M., and Zhang, Tong. A spectral algorithm for learning hidden Markov models. In COLT, 2009. Huang, Qingqing, Ge, Rong, Kakade, Sham, and Dahleh, Munther. Minimal realization problems for hidden markov models. ar Xiv preprint ar Xiv:1411.3698, 2014. Ito, Hisashi, Amari, S-I, and Kobayashi, Kingo. Identifiability of hidden Markov information sources and their minimum degrees of freedom. Information Theory, IEEE Transactions on, 38(2):324 333, 1992. Jaeger, Herbert. Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371 1398, 2000. Jefferies, M. E. and Yeap, W. Robotics and cognitive approaches to spatial mapping, volume 38. Springer, 2008. Kontorovich, A. and Weiss, R. Uniform Chernoff and Dvoretzky-Kiefer-Wolfowitz-type inequalities for Markov chains and related processes. Journal of Applied Probability, 51:1 14, 2014. Kontorovich, Aryeh, Nadler, Boaz, and Weiss, Roi. On learning parametric-output HMMs. In Proceedings of The 30th International Conference on Machine Learning, pp. 702 710, 2013. Learning Parametric-Output HMMs with Two Aliased States Kritchman, Shira and Nadler, Boaz. Non-parametric detection of the number of signals: Hypothesis testing and random matrix theory. IEEE Transactions on Signal Processing, 57(10):3930 3941, 2009. Leggetter, CJ and Woodland, P. C. Speaker adaptation of continuous density HMMs using multivariate linear regression. In ICSLP, volume 94, pp. 451 454, 1994. Leroux, Brian G. Maximum-likelihood estimation for hidden Markov models. Stochastic processes and their applications, 40(1):127 143, 1992. Mc Callum, R Andrew. Instance-based utile distinctions for reinforcement learning with hidden state. In ICML, pp. 387 395, 1995. Newey, Whitney K. Uniform convergence in probability and stochastic equicontinuity. Econometrica, 59:1161 1167, 1991. Petrie, T. Probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, pp. 97 115, 1969. Rosales, Rafael, Stark, J Alex, Fitzgerald, William J, and Hladky, Stephen B. Bayesian restoration of ion channel records using hidden Markov models. Biophysical journal, 80(3):1088 1103, 2001. Shani, G., Brafman, R. I., and Shimony, S. E. Model-based online learning of POMDPs. In ECML, pp. 353 364. Springer, 2005. Siddiqi, Sajid M., Boots, Byron, and Gordon, Geoffrey J. Reduced-rank Hidden Markov Models. In AISTAT, 2010. Stanke, M. and Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics, 19(suppl 2):ii215 ii225, 2003. Stewart, G.W. and Sun, Ji-guang. Matrix Perturbation Theory. Academic Press, 1990. Titterington, D Michael, Smith, Adrian FM, Makov, Udi E, et al. Statistical analysis of finite mixture distributions, volume 7. Wiley New York, 1985. Vanluyten, Bart, Willems, Jan C, and De Moor, Bart. Equivalence of state representations for hidden Markov models. Systems & Control Letters, 57(5):410 419, 2008. White, Langford B, Mahony, Robert, and Brushe, Gary D. Lumpable hidden Markov models-model reduction and reduced complexity filtering. Automatic Control, IEEE Transactions on, 45(12):2297 2306, 2000. Witkoskie, James B and Cao, Jianshu. Single molecule kinetics. i. theoretical analysis of indicators. The Journal of chemical physics, 121(13):6361 6372, 2004. Zatuchna, Z. V. and Bagnall, A. Learning mazes with aliasing states: An LCS algorithm with associative perception. Adaptive Behavior, 17(1):28 57, 2009.