# learning_mixed_latent_tree_models__fd066b09.pdf

Journal of Machine Learning Research 21 (2020) 1-35 Submitted 4/20; Revised 9/20; Published 10/20

Learning Mixed Latent Tree Models

Can Zhou zhouc574@nenu.edu.cn Xiaofei Wang wangxf341@nenu.edu.cn Jianhua Guo jhguo@nenu.edu.cn KLAS and School of Mathematics and Statistics Northeast Normal University Changchun, China.

Editor: Zhihua Zhang

Latent structural learning has attracted more attention in recent years. But most related works only focuses on pure continuous or pure discrete data. In this paper, we consider mixed latent tree models for mixed data mining. We address the latent structural learning and parameter estimation for those mixed models. For structural learning, we propose a consistent bottom-up algorithm, and give a ﬁnite sample bound guarantee for the exact structural recovery. For parameter estimation, we suggest a moment estimator by exploiting matrix decomposition, and prove asymptotic normality of the estimator. Experiments on the simulated and real data support that our method is valid for mining the hierarchical structure and latent information. Keywords: Information distance, Latent variables, Mixed latent tree, Parameter estimation, Structural learning

1. Introduction

Latent variable models are important tools for probabilistic modeling, and have been widely applied to various domains, such as speech analysis and bioinformatics. As one of classical latent variable models, the latent class model can eﬀectively deal with clustering analysis problems (Goodman, 1974; Lazarsfeld and Henry, 1968). The latent class model has a simple assumption that all observed variables are conditionally independent given the latent class variable. But this assumption can t catch the potential complex mechanism behind observed variables. For further extension, Zhang (2004) ﬁrst investigated the latent tree model, in which all leaf nodes are observed variables and all internal nodes are latent variables. The latent tree model addresses the local dependence problem with a principled manner. It can capture a hierarchical generating mechanism for observed variables, and provide more illuminating explanations on the variable groups compared to the latent class model. The learning and application of latent tree models developed a lot over the past decades. The original work (Zhang, 2004) proposed a scoring-based algorithm for structural learning. Some further works (Chen et al., 2012; Liu et al., 2015) extended the scoring-based learning algorithm, and suggested the multidimensional clustering for the multiple partitions of data. Besides the multidimensional clustering, the model also has applications in computer vision,

. Corresponding authors

c 2020 Can Zhou, Xiaofei Wang, Jianhua Guo.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/20-365.html.

Zhou, Wang and Guo

probabilistic inference, and hierarchical topic detection. Wang and Li (2013) proposed a more ﬂexible and eﬀective human pose estimation in computer vision based on latent tree models. This method can combine two parts which are unlimited to the physical connections in the human body. Furthermore, it eﬀectively exploits the interactions between combined parts and single parts. Wang et al. (2008) applied the latent tree model to the approximate inference of the Bayesian network. They learned a latent tree model from data which is sampled from a Bayesian network, and made inference with the latent tree model instead of the original Bayesian network. It achieved good approximation accuracy at low online computational cost. Chen et al. (2016) used the latent tree models to handle hierarchical topic detection. Bottom-level variables are observed binary variables that represent whether the words appear in the document, and high-level variables are latent binary variables that give soft partitions of the documents and furthermore represent topics. The latent tree model can discover substantially better topics and topic hierarchies. In this paper, we mainly focus on learning latent tree models from samples. The learning of models includes two aspects. One is structural learning and the other is parameter estimation. The scoring-based algorithms (Zhang, 2004; Chen et al., 2016) search the optimal structure by hill-climbing with a scoring metric such as AIC (Akaike, 1974) or BIC (Schwarz, 1978). And the distance-based algorithms (Choi et al., 2011; Wang et al., 2017) can reconstruct the latent structure with an additive information distance. They are usually faster than the scoring-based ones and have theoretical guarantees for the structural learning. Once the latent structure is given, the matrix decomposition (Chang, 1996; Wang et al., 2017) and tensor decomposition (Anandkumar et al., 2014) are eﬃcient for the parameter estimation of discrete latent tree models. Moreover, Song et al. (2011) proposed a method based on the kernel embedding of distributions for latent tree models with continuous non-Gaussian observation. Though above algorithms contribute a lot to learning latent tree models, they can only handle pure discrete or continuous data. Typical multivariate problems may contain both continuous and discrete variables in the population survey data, biological and biomedical data, etc. In the graphical model setting, Lee and Hastie (2014) proposed a pseudo-likelihood method for handling mixed Gaussian and multinomial data. Fan et al. (2017) assumed that the observed binary data are obtained by dichotomizing a latent continuous variable, and proposed a semi-parametric model for modelling mixed continuous and binary data. To our best knowledge, there are no existing works on learning the hierarchical tree structure with latent variables for mixed data. In this paper, we address the latent hierarchical information mining for mixed data, and contribute three points for this mining. Firstly, we introduce a mixed latent tree model for modeling mixed data, and deﬁne an information distance between discrete and continuous variables. Secondly, we propose a bottom-up structural learning algorithm basing on this information distance. This structural learning algorithm has the probabilistic approximate consistency. Thirdly, we suggest a moment method for parameter estimation by exploiting matrix decomposition. Those moment estimators are asymptotically normal. The rest of this paper is organized as follows. In Section 2, some notions on latent tree model are reviewed and some assumptions used in this paper are given. In Section 3, we deﬁne a new information distance and give a ﬁnite sample bound required for the exact structural recovery. In Section 4, we propose a parameter estimation method and prove its

Learning Mixed Latent Tree Models

asymptotic normality for mixed latent tree models. The simulation studies and the real data analysis are conducted in Section 5.

2. Preliminaries

Let G = (W, E) be a simple graph, where W is the set of nodes and E is the set of edges. An edge between node u and node v is denoted by (u, v), and we call that u is adjacent to v. If edge (vj 1, vj) E for any j = 1, , k, the set of distinct nodes [v0, v1, , vk] is referred to as a length-k path from v0 to vk in G. Furthermore, a path [v0, v1, , vk] is referred to as a cycle in G if v0 = vk. We call G a connected graph if for any nodes u, v W, there is a path [v0 = u, , vk = v] in G. Let A, B be two disjoint node subsets. We call that A, B are separated by a node subset S if for any nodes u A, v B, every path in G from u to v contains a node in S. A connected simple acyclic graph is called a tree and denoted as T. A pair of leaves {u, v} is a sibling pair on T if nodes u and v in T are adjacent to a same node. The number of nodes on the longest path of a tree T is referred to as the diameter of the tree and we denote it as diam(T). Let XW = {Xv}v W be a random vector where W corresponds to a set of nodes, and let X(1) W , . . . , X(n) W denote i.i.d. samples of size n. A family of probability distribution over G is referred to as a graphical model (Lauritzen, 1996), if it satisﬁes the conditional independence: XA, XB are conditionally independent given XS when two disjoint node subsets A, B are separated by a node subset S in G . Let T = (W, E) be a tree. If the leaves of T are all observed variables and the internal nodes are latent variables, the graphical model T on T is referred to as a latent tree model (Zhang, 2004). Furthermore, the graphical model T is called the mixed latent tree model, if the tree T contains both discrete and continuous variables. In this paper, we mainly discuss one mixed case that latent variables are binary, and observed variables are binary or conditional Gaussian given its adjacent variable. We denote the set of observed nodes in W as V(with m = |V|), and the set of latent nodes in W as H. Hence W = V H. Furthermore, we denote the set of continuous nodes in V as Vc, and the set of discrete nodes in V as Vd. Hence V = Vc Vd. We refer to u, v as bifurcation nodes of w if nodes u, v are observed and the path from u to v contains w. If we choose a node on tree T as the root, we can obtain a directed tree T = (W, E) by assigning the edge direction from the root to the leaves. The element u v in E represents a directed edge from u to v. Node u is called as a parent of node v and node v is called as a child of node u. We denote all child nodes of u as ch(u) and denote the parent node of v as pa(v). An ordered set of distinct nodes L = [v0, v1, , vk] is a length-k directed path from v0 to vk in T if the directed edge vj 1 vj E for all j = 1, , k. If there exists a directed path in T from u to an observed node v, we say v is a bifurcation node of u in T . Here we consider three assumptions to characterize the relationship between variables.

(A1) The correlation coeﬃcient of any two variables is nonzero. (A2) Each latent variable has three neighbors at least. (A3) Any two variables connected by an edge on the tree are not completely dependent.

These assumptions are routinely used in the graphical model setting because they can provide a guarantee for the identiﬁability of the graphical tree model. If Assumption (A1) is violated, our learning model may be disconnected, thus the model is not a graphical tree model. This

Zhou, Wang and Guo

assumption can be relaxed when we consider a graphical forest model with several connected components. Assumptions (A2) and (A3) ensure that a latent tree does not include a redundant latent node (Choi et al., 2011). If (A2) or (A3) is violated, there may exist a redundant latent variable in the tree model. This can further cause the non-identiﬁability of the model.

3. Structural Learning for Mixed Latent Tree Models

In this section, we ﬁrst deﬁne an information distance between binary and continuous variables. And then we propose a bottom-up structural learning algorithm basing on this information distance. Finally, we give a ﬁnite sample bound for the exact structural recovery of this learning algorithm.

3.1 Information Distance

In this subsection, we deﬁne the information distance between two variables, and prove that the distance has an additivity along paths on mixed latent trees. This additivity is an important tool for designing a structural learning algorithm of latent tree models. The learning algorithm will be suggested in the next subsection. Let T = (W, E) be a tree and T be a mixed latent tree model on T. For variable Xu and Xv where u, v W, we deﬁne the information distance between them:

duv := log(|ρuv|), (1)

where ρuv = Cov(Xu,Xv)

Var(Xu) Var(Xv) is the correlation coeﬃcient of variables Xu and Xv.

The correlation coeﬃcient relies on the covariance of two random variables. By the double expectation formula, the covariance may be further decomposed into a product of several quantities in term of the conditional independence on the graphical tree model. For some speciﬁc distributions, the correlation coeﬃcient ρuv could be presented as the product of two correlation coeﬃcients ρuh and ρhv when variables Xu and Xv are conditional independent given variable Xh. Thus the information distance (1) has the additivity on the graphical tree model for an appropriate distribution. For binary variable Xu and Xv, the form (1) is equivalent to the information distance (Chang, 1996)

| det(Puv)| p

det(Puu) det(Pvv)

where Puv is the joint probability matrix (P(Xu = a, Xv = b))a,b=0,1 of Xu and Xv. For Gaussian variables Xu and Xv, the form in (1) is also the information distance (Choi et al., 2011). In this paper, we mainly consider that variables are binary or conditional Gaussian. According to the Assumptions (A1) and (A3), we know that the correlation coeﬃcient 0 < |ρuv| < 1 and further the information distance 0 < duv < + for any u, v W. The following theorem establishes that the information distance is additive along paths. The proof can be found in the Appendix.

Learning Mixed Latent Tree Models

Theorem 1 Let T = (W, E) be a tree and T be a mixed latent tree model on T. If node set [v0 = u, v1, , vk = v] is a path from u to v on T, we obtain that:

l=0 dvlvl+1.

The proof of Theorem 1 mainly employs conditional independence in the tree model and the restriction that latent variables are binary. So the additivity of the information distance does not rely on the speciﬁc distributions of observed variables. The structural learning algorithm in the next subsection could also apply to mixed data from multinomial distributions or conditional non-Gaussian distributions.

3.2 Structural Learning Algorithm for Mixed Latent Tree Models

From Theorem 1, the information distance (1) has the additivity in the mixed latent tree model. Our structural learning method is from the bottom-up SLLT algorithm (Wang et al., 2017). This algorithm itself does not rely on the type of data. It only requires information distances as its inputs for recovering latent trees. This algorithm can output the mixed latent tree correctly within the time O(diam(T)m3) if the true information distances are available, where m is the number of observed variables. In the following, we illustrate the SLLT algorithm in detail by using the mixed latent tree T shown in Figure 1, where v1, , v12 are observed variables and h1, , h8 are latent variables. We refer to V = {v1, , v12} and H = {h1, , h8} as the observed variable set and the latent variable set respectively, and let W = V H. We use node symbols , , to represent a latent variable, a discrete observed variable, a continuous observed variable respectively. The SLLT algorithm uses the information distances among the observed variable set V of T to reconstruct the unknown latent tree structure.

(a) The mixed latent tree structure

(b) The process of structural learning

Figure 1: Example of structural learning.

For any three observed variables vi, vj and vk, we compute all the information distance diﬀerences Φvivjvk = dvivk dvjvk. When Φv1v2v3 = Φv1v2v4 = = Φv1v2v12, then

Zhou, Wang and Guo

Algorithm 1: Structural Learning for Latent Trees (SLLT)

Input : Observed variables V and information distances duv for any u, v V; Output: A tree structure T;

1. A V. D φ. For any u V, D(u) φ; 2. If |A| 3, compute Φv1v2v3 = dv1v3 dv2v3 for any three variables v1, v2, v3 A;

1 . For any v1, v2 A, if Φv1v2z is constant for any z A\{v1, v2}, then {v1, v2} are a sibling pair in T. 2 . Denote maximal sibling groups by {Πl}L l=1,

A A\ L l=1 Πl.

3 . For any l = 1, , L, add a new latent variable hl and connect hl to every node in Πl. A A {hl}, D(hl) u Πl{u}, D D ( u Πl{u}).

3. While |A| 3,

1 . For any v A V and u A\V, Z V\(D(u) {v}) and choose bifurcation variables v1, v2 of u in D. If Φvv1z is constant and Φvv1z = Φvv1v2 for any z Z, then {v, u} is a sibling pair in T(W\D). 2 . For any two variables u, w A\V, Z V\(D(u) D(w)) and choose bifurcation variables v1, v2 of u and v3, v4 of w in D. If Φv1v3z = Φv1v3v4 and Φv3v1z = Φv3v1v2 for any z Z, then u is a remaining child of w in T(W\D). For any two variables u, w A\V, if neither u nor w is a remaining child, Z V\(D(u) D(w)) and choose bifurcation variables v1, v2 of u and v3, v4 of w in D. If Φv1v3z is constant and Φv1v3z = Φv1v3v4, Φv3v1z = Φv3v1v2 for any z Z, then {u, w} is a sibling pair in T(W\D). 3 . Denote the remaining child relations and maximal sibling groups by {Πl}L l=1.

A A\ L l=1 Πl.

4 . For any l = 1, , L, if Πl = {u, w} and u is remaining child of w, then connect u and w. A A {w}, D(w) D(w) D(u) {u} and D D {u}. if Πl is a sibling group, then add a new latent variable hl and connect hl to every node in Πl. A A {hl}, D(hl) u Πl(D(u) {u}) and D D ( u Πl{u}).

4. If |A| = 2, connect the two remaining variables in A; 5. Return the structure generated;

Learning Mixed Latent Tree Models

{v1, v2} is a sibling pair in T. Similarly, we obtain that {v1, v3}, {v2, v3}, {v4, v5}, {v6, v7}, {v8, v9}, {v10, v11} are also sibling pairs in T. Thus {v1, v2, v3}, {v4, v5}, {v6, v7}, {v8, v9}, {v10, v11} are ﬁve maximal sibling groups, and ﬁve latent variables h5, h2, h6, h7, h8 are detected as their parent variables respectively. Then we have D1 = {v1, , v11} and A1 = {h2, h5, h6, h7, h8, v12}, where the subscripts of D and A are used to indicate the iterative step. We construct a subtree T(W\D1) by discarding all variables in D1 from T, and A1 contains all of the leaf variables {h5, h6, h7, h8, v12} of T(W\D1). We can recover local structures among observed variables at step 2 of the SLLT algorithm. At step 3, we reconstruct the structures with latent variables. Firstly, we ﬁnd out the observed-latent sibling pairs in A1. For v12, h8 A1, we choose the observed variables v10, v11 as bifurcation variables of h8 in D1. Since Φv12v10v is constant and Φv12v10v = Φv12v10v11 for v = v1, v2, , v9, we have that {v12, h8} is a sibling pair in T(W\D1). Secondly, we ﬁnd out the remaining-child relationship in A1. For {h2, h5} in A1, we choose {v4, v5} as bifurcation variables of h2 and {v1, v2} as bifurcation variables of h5. Since Φv1v4v = Φv1v4v5 and Φv4v1v = Φv4v1v2 for v = v6, , v12, we ﬁnd that h5 is a remaining child variable of h2. Thirdly, we judge the latent-latent sibling pair relationship in A1. For {h6, h7} in A1, we choose {v6, v7} as bifurcation variables of h6 and {v8, v9} as bifurcation variables of h7. Since Φv6v8v is constant and Φv6v8v = Φv6v8v9, Φv8v6v = Φv8v6v7 for v = v1, , v5, v10, v11, v12, we have that {h6, h7} is a sibling pair in T(W\D1). Thus, {h6, h7} and {h8, v12} are two maximal sibling groups, and two latent variables h3, h4 are added as their parent variables respectively. Then, we have D2 = {v1, , v12, h5, , h8} and A2 = {h2, h3, h4}. Similarly, A2 contains all the leaf variables {h2, h3, h4} in the subtree T(W\D2). Finally, we obtain that {h2, h3, h4} forms a sibling group through similar steps. Then we add a latent variable h1 as its parent variable and the algorithm ends.

3.3 Finite Sample Bound for the Structural Learning Algorithm

In this subsection, we give a ﬁnite sample bound for the exact structural recovery of this learning algorithm. To apply the SLLT algorithm to data, we replace the correlation coefﬁcient ρuv with its sample version

X(k) u Xu X(k) v Xv

X(k) u Xu 2 s

X(k) v Xv 2

for nodes u, v V. Furthermore, we compute the sample information distances ˆduv = log |ˆρuv|, and put them into the SLLT algorithm. In the algorithm, we identify the relations between variables by checking whether the information distance diﬀerence Φuvw is equal to some constant or not. However, in the samplebased SLLT algorithm, ˆΦuvw and ˆΦuvz are almost impossible to be exactly equal even if the true diﬀerences are equal due to the error of estimate. Since |ˆΦuvw ˆΦuvz| |Φuvw Φuvz| when n , we determine the equality of Φuvw and Φuvz if |ˆΦuvw ˆΦuvz| < ε, where ε is a prescribed positive threshold. Moreover, we deﬁne a lower bound notation φmin := min {|Φuvw Φuvz| : Φuvw = Φuvz, u, v, w, z V} and take a threshold ε min{1

2φmin, 1}. If the diﬀerence |(ˆΦuvw ˆΦuvz) (Φuvw Φuvz)| < ε when the sample size n is suﬃciently

Zhou, Wang and Guo

large, we obtain that Φuvw = Φuvz if and only if |ˆΦuvw ˆΦuvz| < ε. Therefore, if the event {|(ˆΦuvw ˆΦuvz) (Φuvw Φuvz)| < ε for any u, v, w, z V} occurs with a high probability when the sample size n is suﬃciently large, we can learn the true latent tree structure from the sample-based SLLT algorithm with a high probability. The consistency of the sample-based SLLT algorithm is built on the tail probability inequality on the sample covariance

l=1 (X(l) u Xu)(X(l) v Xv),

where u, v are two leaf nodes on the latent tree. We need the following notations:

µ = max |E Xv|Xpa(v) = x | : v Vc, x = 0, 1 ,

σ2 = max Var Xv|Xpa(i) = x : v Vc, x = 0, 1 ,

c = C max{σ4, σ2µ2, µ4, σ2, µ2, 1},

where the constant C does not depend on nodes on the tree.

Theorem 2 The inequality

Suv Cov(Xu, Xv) >

c(t + log 48)

holds for any two leaf nodes u, v in T and any t > 0.

The consistency of the SLLT algorithm relies on two intrinsic parameters φmin := min {|Φuvw Φuvz| : Φuvw = Φuvz, u, v, w, z V} and cmin := minu,v V |Cov(Xu, Xv)| where V is the set of observed nodes. The following theorem shows the relationship between the sample size and the intrinsic parameters of the model when the true latent tree structure is learned.

Theorem 3 Let η (0, 1). The SLLT algorithm can return the true mixed latent tree with a probability of at least 1 η, if the sample size n is suﬃciently large such that the inequality r

c(log(48m2) log η)

n < cmin min{1

2φmin, 1} 16 , (3)

Theorem 3 is obtained from Theorem 2, and it provides a lower bound

256 c(log(48m2) log η)

c2 min min{1

4φ2 min, 1}

of the sample size n for recovering the true structure with a probability of at least 1 η. Intrinsic parameters φmin and cmin depend on the number m of observed nodes and the true joint distributions. When the number m is large, intrinsic parameters may be small. Furthermore, a large sample size is required for recovering mixed latent tree structures with a high probability.

Learning Mixed Latent Tree Models

4. Parameter Estimation for Mixed Latent Tree Models

In this section, we consider the parameter estimation for the mixed latent tree model T with its tree structure T given. We choose a latent variable r as the root and further construct a directed tree T from the tree T and the root r. Model parameters (Chen et al., 2017) in the latent tree model consist of a marginal distribution for the root Xr, and all the conditional distributions for variables given their parents on the directed tree T . In Subsections 4.1 and 4.2, we ﬁrst assume that the expectation µu := EXu is zero for any continuous variable u Vc. Under this assumption, we propose a moment method to estimate all the model parameters along the directed tree T by using matrix decomposition. Finally, we also discuss the parameter estimates for non-zero expectation continuous variables. First, we introduce some notations. For a latent node h H, let

ph := P(Xh = 1),

µv|Xh=xh := E(Xv|Xh = xh) for a continuous observed node v,

µ(2) v|Xh=xh := E(X2 v|Xh = xh) for a continuous observed node v,

pv|Xh=xh := P(Xv = 1|Xh = xh) for a discrete observed node v,

where xh = 0, 1. In particular, if node h is a parent of node v on the tree T , we simplify the notations µv|Xh=xh, µ(2) v|Xh=xh, pv|Xh=xh by µv|xh, µ(2) v|xh, pv|xh respectively.

4.1 Parameter Representation for Mixed Latent Tree Models

Motivated by Chang (1996), we study a local structure consisting of three observed nodes u, v, w V and a latent node h H. u and v are bifurcation nodes of h. v and w are also bifurcation nodes of h. Figure 2 illustrates this local structure, which implies that observed variables Xu, Xv, Xw are conditionally independent given the latent variable Xh. Unlike Chang s work (Chang, 1996) only considering discrete variables, we allow variables Xu, Xv, Xw to be continuous. So there are eight cases of three observed variables depending on whether the variable is binary or continuous. In the following, we ﬁrst provide the parameter representation of models in the case that Xu, Xv, Xw are all continuous. And then we show the general representation in other cases.

Figure 2: A local structure of three observed nodes and one latent node.

If nodes u, v, w Vc, the moment EXlu u Xlv v Xw can be decomposed into

EXlu u Xlv v Xw = µ(lu) u|Xh=0 µ(lu) u|Xh=1 (1 ph) µw|Xh=0 0 0 ph µw|Xh=1

µ(lv) v|Xh=0 µ(lv) v|Xh=1

Zhou, Wang and Guo

for exponents lu, lv {1, 2}, since Xu, Xv, Xw are conditionally independent given Xh. We

denote Euvw, Γu|h, and Γv|h as matrices EXu Xv Xw EXu X2 v Xw EX2 u Xv Xw EX2 u X2 v Xw

µu|Xh=0 µu|Xh=1 µ(2) u|Xh=0 µ(2) u|Xh=1

µv|Xh=0 µv|Xh=1 µ(2) v|Xh=0 µ(2) v|Xh=1

, respectively. We further have that

Euvw = Γu|h

(1 ph) µw|Xh=0 0 0 ph µw|Xh=1

ΓT v|h. (4)

Denote Euv as EXu Xv EXu X2 v EX2 u Xv EX2 u X2 v

. We also have that

Euv = Γu|h 1 ph 0 0 ph

ΓT v|h. (5)

By Assumption (A1), the matrix Euv is invertible since continuous variables have zero means. Thus

Auvw := Euvw E 1 uv = Γu|h µw|Xh=0 0 0 µw|Xh=1

This eigen-decomposition of matrix Auvw, determined by the joint distribution of observed variables (Xu, Xv, Xw), forms the representation of model parameters. Speciﬁcally, conditional means µw|Xh=0, µw|Xh=1 are eigenvalues of matrix Auvw. Columns of conditional moment matrix Γu|h are eigenvectors of matrix Auvw. To compute Γu|h from the eigenvector space of matrix Auvw, we still need two other restrictions:

EXw = (1 ph) µw|Xh=0 + ph µw|Xh=1, (6)

and EXu EX2 u

= Γu|h 1 ph ph

Combining equations (4), (5), (6) and (7), we can compute the conditional expectation matrix Γu|h and the marginal probability ph = P(Xh = 1) using the moments of observed variables. Particularly, if node h is the root r, we can actually obtain the marginal probability of the root. In above discussion, we show that in the case that Xu, Xv, Xw are all continuous, model parameters can be obtained by solving the moment equations (4), (5), (6) and (7). In the following part, we handle the general case that allows observed variables binary. For any

Learning Mixed Latent Tree Models

EXu Xv Xw EXu X2 v Xw EX2 u Xv Xw EX2 u X2 v Xw

, for u, v Vc;

EXu(1 Xv)Xw EXu Xv Xw EX2 u(1 Xv)Xw EX2 u Xv Xw

, for u Vc and v Vd;

E(1 Xu)Xv Xw E(1 Xu)X2 v Xw EXu Xv Xw EXu X2 v Xw

, for u Vd and v Vc;

E(1 Xu)(1 Xv)Xw E(1 Xu)Xv Xw EXu(1 Xv)Xw EXu Xv Xw

, for u, v Vd,

EXu Xv EXu X2 v EX2 u Xv EX2 u X2 v

, for u, v Vc;

EXu(1 Xv) EXu Xv EX2 u(1 Xv) EX2 u Xv

, for u Vc and v Vd;

E(1 Xu)Xv E(1 Xu)X2 v EXu Xv EXu X2 v

, for u Vd and v Vc;

E(1 Xu)(1 Xv) E(1 Xu)Xv EXu(1 Xv) EXu Xv

, for u, v Vd,

and also let Auvw denote the matrix Euvw E 1 uv . Let

µu|Xh=0 µu|Xh=1 µ(2) u|Xh=0 µ(2) u|Xh=1

, u Vc; 1 pu|Xh=0 1 pu|Xh=1 pu|Xh=0 pu|Xh=1

, u W\(Vc {r}),

µw|Xh=0 0 0 µw|Xh=1

, w Vc; pw|Xh=0 0 0 pw|Xh=1

Similar to the case that Xu, Xv, Xw are all continuous, the eigen-decomposition

Auvw = Γu|h Λw|h Γ 1 u|h (8)

holds for any three observed nodes u, v, w V. If node u Vc, the matrix Γu|h can be computed by a similar way in the case that Xu, Xv, Xw are all continuous. If node u Vd, we can also obtain the matrix Γu|h by replacing the restriction equations (6) and (7) with a natural restriction 1T Γu|h = (1, 1) (Wang et al., 2017). So for any observed node u V, we can obtain the matrix Γu|h using the eigen-decomposition. If node h is just a parent of node

u on the tree T , the matrix Γu|h is exactly model parameters of conditional distribution for variable Xu given its parent variable Xh.

Zhou, Wang and Guo

4.2 Parameter Estimation Algorithm for Mixed Latent Tree Models

Beyond the local structure shown in Figure 2, we consider a little more complex structure shown in Figure 3 (a) of four observed nodes and two latent nodes. Let node h2 be the root of the tree. Nodes u, v1 are two bifurcation nodes of h1, and nodes v2, w are two bifurcation nodes of h2. By the separation in Figure 3 (b), variables Xu, Xv1, Xw are conditionally independent given Xh1, and variables Xu, Xv2, Xw are conditionally independent given Xh2. Hence parameter matrices Γu|h1 and Γu|h2 can be computed as discussed in Subsection 4.1. Furthermore, we get the parameter matrix

Γh1|h2 = Γ 1 u|h1 Γu|h2 (9)

related to two latent variables.

Figure 3: A local structure of four observed nodes and two latent nodes.

As discussed above, we can compute all the model parameters by using the joint distributions of three observed variables. Then we further suggest the PEMT algorithm for the parameter estimation of mixed latent tree models. According to Assumption (A2), every latent variable h has three neighbors at least. If |C| = 2 in step 8 of the PEMT algorithm, there exists an observed variable c V such that the path from h to c on T does not contain any child of h. To guarantee that the column states of all Γu1|h, Γu2|h, are matched for h, we need to record the label states of u1 and u2 from h and perform the corresponding matrix decomposition in equation (8) according to the label states. Furthermore, since the parameter matrix Γu|h1 is invertible, we can compute the parameter matrix Γh1|h2 related to two latent variables by the equation (9). Our parameter estimation algorithm can reduce to the PELT algorithm (Wang et al., 2017) if the data is pure binary. Speciﬁcally, the matrices

Euvw = E(1 Xu)(1 Xv)Xw E(1 Xu)Xv Xw EXu(1 Xv)Xw EXu Xv Xw

= Pr(Xu = 0, Xv = 0, Xw = 1) Pr(Xu = 0, Xv = 1, Xw = 1) Pr(Xu = 1, Xv = 0, Xw = 1) Pr(Xu = 1, Xv = 1, Xw = 1)

Euv = E(1 Xu)(1 Xv) E(1 Xu)Xv EXu(1 Xv) EXu Xv

= Pr(Xu = 0, Xv = 0) Pr(Xu = 0, Xv = 1) Pr(Xu = 1, Xv = 0) Pr(Xu = 1, Xv = 1)

Learning Mixed Latent Tree Models

Algorithm 2: Parameter Estimation for Mixed Latent Trees (PEMT)

Input : A latent tree with a root, the ﬁrst order moments EXu for u V, the second order moments EX2 u for u Vc and the moment matrices Euvw and Euv for u, v, w V. Output: All conditional probability matrices on edges in T.

1: Construct a directed tree T and compute the matrices Auvw for u, v, w V.

2: for h H do

3: ﬁnd all child variables ch(h) of h in T ;

4: for z ch(h) do

5: ﬁnd a directed bifurcation variable u of z.

7: Collect the set C of all the bifurcation variables {u1, u2, } of all child variables ch(h) of h.

8: if |C| = 2 then

9: ﬁnd an observed variable c V such that the path from h to c on T does not contain any child of h.

11: Compute Γu1|h, Γu2|h, and ph by matrix decomposition in equation (8).

12: end for

13: for h2 H and h1 ch(h2) do

14: if h1 H then

15: choose a common directed bifurcation variable u of h1 and h2, and compute parameter matrix Γh1|h2 by equation (9) ;

17: end for

18: Return the parameters pr and Γu|pa(u), Γh|pa(h) for u V, h H\{r}.

By direct computation, the matrix Euvw E 1 uv has the spectral decomposition form used in the PELT algorithm. For the sample-based version of the PELT, the work (Wang et al., 2017) lacks of the asymptotic normality guarantee for the algorithm s output. For the PEMT, we provide the asymptotic normality in the following Theorem 4, which is also applied to the PELT for handling pure binary data. For obtaining a sample-based PEMT algorithm, we replace the moments by their sample moments. For the basic structure in Figure 2, we take the case that u, v, w Vc as an example. Moments Euvw, Euv EXu, EX2 u, and EXw are replaced by their sample moments ˆEuvw, ˆEuv, Xu, X2u, and Xw respectively, where

1 n Pn l=1 X(l) u X(l) v X(l) w 1 n Pn l=1 X(l) u (X(l) v )2X(l) w 1 n Pn l=1(X(l) u )2X(l) v X(l) w 1 n Pn l=1(X(l) u )2(X(l) v )2X(l) w

1 n Pn l=1 X(l) u X(l) v 1 n Pn l=1 X(l) u (X(l) v )2

1 n Pn l=1(X(l) u )2X(l) v 1 n Pn l=1(X(l) u )2(X(l) v )2

Zhou, Wang and Guo

l=1 X(l) u , X2u = 1

l=1 (X(l) u )2, Xw = 1

l=1 X(l) w .

By further solving equations (4), (5), (6) and (7), we obtain a moment estimator ˆΓu|h of the true parameter matrix Γu|h. From the property of moment estimators in van der Vaart (2000), we have the following theorem illustrating that our estimator converges to the true one in the meaning of asymptotic normality. The detailed proof is put into the Appendix.

Theorem 4 Assume that the expectation EXu is zero for any continuous node u Vc. In the sample-based PEMT algorithm, the moment estimators ˆpr, ˆΓu|pa(u) and ˆΓh|pa(h) satisfy n(ˆpr pr), n(ˆΓu|pa(u) Γu|pa(u)) and n(ˆΓh|pa(h) Γh|pa(h)) are asymptotically normal, where n is the sample size, r is the root node and u V, h H\{r}.

Note that Theorem 4 requires a zero-expectation assumption. For node u Vc, the zero expectation of variable Xu guarantees the non-singularity of the conditional moment matrix Γu|h and the diagonal matrix Λw|h. So it also ensures that the eigen-decomposition (8) for matrix Auvw is valid. Theoretically, we can replace non-zero mean continuous variables with zero mean continuous variables. Speciﬁcally, for a continuous node u Vc, let e Xu = Xu µu where µu = EXu. Variable e Xu has a zero expectation. Moreover, the conditional distribution e Xu|Xpa(u) = x N eµu|x, eσ2 u|x , where the parameters eµu|x = µu|x µu and eσ2 u|x = σ2 u|x.

So there is an one-to-one map between the parameters for Xu and those for e Xu, if the expectation µu is known. By replacing all the continuous variables with their centralized variables, we obtain new matrices e Euvw, e Euv and e Auvw for any observed nodes u, v, w V. By the similar eigen-decomposition in Subsection 4.1, we can obtain eµu|0, eµu|1, eµ(2) u|0, eµ(2) u|1 for any node u Vc. We can further compute the original model parameters µu|0, µu|1, σ2 u|0, σ2 u|1 by equations:

µu|x = eµu|x + µu, σ2 u|x = eσ2 u|x = eµ(2) u|x eµ2 u|x, and µ(2) u|x = σ2 u|x + µ2 u|x, (10)

where x = 0, 1. In the numerical computation, we also suggest to centralize all the continuous variables before using the PEMT algorithm. For any node u Vc, let e X(l) u = X(l) u Xu and replace the original observation X(l) u with e X(l) u where l = 1, , n. We can obtain the

parameter estimation {ˆeµu|0, ˆeµu|1, ˆeµ (2) u|0, ˆeµ (2) u|1, u Vc} by the sample-based PEMT algorithm. Furthermore, the original parameters can be computed using equations in (10).

5. Numerical Experiment

In this section, we performed numerical experiments on both simulated and real data sets. In Subsection 5.1, we show the consistency of the SLLT algorithm and the PEMT algorithm on the simulated data, which was generated from four mixed latent tree structures. For parameter estimation, we compared the PEMT algorithm with the conventional EM algorithm. For structural learning, we further performed a simulated experiment for highdimensional data with one thousand observed variables. In Subsection 5.2, we apply our

Learning Mixed Latent Tree Models

algorithm to a Forest Cover Type dataset for mining the hierarchical structure and latent information. All of the experiments were performed using R on a desktop with an Intel Core i5-3470 CPU 3.2 GHz and 16 GB RAM.

5.1 Simulation Study

We generated data sets from four mixed latent tree models shown in Figure 4. We use node symbols , , to represent a latent variable, a discrete observed variable, a continuous observed variable respectively. Models 1, 3 and 4 have similar structures, but the ratios of the number of continuous variables to the number of observed variables set V are diﬀerent. And the structure of model 1 is similar to that of model 2, where we restricted every latent variable to have three observed neighbors. The model parameters were generated randomly such that |pu|0 pu|1| 0.3 for u Vd and |µu|0 µu|1| 0.5 for u Vc, which ensure that the information distances are limited.

(a) model 1

(b) model 2

(c) model 3

(d) model 4

Figure 4: Four mixed latent tree models used in the simulation study.

As shown in Subsection 3.2, we determine the basic sibling pair by using a prescribed threshold ε > 0. Particularly, if the diﬀerence |ˆΦuvw ˆΦuvz| < ε for w, z V\{u, v}, we judge that {u, v} is a sibling pair. According to Wang et al. (2017), since a longer distance estimate is less accurate for a given number of samples, not all estimated distances can be used for structural learning reliably. We only considered the possible sibling pair {u, v} whose estimated distances ˆduv, ˆduw, ˆdvw are controlled by two thresholds τ1, τ2. In particular, for each pair of nodes {u, v} satisﬁed ˆduv < τ1, the estimated diﬀerence ˆΦuvw was computed only for node w Kuv = {w V\{u, v} max{ ˆduw, ˆdvw} < τ2}. Furthermore, if the diﬀerence |ˆΦuvw ˆΦuvz| < ε for any w, z Kuv and ˆduv < τ1, we consider {u, v} to be a sibling pair. When we increase the threshold ε, it is apparent that the number of observed

Zhou, Wang and Guo

nodes belonging to the same sibling group tends to increase, while the number of individual nodes tends to decrease. So a larger ε makes it easier to obtain a tree structure. In the structural learning simulation, we started the value of the threshold ε from 0.1 and let it increase with the step size 0.1 until the SLLT algorithm obtained a tree. We set τ1 = 3 and τ2 = 5.

To assess the consistency of the SLLT algorithm and the PEMT algorithm, we varied the sample size among 10k, 30k, 60k, 100k, 300k, 600k, 1000k. For each sample size, we did 500 experiments with randomly generated parameters in four mixed latent tree models. The SLLT algorithm may fail in one experiment if it does not ﬁnd the real latent tree structure. The performance of the SLLT was evaluated by its failure rate. For the PEMT algorithm, its performance was assessed by the average estimate error in 500 experiments. Figure 5 shows that the failure rate and the estimation error decreased as the sample size increased.

10k 30k 60k 100k 300k 600k 1000k Sample Size

Failure Rate

Model1 Model2 Model3 Model4

(a) The SLLT algorithm

10k 30k 60k 100k 300k 600k 1000k Sample Size

Estimation Error

Model1 Model2 Model3 Model4

(b) The PEMT algorithm

Figure 5: The consistency of algorithms.

As shown in Figure 5 (a), the SLLT algorithm performed much better with model 3 than model 4, which indicates that the structure of continuous variables is better to be recovered than that of discrete ones. The SLLT algorithm worked better with model 1 than model 2, since the additional nodes in model 2 relative to model 1 increase the probability of misjudgment. Similarly, as shown in Figure 5 (b), the PEMT algorithm had a better performance on the model 3 than the model 1. Moreover, the algorithm performed better with model 1 than model 4. As the proportion of the continuous parts increased in the observed variables, the performance of the algorithm could get better. The PEMT algorithm performed best with model 2, since the additional leaves in model 2 relative to model 1 can enhance the parameter estimation accuracy (Wang et al., 2017).

We also compared the performance of the EM algorithm and the PEMT algorithm using those mixed latent tree models. We varied the sample size among 10k, 30k, 60k, 100k. Given a group of model parameters, we generated 100 random datasets for each sample size and each of the four structures in Figure 4, where node symbols , , represent a latent variable, a discrete observed variable, a continuous observed variable respectively. We ran the EM with random initialization and 100 iterations. In Table 1, we list the average estimation accuracy and the average time costs for both algorithms. The estimation

Learning Mixed Latent Tree Models

accuracy is measured by the mean absolute error (MAE) and the mean squared error (MSE). The MSEs are listed in the brackets of Table 1.

Table 1: Estimation Accuracy

Model Sample Size Estimation Accuracy ( 10 2) Time Costs (seconds) EM PEMT EM PEMT

10k 3.38542(0.45116) 5.96665(1.79734) 245.27 1.48 30k 3.18377(0.43035) 3.74196(0.79397) 706.07 1.50 60k 3.12205(0.43014) 2.78707(0.439) 1380.06 1.54 100k 3.09361(0.42536) 2.01234(0.20053) 2328.86 1.61

10k 0.71304(0.01021) 1.93799(0.21545) 379.61 1.75 30k 0.41582(0.00354) 1.14539(0.03704) 1103.09 1.80 60k 0.30062(0.00183) 0.79623(0.01761) 2187.71 1.88 100k 0.23239(0.00114) 0.62655(0.01071) 3701.19 1.97

10k 1.72369(0.17865) 5.34733(1.54988) 379.64 1.46 30k 1.49627(0.17578) 3.18196(0.43873) 1098.84 1.49 60k 1.42979(0.1789) 2.23121(0.19593) 2193.00 1.54 100k 1.38426(0.17731) 1.6833(0.11438) 3664.98 1.60

10k 6.93965(1.06076) 6.73677(1.4187) 86.61 1.47 30k 6.81237(1.05467) 4.37126(0.67131) 225.16 1.50 60k 6.77787(1.04613) 3.39273(0.40884) 443.07 1.54 100k 6.77533(1.04971) 2.58861(0.24045) 742.43 1.61

As the sample size went up, the estimation accuracy of the PEMT improved quickly with a little increasing time cost, while the time cost of the EM increased rapidly with a slow accuracy improvement for some models. Speciﬁcally, the EM algorithm had a slow improvement on the MSE for models 1, 3, and 4 as the sample size increased. So the EM failed to provide the maximum likelihood estimates in some experiments for models 1, 3, and 4 since the asymptotic variance of the maximum likelihood estimate depends on the reciprocal of the sample size. The PEMT algorithm performed better than the EM with models 1 and 4 when the sample size is large enough. For model 2, the EM outperformed the PEMT, and both algorithms had a low MAE and a low MSE. The reason may lie in that compared to models 1, 3, and 4, more observed variables in model 2 can provide more information from data for the parameter estimation. If the sample size is equal to 100k, the MAE and the MSE of the PEMT algorithm were 0.62655 10 2 and 0.01071 10 2

respectively. The execution speed was much faster with the PEMT algorithm, and the EM algorithm had huge time costs with a large sample size because the EM algorithm updates the statistics based on every sample and numerous iterations of all the samples are required. We designed an experiment on learning a latent tree structure with 1000 observed variables. A schematic diagram is shown in Figure 6 for the latent tree structure. We considered three cases of observed variables for this structure. The ﬁrst case is that observed variables are pure discrete. The second one is that the ﬁrst half of observed variables {v1, , v500} are continuous and the others are discrete. The last one is that observed variables are pure continuous.

Zhou, Wang and Guo

Figure 6: A schematic diagram for latent tree with pure discrete observed variables.

10k 30k 60k 100k 300k 600k 1000k Sample Size

RF Distance

Pure_Discrete Half_Discrete_Half_Continuous Pure_Continuous

(a) RF Distance

10k 30k 60k 100k 300k 600k 1000k Sample Size

Error in Latent Variables

Pure_Discrete Half_Discrete_Half_Continuous Pure_Continuous

(b) Error in Latent Variables

Figure 7: The consistency of the SLLT algorithm.

The performance of our method was assessed using the Robinson Foulds (RF) distance (Robinson and Foulds, 1981) and the error in the number of latent variables. Figure 7 illustrates the consistency performance of the SLLT algorithm for this high-dimensional structural recovery. In three cases, both the RF distance and the error in latent variables decreased as the sample size increased. So our method can still work well for learning high-dimensional mixed latent tree models when the sample size is large enough.

5.2 Real Data Analysis

In this part, we applied the SLLT algorithm to a Forest Cover Type dataset, which is available from the University of California, Irvine (UCI) machine learning data set repository. This dataset with 581012 samples includes ten continuous variables, forty soil types, and seven forest cover types in the Roosevelt National Forest of northern Colorado. Since the cover type Lodgepole pine accounts for more than 48% of the total samples, we studied the latent hierarchical structure related to the type Lodgepole pine.

Learning Mixed Latent Tree Models

Table 2: Observed variables and their Ids Id Variable Id Variable 1 Elevation 30 Haplocryolls 2 Aspect 31 Haplustalfs 3 Slope 32 Haplustolls 4 Vertical distance to nearest surface water features 33 Histic Cryaquolls 5 Horizontal distance to nearest surface water features 34 Hiwan family 6 Horizontal distance to nearest roadway 35 Legault family 7 Horizontal distance to nearest wildﬁre ignition points 36 Leighcan family 8 Hillshade index at 9am 37 Lithic Cryorthents 9 Hillshade index at Noon 38 Matcher family 10 Hillshade index at 3pm 39 Moran family 11 Aquic Argiudolls 40 Pachic Argiustolls 12 Argicryolls 41 Pachic Haplustolls 13 Argiustolls 42 Ratake family 14 Barrett family 43 Rock land 15 Bross family 44 Rock outcrop 16 Bullwark family 45 Rogert family 17 Catamount family 46 Rubble land 18 Cathedral family 47 Scout family 19 Cerro family 48 Supervisor family 20 Cryaquepts 49 Tolby family 21 Cryaquolls 50 Troutville family 22 Cryoﬂuvents 51 Typic Argiustolls 23 Cryohemists 52 Typic Cryaquepts 24 Cryorthents 53 Typic Cryorthents 25 Cypher family 54 Typic Haplocryolls 26 Dystrocryepts 55 Typic Haplustolls 27 Eutrocryepts 56 Water 28 Frisco family 57 Wetmore family 29 Gateview family 58 Cover Type - Lodgepole pine

The original soil types are tagged using the US Forest Service Ecological Landtype Units (ELUs) , and each ELU consists of one or more basic soil components . We replaced the original types by those basic soil components, and considered the total 58 observed variables shown in Table 2 after preprocessing. Those observed variables are the elevation, the aspect, the slope, the distances (4), the hillshade indices (3), the soil components (47), and the forest cover type. The ﬁrst 10 variables are continuous. The following 47 soil components are binary. The forest cover type is also binary depending on whether the Lodgepole pine exists or not. Figure 8 presents the learned mixed latent tree structure for this Forest Cover Type dataset. The tree consists of 58 observed variables and nine latent variables. The diameter of the tree is seven. The node set {1, 4, 5, 6, 7, 9, 13, 16, 17, 18, 20, 21, 23, 25, 26, 27, 30, 31, 33, 34, 35, 36, 37, 39, 41, 42, 43, 44, 47, 49, 50, 52, 53, 55, 56, 58} is the largest sibling group in the tree. From the additivity of the information distance, the cover type on Lodgepole

. For more details about ELU, please visit the following link: https://archive.ics.uci.edu/ml/datasets/Covertype . For more details about basic soil components, please visit the following link: https://casoilresource.lawr.ucdavis.edu/soil_web/ssurgo.php?action=list_mapunits& areasymbol=co645

Zhou, Wang and Guo

Figure 8: Mixed latent tree structure for the Forest Cover Type dataset.

Pine is close to six continuous variables and 29 soil types. Latent node H1 synthesizes all the information close to Lodgepole pine. Further statistical inferences on Lodgepole pine can be done on this largest sibling group. The Aspect (node 2) and the Hillshade index at 3pm (node 10) form a sibling pair with a latent node H2, which may be related to the solar incident angle. Soil components {11, 19, 32, 40, 51} form another sibling group. These ﬁve components belong to a common soil order Mollisols. Similarly, soil components {12,48} is from a soil suborder Cryolls. Observed variables {14,45,54} are basic components in two Ecological Landtype Units ELU3502 and ELU6731, and observed variables {15,24,38} are components in the ELU8707. So the latent nodes for soil components may reveal the potential background information on the soil taxonomy, and are related to the US Forest Ecological Landtype Units.

6. Discussion and Conclusion

To provide moderate theoretical proofs on mixed latent tree learning, we mainly consider that discrete observed variables are binary in this paper. But our algorithms also work for observed variables with more than two categories. For structural learning, we have mentioned in Subsection 3.1 that the information distance does not rely on the speciﬁc distributions of observed variables. So the structural learning algorithm can also handle general categorical observed variables. For parameter estimation, our strategy is to view categorical observed variables as binary variables. Assume that a categorical observed variable X has categories in {C1, , CK}. For any ﬁxed category Ck, variable X can be viewed as a binary variable X(k) with two states. One is that variable X takes the category Ck and the other is that X takes a category in the remaining set {C1, , Ck 1, Ck+1, , CK}. Furthermore, for the ﬁxed category Ck, we replace the original variable X by the binary variable X(k). This replacement can neither change the original tree structure, nor impact the conditional independence relations in the graphical tree model. So with this replacement, the Assumption

Learning Mixed Latent Tree Models

(A2) also holds that each latent variable has three neighbors at least. Similarly, we can also convert other categorical observed variables into binary variables. Thus the proposed PEMT algorithm can estimate the parameter on the categorical variable X for the category Ck. The parameters on variable X for other categories can be computed in a similar way. In summary, we propose an information distance for the mixed latent tree model and prove that the distance is additive along paths. Furthermore, we suggest a consistent bottomup algorithm and give a ﬁnite sample bound guarantee for the exact structural recovery. For estimating model parameters, we study the moment estimator using matrix decomposition and prove that this estimator is asymptotically normal. The simulations support that our algorithms perform well when the sample size is large. In the real data application, we show that our structural learning algorithm can detect the latent hierarchical structure of observed variables.

Acknowledgments

The authors would like to thank the Action Editor and the reviewers for their constructive suggestions and comments that improved the quality of this paper. This work is supported by the National Key Research and Development Program of China (SQ2020YFA070264), the National Natural Science Foundation of China (11690012, 11631003, 11726629, 11701079), the Fundamental Research Funds for the Central Universities (2412019FZ030).

Appendix A. Proof of Theorem 1

Here we give a proof of Theorem 1. Proof Consider a path [v0 = u, v1 = w, v2 = v] on the latent tree T between two nodes u, v W. By the conditional independence on T and the double expectation formula, we have that

Cov(Xu, Xv) = Var(Xw) E(Xu|Xw = 1) E(Xu|Xw = 0) E(Xv|Xw = 1) E(Xv|Xw = 0) .

Similarly, we have

Cov(Xu, Xw) = Var(Xw) E(Xu|Xw = 1) E(Xu|Xw = 0) ,

Cov(Xv, Xw) = Var(Xw) E(Xv|Xw = 1) E(Xv|Xw = 0) .

Cov(Xu, Xv) = Cov(Xu, Xw)Cov(Xv, Xw)

and duv = duw + dwv.

Appendix B. Proof of Theorem 2

A random variable X with zero expectation is sub-exponential if there is a positive number

λ such that its moment function satisﬁes E(et X) e λ2t2

2 for any |t| 1

λ. Denote a sub-

Zhou, Wang and Guo

exponential distribution by sub E(λ). We further have the following proposition (Boucheron et al., 2013) for controlling the tail probability of the sample mean:

Proposition 1 If independent random variables Z1, , Zn with mean zero and Zi sub E(λ) for all i, we have that

P(Z > t) P(Z < t) exp n

for any t > 0, where Z = 1

n Pn i=1 Zi.

Now we give the proof of Theorem 2.

Proof Let X, Y be two observed variables in T and denote ε = q

c(t+log 48)

n , then the inequality (2) is equivalent to

P SXY Cov(X, Y ) > ε 48 exp nε2

for any ε > 0. We prove this inequality in two cases. Case I: X and Y are the same variable. Case II: X and Y are not the same variable. Furthermore, we divide Case I into Case I.i, variable X is continuous, and Case I.ii, X is binary. Similarly, we divide Case II into Case II.i, both variables X and Y are continuous, and Case II.ii, X is continuous and Y is binary, and Case II.iii, both X and Y are binary. Case I.i:

P(|SXX Var(X)| > ε)

i=1 (Xi EXi + EXi X)2 E(Xi EXi)2 > ε

i=1 (Xi EXi)2 E(Xi EXi)2 > ε

+ P (X EXi) > rε

=: P1 + P2. (12)

Then we compute the upper bounds of P1 and P2 in the following. Let variable U be a parent of X on the tree. Then we have X|U=u N(µX|u, σ2 X|u) for

any u = 0, 1. Since Xi = P1 u=0 Xi I(Ui = u) and EXi = P1 u=0 P(Ui = u)µi|u, let

Ai,u := Xi I(Ui = u) P(Ui = u)µi|u

for u = 0, 1, hence Xi EXi = Ai,0 + Ai,1. For any u = 0, 1, we obtain that

Ai,u = Xi I(Ui = u) µX|u I(Ui = u) + µX|u I(Ui = u) P(Ui = u)µX|u =: Ci,u,1 + Ci,u,2.

Learning Mixed Latent Tree Models

(Xi EXi)2 =

= C2 i,0,1 + C2 i,1,1 + 2Ci,0,1Ci,0,2 + 2Ci,1,1Ci,1,2 + 0 + 2Ci,0,1Ci,1,2+

2Ci,1,1Ci,0,2 + C2 i,0,2 + C2 i,1,2 + 2Ci,0,2Ci,1,2,

E(Xi EXi)2 = EC2 i,0,1 + EC2 i,1,1 + 0 + 0 + 0 + 0+

0 + E 2C2 i,0,2 + 2C2 i,1,2 + E (2Ci,0,2Ci,1,2) .

P1 P1.1 + P1.2 + P1.3 + P1.4 + P1.5 + P1.6 + P1.7 + P1.8, (13)

P2 P2.1 + P2.2 + P2.3 + P2.4, (14)

i=1 C2 i,0,1 EC2 i,0,1

, P1.2 := P

i=1 C2 i,1,1 EC2 i,1,1

i=1 Ci,0,1Ci,0,2

, P1.4 := P

i=1 Ci,1,1Ci,1,2

i=1 Ci,0,1Ci,1,2

, P1.6 := P

i=1 Ci,0,2Ci,1,1

i=1 (C2 i,0,2 + C2 i,1,2) E(C2 i,0,2 + C2 i,1,2)

i=1 Ci,0,2Ci,1,2 ECi,0,2Ci,1,2

, P2.2 := P

, P2.4 := P

We only show the computation of the upper bound of P1.1, P1.3, P1.7, and the others can be obtained similarly. Firstly, we compute the upper bound of P1.3. Since

σX|0 I(Ui = 0)

Zhou, Wang and Guo

hence we denote the variable Xi µX|0

σX|0 I(Ui = 0) as Zi for any i = 1, . . . , n. Then ( EZi = 0

Eet Zi = (1 P(Ui = 0)) 1 + P(Ui = 0) e t2

therefore Z1, . . . , Zn i.i.d. g sub E(1). By Proposition 1, a constant C exists such that

P1.3 2 exp nε2

Similarly, we obtain that

P1.4, P1.5, P1.6 2 exp nε2

P2.1, P2.2 2 exp nε

Secondly, we compute the bound of P1.7. Let

Zi := C2 i,0,2 + C2 i,1,2 =

u=0 (I(Ui = u) P(Ui = u))2µ2 X|u.

P 2(Ui = 1) µ2 X|0 + µ2 X|1 , for Ui = 0

P 2(Ui = 0) µ2 X|0 + µ2 X|1 , for Ui = 1 ,

and Zi [ai, bi], where bi ai = |P(Ui = 1) P(Ui = 0)| µ2 X|0 + µ2 X|1 2µ2 m. By Hoeﬀding inequality, a constant C exists such that

Similarly, we obtain that

P1.8 2 exp nε2

P2.3, P2.4 2 exp nε2

Thirdly, we compute the bound of P1.1. Since,

Xi µX|0 2 I(Ui = 0) P(Ui = 0)σ2 X|0

> ε 32σ2 X|0

i=1 I(Ui = 0) P(Ui = 0)

> ε 32σ2 X|0

Learning Mixed Latent Tree Models

hence let Zi = Xi µX|0

2 1 I(Ui = 0). Then we have EZi = 0 and

Eet Zi = (1 P(Ui = 0)) + P(Ui = 0) Eet Zi|Ui=0

= (1 P(Ui = 0)) + P(Ui = 0) e t (1 2t) 1

e 1 2 32t2, ( |t| < 1

where the last inequality follows from the inequality 9

2t2 + t + 1

2 log(1 2t) 0 for any |t| < 1

3. Therefore Z1, . . . , Zn i.i.d. g sub E(3). By Proposition 1 and the Hoeﬀding inequality, a constant C exists such that

> ε 32σ2 X|0

i=1 I(Ui = 0) P(Ui = 0)

> ε 32σ2 X|0

Furthermore,

P1.1, P1.2 4 exp nε2

By (12), (13), (14), (15), (16), (17), (18) and (19), we obtain that

P(|SXX Var(X)| > ε) 28 exp nε2

where c := C max{σ4 m, σ2 mµ2 m, µ4 m, σ2 m, µ2 m, 1}.

Case I.ii: by the Hoeﬀding inequality, we obtain that

P (|SXX Var(X)| > ε)

i=1 X2 i X 2 !

EX2 i (EXi)2 > ε

Zhou, Wang and Guo

Case II.i: we divide it into case II.i.a, X, Y are not sibling pair, and case II.i.b, X, Y are sibling pair. Case II.i.a:

P(|SXY Cov(X, Y )| > ε)

i=1 (Xi EX)(Yi EY ) E(X EX)(Y EY )

P (X EX)(Y EY ) > ε

=: P1 + P2. (21)

Let U and V be parent of X and Y respectively. Then X|U=u N(µX|u, σ2 X|u), Y |V =v

N(µY |v, σ2 Y |v), where u, v {0, 1}. Since Xi = P1 u=0 Xi I(Ui = u), EXi = P1 u=0 P(Ui = u)µX|u, and the similar to Yi and EYi, hence let

Ai,u = Xi I(Ui = u) P(Ui = u)µX|u,

Bi,v = Yi I(Vi = v) P(Vi = v)µY |v.

Furthermore, let

Ai,u = Xi I(Ui = u) µX|u I(Ui = u) + µX|u I(Ui = u) P(Ui = u)µX|u =: Ci,u,1 + Ci,u,2, Bi,v = Yi I(Vi = v) µY |v I(Vi = v) + µY |v I(Vi = v) P(Vi = v)µY |v =: Di,v,1 + Di,v,2.

Then we have

l=1 (Ci,u,k Di,v,l ECi,u,k Di,v,l)

i=1 Ci,u,k Di,v,l ECi,u,k Di,v,l

Thus, we only need to ﬁnd the upper bounds of these equations:

P1,u,v,k,l := P

i=1 Ci,u,k Di,v,l ECi,u,k Di,v,l

, u, v {0, 1}, k, l {1, 2},

P2,u,k := P

, u {0, 1}, k {1, 2}.

Learning Mixed Latent Tree Models

From the arguments in case I.i, u, v {0, 1}, we have

P1,u,v,1,1 2 exp nε2

, P1,u,v,1,2, P1,u,v,2,1 2 exp nε2

P1,u,v,2,2 2 exp nε2

, P2,u,1 2 exp nε2

, P2,u,2 2 exp nε2

By (21), (22), (23) and (24), we have

P(|SXY Cov(X, Y )| > ε) 48 exp nε2

Case II.i.b: since X, Y are sibling pair, hence let U be the parent of X and Y . Replace Vi in case II.i.b with Ui, then we can obtain the desired result (11). Case II.ii: Let U be the parent of X. Then X|U=u N(µX|u, σ2 X|u), Y |U=u b(1, p Y |u=u), where u {0, 1}. From similar arguments, we have

P(|SXY Cov(X, Y )| > ε) 48 exp nε2

Case II.iii: Let U be the parent of X, then X|U=u b(1, p X|u), Y |U=u b(1, p Y |U=u), where u = 0, 1. From similar arguments, we have

P(|SXY Cov(X, Y )| > ε) 24 exp nε2

In summary, for any X, Y V and t > 0, we have

SXY Cov(X, Y ) >

c(t + log 48)

where c = C max{σ4 m, σ2 mµ2 m, µ4 m, σ2 m, µ2 m, 1}.

Appendix C. Proof of Theorem 3

In this section, we give the proof of Theorem 3. For an threshold ϵ min{1

2φmin, 1} , we only need to show that the probability of the event n ˆΦuvw ˆΦuvz (Φuvw Φuvz) < ϵ, u, v, w, z V o

is suﬃciently large if the sample size is suﬃciently large. Since ˆΦuvw ˆΦuvz (Φuvw Φuvz)

ˆduw duw + ˆdvw dvw + ˆduz duz + ˆdvz dvz

Zhou, Wang and Guo

for any u, v, w, z V, we have

P ˆΦuvw ˆΦuvz (Φuvw Φuvz) < ϵ, u, v, w, z V

P ˆduv duv < 1

4ϵ, u, v V .

Thus we only need to show that the probability of the event n ˆduv duv < 1

4ϵ, u, v V o

is suﬃciently large if the sample size is suﬃciently large. In the following, we show how to make ˆduv duv < 1

4ϵ. For any u, v V, we con-

sider the information distance duv. Since duv = log |Cov(Xu, Xv)| + 1

2 log(Var(Xu)) + 1 2 log(Var(Xv)), hence ˆduv duv < |log |Suv| log |Cov(Xu, Xv)|| + 1

2 |log(Suu) log(Var(Xu))| +

1 2 |log(Svv) log(Var(Xv))| .

We denote cmin as minu,v V{|Cov(Xu, Xv)|} (this allows u = v). If > 0 exists such that < 1

2cmin and for any u, v V, Suv Cov(Xu, Xv) , we obtain

|Suv| |Cov(Xu, Xv)| |Suv| |Cov(Xu, Xv)|

|Cov(Xu, Xv)| Suv Cov(Xu, Xv) 1

Since |Cov(Xu, Xv)|, |Suv| > 1

2cmin, then |log |Suv| log |Cov(Xu, Xv)|| < 2 cmin . Fur-

thermore, we have ˆduv duv < 4 cmin . Since ϵ 1, we obtain that < cminϵ

16 implies <

1 2cmin. Thus if an appropriate exists such that < cminϵ

16 and Suv Cov(Xu, Xv)

for any u, v V, then ˆduv duv < 1

Next, we show how to select such that P T u,v Suv Cov(Xu, Xv) is sufﬁciently large and < cminϵ

16 . If we show these successfully, the proof of Theorem 3 is completed. According to Theorem 2, for any u, v V and any t > 0, we have

Suv Cov(Xu, Xv) >

c(t + log 48)

Thus, for any t > 0, we have

( Suv Cov(Xu, Xv)

c(t + log 48)

( Suv Cov(Xu, Xv) >

c(t + log 48)

Suv Cov(Xu, Xv) >

c(t + log 48)

1 m2 e t =: 1 η,

Learning Mixed Latent Tree Models

where η (0, 1). Therefore, η = m2 e t and let t be log η m2 . Hence, we obtain

( Suv Cov(Xu, Xv)

c(log(48m2) log η)

Thus we select = q

c(log(48m2) log η)

n . For any η, if the sample size n is large enough such that < cminϵ

16 , the algorithm returns the true latent tree structure with a probability of at least 1 η. Furthermore, if 16

cmin < min{1

2φmin, 1}, there exists an appropriate threshold ϵ min{1

2φmin, 1}. So we get the conclusion.

Appendix D. Proof of Theorem 4

Take the case that u, v, w Vc in Figure 2 as an example, and denote θuvw as the parameter vector

(ph, µu|Xh=0, µu|Xh=1, µ(2) u|Xh=0, µ(2) u|Xh=1, µv|Xh=0, µv|Xh=1, µ(2) v|Xh=0, µ(2) v|Xh=1,

µw|Xh=0, µw|Xh=1)T ,

and denote f = (fl)11 l=1 as a function of random vector Xuvw := (Xu, Xv, Xw):

f1(xuvw) = xw, f2(xuvw) = xu, f3(xuvw) = x2 u, f4(xuvw) = xuxv, f5(xuvw) = xux2 v,

f6(xuvw) = x2 uxv, f7(xuvw) = x2 ux2 v, f8(xuvw) = xuxvxw, f9(xuvw) = xux2 vxw,

f10(xuvw) = x2 uxvxw, f11(xuvw) = x2 ux2 vxw.

It is obvious that EXw = Ef1(Xuvw), EXu = Ef2(Xuvw), EX2 u = Ef3(Xuvw),

Euv = Ef4(Xuvw) Ef5(Xuvw) Ef6(Xuvw) Ef7(Xuvw)

and Euvw = Ef8(Xuvw) Ef9(Xuvw) Ef10(Xuvw) Ef11(Xuvw)

Furthermore, we denote e = (el)11 l=1 as a function of the parameter vector θuvw:

e1(θuvw) = (1 ph) µw|Xh=0 + ph µw|Xh=1,

e2(θuvw) = (1 ph) µu|Xh=0 + ph µu|Xh=1,

e3(θuvw) = (1 ph) µ(2) u|Xh=0 + ph µ(2) u|Xh=1,

e4(θuvw) = (1 ph) µu|Xh=0 µv|Xh=0 + ph µu|Xh=1 µv|Xh=1,

e5(θuvw) = (1 ph) µu|Xh=0 µ(2) v|Xh=0 + ph µu|Xh=1 µ(2) v|Xh=1,

e6(θuvw) = (1 ph) µ(2) u|Xh=0 µv|Xh=0 + ph µ(2) u|Xh=1 µv|Xh=1,

e7(θuvw) = (1 ph) µ(2) u|Xh=0 µ(2) v|Xh=0 + ph µ(2) u|Xh=1 µ(2) v|Xh=1,

e8(θuvw) = (1 ph) µu|Xh=0 µv|Xh=0 µw|Xh=0 + ph µu|Xh=1 µv|Xh=1 µw|Xh=1,

e9(θuvw) = (1 ph) µu|Xh=0 µ(2) v|Xh=0 µw|Xh=0 + ph µu|Xh=1 µ(2) v|Xh=1 µw|Xh=1,

e10(θuvw) = (1 ph) µ(2) u|Xh=0 µv|Xh=0 µw|Xh=0 + ph µ(2) u|Xh=1 µv|Xh=1 µw|Xh=1,

e11(θuvw) = (1 ph) µ(2) u|Xh=0 µ(2) v|Xh=0 µw|Xh=0 + ph µ(2) u|Xh=1 µ(2) v|Xh=1 µw|Xh=1.

Zhou, Wang and Guo

Then our eigen-decomposition methods are equivalent to solve the equations

e(θuvw) = Eθuvwf(Xuvw). (29)

By using the sample moment 1

n Pn l=1 f(X(l) uvw) to replace the right side of the equations (29), we obtain the moment estimation equations

e(θuvw) = 1

l=1 f(X(l) uvw). (30)

The solution of the equations (30) is the moment estimation ˆθuvw for θuvw. Before proving Theorem 4, we need the following proposition which states the asymptotic normality of the moment estimation (van der Vaart, 2000).

Proposition 2 Suppose that e(θ) = Pθf is one-to-one on an open set Θ Rk and continuously diﬀerentiable at θ0 with nonsingular derivative e θ0. Moreover, assume that Pθ0 f 2 < . Then moment estimators ˆθn exist with probability tending to one and satisfy

n(ˆθn θ0) θ0 N 0, e 1 θ0 Pθ0ff T (e 1 θ0 )T .

Now we give the proof of Theorem 4. Proof We prove this theorem on the two cases: case I, the asymptotic normality of estimators for observed variables, and case II, the asymptotic normality of estimators for latent variables. Case I: as discussed above, we know that Θuvw = {θuvw : µw|Xh=0 > µw|Xh=1} is an open set. For any u, v V, w Vc, since µw|Xh=0 > µw|Xh=1, we obtain the unique parameters µw|Xh=0, µw|Xh=1 by the ﬁrst step of eigen-decomposition. Then, we obtain the unique parameters ph, Γu|h by the second and third step of eigen-decomposition. Furthermore, according to the equation

1 ph 0 0 ph

and the matrix Euv and the parameters ph, Γu|h are all unique, we obtain the unique parameters Γv|h. Thus e(θuvw) is an one-to-one on Θuvw for any u, v V, w Vc. Similarly, e(θuvw) is an one-to-one on Θuvw for u, v V, w Vd. And e(θuvw) is continuously diﬀerentiable at θ0. According to the proposition 2, we only need to prove that the Jacobi matrix of e(θuvw) at θ0 is nonsingular. For any u, v, w Vc, we have

det e(θuvw)

= (1 p1)5p5 1(µu|Xh=0µ(2) u|Xh=1 µu|Xh=1µ(2) u|Xh=0)3(µv|Xh=0µ(2) v|Xh=1 µv|Xh=1µ(2) v|Xh=0)2

(µw|Xh=1 µw|Xh=0)3

= (1 ph)2p2 h det 3(Γu|h) det 2(Γv|h)Cov3(Xw, Xh).

Learning Mixed Latent Tree Models

Thus, for any w V,

det e(θuvw)

(1 ph)2p2 h det3(Γu|h) det2(Γv|h)Cov3(Xw, Xh), u, v Vc;

(1 ph)ph det2(Γu|h) det2(Γv|h)Cov3(Xw, Xh), u Vc, v Vd;

(1 ph)ph det2(Γu|h) det2(Γv|h)Cov3(Xw, Xh), u Vd, v Vc;

(1 ph)ph det2(Γu|h) det2(Γv|h)Cov2(Xw, Xh), u, v Vd.

Since Assumption (A1) holds and continuous variables have mean zero, we have

det e(θuvw)

= 0, u, v, w V.

In the following, we prove E f 2 < + . We only need to ﬁnd an upper bound M < + such that Ef2 l (Xuvw) M for any l. For any u Vc and h H, the conditional moment E(Xl u|Xh = xh) with xh = 0, 1 and l = 2, 4 satisﬁes

E(Xl u|Xh = xh) =

xpa(u)=0 P Xpa(u) = xpa(u)|Xh = xh E Xl u|Xpa(u) = xpa(u) .

Since Xu|Xpa(u) = xpa(u) N(µu|xpa(i), σ2 u|xpa(u)) for xpa(u) = 0, 1, we have

E(X2 u|Xh = xh) =

xpa(u)=0 P Xpa(u) = xpa(u)|Xh = xh σ2 u|xpa(u) + µ2 u|xpa(u)

σ2 m + µ2 m,

E(X4 u|Xh = xh) =

xpa(u)=0 P Xpa(u) = xpa(u)|Xh = xh 3σ4 u|xpa(u)+

6σ2 u|xpa(u)µ2 u|xpa(u) + µ4 u|xpa(u)

3σ4 m + 6σ2 mµ2 m + µ4 m,

where µm = max{|µu|Xh=0|, |µu|Xh=1|, u Vc} and σ2 m = max{σ2 u|Xh=0, σ2 u|Xh=1, u Vc}. For any u Vd and h H, the conditional moment E(Xl u|Xh = xh) with xh = 0, 1 and l = 2, 4 satisﬁes E(Xl u|Xh = xh) = P(Xu = 1|Xh = xh) 1.

Let C = max{1, σ2 m + µ2 m, 3σ4 m + 6σ2 mµ2 m + µ4 m} < + . Thus, for any fl, we have

Ef2 l (X) =

xh=0 P(Xh = xh)E(Xlu u |Xh = xh)E(Xlv v |Xh = xh)E(Xlw w |Xh = xh)

Zhou, Wang and Guo

where lu, lv = 0, 2, 4 and lw = 2, 4. So, we have E f 2 < + . From the proposition 2, we have

n(ˆθuvw θ0) L N(0, e 1(θ0)Ef(Xuvw)f T (Xuvw)(e 1(θ0))T ).

Thus, ˆph, ˆµu|Xh=0, ˆµu|Xh=1, ˆµ(2) u|Xh=0, ˆµ(2) u|Xh=1 for u Vc (or ˆpu|Xh=0, ˆpu|Xh=1 for u Vd) are all asymptotically normal. Case II: ﬁrstly, we consider the estimating equations for the parameter of the latent variables. As shown in Figure 3, for any observed variables u, v1, v2, w V and any latent variables h1, h2 H, let θuv1v2w denote all the model parameters which appear in the decomposition method. As shown in Figure 3 (b), the parameter θuv1v2w consists of θuv1w and θuv2w. Since we estimate the parameters θuv1w and θuv2w separately when we estimate the parameter θuv1v2w, we treat them all as free parameters. Thus, all the parameters in θuv1v2w are free. Take the case where u, v1, v2, w Vc as an example,

θuv1v2w = (θT uv1w, θT uv2w)T

= (ph1, µu|Xh1=0, µu|Xh1=1, µ(2) u|Xh1=0, µ(2) u|Xh1=1, µv1|Xh1=0, µv1|Xh1=1, µ(2) v1|Xh1=0,

µ(2) v1|Xh1=1, µw|Xh1=0, µw|Xh1=1, ph2, µu|Xh2=0, µu|Xh2=1, µ(2) u|Xh2=0, µ(2) u|Xh2=1,

µv2|Xh2=0, µv2|Xh2=1, µ(2) v2|Xh2=0, µ(2) v2|Xh2=1, µw|Xh2=0, µw|Xh2=1)T ,

where µw|Xh1=0 > µw|Xh1=1 and µw|Xh2=0 > µw|Xh2=1. Furthermore, the estimating equations of the parameter θuv1v2w consist of the estimating equations of the parameters θuv1w and θuv2w (similar to equation (30)). Thus, we can obtain estimating equations of the parameter θuv1v2w:

l=1 F(X(l) uv1v2w) :=

1 n Pn l=1 f X(l) uv1w

1 n Pn l=1 f X(l) uv2w

= Ef (Xuv1w) Ef (Xuv2w)

= e(θuv1w) e(θuv2w)

=: g(θuv1v2w),

where f( ), e( ) are deﬁned in the Subsection 4.2. The solution of the above equations is the moment estimation ˆθuv1v2w obtained from the PEMT algorithm for θuv1v2w. By the multiple center limit theorem, we have

l=1 F(X(l) uv1v2w) E(F(Xuv1v2w))

L N(0, Cov(F(Xuv1v2w))).

As discussed above, we know that Θuv1v2w = {θuv1v2w : µw|Xh1=0 > µw|Xh1=1, µw|Xh2=0 > µw|Xh2=1} is an open set. For any u, v1, v2, w V, similar to Case I, we have

det g(θuv1v2w)

= det e(θuv1w)

det e(θuv2w)

Learning Mixed Latent Tree Models

Similar to Case I, we have E f 2 < . Thus, from proposition D.1, we have

n(ˆθuv1v2w θ0) L N(0, g 1(θ0)Ef(Xuv1v2w)f T (Xuv1v2w)(g 1(θ0))T ).

Let θu|h1h2 denote

(µu|Xh1=0, µu|Xh1=1, µ(2) u|Xh1=0, µ(2) u|Xh1=1, µu|Xh2=0, µu|Xh2=1, µ(2) u|Xh2=0, µ(2) u|Xh2=1)T

for u Vc and (pu|Xh1=0, pu|Xh1=1, pu|Xh2=0, pu|Xh1=1)T

for u Vd. Then, we have n(ˆθu|h1h2 θu|h1h2) L N(0, Σ),

where Σ is corresponding submatrix of the matrix g 1(θuv1v2w) Ef(Xuv1v2w)f T (Xuv1v2w) (g 1(θuv1v2w))T .

In the following, we consider the asymptotic normality of the estimation of parameters ph1|Xh2=0, ph1|Xh2=1. Take the case that u Vc as an example. Since

Γh1|h2 = Γ 1 u|h1 Γu|h2

= 1 det(Γu|h1)

µ(2) u|Xh1=1 µu|Xh1=1

µ(2) u|Xh1=0 µu|Xh1=0

µu|Xh2=0 µu|Xh2=1 µ(2) u|Xh2=0 µ(2) u|Xh2=1

ph1|Xh2=0 ph1|Xh2=1

µu|Xh1 =0 µ(2) u|Xh2 =0 µ(2) u|Xh1 =0 µu|Xh2 =0

µu|Xh1 =0 µ(2) u|Xh2 =1 µ(2) u|Xh1 =0 µu|Xh2 =1

=: ϕ(θu|h1h2).

Thus, ϕ(ˆθu|h1h2) is the parameter estimations (ˆph1|Xh2=0, ˆph1|Xh2=1)T obtained from sample PEMT algorithm. According to Assumption (A1) and µu = 0, we have det(Γu|h1) = 0 and µu|Xh=x = 0 for h = h1, h2 and x = 0, 1. Furthermore, the function ϕ( ) is Continuously diﬀerentiable. Thus, by the multiple delta theorem, we have

ˆph1|Xh2=0 ˆph1|Xh2=1

ph1|Xh2=0 ph1|Xh2=1

Then, ˆph1|Xh2=0, ˆph1|Xh2=1 are all asymptotically normal.

Zhou, Wang and Guo

H. Akaike. A new look at the statistical model identiﬁcation. IEEE Transactions on Automatic Control, 19:716 723, 1974.

A. Anandkumar, R. Ge, D. Hsu, S. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773 2832, 2014.

S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities. Oxford University Press, 2013.

J. T. Chang. Full reconstruction of markov models on evolutionary trees: identiﬁability and consistency. Mathematical biosciences, 137(1):51 73, 1996.

P. X. Chen, N. L. Zhang, K. M. Poon, and Z. R. Chen. Progressive em for latent tree models and hierarchical topic detection. Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence, pages 1498 1504, 2016.

P. X. Chen, N. L. Zhang, T. F. Liu, L. K. M. Poon, Z. R. Chen, and F. Khawar. Latent tree models for hierarchical topic detection. Artiﬁcial Intelligence, 250:105 124, 2017.

T. Chen, N. L. Zhang, K. M. Poon T. F. Liu, and Y. Wang. Model-based multidimensional clustering of categorical data. Artiﬁcial Intelligence, 176:2246 2269, 2012.

M. J. Choi, V. Y. F. Tan, A. Anandkumar, and A. S. Willsky. Learning latent tree graphical models. Journal of Machine Learning Research, 12:1771 1812, 2011.

J. Fan, H. Liu, and Y. Ning. High dimensional semiparametric latent graphical model for mixed data. Journal of the Royal Statistical Society: Series B., 79:405 421, 2017.

L. A. Goodman. Exploratory latent structure analysis using both identiﬁable and unidentiﬁable models. Biometrika, 61:215 231, 1974.

S. L. Lauritzen. Graphical Models. Oxford Clarendon Press, 1996.

P. F. Lazarsfeld and N. W. Henry. Latent structure analysis. Boston: Houghton Miﬄin, 1968.

J. D. Lee and T. J. Hastie. Learning the structure of mixed graphical models. Journal of Machine Learning Research, 24:230 253, 2014.

T. F. Liu, N. L. Zhang, P. X. Chen, A. H. Liu, L. K. M. Poon, and Y. Wang. Greedy learning of latent tree models for multidimensional clustering. Machine Learning, 98: 301 330, 2015.

D. F. Robinson and L. R. Foulds. Comparison of phylogenetic trees. Mathematical Biosciences, 53:131 147, 1981.

G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461 464, 1978.

Learning Mixed Latent Tree Models

L. Song, A. P. Parikh, and E. P. Xing. Kernel embeddings of latent tree graphical models. Advances in Neural Information Processing Systems, 24:2708 2716, 2011.

A. W. van der Vaart. Asymptotic Statistics. Cambridge Univ. Press, 2000.

F. Wang and Y. Li. Beyond physical connections: Tree models in human pose estimation. The IEEE Conference on Computer Vision and Pattern Recognition, pages 596 603, 2013.

X. F. Wang, J. H. Guo, L. Z. Hao, and N. L. Zhang. Spectral methods for learning discrete latent tree models. Statistics and Its Interface, 10:677 698, 2017.

Y. Wang, N. L. Zhang, and T. Chen. Latent tree models and approximate inference in Bayesian networks. Journal of Artiﬁcial Intelligence Research, 32:879 900, 2008.

N. L. Zhang. Hierarchical latent class models for cluster analysis. Journal of Machine Learning Research, 5:697 723, 2004.