# nonparametric_estimation_of_multiview_latent_variable_models__f9024d83.pdf Nonparametric Estimation of Multi-View Latent Variable Models Le Song LSONG@CC.GATECH.EDU Georgia Institute of Technology, Atlanta, GA 30345 USA Animashree Anandkumar A.ANANDKUMAR@UCI.EDU University of California, Irvine, CA 92697, USA Bo Dai, Bo Xie BODAI,BXIE33@GATECH.EDU Georgia Institute of Technology, Atlanta, GA 30345 USA Spectral methods have greatly advanced the estimation of latent variable models, generating a sequence of novel and efficient algorithms with strong theoretical guarantees. However, current spectral algorithms are largely restricted to mixtures of discrete or Gaussian distributions. In this paper, we propose a kernel method for learning multi-view latent variable models, allowing each mixture component to be nonparametric and learned from data in an unsupervised fashion. The key idea of our method is to embed the joint distribution of a multi-view latent variable model into a reproducing kernel Hilbert space, and then the latent parameters are recovered using a robust tensor power method. We establish that the sample complexity for the proposed method is quadratic in the number of latent components and is a low order polynomial in the other relevant parameters. Thus, our nonparametric tensor approach to learning latent variable models enjoys good sample and computational efficiencies. As a special case of our framework, we also obtain a first unsupervised conditional density estimator of the kind with provable guarantees. In both synthetic and real world datasets, the nonparametric tensor power method compares favorably to EM algorithm and other spectral algorithms. 1. Introduction Recently, there is a surge of interest in designing spectral algorithms for estimating the parameters of latent variable models (Hsu et al., 2009; Song et al., 2010; Parikh et al., 2011; Song et al., 2011; Foster et al., 2012; Anandkumar et al., 2012a;b). Compared to the Expectation Maximization (EM) algorithm (Dempster et al., 1977) tra- Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). ditionally used for this task, spectral algorithms are better in terms of their computational efficiency and provable guarantees. However, current spectral algorithms are largely restricted to mixture of discrete or Gaussian distributions, e.g. (Anandkumar et al., 2012a; Hsu & Kakade, 2013). When the mixture components are distributions other than these standard distributions, the theoretical guarantees for these algorithms are no longer applicable, and their empirical performance can be very poor. We propose a kernel method for obtaining sufficient statistics of a multi-view latent variable model (for 3), t2[ ] P(Xt|h), (1) given samples only from the observed variables {Xt}t2[ ], but not the hidden variable H. These statistics allow us to answer integral query, X f(xt) d P(xt|h), for functions f from a reproducing kernel Hilbert space (RKHS) without the need to assume any parametric form for the involved latent component P(Xt|h) (we call this setting unsupervised ). Note that this is a very challenging problem, since we do not have samples to directly estimate P(Xt|h). Hence traditional kernel density estimator does not apply. Furthermore, the nonparametric form of P(Xt|h) renders previous spectral methods inapplicable. Our solution is to embed the distribution of the observed variables in such a model into a reproducing kernel Hilbert space, and exploit tensor decomposition of the embedded distribution (or covariance operators) to recover the unobserved embedding µXt|h = X φ(x) d P(x|h) of the mixture components. The key computation of our algorithm involves a kernel singular value decomposition of the two-view covariance operator, followed by a robust tensor power method on the three-view covariance operator. These standard matrix operations makes the algorithm very efficient and easy to deploy. Although kernel methods have been previously applied to learning latent variable models, none of them can provably recover the exact latent component P(Xt|h) or its sufficient statitiscs to support integral query on this distribution. For Nonparametric Estimation of Multi-view Latent Variable Models instance, Song et al. (2010; 2011); Song & Dai (2013) estimated an (unknown) invertible transformation of the sufficient statistics of the latent component P(Xt|h), and only supported integral query associated with the distribution of the observed variables. Sgouritsa et al. (2013) used kernel independence measure to cluster data points, and treated each cluster as a latent component. Besides computational issues, it is also difficult to provide theoretical guarantees to such an approach since the clustering step only finds a local minimum. Benaglia et al. (2009) designed an EM-like algorithm for learning the conditional densities in latent variable models. This algorithm alternates between the E-step, proportional assignment of data points to components, and the M-step, kernel density estimation based on weighted data points. Similarly, theoretical analysis of such a local search heurstic is difficult. The kernel algorithm proposed in this paper is also significantly more general than the previous spectral algorithms which work only for distributions with parametric assumptions (Anandkumar et al., 2012a; Hsu & Kakade, 2013). In fact, when we use the delta kernel, our algorithm recovers the previous algorithm of Anandkumar et al. (2012a) for discrete mixture components as a special case. When we use universal kernels, such as the Gaussian RBF kernel, our algorithm can recover Gaussian mixture components as well as mixture components with other distributions. In this sense, our work also provides a unifying framework for previous spectral algorithms. We prove sample complexity bounds for the nonparametric tensor power method and show that it is both computational and sample efficient. As a special case of our framework, we also obtain a first unsupervised conditional density estimator of the kind with provable guarantees. Furthermore, our approach can also be generalized to other latent variable learning tasks such as independent component analysis and latent variable models with Dirichlet priors. Experimentally, we corroborate our theoretical results by comparing our algorithm to the EM algorithm and previous spectral algorithms. We show that when the model assumptions are correct for the EM algorithm and previous spectral algorithms, our algorithm converges in terms of estimation error to these competitors. In the opposite cases when the model assumptions are incorrect, our algorithm is able to adapt to the nonparametric mixture components and beating alternatives by a very large margin. 2. Notation We denote by X a random variable with domain X, and refer to instantiations of X by the lower case character, x. We endow X with some σ-algebra A and denote a distributions (with respect to A ) on X by P(X). For the multiview model in equation (1), we also deal with multiple random variables, X1, X2, . . . , X , with joint distribution P(X1, X2, . . . , X ). For simplicity of notation, we assume that the domains of all Xt, t 2 [ ] are the same, but the methodology applies to the cases where they have different domains. Furthermore, we denote by H a hidden variable with domain H and distribution P(H). A reproducing kernel Hilbert space (RKHS) F on X with a kernel (x, x0) is a Hilbert space of functions f( ) : X 7! R with inner product h , i F. Its element (x, ) satisfies the reproducing property: hf( ), (x, )i F = f(x), and consequently, h (x, ), (x0, )i F = (x, x0), meaning that we can view the evaluation of a function f at any point x 2 X as an inner product. Alternatively, (x, ) can be viewed as an implicit feature map φ(x) where (x, x0) = hφ(x), φ(x0)i F. In this paper, we will focus on X = Rd, and the normalized Gaussian RBF kernel (x, x0) = exp( kx x0k2 /(2s2))/( But kernel functions have also been defined on graphs, time series, dynamical systems, images and other structured objects (Sch olkopf et al., 2004). Thus the methodology presented below can be readily generalized to a diverse range of data types as long as kernel functions are defined. 3. Kernel Embedding of Distributions Kernel embeddings of distributions are implicit mappings of distributions into potentially infinite dimensional RKHS. The kernel embedding approach represents a distribution by an element in the RKHS associated with a kernel function (Smola et al., 2007), µX := EX [φ(X)] = φ(x) d P(x), (3) where the distribution is mapped to its expected feature map, i.e., to a point in a potentially infinite-dimensional and implicit feature space. By the reproducing property of an RKHS, the kernel embedding is a sufficient statistic for integral query 8f 2 F, i.e., X f(x) d P(x) = hµX, fi F . Kernel embedding of distributions has rich representational power. The mapping is injective for characteristic kernels (Sriperumbudur et al., 2008). That is, if two distributions, P(X) and Q(X), are different, they are mapped to two distinct points in the RKHS. For domain Rd, many commonly used kernels are characteristic, such as the normalized Gaussian RBF kernel. Kernel embeddings can be readily generalized to joint distributions of two or more variables using tensor product feature maps. We can embed the joint distribution of two variables X1 and X2 into a tensor product feature space F F by CX1X2 := X X φ(x1) φ(x2) d P(x1, x2), where the reproducing kernel for the tensor product features satisfies hφ(x1) φ(x2), φ(x0 2)i F F = (x1, x0 2). By analogy, we can also define CX1X2X3 := EX1X2X3[φ(X1) φ(X2) φ(X3)]. Nonparametric Estimation of Multi-view Latent Variable Models Given a sample DX = x1, . . . , xm of size m drawn i.i.d. from P(X), the empirical kernel embedding can be estimated simply as bµX = 1 m i=1 φ(xi) with an error kbµX µXk F scaling as Op(m 1 2 ) (Smola et al., 2007). Similarly, CX1X2 and CX1X2X3 can be estimated as b CX1X2 = 1 m 2), and b CX1:3 = 1 m 3) respectively. Note that we never explicitly compute the feature maps φ(x) for each data point. Instead, most of the computation required for subsequent statistical inference using kernel embeddings can be reduced to the Gram matrix manipulation. 3.1. Kernel Embedding as Multi-Linear Operator The joint embeddings can also be viewed as an uncentered covariance operator CX1X2 : F 7! F by the standard equivalence between a tensor product feature and a linear map. That is, given two functions f1, f2 2 F, their covariance can be computed by EX1X2[f1(X1)f2(X2)] = hf1, CX1X2f2i F , or equivalently hf1 f2, CX1X2i F F , where in the former we view CXY as an operator while in the latter we view it as an element in tensor product feature space. By analogy, CX1X2X3 (with shorthand CX1:3) can be regarded as a multi-linear operator from F F F to R. It will be clear from the context whether we use CX1:3 as an operator between two spaces or as an element from a tensor product feature space. For generic introduction to tensors, please see (Kolda & Bader, 2009). In the multi-linear operator view, the application of CX1:3 to a set of elements {f1, f2, f3 2 F} can be defined using the inner product from the tensor product feature space, i.e., CX1:3 1 f1 2 f2 3 f3 := h CX1:3, f1 f2 f3i F3 which is further equal to EX1X2X3[Q t2[3] hφ(Xt), fti F]. Furthermore, we can define the Hilbert-Schmidt norm k k as k CX1:3k2 = P1 i1,i2,i3=1 (CX1:3 1 ui1 2 ui2 3 ui3)2 using three collections of orthonormal bases {ui1}1 i1=1, {ui2}1 i2=1, and {ui3}1 The joint embedding, CX1X2, can be viewed as infinite dimensional matrices. For instance, we can perform singular value decomposition CX1X2 = P1 i=1 σi ui1 ui2, where σi 2 R are singular values ordered in nonincreasing manner, and {ui1}1 i1=1 F, {ui2}1 i2=1 F are singular vectors and orthonormal bases. The rank of CX1X2 is the smallest k such that σi = 0 for i > k. 4. Multi-View Latent Variable Models Multi-view latent variable models studied in this paper are a special class of Bayesian networks in which (i) observed variables X1, X2, . . . , X are conditionally independent given a discrete latent variable H, and (ii) the conditional distributions, P(Xt|H), of the Xt, t 2 [ ] given the hidden variable H can be different. The conditional independent structure of a multi-view latent variable model (a) Na ıve Bayes model (b) Hidden Markov model Figure 1. Examples of multi-view latent variable models. is illustrated in Figure 1(a), and many complicated graphical models, such as the hidden Markov model in Figure 1(b), can be reduced to a multi-view latent variable model. For simplicity of exposition, we will explain our method using the model with symmetric view. That is the conditional distribution are the same for each view, i.e., P(X|h) = P(X1|h) = P(X2|h) = P(X3|h). In Appendix 9, we will show that multi-view models with different views can be reduced to ones with symmetric view. 4.1. Conditional Embedding Operator For simplicity of exposition, we focus on a simple model with three observed variables ( = 3). Suppose H 2 [k], then we can embed each conditional distribution P(X|h) corresponding to a particular value of H = h as φ(x) d P(x|h). (4) If we vary the value of H, we obtain the kernel embedding for different P(X|h). Conceptually, we can tile these embeddings into a matrix (with infinite number of rows) µX|h=1, µX|h=2, . . . , µX|h=k which is called the conditional embedding operator. If we use the standard basis eh in Rk to represent each value of h, we can retrieve each µX|h from CX|H by µX|h = CX|Heh (6) Once we have the conditional embedding µX|h, we can compute the conditional expectation of a function f 2 F as X f(x) d P(x|h) = Remarks. For data from Rd and the normalized Gaussian RBF kernel in (2), the conditional density p(x|h) exists, and it can be approximated by the embedding as F = EX|h[ (x, X)]. Essentially, this is the convolution of the conditional density with the kernel function. For continuous density p(x|h) with suitable smoothness conditions, the approximation error is of the order (Wasserman, 2006) |p(x|h) ep(x|h)| = O(s2). (7) 4.2. Factorized Kernel Embedding For multi-view latent variable models, P(X1, X2) and P(X1, X2, X3), can be factorized respectively as P(x1, x2) = h2[k] P(x1|h) P(x2|h) P(h), and P(x1, x2, x3) = h2[k] P(x1|h) P(x2|h) P(x3|h) P(h). Nonparametric Estimation of Multi-view Latent Variable Models Since we assume the hidden variable H 2 [k] is discrete, we let h := P(h). Furthermore, if we apply Kronecker delta kernel δ(h, h0) with feature map eh, then the embeddings for P(H) CHH = EH[e H e H] = ! hδ(h, h0) " h,h02[k] , and CHHH = EH[e H e H e H] h δ(h, h0) δ(h0, h00) h,h0,h002[k] are diagonal tensors. Making use of CHH and CHHH, and the factorization of the distributions P(X1, X2) and P(X1, X2, X3), we obtain the factorization of the embedding of P(X1, X2) (second order embedding) µX1|h µX2|h h2[k] eh eh P(h) = CX|H CHH C> and that of P(X1, X2, X3) (third order embedding) CX1X2X3 = CHHH 1 CX|H 2 CX|H 3 CX|H. (9) 4.3. Identifiability of Parameters We note that CX|H = µX|h=1, µX|h=2, . . . , µX|h=k , and the kernel embeddings for CX1X2 and CX1X2X3 can be alternatively written as h2[k] h µX|h µX|h, (10) h2[k] h µX|h µX|h µX|h. (11) Allman et al. (2009) showed that, under mild conditions, a finite mixture of nonparametric product distributions is identifiable. The multi-view latent variable model in (10) and (11) has the same form as a finite mixture of nonparametric product distribution, and therefore we can adapt Allman s results to the current setting. Proposition 1 (Identifiability) Let P(X1, X2, X3) be a multi-view latent variable model, such that the conditional distributions {P(X|h)}h2[k] are linearly independent. Then, the set of parameters h2[k] are identifiable from CX1X2X3, up to label swapping of the hidden variable H. Example 1. The probability vector of a discrete variable X 2 [n], and the joint probability table of two discrete variables X1 2 [n] and X2 2 [n], are both kernel embeddings. To see this, let the kernel be the Kronecker delta kernel (x, x0) = δ(x, x0) whose feature map φ(x) is the standard basis of ex in Rn. The x-th dimension of ex is 1 and 0 otherwise. Then ! P(x = 1) . . . P(x = n) "> , ! P(x1 = s, x2 = t) " We require that the conditional probability table {P(X|h)}h2[k] to have full column rank for identifiability in this case. Example 2. Suppose we have a k-component mixture of one dimensional spherical Gaussian distributions. The Gaussian components have identical covariance σ2, but their mean values are distinct. Note that this model is not identifiable under the framework of Hsu & Kakade (2013) since the mean values are just scalars and therefore, rank deficient. However, if we embed the density functions using universal kernels such as Gaussian RBF kernel, it can be shown that the mixture model becomes identifiable. This is because we are working with the entire density function which are linearly independent from each other in this case. Thus, the non-parametric framework allows us to incorporate a wider range of latent variable models. Finally, we remark that the identifiability result in Proposition 1 can be extended to cases where the conditional distributions do not satisfy linear independence, i.e., they are overcomplete, e.g. (Kruskal, 1977; De Lathauwer et al., 2007; Anandkumar et al., 2013b). However, in general, it is not tractable to learn such overcomplete models and we do not consider them here. 5. Kernel Algorithm We first design a kernel algorithm to recover the parameters, h2[k], of the multi-view latent variable model based on CX1X2 and CX1X2X3. This can be easily extended to the sample versions and this is discussed in Section 5.2. Again for simplicity of exposition, the algorithm is explained for symmetric view case. The more general version is presented in Appendix 9. 5.1. Population Case We first derive the algorithm for the population case as if we could access the true operator CX1X2 and CX1X2X3. Its finite sample counterpart will be presented in the next section. The algorithm can be thought of as a kernel generalization of the algorithm in Anandkumar et al. (2013a) using embedding representations. Step 1. We perform eigen-decomposition of CX1X2, i=1 σi ui ui where the eigen-values are ordered in non-decreasing manner. According to the factorization in Eq. (8), CX1X2 has rank k. Let the leading eigenvectors corresponding to the largest k eigen-value be Uk := (u1, u2, . . . , uk), and the eigen-value matrix be Sk := diag(σ1, σ2, . . . , σk). We define the whitening operator W := Uk S 1/2 k which satisfies W>CX1X2W = (W>CX|HC1/2 and M := W>CX|HC1/2 HH is an orthogonal matrix. Nonparametric Estimation of Multi-view Latent Variable Models Step 2. We apply the whiten operator to the 3rd order kernel embedding CX1X2X3 T := CX1X2X3 1 (W>) 2 (W>) 3 (W>). According to the factorization in Eq. (9), T = C 1/2 HHH 1 M 2 M 3 M, which is a tensor with orthogonal factors. Essentially, each column vi of M is an eigenvector of T . Step 3. We use tensor power method to find the leading k eigenvectors M for T (Anandkumar et al., 2012a). The corresponding k eigenvalues λ = (λ1, . . . , λk)> will then be equal to (P(h = 1) 1/2, . . . , P(h = k) 1/2). The tensor power method is provided in the Appendix in Algorithm 2 for completeness. Step 4. We recover the conditional embedding operator by undoing the whitening step CX|H = (W>) M diag(λ). 5.2. Finite Sample Case Given m observation DX1X2X3 = {(xi 3)}i2[m] drawn i.i.d. from a multi-view latent variable model P(X1, X2, X3), we now design a kernel algorithm to estimate the latent parameters from data. Although the empirical kernel embeddings can be infinite dimensional, we can carry out the decomposition using just the kernel matrices. We denote the implicit feature matrix by 1), . . . , φ(xm 2), . . . , φ(xm 2), . . . , φ(xm 1), . . . , φ(xm and the corresponding kernel matrix by K = Φ>Φ and L = > respectively. And we denote K:x := Φ>φ(x) as a column vector containing the kernel between x and data points in Φ. For three vectors 1, 2 and 3, denote the symmetric tensor obtained from their outer product [ 1, 2, 3] := 1 2 3 + 3 1 2 + 2 3 1. Then the steps in the population case can be mapped oneby-one into kernel operations. Step 1. We perform a kernel eigenvalue decomposition of the empirical 2nd order embedding b CX1X2 := 1 2m which can be expressed succinctly as b CX1X2 = 1 2mΦ >. Its leading k eigenvectors b Uk = (bu1, . . . , buk) lie in the span of the column of Φ, i.e., b Uk = Φ(β1, . . . , βk) with β 2 R2m. Then we can transform the eigen-value decomposition problem for an infinite dimensional matrix to a problem involving finite dimensional kernel matrices, b CX1X2 b C> X1X2 u = bσ2 u ) 1 4m2 Φ > Φ>Φβ = bσ2 Φβ ) 1 4m2 KLKβ = bσ2 Kβ. Algorithm 1 Kernel Spectral Algorithm In: Kernel matrices K and L, and desired rank k Out: A vector b 2 Rk and a matrix A 2 R2m k 1: Cholesky decomposition: K = R>R 2: Eigen-decomposition: 1 4m2 RLR> eβ = bσ2 eβ 3: Use k leading eigenvalues: b Sk = diag(bσ1, . . . , bσk) 4: Use k leading eigenvectors (eβ1, . . . , eβk) to compute: (β1, . . . , βk) = R (eβ1, . . . , eβk) 5: Form tensor: b T = 1 3m 1) = b S 1/2 k (β1, . . . , βk)>K:xi 1 6: Power method: eigenvectors c M := (bv1, . . . , bvk), and the eigenvalues bλ := (bλ1, . . . , bλk)> of b T 7: A = (β1, . . . , βk)b S1/2 k c M diag(bλ) 8: b = (bλ 2 1 , . . . , bλ 2 Let the Cholesky decomposition of K be R>R. Then by redefining eβ = Rβ, and solving an eigenvalue problem 1 4m2 RLR> eβ = bσ2 eβ, and obtain β = R eβ. (12) The resulting eigenvectors satisfy u> i Φ>Φβi0 = β> i Kβi0 = eβ> i eβi0 = δii0. Step 2. We whiten the empirical 3rd order embedding b CX1X2X3 := 1 3m using c W := b Uk b S 1/2 k , and obtain b T := 1 3m 1) := b S 1/2 k (β1, . . . , βk)>K:xi Step 3. We run tensor power method (Anandkumar et al., 2012a) on the finite dimension tensor b T to obtain its leading k eigenvectors c M := (bv1, . . . , bvk) and the corresponding eigenvalues bλ := (bλ1, . . . , bλk)>. Step 4. The estimates of the conditional embeddings are b CX|H = Φ(β1, . . . , βk)b S1/2 k c M diag(bλ). The overall kernel algorithm is summarized in Algorithm 1. 6. Sample Complexity Let := supx2X (x, x), k k be the Hilbert-Schmidt norm, min := mini2[k] i and σk(CX1X2) be the k-th largest singular value of CX1X2. In the following, we provide sample complexity bounds for the estimated conditional embedding µX|h and the corresponding prior distribution (the proof is in Appendix 11). Theorem 2 Pick any δ 2 (0, 1). When the number of samples m satisfies m > 2 log 2 k(CX1X2), := max C3k2 σk(CX1X2), C4k2/3 Nonparametric Estimation of Multi-view Latent Variable Models for some constants C3, C4 > 0, and the number of iterations N and the number of random initialization vectors L (drawn uniformly on the sphere Sk 1) satisfy log(k) + log log for constant C2 > 0 and L = poly(k) log(1/δ), the robust power method in (Anandkumar et al., 2012a) yields eigenpairs (bλi, bvi) such that there exists a permutation , with probability 1 4δ, we have j µX|h=j (β1, . . . , βk)b S1/2 k bv (j)k F 8 T 1/2 j bλ (j)| 5 T , 8j 2 [k], j=1 ˆλj ˆφ 3 j k 55 T where T := kb T T k is the tensor perturbation bound k (CX1X2) + 512 k(CX1X2)p min Proof Sketch: Our proof is different from those in Anandkumar et al. (2012a) which only analyze the perturbation of the tensor decomposition. Our proof further takes into account the error introduced by the approximate whitening step, and its effects to the tensor decomposition. Remark 1: We note that the sample complexity is poly(k, , 1/ min, 1/σk(CX1X2)) of a low order, and in particular, it is O(k2), when the other parameters are fixed. For the special case of discrete measurements, where the kernel (x, x0) = δ(x, x0), we have = 1. Note that the sample complexity depends in this case only on the number of components k and not on the dimensionality of the observed state space. Remark 2: Theorem 2 also gives us an error bound for estimating the integral of a function f 2 F with respect to a mixture component in unsupervised fashion. Under the conditions specified in the theorem, we have 9999 f(x) d P(x|h) ::µX|h bµX|h assuming kfk F is bounded and /σk = O(1). We are not aware of any other result of the similar type in this unsupervised setting. Remark 3: For x 2 Rd and the normalized Gaussian RBF kernel in (2), the recovered conditional embedding bµX|h can be used to estimate the conditional density, p(x|h) φ(x), bµX|h F. In this case, the error can be decomposed into two terms |p(x|h) bp(x|h)| |p(x|h) ep(x|h)| | {z } O(s2) bias as in (7) + |ep(x|h) bp(x|h)| | {z } estimation error where s is kernel bandwidth and ep is the density convolved with the kernel function. The estimation error is bounded by |ep(x|h) bp(x|h)| kφ(x)k FkµX|h bµX|hk F = O( 1/2 m 1/2) = O(s d/2m 1/2) assuming /σk = O(1) and using = O(s d). Under the conditions specified in Theorem 2, we combine the analysis for the two sources of errors, and obtain the bound |p(x|h) bp(x|h)| = O(s2 + s d/2m 1/2) Then we have |p(x|h) bp(x|h)| = O(m 2/(4+d)) if we balance the two terms by setting s = O(m 1/(4+d)). We are not aware of any other result of the similar type in this unsupervised setting. 7. Discussion Our algorithm and theoretical results can also be generalized to the settings of latent variable models with Dirichlet priors and nonparametric independent component analysis (ICA) as in Anandkumar et al. (2012a). In the first setting, a Dirichlet prior is placed on the mixing weights of the multi-view latent variables, P( ) = i2[k] Γ( i) i where 0 = P i2[k] i with i > 0, and Γ( ) is the Gamma function. In this case, we only need to modify the second and third order kernel embedding CX1X2 and T respectively, and then Algorithm 1 applies. In the nonparametric ICA setting, the feature map φ(X) of an observed variable X is assumed to be generated from a latent vector H 2 Rk with independent coordinates via an operator A : Rk 7! F, φ(X) := A H + Z, where Z is a zero mean random vector independent of H. In this case, we need to start with a modified 4-th order kernel embedding, and then reduce to a multi-view problem and estimate A via Algorithm 1. 8. Experiments Methods. We compared our kernel spectral algorithm with four alternatives 1. The EM algorithm for mixture of Gaussians. The EM algorithm is not guaranteed to find the global solution in each trial. Thus we randomly initialize it 10 times. 2. The EM-like algorithm for mixture of nonparametric densities (Benaglia et al., 2009). We initialize the algorithm with k-means as Benaglia et al. (2009). 3. The spectral algorithm for mixture of spherical Gaus- sians (Hsu & Kakade, 2013). Their assumption is restrictive: the centers of the Gaussian need to span a kdimension subspace, thus it is not applicable for rank deficiency case where k l. 4. A discretization based spectral algorithm (Kasahara & Shimotsu, 2010). This algorithm approximates the joint distribution of the observed variables with histogram and then applies the spectral algorithm to recover the discretized conditional density. Nonparametric Estimation of Multi-view Latent Variable Models Both our method and the (Benaglia et al., 2009) have a hyper-parameter, kernel bandwidth, which we selected for each view separately using cross-validation. 8.1. Synthetic Data We generated three-dimensional synthetic data from various mixture models. The variables corresponding to the dimensions are independent given the latent component indicator. More specifically, we explored two settings (1) Gaussian conditional densities with different variances; (2) Mixture of Gaussian and shifted Gamma conditional densities. The shifted Gamma distribution has density p(x|h) = (x µ)(d 1)e x/ dΓ(d) , x µ where we chose the shape parameter d 1 such that density is very skewed. Furthermore, we chose the mean and variance parameters of the Gaussian/Gamma density such that component pair-wise overlap is relatively small according to the Fisher ratio (µ1 µ2)2 We also varied the number of samples m for the observed variables X1, X2 and X3 from 50 to 10, 000, and experimented with k = 2, 3, 4 or 8 mixture components. The mixture proportion for the h-th component is set to be h = 2h (k+1), 8h 2 [k] (unbalanced). It is worth noting that as k becomes larger, it is more difficult to recover parameters. This is because only a small number of data will be generated for the first several clusters. For every n, k in each setting, we randomly generated 10 sets of samples and reported the average results. We note that the values for the latent variables are not given to the algorithms, and hence this is an unsupervised setting to recover the conditional density p(x|h) and the ratio p(h). Error measure. We measured the performance of algorithms by the following weighted 2 norm difference j=1(p(xj|h) bp(xj|h))2, where {xj}j2[m] is a set of uniformly-spaced test points. Results. We first illustrated the actual recovered conditional densities of our method and EM-GMM in Figure 2 as a concrete example. The kernel spectral algorithm recovers nicely both the Gaussian and Gamma components, while the EM-GMM fails to fit the Gamma component. More quantitative results are plotted in Figure 3. It is clear that the kernel spectral method converges rapidly with the data increment in all experiment settings. In the mixture of Gaussians setting, the EM algorithm is best since the model is correctly specified. The spectral algorithm for spherical Gaussians does not perform well since the assumption of the method is too restricted. The performance of our kernel method converges to that of the EM algorithm. In the mixture of Gaussian and Gamma setting, our kernel spectral algorithm achieves superior results compared to other 10 5 0 5 10 15 20 25 0 Probability Density True Component1 True Component2 Estimated Component1 Estimated Component2 (a) EM Gaussians Mixture 10 5 0 5 10 15 20 25 0 Probability Density True Component1 True Component2 Estimated Component1 Estimated Component2 (b) Kernel Spectral Figure 2. Kernel spectral algorithm is able to adapt to the shape of the mixture components, while EM algorithm for mixture of Gaussians misfits the Gamma distribution. algorithms. These results demonstrate that our algorithm is able to automatically adapt to the shape of the density. It is worth noting that both the discretized spectral algorithm and nonparametric EM-like algorithm did not perform as well. In the discretized spectral method, the joint distribution is estimated by histogram. It is well-known that the histogram estimation suffers from poor performance even for 3 dimensional data. In the nonparametric EM-like algorithm, besides the issue of local minima, its performance also highly depends on the initialization. And the flexibility of nonparametric densities without regularization makes the issue of overfitting quite severe, often leading to a single component in the algorithm. We also note that the our method outperforms the EMGMM more as the number of components increases. This is the key advantage of our method in that it has favorable performance in higher dimensions, which agrees with the theoretical result in Theorem 2 that the sample complexity depends only quadratically in the number of components, when other parameters are held fixed. 8.2. Flow Cytometry Data Flow cytometry (FCM) data are multivariate measurements from flow cytometers that record light scatter and fluorescence emission properties of hundreds of thousands of individual cells. They are important to the studying of the cell structures of normal and abnormal cells and the diagnosis of human diseases. Aghaeepour et al. (2013) introduced the Flow CAP-challenge whose main task is grouping the flow cytometry data automatically. Clustering on the FCM data is a difficult task because the distribution of the data is non Gaussian and heavily skewed. We use the DLBCL Lymphoma dataset collection from (Aghaeepour et al., 2013) to compare our kernel algorithm with the four alternatives. This collection contains 24 datasets with two or three clusters, and each dataset consists of tens of thousands of cell measurements in 5 dimensions. Each dataset is a separate clustering task, so we fit a multi-view model to each dataset separately and use the maximum-a-posteriori assignment to obtain the cluster labels. All the cell measurements have been manually labeled, therefore we can evaluate the clustering performance using f-score (Aghaeepour et al., 2013). Nonparametric Estimation of Multi-view Latent Variable Models 0.05 0.1 0.2 0.5 1 2 5 10 0 sample size ( 103) EM GMM Nonparametric EM Kernel Spectral Histogram Spectral Hsu et. al. 0.05 0.1 0.2 0.5 1 2 5 10 0 sample size ( 103) 0.05 0.1 0.2 0.5 1 2 5 10 0 sample size ( 103) 0.05 0.1 0.2 0.5 1 2 5 10 0 sample size ( 103) (a) Gaussian k = 2 (b) Gaussian k = 3 (c) Gaussian k = 4 (d) Gaussian k = 8 0.05 0.1 0.2 0.5 1 2 5 10 0 sample size ( 103) 0.05 0.1 0.2 0.5 1 2 5 10 0 sample size ( 103) 0.05 0.1 0.2 0.5 1 2 5 10 0.5 sample size ( 103) 0.05 0.1 0.2 0.5 1 2 5 10 0 sample size ( 103) (e) Gaussian/Gamma k = 2 (f) Gaussian/Gamma k = 3 (g) Gaussian/Gamma k = 4 (h) Gaussian/Gamma k = 8 Figure 3. (a)-(d) Mixture of Gaussian distributions with k = 2, 3, 4, 8 components. (e)-(h) Mixture of Gaussian/Gamma distribution with k = 2, 3, 4, 8. For the former case, the performances of kernel spectral algorithm converge to those of EM algorithm for mixture of Gaussian model. For the latter case, the performances of kernel spectral algorithm are consistently much better than EM algorithm for mixture of Gaussian model. Spherical Gaussian spectral algorithm does not work for k = 4, 8 since k > l(= 3) causes rank deficiency. 1 2 3 4 5 6 7 8 0.4 EM GMM Nonparametric EM Kernel Spectral Histogram Spectral Hsu et al. (a) number of clusters k = 2 2 4 6 8 10 12 14 16 0.4 (b) number of clusters k = 3 Figure 4. Clustering results on the datasets from the DLBCL flow cytometry data. The results for spherical Gaussian spectral algorithm (Hsu et al.) are not plotted for datasets on which it has rank deficiency problem. The datasets are ordered by increasing sample size. We split the 5 dimensions into three views: dimension 1 and 2 as the first view, 3 and 4 the second, and 5 the third view based on correlation between views, since we would like the views to satisfy the conditional independence assumptions to ensure good performance for the kernel spectral method. For each dataset, we select the best kernel bandwidth by 5-fold cross validation using log-likelihood. Figure 4 presents the results sorted by the number of clusters. Since the data are collapsed in most cases, the centers cannot span a subspace with enough rank. Thus, the method in (Hsu & Kakade, 2013) is not applicable. However, our method (kernel spectral) outperforms EM-GMM as well as the other algorithms in a majority of datasets. There are also datasets where kernel spectral algorithm has a large gap in performance compared to GMM. These are the datasets where the multi-view assumptions are heavily violated. For example, in some datasets, the correlation coefficient between dimensions 3 and 5 is as high as 0.927 given a particular cluster label, suggesting strong correlation between the two views. Obtaining improved and robust performance in these datasets will be a subject of our future study where we plan to develop even more robust kernel spectral algorithms. Acknowledgement L.S. is supported in part by NSF IIS1116886, NSF/NIH BIGDATA 1R01GM108341, NSF CAREER IIS1350983 and Raytheon Faculty Fellowship. A.A. is supported in part by Microsoft Faculty Fellowship, NSF CAREER CCF1254106, NSF CCF1219234, NSF BIGDATA IIS1251267 and ARO YIP Award W911NF-13-1-0084. Nonparametric Estimation of Multi-view Latent Variable Models Aghaeepour, Nima, Finak, Greg, Consortium, The Flow- CAP, Consortium, The DREAM, Hoos, Holger, Mosmann, Tim R, Brinkman, Ryan, Gottardo, Raphael, and Scheuermann, Richard H. Critical assessment of automated flow cytometry data analysis techniques. Nature Methods, 10(3):228 238, 2013. Allman, Elizabeth, Matias, Catherine, and Rhodes, John. Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 37(6A):3099 3132, 2009. Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Tel- garsky, M. Tensor Methods for Learning Latent Variable Models. Available at ar Xiv:1210.7559, Oct. 2012a. Anandkumar, A., Ge, R., Hsu, D., and Kakade, S. M. A Tensor Spectral Approach to Learning Mixed Membership Community Models. Ar Xiv 1302.2684, Feb. 2013a. Anandkumar, A., Hsu, D., Janzamin, M., and Kakade, S. M. When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity. Ar Xiv 1308.2853, Aug. 2013b. Anandkumar, Animashree, Foster, Dean P., Hsu, Daniel, Kakade, Sham M., and Liu, Yi-Kai. A spectral algorithm for latent dirichlet allocation. Available at ar Xiv:1204.6703, 2012b. Benaglia, Tatiana, Chauveau, Didier, and Hunter, David R. An em-like algorithm for semi-and nonparametric estimation in multivariate mixtures. Journal of Computational and Graphical Statistics, 18(2):505 526, 2009. De Lathauwer, L., Castaing, J., and Cardoso, J.-F. Fourth- order cumulant-based blind identification of underdetermined mixtures. IEEE Tran. on Signal Processing, 55: 2965 2973, June 2007. Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39(1):1 22, 1977. Foster, D.P., Rodu, J., and Ungar, L.H. Spectral dimensionality reduction for hmms. Arxiv preprint ar Xiv:1203.6130, 2012. Hsu, D., Kakade, S., and Zhang, T. A spectral algorithm for learning hidden markov models. In Proc. Annual Conf. Computational Learning Theory, 2009. Hsu, Daniel and Kakade, Sham M. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, ITCS 13, pp. 11 20. Kasahara, Hiroyuki and Shimotsu, Katsumi. Nonparamet- ric identification of multivariate mixtures. Journal of the Royal Statistical Society - Series B, 2010. Kolda, Tamara G. and Bader, Brett W. Tensor decompo- sitions and applications. SIAM Review, 51(3):455 500, 2009. Kruskal, J.B. Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear algebra and its applications, 18(2):95 138, 1977. Parikh, A., Song, L., and Xing, E. P. A spectral algorithm for latent tree graphical models. In Proceedings of the International Conference on Machine Learning, 2011. Rosasco, L., Belkin, M., and Vito, E.D. On learning with integral operators. Journal of Machine Learning Research, 11:905 934, 2010. Sch olkopf, B., Tsuda, K., and Vert, J.-P. Kernel Methods in Computational Biology. MIT Press, Cambridge, MA, 2004. Sgouritsa, Eleni, Janzing, Dominik, Peters, Jonas, and Sch olkopf, Bernhard. Identifying finite mixtures of nonparametric product distributions and causal inference of confounders. In Conference on Uncertainty on Artificial Intelligence (UAI), 2013. Smola, A. J., Gretton, A., Song, L., and Sch olkopf, B. A Hilbert space embedding for distributions. In Proceedings of the International Conference on Algorithmic Learning Theory, volume 4754, pp. 13 31. Springer, 2007. Song, L. and Dai, B. Robust low rank kernel embedding of multivariate distributions. In Neural Information Processing Systems (NIPS), 2013. Song, L., Boots, B., Siddiqi, S., Gordon, G., and Smola, A. J. Hilbert space embeddings of hidden markov models. In International Conference on Machine Learning, 2010. Song, L., Parikh, A., and Xing, E.P. Kernel embeddings of latent tree graphical models. In Advances in Neural Information Processing Systems, volume 25, 2011. Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, G., and Sch olkopf, B. Injective Hilbert space embeddings of probability measures. In Proc. Annual Conf. Computational Learning Theory, pp. 111 122, 2008. Wasserman, L. All of Nonparametric Statistics. Springer,