# learning_latent_variable_gaussian_graphical_models__a2063706.pdf Learning Latent Variable Gaussian Graphical Models Zhaoshi Meng MENGZS@UMICH.EDU Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA Brian Eriksson BRIAN.ERIKSSON@TECHNICOLOR.COM Technicolor Research Center, 735 Emerson Street, Palo Alto, CA 94301, USA Alfred O. Hero III HERO@EECS.UMICH.EDU Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA Gaussian graphical models (GGM) have been widely used in many high-dimensional applications ranging from biological and financial data to recommender systems. Sparsity in GGM plays a central role both statistically and computationally. Unfortunately, real-world data often does not fit well to sparse graphical models. In this paper, we focus on a family of latent variable Gaussian graphical models (LVGGM), where the model is conditionally sparse given latent variables, but marginally non-sparse. In LVGGM, the inverse covariance matrix has a low-rank plus sparse structure, and can be learned in a regularized maximum likelihood framework. We derive novel parameter estimation error bounds for LVGGM under mild conditions in the highdimensional setting. These results complement the existing theory on the structural learning, and open up new possibilities of using LVGGM for statistical inference. 1. Introduction Critical to many statistical inference tasks in complex realworld systems, such as prediction and detection, is the ability to extract and estimate distributional characteristics from the observations. Unfortunately, in the highdimensional regime such model estimation often leads to ill-posed problems, particularly when the number of observations n (or sample size) is comparable to or fewer than the ambient dimensionality p of the model (i.e., the large p, small n problem). This challenge arises in Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). many modern real-world applications ranging from recommender systems, gene microarray data, and financial data, to name a few. To perform accurate model parameter estimation and subsequent statistical inference, low dimensional structure is often imposed for regularization (Negahban et al., 2012). For Gaussian-distributed data, the central problem is often to estimate the inverse covariance matrix (alternatively known as the precision, concentration or information matrix). Gaussian graphical models (GGM) provide an efficient representation of the precision matrix through a graph that represents non-zeros in the matrix (Lauritzen, 1996). In high-dimensional regimes, this graph can be forced to be sparse, imposing a low-dimensional structure on the GGM. For sufficiently sparse GGM, statistically consistent estimates of the model structure (i.e., sparsistency) can be achieved (e.g., Ravikumar et al. (2011)). On the computational side, sparsity also leads to reduced complexity of the estimator (Hsieh et al., 2013). However, when the true distribution can not be well-approximated by a sparse GGM, the standard learning paradigm suffers from either large estimation bias due to enforcing a overly sparse model, or degraded computation time for a dense model. Both result in suboptimal performance in the subsequent inference tasks. In this paper, we consider a new class of high-dimensional GGM for extending the standard sparse GGM. The proposed model is motivated by many real-world applications, where there exist certain exogenous and often latent factors affecting a large portion of the variables. Examples are the price of oil on the airlines stock price variables (Choi et al., 2010), and the genres on movie rating variables. Conditioning on these global effects, the variables are assumed to have highly localized interactions, which can be wellfitted by a sparse GGM. However, due to the marginalization over global effects, the observed (marginal) GGM, and its corresponding precision matrix, is not sparse. Unfortunately, in this regime, existing theoretical results and com- Learning Latent Variable Gaussian Graphical Models putational tools for sparse GGM are not applicable. To address this problem, we propose to use latent variable Gaussian graphical models (LVGGM) for modeling and statistical inference. LVGGM introduce latent variables to capture the correlations due to the global effects, and the remaining effects are captured by a conditionally sparse graphical model. The resulting marginal precision matrix of the LVGGM has a sparse plus low-rank structure, therefore we consider a regularized maximum likelihood (ML) approach for parameter estimation (previously considered by Chandrasekaran et al. (2012)). By utilizing the almost strong convexity (Kakade et al., 2010) of the log-likelihood, we derive a non-asymptotic parameter error bound for the regularized ML estimator. Our derived bounds apply to the high-dimensional setting of p n due to restricted strong convexity (Negahban et al., 2012) and certain structural incoherence between the sparse and low-rank components of the precision matrix (Yang & Ravikumar, 2013). We show that for sufficiently large n, the Frobenius norm error of the precision matrix of LVGGM converges at the (s+reff r) log p n ), where s is the number of nonzeros in the conditionally sparse precision matrix, reff is the effective rank of the covariance matrix and r is the number of latent variables. This rate is in general significantly faster than the standard convergence rate of O( n ) for an unstructured dense GGM. This result offers a compelling argument for using LVGGM over sparse GGM for many inference problems. The paper is structured as follows. In Section 2 we review the relevant prior literature. In Section 3 we formulate the LVGGM estimation problem. In Section 4 the main theoretical results are presented. Experimental results are shown in Section 5 and we conclude in Section 6. We use boldface letters to denote vectors and matrices. 1, 2, F , denote the elementwise ℓ1, spectral, Frobenius, and nuclear matrix norms, respectively. 2. Background and Related Work The problem of learning GGM with sparse inverse covariance matrices using ℓ1-regularized maximum likelihood estimation, often referred to as the graphical lasso (Glasso) problem, has been studied in Friedman et al. (2008); Ravikumar et al. (2011); Rothman et al. (2008). In particular, the authors of Ravikumar et al. (2011) study the model selection consistency (i.e., sparsistency ) under certain incoherence condition. Beyond sparse GGM, Choi et al. (2010) propose a multi-resolution extension of a GGM augmented with sparse inter-level correlations, while in Choi et al. (2011) the authors consider latent treestructured graphical models. Both models lead to computa- tionally efficient inference and learning algorithms but restrict the latent structure to trees. Recently, Liu & Willsky (2013) consider a computationally efficient learning algorithm for a class of conditionally tree-structured LVGGM. The work that is most relevant to ours is by Chandrasekaran et al. (2012), who study the LVGGM learning problem, but focus on the simultaneous model selection consistency of both the sparse and low-rank components. In contrast, in this paper we focus on the Frobenius norm error bounds for estimating the precision matrix of LVGGM. Although structural consistency can be useful for deriving insights, parameter estimation error analysis is of equal or greater importance in practice. Since it provides additional, and usually more direct, insights into factors influencing the performance of the subsequent statistical inference tasks, such as prediction and detection. Also, compared with Chandrasekaran et al. (2012), our Frobenius norm error bounds are derived under mild condition on the Fisher information of the distribution. We note that there is a fundamentally different line of work on estimating models with a similar structural composition, known as robust PCA (Cand es et al., 2011). In robust PCA, the data matrix is modeled as low-rank plus sparse . This model has been applied to extracting the salient foreground from background in videos, and detecting malicious user ratings in recommender system data (Xu et al., 2012). In contrast, the equivalent covariance model of our LVGGM can be decomposed into a low-rank plus a dense matrix whose inverse is sparse. A similar covariance model has recently been studied by Kalaitzis & Lawrence (2012), in which an EM algorithm is proposed for estimation but no theoretical error bounds are derived. In this paper, we instead focus on the precision matrix parameterization, which enables model estimation through a convex optimization. This formulation is of both theoretical and computational importance. 3. Problem Setup In this section, we review Gaussian graphical models and formulate the problem of latent variable Gaussian graphical model estimation via a regularized maximum likelihood optimization. 3.1. Gaussian Graphical Models Consider a p-dimensional random vector x associated with an undirected graph G = (VG, EG), where VG is a set of nodes corresponding to elements of x and EG is a set of edges connecting nodes (including self-edges for each node). Then x follows a graphical model distribution if it satisfies the Markov property with respect to G: for any pair of nonadjacent nodes in G, the corresponding pair of Learning Latent Variable Gaussian Graphical Models Figure 1. Illustrations of a sparse Gaussian graphical model (GGM) (left) and a latent variable Gaussian graphical model (LVGGM) (right). (A) Example of a sparse GGM with only observed variables, (B) Sparsity pattern of example sparse GGM s precision matrix, (C) Example of a LVGGM with both observed and latent variables, (D) Sparsity pattern of example LVGGM s precision matrix. variables in x are conditionally independent given the remaining variables, i.e., xi xj | x\i,j, for all (i, j) / EG. If x follows a multivariate Gaussian distribution, the corresponding graphical model is called a Gaussian graphical model (GGM). We assume without loss of generality that x has zero mean. The Markov property in GGM is manifested in the sparsity pattern of the inverse covariance matrix J: Ji,j = 0 for all i = j, (i, j) / E. An example of this property for sparse GGM is shown in Figure 1(a) and 1(b). The precision matrix parameterization arises in many statistical inference problems for Gaussian distributions, in areas such as belief propagation, linear prediction, portfolio selection in financial data, and anomaly detection. Estimation of the precision matrix in GGM is the first step in these inference problems. 3.2. Latent Variable Gaussian Graphical Models Unfortunately, due to the presence of global factors that destroy sparsity, real-world observations often do not conform exactly to a sparse GGM (Choi et al., 2010; 2011). By introducing latent variables (denoted as a r-dimensional random vector x L) to capture global factors, we can generalize the GGM. Specifically, we construct a model that is conditionally a GGM, i.e., one that has a sparse precision matrix given knowledge of latent variables, x L. Defining the p observed variables as x O, we assume the joint distribution of the (p + r)-dimensional concatenated random vector x = (x O, x L) follows a Gaussian distribution with covariance matrix Ωand precision matrix J = Ω 1. An example of this structure can be seen in Figure 1(c) and 1(d). Marginalizing over the latent variables x L, the distribution of the observed variables x O remains Gaussian with observed covariance matrix, Σ = ΩO,O. The observed precision matrix Θ Rp p satisfies: Θ = Σ 1 = JO,O " #$ % L,LJL,O " #$ % L where we have defined S := JO,O and L := JO,LJ 1 L,LJL,O. Thus, the marginal precision matrix can be written as Θ = S + L, the sum of a sparse and a lowrank matrix. Similar to standard GGM, we parameterize the marginal distribution through the precision matrix. We refer to this model as the latent variable GGM, or LVGGM. The LVGGM is a hierarchical model that generalizes the (sparse) GGM. Note that S 1 = J 1 O,O = ΩO,O ΩO,LΩ 1 L,LΩL,O is the covariance matrix of the conditional distribution of the observed variables. The matrix is not generally sparse, even though S is assumed to be sparse. We will also assume that the number of latent variables is much smaller than the number of observed variables, i.e., r p. We place no sparsity restrictions on the dependencies between the observed and latent variables the submatrices JO,L and JL,O could be dense. As a result, the p p matrix L = JO,LJ 1 L,LJL,O is low-rank and potentially dense. The sparse plus low-rank structure of the marginal precision matrix Θ is the key property of the precision matrix that will be exploited for model estimation. The structural assumptions on the precision matrix of the LVGGM can be further motivated and validated on realworld recommender system data and stock return data. Due to the space limits, we defer these two motivating examples to Section A in the Appendix. 3.3. Effective Rank of Covariance Matrix We introduce the effective rank of a matrix, which will be useful to derived high-dimensional error bounds. The effective rank of a matrix Σ is defined as (Vershynin, 2010): reff(Σ) := tr(Σ)/ Σ 2. (2) Learning Latent Variable Gaussian Graphical Models The effective rank can be considered a measure of the concentration level of the spectrum of Σ. As we will show in Section 5.1, in many situations the effective rank of the covariance matrix corresponding to a LVGGM is much smaller than p. Under this condition, our theoretical results in the sequel provide a tight Frobenius norm estimation error bound, which is significantly improved upon the error bound derived without the effective rank assumption. 3.4. Regularized ML Estimation of LVGGM Available are n samples x1, x2, . . . , xn from a LVGGM model x O, concatenated into a data matrix X Rp n. The negative log-likelihood function is L(Θ; X) = &Σ, Θ log det(Θ), (3) where &Σ := 1 n XT X is the sample covariance matrix. The regularized ML estimate minimizes the objective function L(Θ; X) + λR(Θ), where the regularization parameter λ > 0, and the regularization function R(Θ) is designed to enforce the sparse plus low-rank structure on Θ. Similar to Chandrasekaran et al. (2012), we consider the following regularized ML estimation problem: S,L L(S + L; X) + λ S 1 + µ L s.t. L 0, S + L 0, where the corresponding regularization function is the sum of two regularizers: R(Θ) = S 1+ µ λ L , each of which has been shown to promote sparse (low-rank) structure in S (L, respectively) (Negahban et al., 2012). Constants λ, µ > 0 are regularization parameters corresponding to the two functions, respectively. The LVGGM estimator is defined as a solution to the above convex optimization problem (4). Efficient convex solver, such as Ma et al. (2013), can be used to solve. 4. Error Bounds on ML LVGGM Estimation We analyze the regularized ML estimation problem (4) and provide Frobenius norm error bounds for estimating the precision matrix in high-dimensional setting. We adopt the decomposable regularization framework of (Negahban et al., 2012; Agarwal et al., 2012; Yang & Ravikumar, 2013) to derive these bounds. In contrast to this prior work, here we focus on multiple decomposable regularizers interacting with the non-quadratic log-likelihood loss function encountered in the LVGGM. Two important ingredients in the derivations are the restricted strong convexity of the loss function, and an incoherence condition between the two structured subspaces containing the sparse and low-rank components (S and L). We show that under assumptions on the Fisher information these two conditions are verified. In the following subsections, first we define some necessary notation, then we introduce the assumptions and place them in the context of prior literature, and finally we state the main results in Theorem 1 and Theorem 2. 4.1. Decomposable Regularizers and Subspace In this subsection we introduce the notion of decomposable regularizers and the corresponding subspace pairs. We refer the reader to Negahban et al. (2012) for more details. Consider a pair of subspaces (M, M ), where M M Rp p. R( ) is called a decomposable regularization function with respect to the subspace pair if, for any u M, v M , we have R(u + v) = R(u) + R(v). For the sparse and low-rank matrix-valued parameters, the following two subspace pairs and their corresponding decomposable regularizers are considered: Sparse matrices. Let E {1, . . . , p} {1, . . . , p} be a subset of index pairs (edges). Define M(E) = M(E) as the subspace of all sparse matrices in Rp p that are supported in subsets of E, i.e., PM(E)(A) = AE. A decomposable regularizer is the ℓ1 norm, since A 1 = AE 1 + AEC 1. Low-rank PSD matrices. Consider a class of low-rank and positive semi-definite matrices A Sp p + which have rank r p. For any given matrix A A, let col(A) denote its column space. Let U Rn be a rdimensional subspace and define the subspace M(U) and the perturbation subspace M M(U) :={A Rn p | col(A) U}, (U) :={A Rn p | col(A) U }. Then the nuclear norm RL( ) = is a decomposable regularization function with respect to the subspace pair (M(U), M For the true model parameter Θ , we define its associated structural error set with respect to a subspace M as (Negahban et al., 2012): ; Θ ) := (5) ' Rn p | R( M ) 3R( M) + 4R(Θ By construction, if the norm of the projection of the true parameter Θ into M is small, then elements in this structural error set also have limited projection onto the perturbation subspace M Now let Θ be the true (marginal) precision matrix of the LVGGM, and let the sparse and low-rank components Learning Latent Variable Gaussian Graphical Models be S and L , respectively. For the defined subspace pairs (M(E), M(E) ) and (M(U), M(U) ), we use C(E) and C(U) as the shorthand notations for the corresponding structural error sets centered at S and L , i.e., C(M(E), M(E) ; S ) and C(M(U), M(U) ; L ), respectively. Later, we will consider the perturbation of Θ along restricted directions in these two sets. 4.2. Assumptions on Fisher Information We characterize the interaction between the elements in the two subspaces through their inner products using the Hessian of the loss function, also known as the Fisher information of the distribution. Denoting the Fisher information matrix of a Gaussian distribution as F (evaluated at Θ ), we find that F = Θ 1 Θ 1, where is the Kronecker product. We define the Fisher inner product between two matrices A and B as A, B F := vec( A)T F vec( B) (6) = Tr(Θ 1 AΘ 1 B), (7) where vec( ) denotes the vectorization of a matrix. Similar to prior work of (Kakade et al., 2010), we define the induced Fisher norm of a matrix as F := vec( )T F vec( ) (8) = Tr(Θ 1 Θ 1 ). (9) The first assumption we make is the following Restricted Fisher Eigenvalue (RFE) condition on the true precision model with respect to the sparse and low-rank structural error sets. Assumption 1 (Restricted Fisher Eigenvalue). There exists some constant κ min > 0, such that for all C(E) C(U), the following holds: This RFE condition generalizes the restricted eigenvalue (RE) condition for sparsity-promoting linear regression problems (Bickel et al., 2009). It assumes that the minimum eigenvalue of the Fisher information is bounded away from zero along the directions C(E) and C(U). Due to the identity (8) and properties of the Kronecker product, a trivial lower bound for κ min(Θ ), where λmin( ) denotes the minimum eigenvalue. In the high-dimensional setting, the RFE parameter κ min, which is defined only with respect to the above restricted set of directions, can be substantially larger than λ2 min(Θ ). As a result, the derived error bounds, which depend on κ min, are generally tighter than the bounds depending on λ2 min(Θ ) (cf. Theorem 1). Due to the sparse plus low-rank superpositioned structure, we impose a type of incoherence between the two structural error sets to ensure consistent estimation of the combined model. The incoherence condition will limit the interaction between elements from the two sets. For our problem, such interaction occurs through their inner products with the Fisher information, which motivates the following Structural Fisher Incoherence (SFI) assumption (which generalizes the C-Linear assumption proposed in Yang & Ravikumar (2013)). Let PE := PM(E) denote the projection operator corresponding to the subspace M(E). Similarly define PU := PM(U), PE := PM(E) , and PU := PM(U) . We assume the following condition on the Fisher information. Assumption 2 (Structural Fisher Incoherence). Given a constant M > 6, a set of regularization parameters (λ, µ), and the subspace pairs (M(E), M(E) ) and (M(U), M(U) ) as defined above, let Λ = 2 + 3 max λ s µ r, µ r , where s = |E| and r = rank(U). Then the Fisher information F satisfies: max {σ (PEF PU) , σ (PE F PU) , σ (PEF PU ) , σ (PE F PU )} κ where σ( ) denotes the maximum singular value, and constant c1 is defined as c1 = 16M The constant M is related to a burn-in period after which the likelihood loss function has desirable properties in a small neighborhood of the true parameter. In particular, when M = 7, the constant c1 = 112 suffices for our theory to hold. See the main theorem and its proof for more discussion on this quantity. It is interesting to compare our SFI assumption to other similar assumptions in the literature of GGM estimation. In Ravikumar et al. (2011), a form of irrepresentability condition is assumed, which limits the induced ℓ1 norm of a matrix that is similar to the projected Fisher information onto the sparse matrix subspace pair. In Chandrasekaran et al. (2012), the notion of irrepresentability is extended to two subspace pairs (i.e., sparse and low-rank), but detailed behaviors of the projected Fisher information are controlled (see the main assumption on page 1949 of Chandrasekaran et al. (2012)). For model selection consistency, a more general form of irrepresentability has been shown to be necessary for model selection consistency, see Lee et al. (2013) for a recent discussion. In contrast to the above line of work, the SFI assumption we make only controls the maximum singular values of the projected Fisher information. This can be explained as we are interested in bounding a weaker quantity, the Frobenius norm of the parameter estimation error, instead of establishing the stronger model selection consistency of Ravikumar et al. (2011) or the algebraic consistency as in Chandrasekaran et al. (2012). Learning Latent Variable Gaussian Graphical Models 4.3. Error Bounds for LVGGM Estimation We have the following bound on the parameter error of the estimated precision matrix of LVVGGM, &Θ = &S + &L, obtained by solving the regularized ML problem (4). Theorem 1. Suppose Assumption 1 and 2 hold for the true marginal precision matrix Θ , and the regularization parameters are chosen such that λ 2 Σ &Σ and µ 2 Σ &Σ 2. (11) Given a constant M > 6, if an optimal solution pair (&S, &L) to the convex program (4) satisfies max{ &S S F , &L L F } 1 6M 2 , (12) then we have the following error bound for the estimated precision matrix &Θ = &S + &L: where s = |E|, r = rank(U), and κL := M 2 2(M 1)κ σj(L ). (15) Proof sketch. The proof is inspired by Yang & Ravikumar (2013), in which a parameter estimation error bound is proven for estimating a class of superposition-structured parameters, such as sparse plus low-rank, through Mestimation with decomposable regularizers. Critical to specializing this framework to our LVGGM estimation problem is to verify two conditions on the log-likelihood loss function (3): the restricted strong convexity (RSC) and structural incoherence (SI). The RSC condition (which originally proposed in Negahban et al. (2012)) specifies the loss function to be sufficiently curved (i.e. lower bounded by a quadratic function) along a restricted set of directions (defined by C(E) and C(U)). On the other hand, the SI condition effectively limits certain interaction between elements from the above two structural error sets. In Yang & Ravikumar (2013), under certain C-linear assumptions, the RSC and SI conditions are verified for several problems with quadratic loss functions. For the LVGGM estimation problem, however, the technical difficulty lies in the nonquadratic log-likelihood loss (3), for which the previously established RSC and SI conditions do not hold. To deal with this difficulty, we leverage the almost strong convexity properties (Kakade et al., 2010) to characterize the convergence behavior of the sum of higher-order terms in the Taylor series of the log-likelihood loss function. We show that in the regime specified by condition (12), the loss function can be well-approximated by the sum of a quadratic function and a residual term. Under this condition, the RFE assumption (Assumption 1) guarantees the RSC condition (cf. Lemma 2), and the SFI assumption (Assumption 2) leads to SI condition to hold (cf. Lemma 4). Theorem 1 can then be proven by the general theorem in Yang & Ravikumar (2013). A detailed proof of Theorem 1 can be found in Appendix B. We make the following remarks: The error bound (13) is a family of upper bounds defined by different sets of subspace pairs (M(E), M(E) ) and (M(U), M(U) ). The tightest bound can be achieved by appropriately choosing E and U. The first additive term in (13) captures effect of the estimation error, while the second term captures the approximation error. In many cases it is reasonable to assume the approximation error is zero, then the error bound reduces to the first additive term. We note that similar derivations also apply to ℓ1- regularized estimation of sparse GGM. For the sparse GGM, only Assumption 1 is required, and the derivations largely simplify. The final error bound also contains estimation and approximation errors, depending only on the sparse matrix subspace pair. However, when the true precision matrix Θ cannot be wellapproximated as a sparse matrix (such as the LVGGM case), the approximation error would be much worse, leading to an inefficient learning rate. We finally remark that the SFI assumption can be relaxed to an even milder incoherence condition, L α, as considered in Agarwal et al. (2012). Following similar derivations as in the proof of Theorem 1, the corresponding error bound can be obtained. However, as a result of this incoherence assumption, the error bound would contain an additional incoherence term which does not vanish to zero even with infinite samples. This disadvantage is overcome under the structural incoherence condition. The statement of Theorem 1 is deterministic in nature and applies to any optimum of the convex program. However, the condition on the regularization parameters (11) and the error bound depend on the sampled data (in particular the sample covariance matrix &Σ), which is random. Therefore the key to specifying the regularization parameters, and hence obtaining error bounds independent of data, is to derive tight deviation bounds of the sample covariance matrix in terms of the ℓ and ℓ2 norms, such that condition (11) holds with high probability. These bounds can be obtained by using concentration inequalities for Gaussian distributions, which leads to the following corollary. Corollary 1. Let the same assumptions in Theorem 1 hold. Learning Latent Variable Gaussian Graphical Models Given constants C1 > 1 and C2 1, assume that the number of samples n satisfies n max 1 log p, C2 , and that the regularization parameters satisfy n and µ = 16C2ρ where σ = maxi Σ i,i and ρ = Σ 2. Then with prob- ability at least 1 4p 2(C1 1) 2 exp( C2 2p 2 ), we have where c1 = 960 κL σ and c2 = 96 Remark: The estimation error (17) consists of two terms corresponding to the sparse and low-rank components, respectively. Note its resemblance to the error bounds of robust PCA (e.g., Agarwal et al. (2012); Yang & Ravikumar (2013)) and the derived bound in Chandrasekaran et al. (2012). In particular, the first term in (17) was on the same order as the estimation error of a sparse GGM (Ravikumar et al., 2011). However, due to the presence of latent variables, both the sample requirement (i.e., n p) and the combined error bound are worse than those for learning the sparse conditional GGM. Next we consider a scenario under which this additional disadvantage is largely removed. Assume that the true marginal covariance matrix Σ has an effective rank reff := reff(Σ ) (recall reff(Σ ) := tr(Σ )/ Σ 2 ) that is much smaller than p. Then, by using recent advances on the asymptotic behavior of the sample covariance matrix (Lounici, 2012), we can obtain a much tighter bound which only depends on p logarithmically, as stated in the following theorem. Theorem 2. Let the same assumptions in Theorem 1 hold. Given a constant C1 > 1, assume that the number of observations n satisfies n max 4C1 log p, C3reff log2(2p) , and the regularization parameters satisfy n and µ = C4ρ where σ = maxi Σ i,i, ρ = Σ 2, and C3, C4 > 0 are sufficiently large constants. Then with probability at least 1 2p 2(C1 1) (2p) 1, we have reff r log(2p) where c1 = 960 κL σ , c2 = 8C4 Proof sketch. Same as Corollary 1, we need to verify that the choices of regularization parameters (18) satisfy the condition (11) with high probability. Since the choice of λ has been verified in Corollary 1, it only remains to verify the condition on µ. To this end, we make use of the following sharp bound on the spectral norm deviation of the sample covariance matrix: Lemma 1 (Lounici (2012)). Let &Σ be a sample covariance matrix constructed from n i.i.d. samples from a pdimensional Gaussian distribution N(0, Σ ). Then with probability at least 1 (2p) 1, &Σ Σ 2 C Σ 2 2reff log(2p) n , 2reff log(2p)(3/8 + log(2pn) where C > 0 is an absolute constant. Then as commented in Lounici (2012) (Prop. 3), when the sample size n is sufficiently large such that n C3reff log2 max{2p, n}, where C3 > 0 is a large constant, the choice of regularization parameter µ as in (18) suffices for the condition (11) to hold with high probability. Notice that when reff p, the error bound (19) is significantly tighter than the bound (17). Also the sample requirement n reff log(p) is much milder. This result implies the efficiency of LVGGM learning when the true covariance model has a low effective rank. 5. Experiments We use a set of simulations on synthetic data to verify our reduced effective rank assumption on the covariance matrix of LVGGM, and the derived error bounds in Theorem 2. 5.1. Effective Rank of Covariance of LVGGM To better understand the effective rank of the covariance matrix of LVGGM, it is convenient to consider a hierarchical generating process for the observed variables: x O Ax L+z, where x L N(0, ΩL,L) are the latent variables, A := J 1 O,OJO,L Rp r, and z N(0, S 1) captures the conditional effects. The marginal covariance matrix of the observed variables can be represented as Σ = AΩL,LAT where G is a low-rank covariance matrix (global effects), and S 1 is a non-sparse covariance matrix (conditionally local effects) whose inverse is sparse. While the low-rank global effects naturally result in a concentrated spectrum, the sparse-inverse local effects generally contribute to a diffuse spectrum. The effective rank, which is the sum of all Learning Latent Variable Gaussian Graphical Models Relative Energy Ratio (Global/Local) Effective Rank of Covariance Matrix Effective Rank (r = 10) p = 80 p = 120 p = 200 p = 500 Figure 2. Effective ranks of covariance matrices of LVGGM with various global/local energy ratios. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 2 n/(s log(p) + r log(2p)) Frobenius Error p = 160, r = 16 p = 160, r = 24 p = 160, r = 32 p = 160, r = 48 p = 200, r = 20 p = 200, r = 30 p = 200, r = 40 p = 200, r = 60 p = 320, r = 32 p = 320, r = 48 p = 320, r = 64 p = 320, r = 96 p = 400, r = 40 p = 400, r = 60 p = 400, r = 80 p = 400, r = 120 Figure 3. Simulations for chain graphical models with latent variables. Plots of Frobenius norm error !Θ Θ F versus the rescaled sample size n/(s log(p) + r log(2p)). eigenvalues divided by the magnitude of the largest one, depends on the relative energy ratio between G and S 1. Since an exact characterization of the effective rank in terms of A, ΩL,L, and S tends to be difficult, we use Monte Carlo simulations to investigate synthetic LVGGM that conform to our assumptions. We generate LVGGM with independent latent variables (i.e., diagonal JL,L), dense latent-observed submatrix JL,O, and a sparse conditional GGM JO,O for observed variable with a random sparsity pattern (sparsity level 5%). We fix the number of latent variables to be 10, and vary the number of observed variables p = {80, 120, 200, 500}. By scaling the magnitudes of the elements in the latent variable submatrix, we sweep through the relative energy ratio between the global and local factors, i.e., Tr(G)/Tr(S 1) from 0.1 to 10. After 550 realizations for each value of p, we plot the empirical effective ranks of observed covariance matrices in Figure 2. As seen in the figure, when the global factor dominates (i.e., the ratio is large), the effective rank of the covariance matrix is very small, as expected. On the other hand, when the local effects become stronger (e.g., when the number of observed variables p increases) the effective rank increases, but at a very mild rate. In particular, when p increases from 80 to 500, the maximum empirical effective rank in our simulation only increases from 4 to 26. For all of our simulated LVGGM, the empirical effective ranks are observed as at least an order of magnitude smaller than p. This mild growing rate of the effective rank (compared to p) will lead to our improved error bound in Theorem 2 to hold. 5.2. Frobenius Norm Error of LVGGM Estimation We simulate LVGGM data with number of observed variables p = {160, 200, 320, 400} and number of latent variables in the set r = {0.1, 0.15, 0.2, 0.3}p. The sparse conditional GGM is a chain graph whose associated precision matrix is tridiagonal with off-diagonal elements Si,i 1 = Si,i+1 = 0.4Si,i for i = {2, . . . , p 1}. For each configuration of p and r, we draw n samples from the LVGGM, where n ranges from 200 to 1000. Using these samples, the precision matrix &Θ is learned by solving the regularized ML estimation problem (4). As shown in Section 5.1, the effective rank of the covariance matrix grows mildly. Then Theorem 2 predicts that the Frobenius error of the estimated precision matrix of LVGGM should scale as &Θ Θ F (s log(p) + r log(2p))/n, when the regularization parameters are chosen such that reff log(p) n . Guided by this theoretical result, we set the regularization parameters as n and µ = Cbρ reff log(p) n , where constants Ca and Cb are cross-validated and then fixed for all test data sets with different configurations. We plot the Frobenius estimation errors against the rescaled sample size n/(s log(p)+r log(2p)) in Figure 3. With a wide range of configurations, almost all the empirical error curves for models align and have the form of f(t) t 1/2 when the sample size is rescaled, as predicted by Theorem 2. In practice when the true model is unknown, one could set the regularization parameters according to the sample versions of the quantities σ and ρ , as discussed in Lounici (2012). 6. Conclusions We consider a family of latent variable Gaussian graphical model whose precision matrix has a sparse plus low-rank structure. We derive parameter error bounds for regularized maximum likelihood estimation. Future work includes extending the framework to other distributions and the application to tasks such as prediction and detection. Acknowledgement This work is in part supported by ARO grant W911NF-111-0391. The authors thank the anonymous reviewers for their valuable comments, along with Jason Lee and Yuekai Sun for helpful discussions. Learning Latent Variable Gaussian Graphical Models Agarwal, Alekh, Negahban, Sahand, and Wainwright, Mar- tin J. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2):1171 1197, 2012. Bickel, Peter J, Ritov, Yaacov, and Tsybakov, Alexandre B. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705 1732, 2009. Cand es, Emmanuel J, Li, Xiaodong, Ma, Yi, and Wright, John. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011. Chandrasekaran, Venkat, Parrilo, Pablo A, and Willsky, Alan S. Latent variable graphical model selection via convex optimization. Annals of Statistics, 40(4):1935 1967, 2012. Choi, Myung Jin, Chandrasekaran, Venkat, and Willsky, Alan S. Gaussian multiresolution models: Exploiting sparse Markov and covariance structure. Signal Processing, IEEE Transactions on, 58(3):1012 1024, 2010. Choi, Myung Jin, Tan, Vincent YF, Anandkumar, Ani- mashree, and Willsky, Alan S. Learning latent tree graphical models. Journal of Machine Learning Research, 12:1729 1770, 2011. Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008. Hsieh, Cho-Jui, Sustik, M aty as A, Dhillon, Inderjit, Ravikumar, Pradeep, and Poldrack, Russell. BIG & QUIC: Sparse inverse covariance estimation for a million variables. In Advances in Neural Information Processing Systems, pp. 3165 3173, 2013. Kakade, Sham, Shamir, Ohad, Sindharan, Karthik, and Tewari, Ambuj. Learning exponential families in highdimensions: Strong convexity and sparsity. In International Conference on Artificial Intelligence and Statistics, pp. 381 388, 2010. Kalaitzis, Alfredo and Lawrence, Neil D. Residual com- ponent analysis: Generalising PCA for more flexible inference in linear-Gaussian models. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 209 216, 2012. Lauritzen, S.L. Graphical models, volume 17. Oxford Uni- versity Press, USA, 1996. Lee, Jason, Sun, Yuekai, and Taylor, Jonathan E. On model selection consistency of penalized M-estimators: a geometric theory. In Advances in Neural Information Processing Systems, pp. 342 350, 2013. Liu, Ying and Willsky, Alan. Learning Gaussian graphical models with observed or latent FVSs. In Advances in Neural Information Processing Systems, pp. 1833 1841, 2013. Lounici, Karim. High-dimensional covariance matrix es- timation with missing observations. ar Xiv preprint ar Xiv:1201.2577, 2012. Ma, Shiqian, Xue, Lingzhou, and Zou, Hui. Alternating direction methods for latent variable Gaussian graphical model selection. Neural computation, 25(8):2172 2198, 2013. Negahban, Sahand N, Ravikumar, Pradeep, Wainwright, Martin J, and Yu, Bin. A unified framework for highdimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4):538 557, 2012. Ravikumar, Pradeep, Wainwright, Martin J, Raskutti, Garvesh, and Yu, Bin. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935 980, 2011. Rothman, Adam J, Bickel, Peter J, Levina, Elizaveta, and Zhu, Ji. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494 515, 2008. Vershynin, Roman. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010. Xu, Huan, Caramanis, Constantine, and Sanghavi, Sujay. Robust PCA via outlier pursuit. Information Theory, IEEE Transactions on, 58(5):3047 3064, 2012. Yang, Eunho and Ravikumar, Pradeep. Dirty statistical models. In Advances in Neural Information Processing Systems 26, pp. 611 619. 2013.