# learning_latent_variable_gaussian_graphical_models__a2063706.pdf

Learning Latent Variable Gaussian Graphical Models

Zhaoshi Meng MENGZS@UMICH.EDU Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA

Brian Eriksson BRIAN.ERIKSSON@TECHNICOLOR.COM Technicolor Research Center, 735 Emerson Street, Palo Alto, CA 94301, USA

Alfred O. Hero III HERO@EECS.UMICH.EDU Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA

Gaussian graphical models (GGM) have been widely used in many high-dimensional applications ranging from biological and ﬁnancial data to recommender systems. Sparsity in GGM plays a central role both statistically and computationally. Unfortunately, real-world data often does not ﬁt well to sparse graphical models. In this paper, we focus on a family of latent variable Gaussian graphical models (LVGGM), where the model is conditionally sparse given latent variables, but marginally non-sparse. In LVGGM, the inverse covariance matrix has a low-rank plus sparse structure, and can be learned in a regularized maximum likelihood framework. We derive novel parameter estimation error bounds for LVGGM under mild conditions in the highdimensional setting. These results complement the existing theory on the structural learning, and open up new possibilities of using LVGGM for statistical inference.

1. Introduction

Critical to many statistical inference tasks in complex realworld systems, such as prediction and detection, is the ability to extract and estimate distributional characteristics from the observations. Unfortunately, in the highdimensional regime such model estimation often leads to ill-posed problems, particularly when the number of observations n (or sample size) is comparable to or fewer than the ambient dimensionality p of the model (i.e., the large p, small n problem). This challenge arises in

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

many modern real-world applications ranging from recommender systems, gene microarray data, and ﬁnancial data, to name a few. To perform accurate model parameter estimation and subsequent statistical inference, low dimensional structure is often imposed for regularization (Negahban et al., 2012).

For Gaussian-distributed data, the central problem is often to estimate the inverse covariance matrix (alternatively known as the precision, concentration or information matrix). Gaussian graphical models (GGM) provide an efﬁcient representation of the precision matrix through a graph that represents non-zeros in the matrix (Lauritzen, 1996). In high-dimensional regimes, this graph can be forced to be sparse, imposing a low-dimensional structure on the GGM. For sufﬁciently sparse GGM, statistically consistent estimates of the model structure (i.e., sparsistency) can be achieved (e.g., Ravikumar et al. (2011)). On the computational side, sparsity also leads to reduced complexity of the estimator (Hsieh et al., 2013). However, when the true distribution can not be well-approximated by a sparse GGM, the standard learning paradigm suffers from either large estimation bias due to enforcing a overly sparse model, or degraded computation time for a dense model. Both result in suboptimal performance in the subsequent inference tasks.

In this paper, we consider a new class of high-dimensional GGM for extending the standard sparse GGM. The proposed model is motivated by many real-world applications, where there exist certain exogenous and often latent factors affecting a large portion of the variables. Examples are the price of oil on the airlines stock price variables (Choi et al., 2010), and the genres on movie rating variables. Conditioning on these global effects, the variables are assumed to have highly localized interactions, which can be wellﬁtted by a sparse GGM. However, due to the marginalization over global effects, the observed (marginal) GGM, and its corresponding precision matrix, is not sparse. Unfortunately, in this regime, existing theoretical results and com-

Learning Latent Variable Gaussian Graphical Models

putational tools for sparse GGM are not applicable.

To address this problem, we propose to use latent variable Gaussian graphical models (LVGGM) for modeling and statistical inference. LVGGM introduce latent variables to capture the correlations due to the global effects, and the remaining effects are captured by a conditionally sparse graphical model. The resulting marginal precision matrix of the LVGGM has a sparse plus low-rank structure, therefore we consider a regularized maximum likelihood (ML) approach for parameter estimation (previously considered by Chandrasekaran et al. (2012)). By utilizing the almost strong convexity (Kakade et al., 2010) of the log-likelihood, we derive a non-asymptotic parameter error bound for the regularized ML estimator. Our derived bounds apply to the high-dimensional setting of p n due to restricted strong convexity (Negahban et al., 2012) and certain structural incoherence between the sparse and low-rank components of the precision matrix (Yang & Ravikumar, 2013).

We show that for sufﬁciently large n, the Frobenius norm error of the precision matrix of LVGGM converges at the

(s+reff r) log p

n ), where s is the number of nonzeros in the conditionally sparse precision matrix, reff is the effective rank of the covariance matrix and r is the number of latent variables. This rate is in general signiﬁcantly

faster than the standard convergence rate of O(

n ) for an unstructured dense GGM. This result offers a compelling argument for using LVGGM over sparse GGM for many inference problems.

The paper is structured as follows. In Section 2 we review the relevant prior literature. In Section 3 we formulate the LVGGM estimation problem. In Section 4 the main theoretical results are presented. Experimental results are shown in Section 5 and we conclude in Section 6. We use boldface letters to denote vectors and matrices. 1, 2, F , denote the elementwise ℓ1, spectral, Frobenius, and nuclear matrix norms, respectively.

2. Background and Related Work

The problem of learning GGM with sparse inverse covariance matrices using ℓ1-regularized maximum likelihood estimation, often referred to as the graphical lasso (Glasso) problem, has been studied in Friedman et al. (2008); Ravikumar et al. (2011); Rothman et al. (2008). In particular, the authors of Ravikumar et al. (2011) study the model selection consistency (i.e., sparsistency ) under certain incoherence condition. Beyond sparse GGM, Choi et al. (2010) propose a multi-resolution extension of a GGM augmented with sparse inter-level correlations, while in Choi et al. (2011) the authors consider latent treestructured graphical models. Both models lead to computa-

tionally efﬁcient inference and learning algorithms but restrict the latent structure to trees. Recently, Liu & Willsky (2013) consider a computationally efﬁcient learning algorithm for a class of conditionally tree-structured LVGGM.

The work that is most relevant to ours is by Chandrasekaran et al. (2012), who study the LVGGM learning problem, but focus on the simultaneous model selection consistency of both the sparse and low-rank components. In contrast, in this paper we focus on the Frobenius norm error bounds for estimating the precision matrix of LVGGM. Although structural consistency can be useful for deriving insights, parameter estimation error analysis is of equal or greater importance in practice. Since it provides additional, and usually more direct, insights into factors inﬂuencing the performance of the subsequent statistical inference tasks, such as prediction and detection. Also, compared with Chandrasekaran et al. (2012), our Frobenius norm error bounds are derived under mild condition on the Fisher information of the distribution.

We note that there is a fundamentally different line of work on estimating models with a similar structural composition, known as robust PCA (Cand es et al., 2011). In robust PCA, the data matrix is modeled as low-rank plus sparse . This model has been applied to extracting the salient foreground from background in videos, and detecting malicious user ratings in recommender system data (Xu et al., 2012). In contrast, the equivalent covariance model of our LVGGM can be decomposed into a low-rank plus a dense matrix whose inverse is sparse. A similar covariance model has recently been studied by Kalaitzis & Lawrence (2012), in which an EM algorithm is proposed for estimation but no theoretical error bounds are derived. In this paper, we instead focus on the precision matrix parameterization, which enables model estimation through a convex optimization. This formulation is of both theoretical and computational importance.

3. Problem Setup

In this section, we review Gaussian graphical models and formulate the problem of latent variable Gaussian graphical model estimation via a regularized maximum likelihood optimization.

3.1. Gaussian Graphical Models

Consider a p-dimensional random vector x associated with an undirected graph G = (VG, EG), where VG is a set of nodes corresponding to elements of x and EG is a set of edges connecting nodes (including self-edges for each node). Then x follows a graphical model distribution if it satisﬁes the Markov property with respect to G: for any pair of nonadjacent nodes in G, the corresponding pair of

Learning Latent Variable Gaussian Graphical Models

Figure 1. Illustrations of a sparse Gaussian graphical model (GGM) (left) and a latent variable Gaussian graphical model (LVGGM) (right). (A) Example of a sparse GGM with only observed variables, (B) Sparsity pattern of example sparse GGM s precision matrix, (C) Example of a LVGGM with both observed and latent variables, (D) Sparsity pattern of example LVGGM s precision matrix.

variables in x are conditionally independent given the remaining variables, i.e., xi xj | x\i,j, for all (i, j) / EG.

If x follows a multivariate Gaussian distribution, the corresponding graphical model is called a Gaussian graphical model (GGM). We assume without loss of generality that x has zero mean. The Markov property in GGM is manifested in the sparsity pattern of the inverse covariance matrix J: Ji,j = 0 for all i = j, (i, j) / E. An example of this property for sparse GGM is shown in Figure 1(a) and 1(b).

The precision matrix parameterization arises in many statistical inference problems for Gaussian distributions, in areas such as belief propagation, linear prediction, portfolio selection in ﬁnancial data, and anomaly detection. Estimation of the precision matrix in GGM is the ﬁrst step in these inference problems.

3.2. Latent Variable Gaussian Graphical Models

Unfortunately, due to the presence of global factors that destroy sparsity, real-world observations often do not conform exactly to a sparse GGM (Choi et al., 2010; 2011). By introducing latent variables (denoted as a r-dimensional random vector x L) to capture global factors, we can generalize the GGM. Speciﬁcally, we construct a model that is conditionally a GGM, i.e., one that has a sparse precision matrix given knowledge of latent variables, x L.

Deﬁning the p observed variables as x O, we assume the joint distribution of the (p + r)-dimensional concatenated random vector x = (x O, x L) follows a Gaussian distribution with covariance matrix Ωand precision matrix J = Ω 1. An example of this structure can be seen in Figure 1(c) and 1(d). Marginalizing over the latent variables x L, the distribution of the observed variables x O remains Gaussian with observed covariance matrix, Σ = ΩO,O.

The observed precision matrix Θ Rp p satisﬁes:

Θ = Σ 1 = JO,O " #$ %

L,LJL,O " #$ % L

where we have deﬁned S := JO,O and L := JO,LJ 1

L,LJL,O. Thus, the marginal precision matrix can be written as Θ = S + L, the sum of a sparse and a lowrank matrix. Similar to standard GGM, we parameterize the marginal distribution through the precision matrix. We refer to this model as the latent variable GGM, or LVGGM.

The LVGGM is a hierarchical model that generalizes the (sparse) GGM. Note that S 1 = J 1

O,O = ΩO,O ΩO,LΩ 1

L,LΩL,O is the covariance matrix of the conditional distribution of the observed variables. The matrix is not generally sparse, even though S is assumed to be sparse. We will also assume that the number of latent variables is much smaller than the number of observed variables, i.e., r p. We place no sparsity restrictions on the dependencies between the observed and latent variables the submatrices JO,L and JL,O could be dense. As a result, the p p matrix L = JO,LJ 1

L,LJL,O is low-rank and potentially dense. The sparse plus low-rank structure of the marginal precision matrix Θ is the key property of the precision matrix that will be exploited for model estimation.

The structural assumptions on the precision matrix of the LVGGM can be further motivated and validated on realworld recommender system data and stock return data. Due to the space limits, we defer these two motivating examples to Section A in the Appendix.

3.3. Effective Rank of Covariance Matrix

We introduce the effective rank of a matrix, which will be useful to derived high-dimensional error bounds. The effective rank of a matrix Σ is deﬁned as (Vershynin, 2010):

reff(Σ) := tr(Σ)/ Σ 2. (2)

Learning Latent Variable Gaussian Graphical Models

The effective rank can be considered a measure of the concentration level of the spectrum of Σ. As we will show in Section 5.1, in many situations the effective rank of the covariance matrix corresponding to a LVGGM is much smaller than p. Under this condition, our theoretical results in the sequel provide a tight Frobenius norm estimation error bound, which is signiﬁcantly improved upon the error bound derived without the effective rank assumption.

3.4. Regularized ML Estimation of LVGGM

Available are n samples x1, x2, . . . , xn from a LVGGM model x O, concatenated into a data matrix X Rp n. The negative log-likelihood function is

L(Θ; X) = &Σ, Θ log det(Θ), (3)

where &Σ := 1

n XT X is the sample covariance matrix. The regularized ML estimate minimizes the objective function L(Θ; X) + λR(Θ), where the regularization parameter λ > 0, and the regularization function R(Θ) is designed to enforce the sparse plus low-rank structure on Θ.

Similar to Chandrasekaran et al. (2012), we consider the following regularized ML estimation problem:

S,L L(S + L; X) + λ S 1 + µ L

s.t. L 0, S + L 0,

where the corresponding regularization function is the sum of two regularizers: R(Θ) = S 1+ µ

λ L , each of which has been shown to promote sparse (low-rank) structure in S (L, respectively) (Negahban et al., 2012). Constants λ, µ > 0 are regularization parameters corresponding to the two functions, respectively. The LVGGM estimator is deﬁned as a solution to the above convex optimization problem (4). Efﬁcient convex solver, such as Ma et al. (2013), can be used to solve.

4. Error Bounds on ML LVGGM Estimation

We analyze the regularized ML estimation problem (4) and provide Frobenius norm error bounds for estimating the precision matrix in high-dimensional setting. We adopt the decomposable regularization framework of (Negahban et al., 2012; Agarwal et al., 2012; Yang & Ravikumar, 2013) to derive these bounds. In contrast to this prior work, here we focus on multiple decomposable regularizers interacting with the non-quadratic log-likelihood loss function encountered in the LVGGM. Two important ingredients in the derivations are the restricted strong convexity of the loss function, and an incoherence condition between the two structured subspaces containing the sparse and low-rank components (S and L). We show that under assumptions on the Fisher information these two conditions are veriﬁed.

In the following subsections, ﬁrst we deﬁne some necessary notation, then we introduce the assumptions and place them in the context of prior literature, and ﬁnally we state the main results in Theorem 1 and Theorem 2.

4.1. Decomposable Regularizers and Subspace

In this subsection we introduce the notion of decomposable regularizers and the corresponding subspace pairs. We refer the reader to Negahban et al. (2012) for more details.

Consider a pair of subspaces (M, M

), where M M Rp p. R( ) is called a decomposable regularization function with respect to the subspace pair if, for any u M, v M

, we have R(u + v) = R(u) + R(v).

For the sparse and low-rank matrix-valued parameters, the following two subspace pairs and their corresponding decomposable regularizers are considered:

Sparse matrices. Let E {1, . . . , p} {1, . . . , p} be a

subset of index pairs (edges). Deﬁne M(E) = M(E) as the subspace of all sparse matrices in Rp p that are supported in subsets of E, i.e., PM(E)(A) = AE. A decomposable regularizer is the ℓ1 norm, since A 1 = AE 1 + AEC 1.

Low-rank PSD matrices. Consider a class of low-rank

and positive semi-deﬁnite matrices A Sp p

+ which have rank r p. For any given matrix A A, let col(A) denote its column space. Let U Rn be a rdimensional subspace and deﬁne the subspace M(U) and the perturbation subspace M

M(U) :={A Rn p | col(A) U},

(U) :={A Rn p | col(A) U }.

Then the nuclear norm RL( ) = is a decomposable regularization function with respect to the subspace pair (M(U), M

For the true model parameter Θ , we deﬁne its associated structural error set with respect to a subspace M as (Negahban et al., 2012):

; Θ ) := (5) '

Rn p | R( M

) 3R( M) + 4R(Θ

By construction, if the norm of the projection of the true parameter Θ into M

is small, then elements in this structural error set also have limited projection onto the perturbation subspace M

Now let Θ be the true (marginal) precision matrix of the LVGGM, and let the sparse and low-rank components

Learning Latent Variable Gaussian Graphical Models

be S and L , respectively. For the deﬁned subspace pairs (M(E), M(E) ) and (M(U), M(U) ), we use C(E) and C(U) as the shorthand notations for the corresponding structural error sets centered at S and L , i.e., C(M(E), M(E) ; S ) and C(M(U), M(U) ; L ), respectively. Later, we will consider the perturbation of Θ

along restricted directions in these two sets.

4.2. Assumptions on Fisher Information

We characterize the interaction between the elements in the two subspaces through their inner products using the Hessian of the loss function, also known as the Fisher information of the distribution. Denoting the Fisher information matrix of a Gaussian distribution as F (evaluated at Θ ), we ﬁnd that F = Θ 1 Θ 1, where is the Kronecker product. We deﬁne the Fisher inner product between two matrices A and B as

A, B F := vec( A)T F vec( B) (6)

= Tr(Θ 1 AΘ 1 B), (7)

where vec( ) denotes the vectorization of a matrix.

Similar to prior work of (Kakade et al., 2010), we deﬁne the induced Fisher norm of a matrix as

F := vec( )T F vec( ) (8)

= Tr(Θ 1 Θ 1 ). (9)

The ﬁrst assumption we make is the following Restricted Fisher Eigenvalue (RFE) condition on the true precision model with respect to the sparse and low-rank structural error sets. Assumption 1 (Restricted Fisher Eigenvalue). There exists some constant κ

min > 0, such that for all C(E) C(U), the following holds:

This RFE condition generalizes the restricted eigenvalue (RE) condition for sparsity-promoting linear regression problems (Bickel et al., 2009). It assumes that the minimum eigenvalue of the Fisher information is bounded away from zero along the directions C(E) and C(U). Due to the identity (8) and properties of the Kronecker product, a trivial lower bound for κ

min(Θ ), where λmin( ) denotes the minimum eigenvalue. In the high-dimensional setting, the RFE parameter κ

min, which is deﬁned only with respect to the above restricted set of directions, can be substantially larger than λ2

min(Θ ). As a result, the derived error bounds, which depend on κ

min, are generally tighter than the bounds depending on λ2

min(Θ ) (cf. Theorem 1).

Due to the sparse plus low-rank superpositioned structure, we impose a type of incoherence between the two structural

error sets to ensure consistent estimation of the combined model. The incoherence condition will limit the interaction between elements from the two sets. For our problem, such interaction occurs through their inner products with the Fisher information, which motivates the following Structural Fisher Incoherence (SFI) assumption (which generalizes the C-Linear assumption proposed in Yang & Ravikumar (2013)).

Let PE := PM(E) denote the projection operator corresponding to the subspace M(E). Similarly deﬁne PU := PM(U), PE := PM(E) , and PU := PM(U) . We assume the following condition on the Fisher information.

Assumption 2 (Structural Fisher Incoherence). Given a constant M > 6, a set of regularization parameters (λ, µ), and the subspace pairs (M(E), M(E) ) and (M(U), M(U) ) as deﬁned above, let Λ = 2 + 3 max

λ s µ r, µ r

, where s = |E| and r = rank(U). Then the Fisher information F satisﬁes:

max {σ (PEF PU) , σ (PE F PU) ,

σ (PEF PU ) , σ (PE F PU )} κ

where σ( ) denotes the maximum singular value, and constant c1 is deﬁned as c1 = 16M

The constant M is related to a burn-in period after which the likelihood loss function has desirable properties in a small neighborhood of the true parameter. In particular, when M = 7, the constant c1 = 112 sufﬁces for our theory to hold. See the main theorem and its proof for more discussion on this quantity.

It is interesting to compare our SFI assumption to other similar assumptions in the literature of GGM estimation. In Ravikumar et al. (2011), a form of irrepresentability condition is assumed, which limits the induced ℓ1 norm of a matrix that is similar to the projected Fisher information onto the sparse matrix subspace pair. In Chandrasekaran et al. (2012), the notion of irrepresentability is extended to two subspace pairs (i.e., sparse and low-rank), but detailed behaviors of the projected Fisher information are controlled (see the main assumption on page 1949 of Chandrasekaran et al. (2012)). For model selection consistency, a more general form of irrepresentability has been shown to be necessary for model selection consistency, see Lee et al. (2013) for a recent discussion. In contrast to the above line of work, the SFI assumption we make only controls the maximum singular values of the projected Fisher information. This can be explained as we are interested in bounding a weaker quantity, the Frobenius norm of the parameter estimation error, instead of establishing the stronger model selection consistency of Ravikumar et al. (2011) or the algebraic consistency as in Chandrasekaran et al. (2012).

Learning Latent Variable Gaussian Graphical Models

4.3. Error Bounds for LVGGM Estimation

We have the following bound on the parameter error of the estimated precision matrix of LVVGGM, &Θ = &S + &L, obtained by solving the regularized ML problem (4).

Theorem 1. Suppose Assumption 1 and 2 hold for the true marginal precision matrix Θ , and the regularization parameters are chosen such that

λ 2 Σ &Σ and µ 2 Σ &Σ 2. (11)

Given a constant M > 6, if an optimal solution pair (&S, &L) to the convex program (4) satisﬁes

max{ &S S F , &L L F } 1 6M 2 , (12)

then we have the following error bound for the estimated precision matrix &Θ = &S + &L:

where s = |E|, r = rank(U), and

κL := M 2 2(M 1)κ

σj(L ). (15)

Proof sketch. The proof is inspired by Yang & Ravikumar (2013), in which a parameter estimation error bound is proven for estimating a class of superposition-structured parameters, such as sparse plus low-rank, through Mestimation with decomposable regularizers. Critical to specializing this framework to our LVGGM estimation problem is to verify two conditions on the log-likelihood loss function (3): the restricted strong convexity (RSC) and structural incoherence (SI). The RSC condition (which originally proposed in Negahban et al. (2012)) speciﬁes the loss function to be sufﬁciently curved (i.e. lower bounded by a quadratic function) along a restricted set of directions (deﬁned by C(E) and C(U)). On the other hand, the SI condition effectively limits certain interaction between elements from the above two structural error sets. In Yang & Ravikumar (2013), under certain C-linear assumptions, the RSC and SI conditions are veriﬁed for several problems with quadratic loss functions. For the LVGGM estimation problem, however, the technical difﬁculty lies in the nonquadratic log-likelihood loss (3), for which the previously established RSC and SI conditions do not hold.

To deal with this difﬁculty, we leverage the almost strong convexity properties (Kakade et al., 2010) to characterize the convergence behavior of the sum of higher-order terms in the Taylor series of the log-likelihood loss function. We

show that in the regime speciﬁed by condition (12), the loss function can be well-approximated by the sum of a quadratic function and a residual term. Under this condition, the RFE assumption (Assumption 1) guarantees the RSC condition (cf. Lemma 2), and the SFI assumption (Assumption 2) leads to SI condition to hold (cf. Lemma 4). Theorem 1 can then be proven by the general theorem in Yang & Ravikumar (2013). A detailed proof of Theorem 1 can be found in Appendix B.

We make the following remarks:

The error bound (13) is a family of upper bounds deﬁned

by different sets of subspace pairs (M(E), M(E) ) and (M(U), M(U) ). The tightest bound can be achieved by appropriately choosing E and U. The ﬁrst additive term in (13) captures effect of the estimation error, while the second term captures the approximation error. In many cases it is reasonable to assume the approximation error is zero, then the error bound reduces to the ﬁrst additive term.

We note that similar derivations also apply to ℓ1-

regularized estimation of sparse GGM. For the sparse GGM, only Assumption 1 is required, and the derivations largely simplify. The ﬁnal error bound also contains estimation and approximation errors, depending only on the sparse matrix subspace pair. However, when the true precision matrix Θ cannot be wellapproximated as a sparse matrix (such as the LVGGM case), the approximation error would be much worse, leading to an inefﬁcient learning rate.

We ﬁnally remark that the SFI assumption can be relaxed

to an even milder incoherence condition, L α, as considered in Agarwal et al. (2012). Following similar derivations as in the proof of Theorem 1, the corresponding error bound can be obtained. However, as a result of this incoherence assumption, the error bound would contain an additional incoherence term which does not vanish to zero even with inﬁnite samples. This disadvantage is overcome under the structural incoherence condition.

The statement of Theorem 1 is deterministic in nature and applies to any optimum of the convex program. However, the condition on the regularization parameters (11) and the error bound depend on the sampled data (in particular the sample covariance matrix &Σ), which is random. Therefore the key to specifying the regularization parameters, and hence obtaining error bounds independent of data, is to derive tight deviation bounds of the sample covariance matrix in terms of the ℓ and ℓ2 norms, such that condition (11) holds with high probability. These bounds can be obtained by using concentration inequalities for Gaussian distributions, which leads to the following corollary.

Corollary 1. Let the same assumptions in Theorem 1 hold.

Learning Latent Variable Gaussian Graphical Models

Given constants C1 > 1 and C2 1, assume that the number of samples n satisﬁes n max

1 log p, C2

, and that the regularization parameters satisfy

n and µ = 16C2ρ

where σ = maxi Σ

i,i and ρ = Σ 2. Then with prob-

ability at least 1 4p 2(C1 1) 2 exp( C2

2p 2 ), we have

where c1 = 960

κL σ and c2 = 96

Remark: The estimation error (17) consists of two terms corresponding to the sparse and low-rank components, respectively. Note its resemblance to the error bounds of robust PCA (e.g., Agarwal et al. (2012); Yang & Ravikumar (2013)) and the derived bound in Chandrasekaran et al. (2012). In particular, the ﬁrst term in (17) was on the same order as the estimation error of a sparse GGM (Ravikumar et al., 2011). However, due to the presence of latent variables, both the sample requirement (i.e., n p) and the combined error bound are worse than those for learning the sparse conditional GGM.

Next we consider a scenario under which this additional disadvantage is largely removed. Assume that the true marginal covariance matrix Σ has an effective rank reff := reff(Σ ) (recall reff(Σ ) := tr(Σ )/ Σ 2 ) that is much smaller than p. Then, by using recent advances on the asymptotic behavior of the sample covariance matrix (Lounici, 2012), we can obtain a much tighter bound which only depends on p logarithmically, as stated in the following theorem.

Theorem 2. Let the same assumptions in Theorem 1 hold. Given a constant C1 > 1, assume that the number of observations n satisﬁes n max

4C1 log p, C3reff log2(2p)

, and the regularization parameters satisfy

n and µ = C4ρ

where σ = maxi Σ

i,i, ρ = Σ 2, and C3, C4 > 0 are sufﬁciently large constants. Then with probability at least 1 2p 2(C1 1) (2p) 1, we have

reff r log(2p)

where c1 = 960

κL σ , c2 = 8C4

Proof sketch. Same as Corollary 1, we need to verify that the choices of regularization parameters (18) satisfy the

condition (11) with high probability. Since the choice of λ has been veriﬁed in Corollary 1, it only remains to verify the condition on µ. To this end, we make use of the following sharp bound on the spectral norm deviation of the sample covariance matrix:

Lemma 1 (Lounici (2012)). Let &Σ be a sample covariance matrix constructed from n i.i.d. samples from a pdimensional Gaussian distribution N(0, Σ ). Then with probability at least 1 (2p) 1,

&Σ Σ 2 C Σ 2

2reff log(2p)

n , 2reff log(2p)(3/8 + log(2pn)

where C > 0 is an absolute constant.

Then as commented in Lounici (2012) (Prop. 3), when the sample size n is sufﬁciently large such that n C3reff log2 max{2p, n}, where C3 > 0 is a large constant, the choice of regularization parameter µ as in (18) sufﬁces for the condition (11) to hold with high probability.

Notice that when reff p, the error bound (19) is signiﬁcantly tighter than the bound (17). Also the sample requirement n reff log(p) is much milder. This result implies the efﬁciency of LVGGM learning when the true covariance model has a low effective rank.

5. Experiments

We use a set of simulations on synthetic data to verify our reduced effective rank assumption on the covariance matrix of LVGGM, and the derived error bounds in Theorem 2.

5.1. Effective Rank of Covariance of LVGGM

To better understand the effective rank of the covariance matrix of LVGGM, it is convenient to consider a hierarchical generating process for the observed variables: x O Ax L+z, where x L N(0, ΩL,L) are the latent variables, A := J 1

O,OJO,L Rp r, and z N(0, S 1) captures the conditional effects. The marginal covariance matrix of the observed variables can be represented as

Σ = AΩL,LAT

where G is a low-rank covariance matrix (global effects), and S 1 is a non-sparse covariance matrix (conditionally local effects) whose inverse is sparse. While the low-rank global effects naturally result in a concentrated spectrum, the sparse-inverse local effects generally contribute to a diffuse spectrum. The effective rank, which is the sum of all

Learning Latent Variable Gaussian Graphical Models

Relative Energy Ratio (Global/Local)

Effective Rank of Covariance Matrix

Effective Rank (r = 10)

p = 80 p = 120 p = 200 p = 500

Figure 2. Effective ranks of covariance matrices of LVGGM with various global/local energy ratios.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 2

n/(s log(p) + r log(2p))

Frobenius Error

p = 160, r = 16 p = 160, r = 24 p = 160, r = 32 p = 160, r = 48 p = 200, r = 20 p = 200, r = 30 p = 200, r = 40 p = 200, r = 60 p = 320, r = 32 p = 320, r = 48 p = 320, r = 64 p = 320, r = 96 p = 400, r = 40 p = 400, r = 60 p = 400, r = 80 p = 400, r = 120

Figure 3. Simulations for chain graphical models with latent variables. Plots of Frobenius norm error !Θ Θ F versus the rescaled sample size n/(s log(p) + r log(2p)).

eigenvalues divided by the magnitude of the largest one, depends on the relative energy ratio between G and S 1.

Since an exact characterization of the effective rank in terms of A, ΩL,L, and S tends to be difﬁcult, we use Monte Carlo simulations to investigate synthetic LVGGM that conform to our assumptions. We generate LVGGM with independent latent variables (i.e., diagonal JL,L), dense latent-observed submatrix JL,O, and a sparse conditional GGM JO,O for observed variable with a random sparsity pattern (sparsity level 5%). We ﬁx the number of latent variables to be 10, and vary the number of observed variables p = {80, 120, 200, 500}. By scaling the magnitudes of the elements in the latent variable submatrix, we sweep through the relative energy ratio between the global and local factors, i.e., Tr(G)/Tr(S 1) from 0.1 to 10. After 550 realizations for each value of p, we plot the empirical effective ranks of observed covariance matrices in Figure 2.

As seen in the ﬁgure, when the global factor dominates (i.e., the ratio is large), the effective rank of the covariance matrix is very small, as expected. On the other hand, when the local effects become stronger (e.g., when the number of observed variables p increases) the effective rank increases, but at a very mild rate. In particular, when p increases from

80 to 500, the maximum empirical effective rank in our simulation only increases from 4 to 26. For all of our simulated LVGGM, the empirical effective ranks are observed as at least an order of magnitude smaller than p. This mild growing rate of the effective rank (compared to p) will lead to our improved error bound in Theorem 2 to hold.

5.2. Frobenius Norm Error of LVGGM Estimation

We simulate LVGGM data with number of observed variables p = {160, 200, 320, 400} and number of latent variables in the set r = {0.1, 0.15, 0.2, 0.3}p. The sparse conditional GGM is a chain graph whose associated precision matrix is tridiagonal with off-diagonal elements Si,i 1 = Si,i+1 = 0.4Si,i for i = {2, . . . , p 1}. For each conﬁguration of p and r, we draw n samples from the LVGGM, where n ranges from 200 to 1000. Using these samples, the precision matrix &Θ is learned by solving the regularized ML estimation problem (4). As shown in Section 5.1, the effective rank of the covariance matrix grows mildly. Then Theorem 2 predicts that the Frobenius error of the estimated precision matrix of LVGGM should scale as &Θ Θ F

(s log(p) + r log(2p))/n, when the regularization parameters are chosen such that

reff log(p)

n . Guided by this theoretical result, we set the regularization parameters as

n and µ = Cbρ

reff log(p)

n , where constants Ca and Cb are cross-validated and then ﬁxed for all test data sets with different conﬁgurations. We plot the Frobenius estimation errors against the rescaled sample size n/(s log(p)+r log(2p)) in Figure 3. With a wide range of conﬁgurations, almost all the empirical error curves for models align and have the form of f(t) t 1/2 when the sample size is rescaled, as predicted by Theorem 2. In practice when the true model is unknown, one could set the regularization parameters according to the sample versions of the quantities σ and ρ , as discussed in Lounici (2012).

6. Conclusions

We consider a family of latent variable Gaussian graphical model whose precision matrix has a sparse plus low-rank structure. We derive parameter error bounds for regularized maximum likelihood estimation. Future work includes extending the framework to other distributions and the application to tasks such as prediction and detection.

Acknowledgement

This work is in part supported by ARO grant W911NF-111-0391. The authors thank the anonymous reviewers for their valuable comments, along with Jason Lee and Yuekai Sun for helpful discussions.

Learning Latent Variable Gaussian Graphical Models

Agarwal, Alekh, Negahban, Sahand, and Wainwright, Mar-

tin J. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2):1171 1197, 2012.

Bickel, Peter J, Ritov, Yaacov, and Tsybakov, Alexandre B.

Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705 1732, 2009.

Cand es, Emmanuel J, Li, Xiaodong, Ma, Yi, and Wright,

John. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011.

Chandrasekaran, Venkat, Parrilo, Pablo A, and Willsky,

Alan S. Latent variable graphical model selection via convex optimization. Annals of Statistics, 40(4):1935 1967, 2012.

Choi, Myung Jin, Chandrasekaran, Venkat, and Willsky,

Alan S. Gaussian multiresolution models: Exploiting sparse Markov and covariance structure. Signal Processing, IEEE Transactions on, 58(3):1012 1024, 2010.

Choi, Myung Jin, Tan, Vincent YF, Anandkumar, Ani-

mashree, and Willsky, Alan S. Learning latent tree graphical models. Journal of Machine Learning Research, 12:1729 1770, 2011.

Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert.

Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008.

Hsieh, Cho-Jui, Sustik, M aty as A, Dhillon, Inderjit,

Ravikumar, Pradeep, and Poldrack, Russell. BIG & QUIC: Sparse inverse covariance estimation for a million variables. In Advances in Neural Information Processing Systems, pp. 3165 3173, 2013.

Kakade, Sham, Shamir, Ohad, Sindharan, Karthik, and

Tewari, Ambuj. Learning exponential families in highdimensions: Strong convexity and sparsity. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 381 388, 2010.

Kalaitzis, Alfredo and Lawrence, Neil D. Residual com-

ponent analysis: Generalising PCA for more ﬂexible inference in linear-Gaussian models. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 209 216, 2012.

Lauritzen, S.L. Graphical models, volume 17. Oxford Uni-

versity Press, USA, 1996.

Lee, Jason, Sun, Yuekai, and Taylor, Jonathan E. On model

selection consistency of penalized M-estimators: a geometric theory. In Advances in Neural Information Processing Systems, pp. 342 350, 2013.

Liu, Ying and Willsky, Alan. Learning Gaussian graphical

models with observed or latent FVSs. In Advances in Neural Information Processing Systems, pp. 1833 1841, 2013.

Lounici, Karim. High-dimensional covariance matrix es-

timation with missing observations. ar Xiv preprint ar Xiv:1201.2577, 2012.

Ma, Shiqian, Xue, Lingzhou, and Zou, Hui. Alternating

direction methods for latent variable Gaussian graphical model selection. Neural computation, 25(8):2172 2198, 2013.

Negahban, Sahand N, Ravikumar, Pradeep, Wainwright,

Martin J, and Yu, Bin. A uniﬁed framework for highdimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4):538 557, 2012.

Ravikumar, Pradeep, Wainwright, Martin J, Raskutti,

Garvesh, and Yu, Bin. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935 980, 2011.

Rothman, Adam J, Bickel, Peter J, Levina, Elizaveta, and

Zhu, Ji. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494 515, 2008.

Vershynin, Roman. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

Xu, Huan, Caramanis, Constantine, and Sanghavi, Sujay.

Robust PCA via outlier pursuit. Information Theory, IEEE Transactions on, 58(5):3047 3064, 2012.

Yang, Eunho and Ravikumar, Pradeep. Dirty statistical models. In Advances in Neural Information Processing Systems 26, pp. 611 619. 2013.