# elementary_estimators_for_graphical_models__2331b74d.pdf Elementary Estimators for Graphical Models Eunho Yang IBM T.J. Watson Research Center eunhyang@us.ibm.com Aur elie C. Lozano IBM T.J. Watson Research Center aclozano@us.ibm.com Pradeep Ravikumar University of Texas at Austin pradeepr@cs.utexas.edu We propose a class of closed-form estimators for sparsity-structured graphical models, expressed as exponential family distributions, under high-dimensional settings. Our approach builds on observing the precise manner in which the classical graphical model MLE breaks down under high-dimensional settings. Our estimator uses a carefully constructed, well-defined and closed-form backward map, and then performs thresholding operations to ensure the desired sparsity structure. We provide a rigorous statistical analysis that shows that surprisingly our simple class of estimators recovers the same asymptotic convergence rates as those of the 1-regularized MLEs that are much more difficult to compute. We corroborate this statistical performance, as well as significant computational advantages via simulations of both discrete and Gaussian graphical models. 1 Introduction Undirected graphical models, also known as Markov random fields (MRFs), are a powerful class of statistical models, that represent distributions over a large number of variables using graphs, and where the structure of the graph encodes Markov conditional independence assumptions among the variables. MRFs are widely used in a variety of domains, including natural language processing [1], image processing [2, 3, 4], statistical physics [5], and spatial statistics [6]. Popular instances of this class of models include Gaussian graphical models (GMRFs) [7, 8, 9, 10], used for modeling continuous real-valued data, and discrete graphical models including the Ising model where each variable takes values in a discrete set [10, 11, 12]. In this paper, we consider the problem of highdimensional estimation, where the number of variables p may exceed the number of observations n. In such high-dimensional settings, it is still possible to perform consistent estimation by leveraging low-dimensional structure. Sparse and group-sparse structural constraints, where few parameters (or parameter groups) are non-zero, are particularly pertinent in the context of MRFs as they translate into graphs with few edges. A key class of estimators for learning graphical models has thus been based on maximum likelihood estimators (MLE) with sparsity-encouraging regularization. For the task of sparse GMRF estimation, the state-of-the-art estimator minimizes the Gaussian negative log-likelihood regularized by the 1 norm of the entries (or the off-diagonal entries) of the concentration matrix (see [8, 9, 10]). Strong statistical guarantees for this estimator have been established (see [13] and references therein). The resulting optimization problem is a log-determinant program, which can be solved in polynomial time with interior point methods [14], or by co-ordinate descent algorithms [9, 10]. In a computationally simpler approach for sparse GMRFs, [7] proposed the use of neighborhood selection, which consists of estimating conditional independence relationships separately for each node in the graph, via 1-regularized linear regression, or LASSO [15]. They showed that the procedure can consistently recover the sparse GMRF structure even under high-dimensional settings. The neighborhood selection approach has also been successfully applied to discrete Markov random fields. In particular, for binary graphical models, [11] showed that consistent neighborhood selection can be performed via 1-regularized logistic regression. These results were generalized to general discrete graphical models (where each variable can take m 2 values) by [12] through node-wise multi-class logistic regression with group sparsity. A related regularized convex program to solve for sparse GMRFs is the CLIME estimator [16], which reduces the estimation problem to solving linear programs. Overall, while state of the art optimization methods have been developed to solve all of these regularized (and consequently non-smooth) convex programs, their iterative approach is very expensive for large scale problems. Indeed, scaling such regularized convex programs to very large scale settings has attracted considerable recent research and attention. In this paper, we investigate the following leading question: Can one devise simple estimators with closed-form solutions that are yet consistent and achieve the sharp convergence rates of the aforementioned regularized convex programs? This question was originally considered in the context of linear regression by [17] and to which they had given a positive answer. It is thus natural to wonder whether an affirmative response can be provided for the more complicated statistical modeling setting of MRFs as well. Our key idea is to revisit the vanilla MLE for estimating a graphical model, and consider where it breaks down in the case of high-dimensions. The vanilla graphical model MLE can be expressed as a backward mapping [18] in an exponential family distribution that computes the model parameters corresponding to some given (sample) moments. There are however two caveats with this backward mapping: it is not available in closed form for many classes of models, and even if it were available in closed form, it need not be well-defined in high-dimensional settings (i.e. could lead to unbounded model parameter estimates). Accordingly, we consider the use of carefully constructed proxy backward maps that are both available in closed-form, and well-defined in high-dimensional settings. We then perform simple thresholding operations on these proxy backward maps to obtain our final estimators. Our class of algorithms is thus both computationally practical and highly scalable. We provide a unified statistical analysis of our class of algorithms for graphical models arising from general exponential families. We then instantiate our analysis for the specific cases of GMRFs and DMRFs, and show that the resulting algorithms come with strong statistical guarantees achieving near-optimal convergence rates, but doing so computationally much faster than the regularized convex programs. These surprising results are confirmed via simulation for both GMRFs and DMRFs. There has been considerable recent interest in large-scale statistical model estimation, and in particular, in scaling these to very large-scale settings. We believe our much simpler class of closedform graphical model estimators have the potential to be estimators of choice in such large-scale settings, particularly if it attracts research on optimizing and scaling its closed-form operations. 2 Background and Problem Setup Since most popular graphical model families can be expressed as exponential families (see [18]), we consider general exponential family distributions for a random variable X 2 Rp: P(X; ) = exp h , φ(X)i A( ) where 2 Rd is the canonical parameter to be estimated, φ(X) denotes the sufficient statistics with feature function φ : Rp 7! Rd, and A( ) is the log-partition function. An alternative parameterization of the exponential family, to the canonical parameterization above, is via the vector of mean parameters µ( ) def = E [φ(X)], which are the moments of the sufficient statistics φ(X) under the exponential family distribution. We denote the set of all possible moments by the moment polytope: M = {µ : 9 distribution p s.t. Ep(φ) = µ}, which consist of moments of the sufficient statistics under all possible distributions. The problem of computing the mean parameters µ( ) 2 M given the canonical parameters 2 constitutes the key machine learning problem of inference in graphical models (expressed in exponential family form (1)). Let us denote this computation via a so-called forward mapping A : 7! M. By properties of exponential family distributions, the forward mapping A can actually be expressed in terms of the first derivative of the log-partition function A( ): A : 7! µ = r A( ). It can be shown that this map is injective (one-to-one with its range) if the exponential family is minimal. Moreover, it is onto the interior of M, denoted by Mo. Thus, for any mean parameter µ 2 Mo, there exists a canonical parameter (µ) 2 such that E (µ)[φ(X)] = µ. Unless the exponential family is minimal, the corresponding canonical parameter (µ) however need not be unique. Thus while there will always exist a so-called backward mapping A : Mo 7! , that computes the canonical parameters corresponding to given moments, it need not be unique. A candidate backward map can be constructed via the conjugate of the log-partition function A (µ) = sup 2 h , µi A( ): A : µ 7! = r A (µ). 2.1 High-dimensional Graphical Model Selection We focus on the high-dimensional setting, where the number of variables p may greatly exceed the sample size n. Under such high-dimensional settings, it is now well understood that consistent estimation is possible if structural constraints are imposed on the model parameters . In this paper, we focus on the structural constraint of sparsity, for which the 1 norm is known to be well-suited. Given n samples {X(i)}n i=1 from P(X; ) that belongs to an exponential family (1), a popular class of M-estimators for recovering the sparse model parameter is the 1-regularized maximum likelihood estimators: h , bφ i A( ) + λnk k1 (2) where bφ := 1 i=1 φ(X(i)) is the empirical mean of the sufficient statistics. Since the log partition function A( ) in (1) is convex, the problem (2) is convex as well. We now briefly review the two most popular examples of exponential families in the context of graphical models. Gaussian Graphical Models. Consider a random vector X = (X1, . . . , Xp) with associated pvariate Gaussian distribution N(X|µ, ), mean vector µ and covariance matrix . The probability density function of X can be formulated as an instance of (1): P(X| , ) = exp 2hh , XX>ii + h , Xi A( , ) where hh A, Bii denotes the trace inner product tr(A BT ). Here, the canonical parameters are the precision matrix and a vector , with domain := {( , ) 2 Rp Rp p : 0, = T }. The corresponding moment parameters of the graphical model distribution are given by the mean µ = E [X], and the covariance matrix = E [XXT ] of the Gaussian. The forward map A : ( , ) 7! (µ, ) computing these from the canonical parameters can be written as: = 1 and µ = 1 . The moment polytope can be seen to be given by M = {(µ, ) 2 Rp Rp p : µµT 0, 0}, with interior Mo = {(µ, ) 2 Rp Rp p : µµT 0, 0}. The corresponding backward map A : (µ, ) 7! ( , ) for (µ, ) 2 Mo can be computed as: = 1 and = 1µ. Without loss of generality, assume that µ = 0 (and hence = 0). Then the set of non-zero entries in the precision matrix corresponds to the set of edges in an associated Gaussian Markov random field (GMRF). In cases where the graph underlying the GMRF has relatively few edges, it thus makes sense to impose 1 regularization on the off-diagonal entries of . Given n i.i.d. random vectors X(i) 2 Rp sampled from N(0, ), the corresponding 1-regularized maximum likelihood estimator (MLE) is given by: 0 hh , Sii log det + λnk k1,off , (4) where S is the sample covariance matrix defined as Pn (>, X := 1 n i=1 X(i), and k k1,off is the off-diagonal element-wise 1 norm. Discrete Graphical Models. Let X = (X1, . . . , Xp) be a random vector where each variable Xi takes values in a discrete set X of cardinality m. Given a graph G = (V, E), a pairwise Markov random field over X is specified via nodewise functions s : X 7! R for all s 2 V , and pairwise functions st : X X 7! R for all (s, t) 2 E, as s2V s(Xs) + P (s,t)2E st(Xs, Xt) A( ) This family of probability distributions can be represented using the so-called overcomplete representations [18] as follows. For each random variable Xs and j 2 {1, . . . , m}, define nodewise indicators I[Xs = j] equal to 1 if Xs = j and 0 otherwise. Then pairwise MRFs in (5) can be rewritten as s;j I[Xs = j] + (s,t)2E;j,k2[m] st;jk I[Xs = j, Xt = k] A( ) for a set of parameters := { s;j, st;jk : s, t 2 V ; (s, t) 2 E; j, k 2 [m]}. Given these sufficient statistics, the mean/moment parameters are given by the moments µs;j := E = P(Xs = j; ), and µst;jk := E I[Xs = j, Xt = k] = P(Xs = j, Xt = k; ), which precisely correspond to nodewise and pairwise marginals of the discrete graphical model. Thus, the forward mapping A : 7! µ would correspond to the inference task of computing nodewise and pairwise marginals of the discrete graphical model given the canonical parameters. A backward mapping A : µ 7! corresponds to computing a set of canonical parameters such that the corresponding graphical model distribution would yield the given set of nodewise and pairwise marginals. The moment polytope in this case consists of the set of all nodewise and pairwise marginals of any distribution over the random vector X, and hence is termed the marginal polytope; it is typically intractable to characterize in high-dimensions [18]. Given n i.i.d. samples from an unknown distribution (6) with parameter , one could consider estimating the graphical model structure with an 1-regularized MLE: b 2 minimize h , bφi + A( ) + λk k1,E, where k k1,E is the 1 norm of the edge-parameters: k k1,E = P s6=t k stk, and where we have collated the edgewise parameters { st;jk}m j,k=1 for an edge (s, t) 2 E into the vector st. However, there is an critical caveat to actually computing this regularized MLE: the computation of the log-partition function A( ) is intractable (see [18] for details). To overcome this issue, one might consider instead the following class of M-estimators, discussed in [19]: b 2 minimize h , bφi + B( ) + λk k1,E. (7) Here B( ) is a variational approximation to the log-partition function A( ) of the form: B( ) = supµ2Lh , µi B (µ), where L is a tractable bound on the marginal polytope M, and B (µ) is a tractable approximation to the graphical model entropy A (µ). An example of such approximation, which we shall later leverage in this paper, is the tree-reweighted entropy [20] given by B st st Ist(µst), where Hs(µs) is the entropy for node s, Ist(µst) is the mutual information for an edge (s, t), and st denote the edge-weights that lie in a so-called spanning tree polytope. If all st are set to 1, this boils down to the Bethe approximation [21]. 3 Closed-form Estimators for Graphical Models The state-of-the-art 1-regularized MLE estimators discussed in the previous section enjoy strong statistical guarantees but involve solving difficult non-smooth programs. Scaling them to very largescale problems is thus an important and challenging ongoing research area. In this paper we tackle the scalability issue at the source by departing from regularized MLE approaches and proposing instead a family of closed-form estimators for graphical models. Elem-GM Estimation: minimize where B ( ) is the proxy of backward mapping A , and λn is a regularization parameter as in (2). One of the most important properties of (8) is that the estimator is available in closed-form: b = Sλn , where [Sλ(u)]i = sign(ui) max(|ui| λ, 0) is the element-wise soft-thresholding function. This can be shown by the fact that the optimization problem (8) is decomposable into independent element-wise sub-problems, where each sub-problem corresponds to soft-thresholding. To get some intuition on our approach, let us first revisit classical MLE estimators for graphical models as in (1), and see where they break down in a high-dimensional setting: minimize h , bφ i A( ). By the stationary condition of this optimization problem, the MLE estimator can be simply expressed as a backward mapping A (bφ). There are two caveats here in high-dimensional settings. The first is that this backward mapping need not have a simple closed-form, and is typically intractable to compute for a large number of variables p. The second is that the backward mapping is well-defined only for mean parameters that are in the interior Mo of the marginal polytope, whereas the sample moments bφ might well lie on the boundary of the marginal polytope. We will illustrate these two caveats in the next two examples. Our key idea is to use instead a well-defined proxy function B ( ) in lieu of the MLE backward map A ( ) so that B (bφ) is both well-defined under high-dimensional settings, as well as with a simple closed-form. The optimization problem (8) seeks an estimator with minimum complexity in terms of regularizer k k1 while being close enough to some initial estimator B (bφ) in terms of element-wise 1 norm; ensuring that the final estimator has the desired sparse structure. 3.1 Strong Statistical Guarantees of Closed-form Estimators We now provide a statistical analysis of estimators in (8) under the following structural constraint: (C-Sparsity) The true canonical exponential family parameter is exactly sparse with k non- zero elements indexed by the support set S. All other elements in Sc are zeros. Theorem 1. Consider any graphical model in (1) with sparse canonical parameter as stated in (C-Sparsity). Suppose we solve (8) setting the constraint bound λn such that λn (A) Then the optimal solution b satisfies the following error bounds: 1 2λn , kb k2 4 (B) The support set of the estimate b correctly excludes all true zero coordinates of . Moreover, under the additional assumption that mins2S | 1, it correctly includes all non-zero coordinates of . Remarks. Theorem 1 is a non-probabilistic result, and holds deterministically for any selection of λn and any selection of B ( ). We would then use a probabilistic analysis when we applying the theorem to specific distributional settings and choices of the backward map B ( ). We note that while the theorem analyses the case of sparsity structured parameters, our class of estimators as well as analyses can be seamlessly extended to more general structures (such as group sparsity and low rank), by substituting appropriate regularization functions in (8). A key ingredient in our class of closed-form estimators is the proxy backward map B (bφ). The conditions of the theorem require that this backward map has to be carefully constructed in order for the error bounds and sparsistency guarantees to hold. In the following sections, we will see how to precisely construct such backward maps B ( ) for specific problem instances, and then derive the corresponding consequences of our abstract theorem as corollaries. 4 Closed-form Estimators for Inverse Covariance Estimation in Gaussian Graphical Models In this section, we derive a class of closed-form estimators for the multivariate Gaussian setting in Section 2.1. From our discussion of Gaussian graphical models in Section 2.1, the backward mapping from moments to the canonical parameters can be simply computed as A ( ) = 1, but only provided 2 Mo := { 2 Rp p : 0}. However, given the sample covariance, we cannot just compute the MLE as A (S) = S 1 since the sample covariance matrix is rank-deficient and hence does not belong the Mo under high-dimensional settings where p > n. In our estimation framework (8), we thus use an alternative backward mapping B ( ) via a thresholding operator. Specifically, for any matrix M 2 Rp p, we consider the family of thresholding operators T (M) : Rp p ! Rp p with thresholding parameter , defined as [T (M)]ij := (Mij) where ( ) is an element-wise thresholding operator. Soft-thresholding is a natural option, however, along the lines of [22], we can use arbitrary sparse thresholding operators satisfying the conditions: (C-Thresh) For any input a 2 R, (i) | (a)| |a|, (ii) | (a)| = 0 for |a| , and finally (iii) As long as T (S) is invertible (which we shall examine in section 4.1), we can define B (S) := [T (S)] 1 and obtain the following class of estimators: Elem-GGM Estimation: minimize k k1,off (9) ,, [T (S)] 1,, where k k1,off is the off-diagonal element-wise 1 norm as the dual of k k1,off. Comparison with related work. Note that [16] suggest a Dantzig-like estimator : minimize k k1 s. t. k S Ik1 λn where both k k1 and k k1 are entry-wise ( 1 and 1, respectively) norms for a matrix. This estimator applies penalty functions even for the diagonal elements so that the problem can be decoupled into multiple but much simpler optimization problems. It still requires solving p linear programs with 2p linear constraints for each. On the other hand, the estimator from (9) has a closed-form solution as long as T (S) is invertible. 4.1 Convergence Rates for Elem-GGM In this section we derive a corollary of theorem 1 for Elem-GGM. A prerequisite is to show that B (S) := [T (S)] 1 is well-defined and well-behaved . The following conditions define a broad class of Gaussian graphical models that satisfy this requirement. (C-Min Inf ) The true canonical parameter of (3) has bounded induced operator norm such that ||| |||1 := supw6=02Rp k wk1 (C-Sparse ) The true covariance matrix := ( ) 1 is approximately sparse along the lines of Bickel and Levina [23]: for some positive constant D, ii D for all diagonal entries, and moreover, for some 0 q < 1 and c0(p), maxi ij|q c0(p). If q = 0, then this condition boils down to being sparse. We additionally require infw6=02Rp k wk1 Now we are ready to utilize Theorem 1 and derive the convergence rates for our Elem-GGM (9). Corollary 1. Consider Gaussian graphical models (3) where the true parameter has k non-zero off-diagonal elements, and the conditions in (C-Min Inf ) and (C-Sparse ) hold. Suppose that we solve the optimization problem in (9) with a generalized thresholding operator satisfying (C-Thresh) and setting := 16(maxi ii) n for p0 := max{n, p}. Furthermore, suppose also that we select λn := 4 1a n . Then, as long as n > c3 log p0, any optimal solution b of (9) satisfies $$$$$$b $$$$$$ 1,off 32 1a with probability at least 1 c1 exp( c2 log p0). We remark that the rates in Corollary 1 are asymptotically the same as those for standard 1 regular- ized MLE estimators in (4); for instance, [13] show that |||b MLE |||F = O . This is remarkable given the simplicity of Elem-GGM. 5 Closed-form Estimators for Discrete Markov Random Fields We now specialize our class of closed-form estimators (8) to the setting of discrete Markov random fields described in Section 2.1. In this case, computing the backward mapping A is non-trivial and typically intractable if the graphical structure has loops [18]. Therefore, we need an approximation of the backward map A , for which we will leverage the tree-reweighted variational approximation discussed in Section 2.1. Consider the following map := B trw(bφ), where s;j = log bφs;j , and st;jk = st log bφst;jk bφs;j bφt;k where bφs;j = 1 i=1 I[Xs,i = j] and bφst;jk = 1 i=1 I[Xs,i = j]I[Xt,i = k] are the empirical moments of the sufficient statistics in (6) (we define 0/0 := 1). It was shown in [20] that B satisfies the following property: the (pseudo)marginals computed by performing tree-reweighted variational inference with the parameters := B trw(bφ) yield the sufficient statistics bφ. In other words, the approximate backward map B trw computes an element in the pre-image of the approximate forward map given by tree-reweighted variational inference. Since tree-reweighted variational inference approximates the true marginals well in practice, the map B trw( ) is thus a great candidate for as an approximate backward map. As an alternative to the 1 regularized approximate MLE estimators (7), we thus obtain the following class of estimators using B trw( ) as an instance of (8): Elem-DMRF Estimation: minimize k k1,E (11) where k k1,E is the maximum absolute value of edge-parameters as a dual of k k1,E. Note that given the empirical means of sufficient statistics, B trw(bφ) can usually be obtained easily, without the need of explicitly specifying the log-partition function approximation B( ) in (7). 5.1 Convergence Rates for Elem-DRMF We now derive the convergence rates of Elem-DRMF for the case where B ( ) is selected as in (10) following the tree reweighed approximation [20]. Let µ be the true marginals (or mean parameters) from the true log-partition function and true canonical parameter : µ = A( ). We shall require that the approximation Btrw( ) be close enough to the true A( ) in terms of backward mapping. In addition we assume that true marginal distributions are strictly positive. (C-Log Partition) (C-Marginal) For all s 2 V and j 2 [m], the true singleton marginal µ = P(Xs = j; ) satisfies min < µ s;j for some strictly positive constant min 2 (0, 1). Similarly, for all s, t 2 V and all j, k 2 [m], µ st;jk satisfies min < µ Now we are ready to utilize Theorem 1 to derive the convergence rates for our closed-form estimator (11) when has k non-zero pairwise parameters st, where we recall the notatation that st := { st;jk}m j,k=1 is a collation of the edgewise parameters for edge (s, t). We also define k kq,E := (P s6=t k stkq)1/q, for q 2 {1, 2, 1}. Corollary 2. Consider discrete Markov random fields (6) when the true parameter has actually k non-zero pair-wise parameters, and the conditions in (C-Log Partition) and (C-Marginal) also hold in these discrete MRFs. Suppose that we solve the optimization problem in (11) with B trw( ) set as (10) using tree reweighed approximation. Furthermore, suppose also that we select λn := n for some positive constant c1 depending only on min. Then, as long as n > 4c2 there are universal positive constants (c2, c3) such that any optimal solution b of (11) satisfies kb k1,E 2 + 2c1 n , kb k2,E 4 n , kb k1,E 8k + 8c1k n with probability at least 1 c2 exp( c3 log p0). 6 Experiments In this section, we report a set of synthetic experiments corroborating our theoretical results on both Gaussian and discrete graphical models. Gaussian Graphical Models We now corroborate Corollary 1, and furthermore, compare our estimator with the 1 regularized MLE in terms of statistical performance with respect to the parameter error kb kq for q 2 {2, 1}, as well as in terms of computational performance. To generate true inverse covariance matrices with a random sparsity structure, we follow the procedure described in [25, 24]. We first generate a sparse matrix U whose non-zero entries are set to 1 with equal probabilities. is then set to U >U and then a diagonal term is added to ensure Table 1: Performance of our Elem-GM vs. state of the art QUIC algorithm [24] solving (4) under two different regimes: (Left) (n, p) = (800, 1600), (Right) (n, p) = (5000, 10000). K Time(sec) F (off) 1 (off) FPR TPR 0.01 < 1 6.36 0.1616 0.48 0.99 0.02 < 1 6.19 0.1880 0.24 0.99 0.05 < 1 5.91 0.1655 0.06 0.99 0.1 < 1 6 0.1703 0.01 0.97 0.5 2575.5 12.74 0.11 0.52 1.00 1 1009 7.30 0.13 0.35 0.99 2 272.1 6.33 0.18 0.16 0.99 3 78.1 6.97 0.21 0.07 0.94 4 28.7 7.68 0.23 0.02 0.86 K Time(sec) F (off) 1 (off) FPR TPR 0.05 47.3 11.73 0.1501 0.13 1.00 0.1 46.3 8.91 0.1479 0.03 1.00 0.5 45.8 5.66 0.1308 0.0 1.00 1 46.2 8.63 0.1111 0.0 0.99 2 * * * * * 2.5 * * * * * 3 4.8 104 9.85 0.1083 0.06 1.00 3.5 2.7 104 10.51 0.1111 0.04 0.99 Table 2: Performance of Elem-DMRF vs. the regularized MLE-based approach of [12] for structure recovery of DRMFs. Graph Type # Parameters Method Time(sec) TPR FNR Chain Graph 128 Elem-DMRF 0.17 0.87 0.01 Regularized MLE 7.30 0.81 0.01 2000 Elem-DMRF 21.67 0.79 0.12 Regularized MLE 4315.10 0.75 0.21 128 Elem-DMRF 0.17 0.97 0.01 Regularized MLE 7.99 0.84 0.02 2000 Elem-DMRF 21.68 0.80 0.12 Regularized MLE 4454.44 0.77 0.18 is positive definite. Finally, we normalize with maxp ii so that the maximum diagonal entry is equal to 1. We control the number of non-zeros in U so that the number of non-zeros in the final is approximately 10p. We additionally set the number of samples n to half of the number of variables p. Note that though the number of variables is p, the total number of entries in the canonical parameter consisting of the covariance matrix is O(p2). Table 1 summarizes the performance of our closed-form estimators in terms of computation time, kb k1,off and |||b |||F,off. We fix the thresholding parameter = 2.5 log p/n for all settings, and vary the regularization parameter λn = K log p/n to investigate how this regularizer affects the final estimators. Baselines are 1 regularized MLE estimators in (4); we use QUIC algorithms [24], which is one of the fastest way to solve (4). In the table, we show the results of the QUIC algorithm run with a tolerance = 10 4; * indicates that the algorithm does not stop within 15 hours. In Appendix, we provide more extensive comparisons including receiver operator curves (ROC) for these methods for settings in Table 1. As can be seen from the table and the figure, the performance of Elem-GM estimators is both statistically competitive in terms of all types of errors and support set recovery, while performing much better computationally than classical methods based on 1 regularized MLE. Discrete Graphical Models We consider two different classes of pairwise graphical models: chain graphs and grids. For each case, the size of the alphabet is set to m = 3; the true parameter vector is generated by sampling each non-zero entry from N(0, 1). We compare Elem-DMRF with the group-sparse regularized MLE-based approach of Jalali et al. [12], which uses group 1/ 2 regularization, where all the parameters of an edge form a group, so as to encourage sparsity in terms of the edges, and which we solved using proximal gradient descent. While our estimator in (11) used vanilla sparsity, we used a simple extension to the group-sparse structured setting; please see Appendix E for more details. For both methods, the tuning parameter is set to λn = c log p/n, where c is selected using cross-validation. We use 20 simulation runs where for each run n = p/2 samples are drawn from the distribution specified by . We report true positive rates, false positive rates and timing for running each method. We note that the timing is for running each method without counting the time spent in the cross-validation process (Had we taken the cross-validation into account, the advantage of our method would be even more pronounced, since the entire path of solutions can be computed via simple group-wise thresholding operations.) The results in Table 2 show that Elem-DMRF is much faster than its MLE-based counterpart, and yield competitive results in terms of structure recovery. Acknowledgments E.Y and P.R. acknowledge the support of ARO via W911NF-12-1-0390 and NSF via IIS-1149803, IIS-1320894, IIS-1447574, and DMS-1264033 [1] C. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 1999. [2] J.W. Woods. Markov image modeling. IEEE Transactions on Automatic Control, 23:846 850, October 1978. [3] M. Hassner and J. Sklansky. Markov random field models of digitized image texture. In ICPR78, pages 538 540, 1978. [4] G. Cross and A. Jain. Markov random field texture models. IEEE Trans. PAMI, 5:25 39, 1983. [5] E. Ising. Beitrag zur theorie der ferromagnetismus. Zeitschrift f ur Physik, 31:253 258, 1925. [6] B. D. Ripley. Spatial statistics. Wiley, New York, 1981. [7] N. Meinshausen and P. B uhlmann. High-dimensional graphs and variable selection with the Lasso. Annals of Statistics, 34:1436 1462, 2006. [8] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19 35, 2007. [9] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graph- ical Lasso. Biostatistics, 2007. [10] O. Bannerjee, , L. El Ghaoui, and A. d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Jour. Mach. Lear. Res., 9:485 516, March 2008. [11] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using 1-regularized logistic regression. Annals of Statistics, 38(3):1287 1319, 2010. [12] A. Jalali, P. Ravikumar, V. Vasuki, and S. Sanghavi. On learning discrete graphical models using group-sparse regularization. In Inter. Conf. on AI and Statistics (AISTATS), 14, 2011. [13] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estima- tion by minimizing 1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935 980, 2011. [14] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, UK, 2004. [15] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267 288, 1996. [16] T. Cai, W. Liu, and X. Luo. A constrained 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594 607, 2011. [17] E. Yang, A. Lozano, and P. Ravikumar. Elementary estimators for high-dimensional linear regression. In International Conference on Machine learning (ICML), 31, 2014. [18] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, December 2008. [19] E. Yang and P. Ravikumar. On the use of variational inference for learning discrete graphical models. In International Conference on Machine learning (ICML), 28, 2011. [20] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Tree-reweighted belief propagation algo- rithms and approximate ML estimation by pseudomoment matching. In Inter. Conf. on AI and Statistics (AISTATS), 2003. [21] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation. In NIPS 13, pages 689 695. MIT Press, 2001. [22] A. J. Rothman, E. Levina, and J. Zhu. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association (Theory and Methods), 104:177 186, 2009. [23] P. J. Bickel and E. Levina. Covariance regularization by thresholding. Annals of Statistics, 36 (6):2577 2604, 2008. [24] C. J. Hsieh, M. Sustik, I. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estima- tion using quadratic approximation. In Neur. Info. Proc. Sys. (NIPS), 24, 2011. [25] L. Li and K. C. Toh. An inexact interior point method for l1-regularized sparse covariance selection. Mathematical Programming Computation, 2:291 315, 2010.