# highdimensional_inference_for_clusterbased_graphical_models__9d7ceec4.pdf Journal of Machine Learning Research 21 (2020) 1-55 Submitted 6/18; Revised 12/19; Published 3/20 High-Dimensional Inference for Cluster-Based Graphical Models Carson Eisenach eisenach@princeton.edu Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA Florentina Bunea fb238@cornell.edu Yang Ning yn265@cornell.edu Claudiu Dinicu cd535@cornell.edu Department of Statistics and Data Science Cornell University Ithaca, NY 14850, USA Editor: Nicolas Vayatis Motivated by modern applications in which one constructs graphical models based on a very large number of features, this paper introduces a new class of cluster-based graphical models, in which variable clustering is applied as an initial step for reducing the dimension of the feature space. We employ model assisted clustering, in which the clusters contain features that are similar to the same unobserved latent variable. Two different cluster-based Gaussian graphical models are considered: the latent variable graph, corresponding to the graphical model associated with the unobserved latent variables, and the cluster-average graph, corresponding to the vector of features averaged over clusters. Our study reveals that likelihood based inference for the latent graph, not analyzed previously, is analytically intractable. Our main contribution is the development and analysis of alternative estimation and inference strategies, for the precision matrix of an unobservable latent vector Z. We replace the likelihood of the data by an appropriate class of empirical risk functions, that can be specialized to the latent graphical model and to the simpler, but under-analyzed, cluster-average graphical model. The estimators thus derived can be used for inference on the graph structure, for instance on edge strength or pattern recovery. Inference is based on the asymptotic limits of the entry-wise estimates of the precision matrices associated with the conditional independence graphs under consideration. While taking the uncertainty induced by the clustering step into account, we establish Berry-Esseen central limit theorems for the proposed estimators. It is noteworthy that, although the clusters are estimated adaptively from the data, the central limit theorems regarding the entries of the estimated graphs are proved under the same conditions one would use if the clusters were known in advance. As an illustration of the usage of these newly developed inferential tools, we show that they can be reliably used for recovery of the sparsity pattern of the graphs we study, under FDR control, which is verified via simulation studies and an f MRI data analysis. These experimental results confirm the theoretically established difference between the two graph structures. Furthermore, the data analysis suggests that the latent variable graph, corresponding to the unobserved cluster centers, can help provide more insight into the c 2020 Carson Eisenach, Florentina Bunea, Yang Ning and Claudiu Dinicu. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-357.html. Eisenach, Bunea, Ning and Dinicu understanding of the brain connectivity networks relative to the simpler, average-based, graph. Keywords: Berry-Esseen bound, Graphical model, Latent variables, High-dimensional inference, Clustering, False discovery rate 1. Introduction Over the last several decades, graphical models have become an increasingly popular method for understanding independence and conditional independence relationships between components of random vectors. More recently, the challenges posed by the estimation and statistical analysis of graphical models with many more nodes than the number of observations has led to renewed interest in these models, such as Meinshausen and B uhlmann (2006); Yuan and Lin (2007); Friedman et al. (2008); Verzelen (2008); Lam and Fan (2009); Rothman et al. (2008); Peng et al. (2009); Ravikumar et al. (2011); Yuan (2010); Cai et al. (2011); Sun and Zhang (2012); Liu et al. (2012); Xue and Zou (2012); Ning and Liu (2013); Cai et al. (2016); Tan et al. (2016); Fan et al. (2017); Yang et al. (2018); Feng and Ning (2019), to give only an incomplete list. Nonetheless, when the dimension (number of nodes) grows very large and the sample size is small, the dependency among the components of a random vector may become weak, if it exists at all, and difficult to detect without additional information. If the dimension of the random vector is in the thousands, even if the dependency structure can be detected by an estimated graphical model, it can be difficult to interpret the results and extract meaningful scientific insights. One solution to both of the aforementioned issues is to employ an initial dimension reduction procedure on the high dimensional vector. For example, in neuroscience applications, a typical functional magnetic resonance image (f MRI) consists of blood-oxygen-leveldependent (BOLD) activities measured at 200,000+ voxels of the brain, over a period of time. Instead of analyzing voxel-level data directly, scientists routinely cluster voxels into several regions of interest (ROI) with homogeneous functions using domain knowledge, and then carry out the analysis at the ROI-level. In this example, using the language of graphical models, the group structure of variables may boost the dependency signals. Similar pre-processing steps are used in other application domains, such as genomics, finance and economics. Motivated by a rich set of applications, we consider variable clustering as the initial dimension reduction step applied to the observed vector X =: (X1, . . . , Xd) Rd. To the best of our knowledge, very little is known about the effect of clustering on downstream analysis and, consequently, on the induced graphical models. Our contribution is the provision of a framework that allows for such an analysis. We introduce cluster-based graphical models, show how they can be estimated and, furthermore, provide the asymptotic distribution of the edge strength estimates. These models are built on the assumption that the observed variables X = (X1, . . . , Xd) Rd can be partitioned into K unknown clusters G = {G 1, . . . , G K} such that variables in the same cluster share the same behavior. Following the intuition behind widely-used Kmeans type procedures, we define a population-level cluster as a group of variables that High-Dimensional Inference for Cluster-Based Graphical Models are noise corrupted versions of a hub-like variable. This hub-like variable is not directly observable, and is treated as a latent factor. Formally, we assume there exists a latent random vector Z RK, with mean zero and covariance matrix Cov(Z) = C , such that X = AZ + E, (1) for a zero mean error vector E with independent entries. The entries of the d K matrix A are given by Ajk = I{j G k}. A cluster of variables consist in those components of X with indices in the same G k. We denote Cov(E) = Γ , a diagonal matrix with entries Γ jj = γ j for any 1 j d. We also assume that the mean-zero noise E is independent of Z. Bunea et al. (2018) show that the clusters are uniquely defined by the model in (1), provided that the smallest cluster contains at least two variables and C is strictly positive definite; this result holds irrespective of distributional assumptions. To keep the presentation focused, in this work we assume that Z N(0, C ) and E N(0, Γ ), which implies X N(0, Σ ) with Σ = AC AT + Γ . In this context we consider two related, but different, graphical models: (i) The latent variable graph, associated with the sparsity pattern of the precision matrix Θ := C 1 (2) of the Gaussian vector Z RK. The latent variable graph encodes conditional independencies (CI s) among the unobserved, latent variables Z. (ii) The cluster-average graph, associated with the sparsity pattern of the precision matrix Ω := S 1, (3) where S is the covariance matrix of X RK, and X =: ( X1, . . . , XK) is the within cluster average given by Xk =: 1 |G k| P i G k Xi. The cluster-average graph encodes CI s among averages of observable random variables. In particular, we have S = C + Γ , where Γ = diag( γ 1, ..., γ K) with γ k = 1 |G k|2 P j G k γ j . Although both these graphs correspond to vectors of dimension K and are constructed based on the partition G , they are in general different as the sparsity patterns of Θ and Ω will typically differ, and have different interpretations. Therefore it would be misleading to use one as a proxy for the other when drawing conclusions. For instance, in the neuroscience example, if we interpret each latent variable as the function of a ROI, then the latent variable graph encodes the CI relationships between functions, which is one question of scientific interest. The difference between Θ and Ω shows that this question will not be typically answered by studying the cluster-average graph, although that may be tempting to do by practitioners. Eisenach, Bunea, Ning and Dinicu 1.1. Our Contributions Since the two cluster-based graphical models introduced above can both be of interest in a large array of applications, we provide inferential tools for both of them in this work. We assume that we observe n i.i.d. copies X1, . . . , Xn of X. The focus of our work is on post-clustering and post-regularization inference for these two sparse precision matrices. To this end, we derive the asymptotic distribution of novel estimators of their entries. These estimators can be used to answer any inferential questions of interest at the edge strength level, or can be combined to provide sparsity pattern recovery under FDR control. In Section 4.3 we provide an instance of the latter. Inference for the entries of a Gaussian precision matrix has received a large amount of attention in the past few years, most notably post-regularization inference, for instance Ren et al. (2013); Zhang and Zhang (2014); Jankova and van de Geer (2014); Gu et al. (2015); Barber and Kolar (2015); Jankov a and van de Geer (2017); Javanmard and Montanari (2013); van de Geer et al. (2013); Ning and Liu (2017); Cai and Guo (2017); Neykov et al. (2018); Ning et al. (2017); Fang et al. (2017). These works generalize the classical ideas of one-step estimation (Bickel, 1975) to the high-dimensional setup by first constructing a sparse estimator of the precision matrix via regularization, and then building de-sparsified updates that are asymptotically normal. The effect of the initial regularization step is controlled in the second step, and inference after regularization becomes valid. In this paper we consider a similar estimation strategy, but differs from the existing literature in several important ways. In our work, we add another layer of data-dependent dimension reduction, via clustering, and provide a framework within which the variability induced by clustering can be controlled. Even after controlling for the clustering variability, we note that the existing procedures for estimation and, especially, post-regularization inference in Gaussian graphical models are not immediately applicable to our problem for the following reasons: 1. They are developed for variables that can be observed directly. From this perspective, they could, in principle, be applied to the cluster-average graph, but are not directly extendable to the latent graph; 2. To the best of our knowledge, all existing methods for precision matrix inference require the largest eigenvalue of the corresponding covariance matrix to be upper bounded by a constant. Such an assumption implies, in turn, that the Euclidean norm of each row of the covariance matrix is bounded, which reduces significantly the parameter space for which inference is valid. The assumption holds, for instance, when the number of variables is bounded, or when the entries of each row are appropriately small. To overcome these limitations, we take a different approach in this work, that allows us to lift unpleasant technical conditions associated with other procedures, while maintaining the validity of inference for both the latent and the average graph. We summarize our main contributions below. 1. Methods for estimation tailored to high dimensional inference in latent, cluster-based, graphical models. We develop a new estimation strategy tailored to High-Dimensional Inference for Cluster-Based Graphical Models our final goal, that of constructing approximately Gaussian estimators for the entries of the precision matrices Θ and Ω given in (2) and (3) above. Although we work under the assumption that the data is Gaussian, likelihood based estimators may be unsatisfactory, because their analysis can require stringent assumptions, as explained above (see also Jankova and van de Geer (2014)), or may become analytically intractable, as argued in Section 3.3, for the latent graph. We propose a method that mimics very closely the principles underlying the construction of an efficient score function for estimation in the presence of high dimensional nuisance parameters (see for instance Bickel et al. (1993)), but we do not base it on the corresponding likelihood-derived quantities. The underlying principles are explained in Section 3.1. To the best of our knowledge, this is the first estimation method of the latent precision matrix, that can be analyzed theoretically for inferential purposes. As an added benefit of our estimation framework, the same principles can be applied for the estimation, and analysis, of the cluster-average graph. 2. The analysis of estimators of the precision matrix of unobserved cluster centers: Berry-Esseen-type bounds for Gaussian approximations. We verify, theoretically, that the estimators constructed with an inferential goal in mind do indeed have the desired properties. To this end, we derive the asymptotic distribution of the proposed entry-wise estimates of the latent graph, and also of the cluster-average graph. Moreover, we quantify the speed with which this limit is obtained, which we show to be proportional to 1/ n in both cases. We do so by establishing Berry-Esseen type bounds on the difference between the cumulative distribution function of our estimators and that of a standard Gaussian random variable that are valid for each K, d and n, and are presented in Theorems 3 and 4, respectively. As immediate applications, we can construct approximate confidence intervals for one or a finite number of entries of the latent or average graph, or known linear functionals of such entries. While the answers we thus provide at the average graph level are similar to existing results, established for the full CI graph based on all d nodes, for instance by (Ren et al., 2013; Zhang and Zhang, 2014; Jankova and van de Geer, 2014; Gu et al., 2015; Barber and Kolar, 2015; Jankov a and van de Geer, 2017; Javanmard and Montanari, 2013; van de Geer et al., 2013; Ning and Liu, 2017; Cai and Guo, 2017; Neykov et al., 2018), the results for the latent graph are, to the best of our knowledge, the first such results in the literature. We note, furthermore, that the average cluster graph can be viewed as a graph with observable nodes, the cluster averages, only after the clusters are estimated from the data. Our theoretical analysis takes this step into account. We discuss, in Section 2.2, clustering methods tailored to model (1), where the number of clusters K is unknown and is allowed to grow with n. Using the results of Bunea et al. (2018), these methods yield a partition b G = G , with high probability, provided that λmin(C ) > c, for a small positive quantity c made precise in Section 2.2. A lower bound on the smallest eigenvalue of the covariance matrix is the minimal condition under which inference in any graphical model can be performed. Therefore, consistent clustering via the model (1) does not require a further reduction of the parameter space for which the more standard post-regularized inference can be developed. Moreover, as Section 4 shows, asymptotic inference based on the estimated clusters reduces to asymptotic inference relative to the true clusters, G , without any need for data splitting. This fact holds true for both the average and the latent variable graph, and is in sharp contrast with a phenomenon often encountered in post-model selection inference, such as Eisenach, Bunea, Ning and Dinicu in variable selection in linear regression (Lockhart et al., 2014; Lee et al., 2013; Tibshirani et al., 2014). In that case, reducing inference to the consistently selected set of variables can only be justified over a reduced part of the parameter space (Bunea, 2004), and is therefore not a popular practice. Another technical contribution is that the asymptotic normality of the estimators is established under relaxed conditions. Unlike the existing literature on de-biased inference on graphical models (Jankova and van de Geer, 2014; Jankov a and van de Geer, 2017), we do not require the bounded operator norm condition for the covariance matrix such as λmax(S ) C for cluster-average graph. As shown by Jankova and van de Geer (2014) and explained above, the analysis of the multivariate Gaussian likelihood may require stringent assumptions for cluster-average graph and becomes intractable for the latent graph. By using the proposed pseudo-likelihood function which has a much simpler form, we can remove the unpleasant assumption on the bounded operator norm. In addition, we reanalyze the CLIME estimator (Cai et al., 2011) bΩ k (the kth column of bΩ) under our Assumption 4.1 and 4.2, which is used as the initial estimator for inference. As explained in Section 3.2, we can show that the CLIME estimator satisfies bΩ k Ω k 1 s1 q n , where s1 is the sparsity of Ω k. This result does not require the bounded operator norm or matrix L1 norm condition and can be of independent interest. To illustrate how one can use these newly developed inferential tools, we focus on the estimation of the sparsity pattern of the graphs, which can be equivalently viewed as a multiple-testing problem. It is well known that the exact sparsity pattern can be recovered, with high probability, only if the entries of each precision matrix are above the minimax optimal noise level O( p log d/n) (Ravikumar et al., 2011; Meinshausen and B uhlmann, 2006). Since our aim is inference on the sparsity pattern without further restrictions on the parameter space, the next best type of error that we can control is the False Discovery Rate (FDR) (Benjamini and Hochberg, 1995). In Section 4.3 we use these results for pattern recovery under FDR control, and explain the effect of the asymptotic approximations on this quantity. This paper is organized as follows. Section 2 below contains a brief summary of existing results on model-assisted clustering, via model (1). Section 3 describes the estimation procedures for the latent variable graph and the cluster-average graph, respectively. In Section 4 we establish Berry-Esseen type central limit theorems for the estimators derived in Section 3, and provide bounds on the FDR associated with each graphical model under study, respectively. Section 5 gives numerical results using both simulated and real data sets. 2. Background 2.1. Notation The following notation is adopted throughout this paper. Let d denote the ambient dimension, n the sample size, K the number of clusters and m the minimum cluster size. The matrix C denotes the population covariance of the latent vector Z. Likewise, the matrices Γ , Σ , Θ , S and Ω denote population-level quantities. High-Dimensional Inference for Cluster-Based Graphical Models For v = (v1, ..., vd)T Rd, and 1 q , we define v q = (Pd i=1 |vi|q)1/q, v 0 = |supp(v)|, where supp(v) = {j : vj = 0} and |A| is the cardinality of a set A. Denote v = max1 i d |vi| and v 2 = vv T . Assume that v can be partitioned as v = (v1, v2). Let f(v) denote the gradient of the function f(v), and 1f(v) = f(v)/ v1. Similarly, let 2f(v) denote the Hessian of the function f(v) and 2 12f(v) = 2f(v)/ v1 v2. For a d d matrix M = [Mjk], let M max = maxjk |Mjk|, M 1 = P jk |Mjk|, and M = maxk P j |Mjk|. If the matrix M is symmetric, then λmin(M) and λmax(M) are the minimal and maximal eigenvalues of M. Let [d] = {1, 2, ...., d}. For any j [d], we denote the jth row and jth column of M as Mj and M j, respectively. Similarly, let M j, k be the sub-matrix of M with the jth row and kth column removed. The notation Sd d refers to the set of all real, symmetric d d matrices. Likewise, Sd d + Sd d is the positive semi-definite cone. We use and to denote the Kronecker and Hadamard product of two matrices, respectively; we also may write M 2 = M M. Let ej denote the vector of all zeros except for a one in the jth position. The vector 1 is the vector of all ones. a b = max(a, b). 2.2. Model Assisted Variable Clustering In this section we review existing results on variable clustering. Bunea et al. (2018) showed that if we use model (1) to define clusters of variables, these clusters are uniquely defined, up to label switching, so long as m =: min1 k K |G k| 2 and the components of the latent vector Z are different almost surely, or equivalently (C ) =: min j 0. Since (C ) = min j 0, which is the minimal condition under which one can study properties of the corresponding precision matrix. In addition, Bunea et al. (2018) developed two algorithms, PECOK and COD, that are shown to recover the clusters exactly, from n i.i.d. copies X1, . . . , Xn of X, as soon as λmin (C ) c, for a positive quantity c that approaches 0 as n grows. For the COD procedure, c = O Σ max p log(d n)/n . On the other hand, for the PECOK procedure c = O Γ max p K log(d n)/mn , which can be much smaller when one has a few, balanced, clusters. These values of c are shown to be minimax or near-minimax optimal for cluster recovery. We refer to Theorems 3 and 4 in Bunea et al. (2018) for the precise expressions and details. Under these minimal conditions on λmin (C ), the exact recovery of the clusters holds with Eisenach, Bunea, Ning and Dinicu probability larger than 1 1/(d n). Because these conditions are sufficiently weak, we are able to show, in Section 4, that inference in cluster-based graphical models is not hampered by the clustering step. For completeness, we outline the PECOK algorithm below, which consists in a convex relaxation of the K-means algorithm, further tailored to estimation of clusters G = {G 1, . . . , G K} defined via the interpretable model (1). The PECOK algorithm consists in the following three steps: 1. Compute an estimator eΓ of the matrix Γ . 2. Solve the semi-definite program (SDP) b B = argmax B D bΣ eΓ, B , (4) where bΣ is the sample covariance matrix and B 0 (symmetric and positive semidefinite) P a Bab = 1, b Bab 0, a, b tr(B) = K 3. Compute b G by applying a clustering algorithm on the rows (or equivalently columns) of b B. The construction of an accurate estimator eΓ of Γ , before the cluster structure is known, is a crucial step for guaranteeing the statistical optimality of the PECOK estimator. Its construction is given in Bunea et al. (2018), and included in Appendix F, for the convenience of the reader. We will employ an efficient algorithm for solving (4). Standard black-box SDP solvers, for a fixed precision, exhibit O(d7) running time on (4), which is prohibitively expensive. Eisenach and Liu (2019) recently introduced the FORCE algorithm, which requires worst case O(d6K 2) time to solve the SDP, and in practice often performs the clustering rapidly. The key idea behind the FORCE algorithm is that an optimal solution to (4) can be attained by first transforming (4) into an eigenvalue problem, and then using a first-order method. Iterations of the first-order method are interleaved with a dual step to round the current iterate to an integer solution of the clustering problem, and then searches for an optimality certificate. By using knowledge of both the primal and the dual SDPs, FORCE is able to find the solution much faster than a standard SDP solver. We refer to Eisenach and Liu (2019) for the detailed algorithm. 3. Estimation of Cluster-based Graphical Models In this section, we propose a unified estimation approach, that utilizes similar loss functions for estimation and inference in the cluster-average and the latent variable graphs. We first describe our general principle, and then demonstrate its application to the two graphical models. High-Dimensional Inference for Cluster-Based Graphical Models 3.1. One-step Estimators for High-Dimensional Inference Assume that we observe n i.i.d. realizations X1, ..., Xn of X Rd. Let Q(β, X) denote a known mapping of β and X to R, where β is a q-dimensional unknown parameter of the distribution of X. Often this Q is referred to as the loss function. We define the target parameter β as β = argmin E(Q(β, X)). Next, let us partition β as β = (θ, γ), where θ R is the univariate parameter of interest, and γ Rq 1 is a nuisance parameter. Our goal is to construct a n1/2-consistent and asymptotically normal estimator for θ in high-dimensional models with q = dim(β) n. In this case, the dimension of the nuisance parameter γ is large, which makes the inference on θ challenging. We start from the empirical risk function over n observations defined as i=1 Q(β, Xi). (6) One standard choice for Qn is the negative log-likelihood function of the data. In this work, we conduct inference based on an alternative loss function, as the analysis of the log-likelihood can require unpleasant technical conditions that we would like to avoid, as discussed in Sections 4.1. That said, we mimic likelihood principles as much as possible, in order to make intuitive the construction below. For these reasons we will refer to Qn(β) as the negative pseudo-likelihood function. For now we leave Qn(β) unspecified a detailed discussion of its selection for inference in the latent variable graph and the cluster-average graph will be given in the following two subsections. In terms of Qn, we define the pseudo-information matrix for one observation as I(β) = E( 2Q(β , Xi)). We can partition this matrix as I(β) = I11 I12 I21 I22 with the partitions corresponding to those of β = (θ, γ). When Qn is the negative log-likelihood function, and the dimension of the parameter is independent of n, then h(θ; γ) given by (8) is called the efficient score function for θ, and classical theory shows that it admits solutions that are consistent, asymptotically normal and attain the information bound given by the reciprocal of (9) (Van der Vaart, 1998; Bickel et al., 1993). With these goals in mind, we similarly define the corresponding pseudo-score function for estimating θ in the presence of the nuisance parameter γ as h(θ; γ) = 1Qn(β) I12I 1 22 2Qn(β) 1Q(β, Xi) I12I 1 22 2Q(β, Xi) (8) and define the pseudo information of θ, in the presence of the nuisance parameter γ, as I1|2 = I11 I12I 1 22 I21. (9) Eisenach, Bunea, Ning and Dinicu When the dimension of γ is fixed, one can easily estimate I12 and I22 in (8) by their sample versions b I12 and b I22. However, such simple procedure fails when the dimension of γ is greater than the sample size, as b I22 is rank deficient. To overcome this difficulty, rather than estimating I12 and I 1 22 separately, we directly estimate w T = I12I 1 22 (10) by bw = argmin w 1, s.t. 2 12Qn( bβ) w T 2 22Qn( bβ) λ , (11) where λ is a non-negative tuning parameter, and bβ = (bθ, bγ) is an initial estimator, which is usually defined case by case, for a given model. Then, we can plug bw and bγ into the pseudo-score function, which gives bh(θ, bγ) = 1Qn(θ, bγ) bw T 2Qn(θ, bγ). (12) Following the Z-estimation principle (Van der Vaart, 1998; Bickel et al., 1993), one could define the final estimator of θ as the solution of the pseudo-score function bh(θ, bγ). However, in many examples, the pseudo-score function bh(θ, bγ) may have multiple solutions and it becomes unclear which root serves as a consistent estimator; see Small et al. (2000) for further discussion of the general estimating function context. To bypass this issue, we consider the following simple one-step estimation approach. Given the initial estimator bθ from the partition of bβ, we perform a Newton-Raphson update based on the pseudo-score function bh(θ, bγ), to obtain eθ, which is traditionally referred to as a one-step estimator by Bickel (1975). Specifically, we construct eθ = bθ b I 1 1|2bh(bθ, bγ), (13) where b I1|2 is an estimator of the partial information matrix I1|2. In Sections 3.2 and 3.3 below we show that, under appropriate conditions, the one-step estimator eθ constructed using the empirical risk functions Qn defined in (16) and (31), respectively satisfies n1/2(eθ θ ) = I 1 1|2n1/2h(β ) + op(1). (14) By applying the central limit theorem to h(β ), we establish the asymptotic normality of eθ in Theorems 3 and 4. When Qn(β) is the negative log-likelihood of the data, Ning and Liu (2017) successfully used this approach and the resulting estimator eθ is asymptotically equivalent to the debiased estimator in Zhang and Zhang (2014); van de Geer et al. (2013). As we explain in the following subsections, analysis based on the log-likelihood becomes intractable for the latent graphical model and requires stringent technical conditions for the cluster-average graphical model. To overcome this difficulty, we employ pseudo-score functions derived from (16) and (31). The resulting one-step estimator still attains the information bound established in the literature, and more importantly requires weaker technical assumptions than the existing methods. In addition to (14), we derive explicitly the speed at which the normal approximation is attained. High-Dimensional Inference for Cluster-Based Graphical Models 3.2. Estimation of the Cluster-Average Graph Recall that we assume Z N(0, C ) and E N(0, Γ ), which implies X N(0, Σ ) with Σ = AC AT +Γ . The within-cluster average X =: ( X1, . . . , XK) RK is given by Xk =: 1 |G k| P i G k Xi, corresponding to the population level clusters. Because X N(0, Σ ), we can verify that X N(0, S ), where S = C + Γ , (15) and Γ = diag( γ 1, ..., γ K) with γ k = 1 |G k|2 P j G k γ j . Recall that the precision matrix of X is Ω = S 1 = (C + Γ ) 1. In this section we give the construction of the estimators of the cluster-average graph corresponding to X. Specifically, we use the generic strategy outlined in the previous section in order to construct n1/2-consistent and asymptotically normal estimators for each component Ω t,k of the precision matrix Ω , for 1 t < k K. For the estimation of each entry, the remaining K(K + 1)/2 1 parameters in Ω are treated as nuisance parameters. Since we observe n i.i.d. samples of X Rp, if the clusters and their number were known, then we would implicitly observe n i.i.d. samples of X RK. A priori, the clusters are not known else this problem would simply reduce to the standard setting. However to explain our method, we first assume that clustering is given, and then show how to lift this assumption. Following our general principle, we would naturally tend to choose the negative loglikelihood function of the cluster-averages ( X1, ..., Xn) as the empirical risk function Qn(β) in (6). Along this line, Jankova and van de Geer (2014) proposed the de-biased estimator for Gaussian graphical models. However, the inference requires the irrepresentable condition (Ravikumar et al., 2011) on S , which can be restrictive. The alternative methods proposed by Ren et al. (2013); Jankov a and van de Geer (2017) imposed the condition that the largest eigenvalue of S is bounded. These technical conditions on S are difficult to justify and can be avoided by using our approach. We propose to estimate each sparse row of Ω as explained below. Let S = n 1 Pn i=1 Xi XT i denote the sample covariance matrix of Xi. When K is small, the maximum likelihood estimator of Ω is S 1, which can be viewed as the solution of the following equation SΩ IK = 0. Thus, in the low dimensional setting, this equation defines the maximum likelihood estimator. Since we are only interested in Ω t,k, we can extract the kth column from the left hand side of the above equation, and use it as the pseudo-score function Un(Ω k) = SΩ k ek. To apply the inference strategy in Section 3.1, we need to construct a valid empirical risk function Qn(Ω k) such that Qn(Ω k) = Un(Ω k). Simple algebra shows that a possible choice is Qn(Ω k) = 1 2ΩT k SΩ k e T k Ω k = 1 2ΩT k Xi XT i Ω k e T k Ω k), (16) which we view in the sequel as the empirical risk corresponding to the population level risk EQ(Ω k, X) = 1 2ΩT k S Ω k e T k Ω k, (17) Eisenach, Bunea, Ning and Dinicu based on the loss function Q(Ω k, X) =: 1 2ΩT k X XT Ω k e T k Ω k. (18) Since EQ(Ω k, X) = S Ω k ek = 0 (19) and 2EQ(Ω k, X) = S , (20) then the population risk EQ(Ω k, X) has the rows Ω k of the target precision matrix Ω as the unique minimizers, as desired, provided that S is positive definite, an assumption we make in Section 4.1. We note that the choice of the empirical risk Qn( ) and that of the corresponding pseudoscore Un( ) is not unique. We chose the particular form (16) because it is quadratic in Ω k, which greatly simplifies the theoretical analysis and leads to weaker technical assumptions. Moreover, the property (19) is the same as that of the score function corresponding to the negative log-likelihood function, supporting our terminology. We use the general strategy presented in Section 3.1 to construct estimators that employ the empirical risk Qn( ) defined by (16) above. We first recall that Qn( ) depends on the unknown cluster structure G via Xi. We note that in general the estimated group b Gk may differ from G k by a label permutation. For notational simplicity, we ignore this label permutation issue and treat b Gk as an estimate of G k (rather than G j for some j = k). To define our estimator of Ω t,k, we first replace Xi by c Xi and denote b S = n 1 Pn i=1 c Xi c XT i , where b Xik = 1 | b Gk| P j b Gk Xij. Let (t, k) be arbitrary, fixed. Replacing S by b S in Qn( ), we follow Section 3.1 to define the pseudo-score function h(Ω k) = v T t ( b SΩ k ek), (21) where v t is a K-dimensional vector with (v t )t = 1 and (v t ) t = w t with w t = (S t, t) 1S t,t, which is consistent with the definition in (10) above. To make inferences based on h(Ω k), we further need to estimate w t and Ω k. Following (11), an estimate of w t is given by bwt = argmin w 1, s.t b St, t w T b S t, t λ , (22) where λ is a tuning parameter. Then we can define bvt accordingly, and bh(Ω k) = bv T t ( b SΩ k ek). (23) Recall that the construction of the one-step estimator (13) requires an initial estimator of Ω k. To be concrete, we consider the following initial estimator of Ω k, bΩ k = argmin β 1, s.t b Sβ ek max λ, (24) where λ is a tuning parameter. This estimator has the same form as the CLIME estimator for the k-th column of Ω(Cai et al., 2011). However, unlike the CLIME estimator which High-Dimensional Inference for Cluster-Based Graphical Models requires λ Ω k 1 p log K/n, in our Theorem 3 we assume λ = C p log(K n)/n, where C only depends on the minimum eigenvalue of C and the largest diagonal entries of C and Γ which are assumed bounded by constants in Assumptions 1 and 2. With this choice of λ, we show in Lemma 15 in Appendix A that bΩ k Ω k 1 s1 with high probability, where s1 is the sparsity level of Ω k. The sharp concentration of the gradient and Hessian of the empirical risk function provided in Lemma 14 is the key for this result. As a comparison, Theorem 6 in Cai et al. (2011) only implies bΩ k Ω k 1 s1 Ω k 2 1 q n . For many sparse matrices, the ℓ1 norm of a column, Ω k 1, can grow to infinity with K or s1, and thus (61) gives a faster rate. In this case, and when λmin(C ) > c, Lemma 14 is instrumental in showing that the extra Ω k 2 1 factor in the rate of the original CLIME estimator can be avoided, whereas if only the marginal components of X are assumed to be sub-Gaussian, as in Cai et al. (2011), it may be unavoidable, without further conditions on Ω . We direct the reader to Appendix G for a more detailed discussion of the distinction between our results and those in Cai et al. (2011, 2016). Leveraging the block matrix inverse formula, we can show that the partial pseudo information matrix reduces to I1|2 = 1/Ω t,t. Finally, the one-step estimator is defined as eΩt,k = bΩt,k bh(bΩ k)bΩt,t, (26) in accordance with (13). In Section 4, we show that under mild regularity conditions n1/2(eΩt,k Ω t,k) N(0, s2 tk), where s2 tk = Ω 2 t,k + Ω t,tΩ k,k. If bs2 tk = bΩ2 t,k + bΩt,tbΩk,k is a consistent estimator of the asymptotic variance, then a (1 α) 100% confidence interval for Ωt,k is [eΩt,k z1 α/2bstk/n1/2, eΩt,k + z1 α/2bstk/n1/2], where zα is the α-quantile of a standard normal distribution. Equivalently, we can use the scaled test statistics eΩt,k to construct a test for H0 : Ω t,k = 0 versus H1 : Ω t,k = 0 with α significance level. Namely, the null hypothesis is rejected if and only if the above (1 α) 100% confidence interval does not contain 0. We will employ such tests in Section 4. 3.3. Latent Variable Graph Recall that the structure of the latent variable graph is encoded by the sparsity pattern of Θ = C 1, which is generally different from the cluster-average group as C 1 and Ω = (C + Γ ) 1 may have different sparsity patterns. In this section, we focus on the inference on the component Θ t,k, for some 1 t < k K. Similar to the cluster-average graph, we first discuss the likelihood approach. The negative log-likelihood corresponding to model (1) indexed by the parameter (Θ, Γ) is, up to some additive and multiplicative constants, ℓ(Θ, Γ) = log |(AΘ 1AT + Γ)| + tr(bΣ(AΘ 1AT + Γ) 1), Eisenach, Bunea, Ning and Dinicu where bΣ = n 1 Pn i=1 Xi XT i . It is straightforward to show that the Fisher information matrix for (Θ, Γ) is given by I(Θ, Γ) = (M AT Γ 1Σ Γ 1AM ) 2 (M AT Γ 1Σ 1F T ) 2Dd DT d (F Σ 1Γ 1AM ) 2 DT d (F Σ 1F T ) 2Dd where Dd = (Id 1T d ) (1T d Id), M = (Θ + AT Γ 1A) 1 and F = Id AM AT Γ 1. As seen in Section 3.1, the inference based on the likelihood or equivalently efficient score function (8) requires the estimation of I12I 1 22 which, given the complicated structure of the information matrix (27), becomes analytically intractable. A solution to this problem is inference based on an empirical risk function similar to (16), but tailored to the latent variable graph. With a slight abuse of notation, and reasoning as in (19) and (20), we notice that, for each k, EQ(Θ k, X) = 1 2ΘT k C Θ k e T k Θ k, (28) has the target Θ k as a unique minimizer, where the loss function Q(Θ k, X) is defined as Q(Θ k, X) = 1 2ΘT k CΘ k e T k Θ k, (29) and the matrix C := ( Cjk)j,k has entries Cjk = 1 |G j||G k| a G j ,b G k (Xa Xb Γab), (30) and Γab = 0 if a = b and Γaa = Xa Xa 1 |G k| 1 P a G k,a =j Xa Xj. Since E( C) = C , the risk relative to the loss function in (29) is indeed (28), and the empirical risk is Qn(Θ k) = 1 2ΘT k C(i)Θ k e T k Θ k), (31) where C(i) is obtained by replacing X in Cjk by Xi. Similar to the cluster-average graph, Qn(Θ k) also depends on the unknown cluster structure. We estimate G k by b Gk, and define bΓ = (bΓab), where bΓab = 0 if a = b and bΓaa = 1 n Pn i=1(Xia Xia 1 | b Gk| 1 P a b Gk,a =j Xia Xij), and b C = ( b Cjk) where b Cjk = 1 n Pn i=1( 1 | b Gj|| b Gk| P a b Gj,b b Gk(Xia Xib bΓab)). We replace C by b C in (31) above and follow exactly the strategy of Section 3.2, with b S replaced by b C, to construct the corresponding pseudo-score function bh(Θ k), similarly to (23), and the initial estimator bΘ k, similarly to (24). We combine these quantities, following the general strategy (13), as above, to obtain the final one-step estimator of Θ t,k, defined as eΘt,k = bΘt,k bh( bΘ k)bΘt,t, (32) after observing that, in this case, I1|2 = 1/Θ t,t. Although the form of this estimator is similar to (26), derived for the cluster-average graph, the study of the asymptotic normality of eΘt,k reveals that its asymptotic variance is much more involved, as will be discussed in detail in Section 4.2.2. High-Dimensional Inference for Cluster-Based Graphical Models 4. Main Theoretical Results 4.1. Assumptions In this section we state the two assumptions under which all our results are proved. Assumption 1 The covariance matrix C of Z satisfies: c1 λmin (C ) and maxt C t,t c2, for some absolute constants c1, c2 > 0. Assumption 2 The matrix Γ satisfies: max1 i d γ i c3 for some absolute constant c3 > 0, where γ i are the entries of the diagonal matrix Γ . Assumptions 1 and 2 are minimal conditions for inference on precision matrices. Furthermore, they imply the conditions needed for clustering consistency derived in Bunea et al. (2018) and discussed in Section 2.2, for n sufficiently large. Their work only requires that λmin (C ) is bounded from below by a sequence that converges to zero, as soon as Σ max and Γ max are bounded. This is strengthened by our assumptions. In general, a constant lower bound on C is standard in any inference on graphical models and is needed to show the asymptotic normality of the estimator introduced above (Ren et al., 2013; Jankova and van de Geer, 2014; Jankov a and van de Geer, 2017). 4.2. Asymptotic Normality via Berry-Esseen-type Bounds 4.2.1. Results for the Cluster-Average Graph In the section, we show that the estimators eΩt,k given by (26) are asymptotically normal, for all t < k . We define the sparsity of the cluster-average graph as s1 N such that k=1 I(Ω j,k = 0) s1. Recall that the estimators (22) and (24) depend on the tuning parameters λ and λ . In the following theorem, we choose λ λ q n . For notational simplicity, we use C to denote a generic constant, the value of which may change from line to line. Theorem 3 If Assumptions 1 and 2 hold, we have max 1 t 0, and a change in this power also changes the associated constant C in the term Cs1 log(K n) n1/2 . As shown in Theorem 5, to obtain valid FDR control, we need K2/(K n)2+δ = o(1), which holds for any δ > 0. For simplicity, we choose δ = 1 which gives the power 3. 4.2.2. Results for the Latent Variable Graph In this section we show that the estimators eΘt,k given by (32) are asymptotically normal, for all t < k. We define the sparsity of the latent graph as s0 N such that k=1 I(Θ j,k = 0) s0. Inference for the estimator eΘtk follows the general approach outlined in Section 3.1. We prove in Proposition 17 in Appendix B that n1/2(eΘt,k Θ t,k) = 1 n1/2 i=1 Θ t,tv T t ( C(i)Θ k ek) + op(1), (35) where v t is a K-dimensional vector with (v t )t = 1 and (v t ) t = w t with w t = (C t, t) 1C t,t. and C(i) is defined in (30). The terms of the sum in display (35) are mean zero random variables, and their variance is σ2 tk = E(Θ t,tv T t ( C(i)Θ k ek))2, High-Dimensional Inference for Cluster-Based Graphical Models which does not have an explicit closed form, unlike the asymptotic variance of the estimates of the entries of Ω . However, we show in Proposition 23 in Appendix B that σ2 tk admits an approximation that is easy to estimate: σ2 tk [(Θ t,k)2 + Θ k,kΘ t,t] s0 where m = min1 k K |G k|. Guided by this approximation, we estimate σ2 tk by bσ2 tk = bΘ2 t,k + bΘk,k bΘt,t. When all clusters have the equal size, we obtain K = d/m. Thus the O(s0 m) terms can be ignored asymptotically in the sense that s0 d = o(1), when the clusters are approximately balanced, and their number satisfies K2 = o(d). This is a reasonable assumption in most applied clustering problems. We note that the estimator bσ2 tk may be inconsistent when the size of some clusters is too small. However, we recall that our ultimate goal is to use these estimators for recovering the sparsity pattern of Θ under FDR control. To evaluate the sensitivity of our overall procedure to the size of the smallest cluster, we conduct simulation studies in Section 5. The results shows that the proposed method works well as soon as m > 4. The following theorem gives the Berry Esseen normal approximation bound for the estimators of the entries of the precision matrix corresponding to the latent variable graph. Theorem 4 If Assumptions 1 and 2 hold, then max 1 t 0, the total number of discoveries is 1 t τ]. Similarly, the number of false positives or false discoveries is given by (t,k) H0 I[|f Wt,k| > τ]. The FDR is formally defined as the expected ratio of Vτ over Rτ, FDR(τ) := E Vτ Rτ I[Rτ > 0] , where the indicator function is included to remove the trivial case Rτ = 0. Our goal is to find a data-dependent cutoffτ such that FDR(τ) α + o(1) for any given 0 < α < 1. This is the best one can hope for when, as in our case, the distribution of the test statistics f Wt,k is only available asymptotically. The Berry-Esseen type bounds, derived in Theorems 3 and 4, allow us to precisely quantify the price we must pay for the asymptotic approximation and are instrumental to understanding asymptotic FDR control. In addition, the test statistics f Wt,k for different hypotheses are dependent. To allow for the dependence, instead of the standard B-H procedure (Benjamini and Hochberg, 1995), we consider the more flexible B-Y procedure by Benjamini and Yekutieli (2001). The resulting FDR procedure is as follows: reject all hypotheses such that |f Wt,k| bτ, where bτ := min τ > 0 : τ Φ 1 1 αRτ 2NBY |H| where |H| = K(K 1)/2 is the total number of hypotheses. Our next result shows when the FDR based on our test statistics is guaranteed to be no greater than α, asymptotically. The proofs can be found in Appendix A. 1. Assume that the conditions in Theorem 3 hold. For any 0 < α < 1, we have FDR(bτ) α + 2|H0|bn, (40) High-Dimensional Inference for Cluster-Based Graphical Models where bn = C (d n)3 + C (K n)3 + Cs1 log(K n) n1/2 , and |H0| is the number of true null hypotheses in (37). 2. Assume that the conditions in Theorem 4 hold. If we define the test statistic as f Wt,k = n1/2 eΘt,k/bσtk, and bτ as in (39), we have FDR(bτ) α + 2|H 0|cn, (41) where cn = C (d n)3 + C (K n)3 + Cs0 log(K n) m , and |H 0| is the number of true null hypotheses in (38). This theorem implies that our method can control the FDR asymptotically, in the sense that FDR(bτ) α +op(1), provided that s1|H0| log(K n) = o(n1/2), for the average graph, and s0|H 0| log(K n) = o(n1/2) for the latent graph. Gaussian graphical model estimation under FDR control was recently studied by Liu (2013). They showed that the B-H procedure can control FDR asymptotically under certain conditions. Their approach is based on the following Cramer-type moderate deviation result using our terminology, max (t,k) H0 sup 0 t 2 log K P( b Tt,k t) 2 2Φ(t) 1 = o(1), (42) where b Tt,k is test statistic they proposed for estimation of the Gaussian graphical model structure. The result (42) controls the relative error of the Gaussian approximation within the moderate deviation regime [0, 2 log K], whereas our result is based on the control of the absolute error via the Berry-Esseen-type Gaussian approximation. One of the main advantages of their result is that the number of clusters is allowed to be K = o(nr), where r is a constant that can be greater than 1. However, to prove (42), they required that the number of strong signals tends to infinity, that is |{(t, k) : Ω t,k/ q Ω k,kΩ t,t C p log K/n}| , which reduces significantly the parameter space for which inference is valid. In contrast, the aim of this work is the study of pattern recovery without conditions on the signal strength of the entries of the target precision matrices, as in practice it is difficult to assess whether these conditions are met. For completeness, we provide the detailed analysis of the B-H procedure including technical conditions, theoretical results and further discussion of the B-H procedure in Appendix H. The overall message conveyed by Theorem 5 is that, in the absence of any signal strength assumptions, cluster-based graphical models can still be recovered, under FDR control, provided that the number of clusters K is not very high relative to n, and provided that the clusters are not very small. This further stresses the importance of an initial dimension reduction step in high-dimensional graphical model estimation. For instance, results similar to those of Theorem 3 can be derived along the same lines for the estimation of the sparsity pattern of Σ 1, for a generic, unstructured, covariance matrix of X, where one replaces K by d throughout, and s1 is replaced by s, the number of non-zero entries in the d d matrix Σ 1. Then, the analogue of (41) of Theorem 5 shows that FDR control in generic graphical models, based on asymptotic approximations of p-values, cannot generally be guaranteed if d > n. Our work shows that extra structural assumptions, for instance those motivated by clustering, do alleviate this problem. The simulation study presented in the next section provides further support to our findings. Eisenach, Bunea, Ning and Dinicu 5. Numerical Results This section contains simulations and a real data analysis that illustrate the finite sample performance of the inferential procedures developed in the previous sections for the latent variable graph and cluster-average graph, respectively. 5.1. Synthetic Datasets In this subsection, we demonstrate the effectiveness of the FDR control procedures on synthetic datasets. We consider two settings (n, d) = (800, 400) and (500, 1000), and in each setting we vary the value of K and m. The error variable E is sampled from the multivariate normal distribution with covariance Γ whose entries γ i are generated from U[0.25, 0.5]. Recall that the latent variable Z follows from Z N(0, C ). We consider three different models to generate the graph structure of Z. Once the graph structure is determined, the corresponding adjacency matrix W is found, and the precision matrix Θ = C 1 is taken as Θ = c W + (|λmin (W) | + 0.2)I, where c = 0.3 when d = 400 and c = 0.5 when d = 1000. Finally, we assign the cluster labels for all variables so that all clusters have approximately equal size, which gives us the matrix A. Given A, Z and E, we can generate X according to the model (1). We consider the following three generating models for the graph structure of Z: Scale-Free Graph The Scale-Free model is a generative model for network data, whose degree distribution follows a power law. To be concrete, we generate the graph one node at a time, starting with a 2 node chain. For nodes 3, . . . , K, node t is added and one edge is added between t and one of the t 1 previous nodes. Denoting by ki the current degree of node i in the graph, the probability that node t and node i are connected is pi = ki/(P i ki). The number of edges in the resulting graph is always K. Hub Graph The K nodes of the graph are partitioned evenly into groups of size N. Within each group, one node is selected to be the group hub, and an edge is added between it and the remainder of its group. N is either 5 or 6 depending upon the choice of K. The number of edges in the graph is K(N 1)/N, so for K = 100 with N = 5, the number of edges in the resulting graph is 80. Band3 Graph This model generates a graph with a Toeplitz adjacency matrix. There is an edge between node i and node j if and only if |i j| B, where we set B = 3 in this scenario. In general, the number of edges in a band graph with K nodes is given by BK 3 2B. So, for K = 100 and B = 3, the number of edges in the graph is 294. Recall that X N(0, S ), where S is defined in (15). To determine the structure of the average graph, we numerically compute S 1 and threshold the matrix at 10 8. We examine the empirical FDR of our procedures on some synthetic datasets. The following protocol is followed in all the experiments: 1. Generate the graph structure of Z as specified above. 2. Simulate n observations from our model (1). High-Dimensional Inference for Cluster-Based Graphical Models 3. Estimate the cluster partition b G. For computational convenience, we apply the FORCE algorithm (Eisenach and Liu, 2019) with known K. 4. Construct the test statistic f Wt,k defined in Section 4.3. The regularization parameters λ and λ are chosen by 5-fold cross validation. 5. Find the FDR cutoff(39) at level α; we consider three cases α = 0.05, 0.1, 0.2. The simulation is repeated 100 times. To compare with our Benjamini-Yekutieli based FDR procedure, we also report the empirical FDR based on the more classical Benjamini Hochberg procedure. That is, we apply the same procedures 1-4, but in step 5 we replace the FDR cutoffin (39) with the Benjamini-Hochberg (B-H) cutoff, i.e., bτBH := min τ > 0 : τ Φ 1 1 αRτ Table 1 compares the empirical FDR based on our method with the B-H procedure under different m, K settings when d = 400. When m is relatively large (e.g., m = 20), both methods can control FDR on average, although our method is relatively more conservative. As expected, the FDR control problem becomes more challenging for large K and small m. In this case the graph contains more nodes and each cluster contains fewer variables. We observe that when m = 5 our method can still control FDR reasonably well but the B-H method produces empirical FDR far beyond the nominal level, especially for hub graphs. The inferior performance of the B-H procedure is due to the fact that the dependence among the test statistics is not accounted for, demonstrating that the B-Y procedure is indeed necessary at least in the current simulation settings. Finally, we examine the empirical power of the FDR procedure under each scenario, which is defined as I[f Wt,k bτ] where H1 is the set of alternative hypothesis. Table 2 gives the empirical power of our FDR procedure, and the B-H procedure when d = 400. It shows that our procedure and the B-H procedure have very high power in all scenarios. The same phenomenon is observed when d is large, i.e., d = 1000; see Tables 3 and 4. In summary, our proposed procedure can identify most of the signals in the graph while keeping FDR well controlled. Remark 6 (Inexact Recovery) While our FORCE algorithm guarantees the estimated number of clusters is always K since we use the true value K as the input, the recovered partition b G may still differ from the true partition G . In order to compare our results to the ground-truth, we need to find an aligned version of Θ (or Ω ). Specifically, we calculate the mapping f : [K] [K] defined by f(k) = argmax l [K] | b Gk G l |, and then construct the matrix Θ A, defined entry-wise by (Θ A)s,t := (Θ )f(s),f(t). Eisenach, Bunea, Ning and Dinicu Once we have obtained the aligned ground-truth matrix Θ A, we can proceed with computing the metric of interest (FDR and power). Although the preceding discussion is in terms of Θ , the same procedure applies to Ω . Remark 7 (Discussion of the B-H procedure) Based on the results in Tables 3 and 4, it is clear that the power of both our procedure and the B-H procedure are satisfactory. However, for both setups (d = 400, 1000), the B-H procedure leads to the inflated FDR relative to the nominal level. While Appendix H shows that the B-H procedure can control FDR asymptotically under certain conditions, it seems in numerical examples the dependence among the test statistics leads to substantial errors in FDR control that are indeed not ignorable. α = 0.05 α = 0.1 α = 0.2 K m Scalefree Band3 Hub Scalefree Band3 Hub Scalefree Band3 Hub B-Y Based Procedure 80 5 1.16% 1.42% 5.99% 2.01% 2.64% 7.60% 4.01% 4.73% 10.33% 66 6 0.93% 1.03% 1.08% 1.99% 1.98% 1.96% 3.73% 3.68% 4.00% 50 8 1.16% 0.99% 1.09% 2.20% 1.77% 1.98% 3.90% 3.49% 3.73% 20 20 0.99% 0.88% 1.30% 1.81% 1.66% 2.26% 3.75% 3.38% 4.71% 80 5 1.15% 1.40% 6.29% 2.14% 2.67% 7.89% 4.00% 4.70% 10.43% 66 6 0.91% 1.04% 1.04% 1.85% 2.00% 2.14% 3.70% 3.71% 3.83% 50 8 1.16% 1.00% 1.11% 2.22% 1.80% 1.94% 3.98% 3.49% 3.61% 20 20 0.94% 0.88% 1.23% 1.81% 1.68% 2.26% 3.65% 3.38% 4.71% B-H Based Procedure 80 5 8.23% 9.24% 15.60% 15.01% 16.39% 24.00% 28.16% 28.97% 38.58% 66 6 7.18% 7.38% 7.31% 14.12% 14.01% 13.78% 26.56% 25.84% 27.28% 50 8 6.94% 6.75% 6.69% 13.23% 12.73% 12.74% 25.20% 23.62% 26.05% 20 20 5.43% 4.54% 6.38% 11.09% 8.47% 10.71% 21.46% 17.15% 20.75% 80 5 8.42% 9.26% 15.77% 15.25% 16.43% 24.15% 28.03% 29.02% 38.57% 66 6 7.21% 7.37% 7.45% 13.99% 13.92% 13.81% 26.19% 25.77% 27.17% 50 8 6.82% 6.78% 6.67% 13.23% 12.67% 12.89% 25.10% 23.61% 25.91% 20 20 5.38% 4.51% 6.49% 11.01% 8.52% 10.66% 21.58% 17.09% 20.71% Table 1: Averaged empirical FDR for synthetic data experiments with d = 400 and n = 800. Group Average Procedures Do Not Recover Latent Graphs In this section we demonstrate through simulation studies that procedures specifically tailored to recovering the latent variable graph are necessary. We do this by using both the methodology for the latent variable graph and the group-average graph to recover the latent graph structure. Because the differences between the latent and group-average graph can be small, we use larger sample sizes than in the previous studies. The experimental procedure is almost exactly the same as before, but now we hold most of the parameters of the generating distribution fixed and examine the effects of the error variance Γ and sample size n on the efficacy of our methodologies for recovering the latent variable graph. In all the experiments, we impose a Band3 structure on the latent graph, set Γ = γI, and use (d, K, m) = (400, 20, 20). High-Dimensional Inference for Cluster-Based Graphical Models α = 0.05 α = 0.1 α = 0.2 K m Scalefree Band3 Hub Scalefree Band3 Hub Scalefree Band3 Hub B-Y Based Procedure 80 5 88.22% 99.99% 99.00% 91.10% 100.00% 99.04% 93.38% 100.00% 99.11% 66 6 93.62% 100% 100% 95.42% 100% 100% 96.86% 100% 100% 50 8 97.18% 100% 100% 97.90% 100% 100% 98.47% 100% 100% 20 20 100% 100% 100% 100% 100% 100% 100% 100% 100% 80 5 88.27% 99.99% 98.93% 91.06% 100.00% 98.98% 93.37% 100.00% 99.07% 66 6 93.62% 100% 100% 95.40% 100% 100% 96.91% 100% 100% 50 8 97.18% 100% 100% 97.86% 100% 100% 98.47% 100% 100% 20 20 100% 100% 100% 100% 100% 100% 100% 100% 100% B-H Based Procedure 80 5 95.75% 100.00% 99.18% 97.15% 100.00% 99.23% 98.48% 100.00% 99.26% 66 6 97.85% 100% 100% 98.78% 100% 100% 99.34% 100% 100% 50 8 99.27% 100% 100% 99.57% 100% 100% 99.80% 100% 100% 20 20 100% 100% 100% 100% 100% 100% 100% 100% 100% 80 5 95.82% 100.00% 99.14% 97.14% 100.00% 99.18% 98.53% 100.00% 99.23% 66 6 97.88% 100% 100% 98.78% 100% 100% 99.34% 100% 100% 50 8 99.24% 100% 100% 99.59% 100% 100% 99.78% 100% 100% 20 20 100% 100% 100% 100% 100% 100% 100% 100% 100% Table 2: Averaged FDR power for synthetic data experiments with d = 400 and n = 800. α = 0.05 α = 0.1 α = 0.2 K m Scalefree Band3 Hub Scalefree Band3 Hub Scalefree Band3 Hub B-Y Based Procedure 80 12 1.40% 1.52% 1.78% 2.78% 2.81% 3.29% 5.71% 5.09% 6.30% 66 15 1.23% 1.29% 1.31% 2.62% 2.39% 2.69% 5.08% 4.33% 4.84% 50 20 1.05% 0.94% 1.26% 2.27% 1.85% 2.79% 4.28% 3.65% 4.58% 20 50 1.04% 1.03% 1.05% 1.81% 1.96% 2.20% 3.46% 3.26% 4.19% 80 12 1.55% 1.57% 1.78% 2.78% 2.80% 3.26% 5.57% 5.13% 6.13% 66 15 1.30% 1.30% 1.24% 2.64% 2.35% 2.64% 5.07% 4.34% 4.89% 50 20 1.09% 0.93% 1.26% 2.32% 1.89% 2.75% 4.29% 3.64% 4.51% 20 50 1.09% 1.01% 1.05% 1.91% 1.96% 2.14% 3.50% 3.28% 4.25% B-H Based Procedure 80 12 10.99% 10.04% 12.00% 20.04% 18.18% 21.47% 35.50% 31.46% 37.24% 66 15 9.28% 8.20% 9.14% 16.88% 14.94% 16.82% 31.53% 27.75% 31.81% 50 20 7.73% 6.91% 7.96% 14.74% 13.13% 15.33% 27.03% 23.92% 28.72% 20 50 5.00% 4.85% 6.38% 11.01% 9.38% 11.60% 21.49% 17.00% 23.15% 80 12 10.99% 10.00% 12.00% 20.29% 18.20% 21.38% 35.56% 31.48% 37.10% 66 15 9.28% 8.24% 8.99% 17.03% 14.87% 16.87% 31.49% 27.82% 31.63% 50 20 7.60% 6.93% 7.94% 14.61% 13.13% 15.40% 27.12% 23.94% 28.71% 20 50 4.95% 4.91% 6.38% 10.97% 9.35% 11.60% 21.36% 16.94% 23.41% Table 3: Averaged empirical FDR for synthetic data experiments with d = 1000 and n = 500. First we note that if Γ = 0, the two graph structures are actually the same it is only as we introduce error into the observed variables that the latent variable and group-average graphs begin to differ. In Tables 5 and 6 we see that as we increase the error variance, the the group averages methodology applied to the latent graph is unable to control the Type I error rate and the FDR at the desired level. Further, by increasing the sample size n, Eisenach, Bunea, Ning and Dinicu α = 0.05 α = 0.1 α = 0.2 K m Scalefree Band3 Hub Scalefree Band3 Hub Scalefree Band3 Hub B-Y Based Procedure 80 12 71.42% 99.95% 100.00% 77.14% 99.97% 100.00% 82.29% 99.98% 100.00% 66 15 82.98% 99.98% 100.00% 86.32% 99.99% 100.00% 90.02% 100.00% 100.00% 50 20 90.45% 99.99% 100.00% 93.04% 99.99% 100.00% 94.86% 100.00% 100.00% 20 50 99.95% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 80 12 71.54% 99.95% 100.00% 77.13% 99.97% 100.00% 82.25% 99.98% 100.00% 66 15 82.74% 99.98% 100.00% 86.26% 99.99% 100.00% 90.08% 100.00% 100.00% 50 20 90.45% 99.99% 100.00% 93.00% 99.99% 100.00% 94.76% 100.00% 100.00% 20 50 99.95% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% B-H Based Procedure 80 12 87.15% 99.99% 100.00% 91.33% 100.00% 100.00% 94.66% 100.00% 100.00% 66 15 93.57% 100.00% 100.00% 95.63% 100.00% 100.00% 97.38% 100.00% 100.00% 50 20 96.27% 100.00% 100.00% 97.78% 100.00% 100.00% 98.94% 100.00% 100.00% 20 50 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 80 12 87.22% 99.99% 100.00% 91.33% 100.00% 100.00% 94.58% 100.00% 100.00% 66 15 93.52% 100.00% 100.00% 95.65% 100.00% 100.00% 97.35% 100.00% 100.00% 50 20 96.33% 100.00% 100.00% 97.78% 100.00% 100.00% 98.94% 100.00% 100.00% 20 50 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% Table 4: Averaged FDR power for synthetic data experiments with d = 1000 and n = 500. we observe that the performance of the group average procedures applied to recovering the latent graph gets worse this is as expected because with increased n, the tests become accurate, but are by construction estimating the wrong graph! L.V. Test Procedure G.A. Test Procedure n γ α = 0.01 α = 0.05 α = 0.1 α = 0.2 α = 0.01 α = 0.05 α = 0.1 α = 0.2 1600 0.5 0.56% 2.67% 5.83% 10.22% 0.61% 3.00% 6.39% 10.39% 1.0 0.78% 3.11% 5.22% 10.61% 0.78% 3.83% 6.89% 13.39% 1.5 0.61% 2.89% 5.11% 8.72% 0.94% 3.33% 6.22% 10.50% 2.0 1.00% 3.39% 6.39% 11.06% 1.39% 4.78% 7.61% 12.44% 2.5 0.56% 3.50% 6.17% 11.22% 1.89% 5.33% 9.17% 15.28% 3.0 1.72% 5.11% 8.61% 14.28% 3.00% 8.56% 11.89% 17.61% 3.5 1.06% 3.83% 6.78% 11.72% 3.39% 8.44% 12.94% 18.44% 4.0 1.50% 4.33% 8.33% 12.83% 5.22% 10.94% 14.22% 20.78% 3200 0.5 0.28% 1.89% 3.22% 7.50% 0.44% 2.17% 4.00% 8.11% 1.0 0.44% 3.00% 5.11% 9.44% 1.11% 4.22% 7.83% 14.17% 1.5 0.44% 2.72% 4.50% 9.56% 1.78% 5.61% 9.00% 13.61% 2.0 0.89% 3.56% 6.78% 11.61% 2.94% 7.50% 11.33% 16.78% 2.5 1.06% 3.61% 6.06% 10.67% 2.78% 9.67% 13.44% 19.00% 3.0 1.50% 5.44% 8.39% 14.33% 7.72% 13.78% 18.00% 22.78% 3.5 1.67% 5.44% 9.72% 15.72% 7.61% 13.83% 18.22% 22.89% 4.0 2.78% 5.83% 10.06% 16.28% 11.28% 17.11% 22.22% 27.06% Table 5: Averaged Type I for the latent variable graph using both the latent variable and group averages methodology. High-Dimensional Inference for Cluster-Based Graphical Models L.V. Test Procedure G.A. Test Procedure n γ α = 0.01 α = 0.05 α = 0.1 α = 0.2 α = 0.01 α = 0.05 α = 0.1 α = 0.2 1600 0.5 0.83% 3.81% 4.95% 11.44% 0.41% 3.23% 5.51% 12.57% 1.0 0.62% 3.23% 6.07% 10.11% 1.03% 4.00% 7.51% 14.13% 1.5 0.21% 3.61% 5.88% 10.61% 1.23% 4.76% 6.25% 12.25% 2.0 1.23% 3.81% 6.43% 13.51% 1.64% 5.88% 9.26% 14.74% 2.5 0.83% 4.00% 7.16% 12.25% 3.23% 5.88% 10.95% 16.52% 3.0 2.24% 6.61% 10.11% 16.81% 3.23% 9.94% 15.19% 20.66% 3.5 1.23% 4.95% 6.98% 13.36% 3.81% 10.78% 15.79% 22.08% 4.0 1.64% 4.95% 8.40% 15.19% 6.25% 14.29% 17.95% 24.88% 3200 0.5 0.41% 2.44% 4.00% 7.51% 0.41% 2.64% 4.19% 8.75% 1.0 0.83% 3.42% 6.80% 10.61% 1.03% 4.57% 8.05% 16.38% 1.5 0.41% 3.23% 5.70% 9.09% 2.64% 7.34% 10.45% 16.81% 2.0 0.62% 4.38% 6.98% 12.25% 4.00% 9.77% 13.36% 20.27% 2.5 1.03% 4.57% 7.51% 11.60% 4.00% 12.09% 16.81% 23.44% 3.0 1.84% 7.51% 10.45% 16.38% 9.94% 17.81% 22.95% 28.14% 3.5 1.64% 6.98% 11.28% 18.92% 10.95% 18.23% 23.08% 27.93% 4.0 4.38% 8.57% 11.60% 18.92% 15.94% 22.95% 27.16% 31.82% Table 6: Averaged empirical FDR error rates for the latent variable graph using both the latent variable and group averages methodology. 5.2. f MRI Dataset The study of brain relationships in humans via modern neuroimaging techniques has attracted an enormous amount of scientific interest in recent years. One fundamental goal in these studies is to understand the functional communication between brain regions, which plays a key role in complex cognitive processes. To study the functional connectivity network, Power et al. (2011) partitioned the human brain into regions of interest (ROIs) and represented each ROI by a node in a graph. They identified several subgraphs of highly related nodes (i.e, clusters), which can represent the major functional systems of the brain. One of the main goals in their work is to understand the relationship between different subgraphs or clusters. To this end, they estimated the relationship between clusters by thresholding the correlation matrix of all nodes in the same clusters in an adhoc way. The analysis of network of clusters can be conducted under a rigorous statistical framework by using the proposed cluster-based graphical models. In the section, we will apply the proposed method to study the dependence structure among functional systems of the brain. As an illustration, we focus on the publicly available resting-state f MRI data from the Neuro-bureau pre-processed repository (Bellec et al., 2015). Specifically, we use the data from patient 1018959, session 1 in the KKI dataset. This f MRI dataset was pre-processed using the Athena pipeline and mapped to T1 MNI152 coordinate space. We choose this dataset to make our experiments easily reproducible, as the data are available pre-processed using standard alignment and denoising procedures. In a recent work, Luo (2014) estimated the f MRI network in a similar clustering model by an iterative procedure that maximizes the likelihood function with respect to the unknown cluster assignment and the precision Eisenach, Bunea, Ning and Dinicu matrix. However, their optimization problem is non-convex due to estimating the unknown cluster assignment and the algorithm is computationally intensive in high dimension. Following Power et al. (2011), we directly extract the 264 ROIs, which gives us d = 264 mean activities across n = 148 time periods. Since the FORCE algorithm requires to know the number of clusters in advance, we first apply the CORD algorithm (Bunea et al., 2018) to estimate the number of clusters, which gives us K = 53. Then we reapply the FORCE algorithm with K = 53 to obtain the corresponding clusters. Recall that the goal is to analyze the dependence structure among functional systems of the brain. Under the latent variable model (1), we treat the measurement of ROIs as the observed variable X and the underlying brain function as the latent variable Z. Thus, the relationship between brain functional systems can be modeled by the latent variable graph. Using the FDR control procedures with α = 0.01, the estimated latent variable graph is shown in Figure 1. As a comparison, we also show the cluster average graph in Figure 2. However, in this example the biological meaning of the cluster-average variables X is unclear so that the interpretation of the cluster average graph seems difficult. For purposes of clarity, we only display the nodes and connections corresponding to the 10 largest groups in these two figures. The groups are colored according to the functional network the majority of their nodes belong to as given in Power et al. (2011). It is known that the default mode may comprise multiple interacting subsystems (Andrews-Hanna et al., 2010). Thus our graph contains multiple nodes for the default mode, where each node may represent different subsystems. To demonstrate the difference between the latent variable graph and the cluster average graph, we focus on three regions, default mode (pink), dorsal attention (brown) and salience (yellow). In the latent variable graph, dorsal attention and salience are connected and both of them are strongly connected with many subsystems of the default mode. This dependence structure is supported by the neuroscience theory that the default mode, engaged during rest and internally directed tasks, should exhibit anticorrelation with networks engaged during externally directed tasks, such as the dorsal attention network and the salience network (Zhou et al., 2017). Moreover, this finding is also consistent with many existing empirical results such as Fox et al. (2005); Fransson (2005); Smith et al. (2009); Raichle (2015). However, in the cluster-averages graph, dorsal attention is conditionally independent of any subsystems of the default mode and salience is only loosely connected with the default mode. In summary, the latent variable graph seems to be more biologically meaningful than the cluster-average graph in order to interpret the dependence structure of the functional systems of the brain. Acknowledgments We would like to thank the reviewers and editor for their suggestions to improve this paper. Florentina Bunea was partially supported by NSF-DMS 712709. Yang Ning was partially supported by NSF-DMS 1854637. We are grateful to Xi Luo for help with the interpretation of our data analysis results. High-Dimensional Inference for Cluster-Based Graphical Models FMRI - Latent Graph ( b K = 53) Uncertain Auditory Default mode Salience Cingulo-opercular Task Control Dorsal Attention Sensory/Somatomotor Hand Figure 1: Recovered latent graph structure between 10 largest clusters in f MRI data with FDR level α = 0.01 colored according to their functions. FMRI - Averages Graph ( b K = 53) Uncertain Auditory Default mode Salience Cingulo-opercular Task Control Dorsal Attention Sensory/Somatomotor Hand Figure 2: Recovered cluster-averages graph structure between 10 largest clusters in f MRI data with FDR level α = 0.01. Eisenach, Bunea, Ning and Dinicu Appendix A. Estimation in the Cluster-Average Graph In this section we provide the proofs of the results needed for establishing the asymptotic normality of the estimators of the entries of Ω . These results make use of the fact that consistent clustering is possible, under our assumptions, as stated below. Lemma 8 Let E =: { b G = G }, for b G estimated by either the COD or the PECOK algorithm of Bunea et al. (2018). Then, under Assumptions 1 and 2, we have P(E) 1 C (d n)3 . The conclusion of this Lemma is proved in Theorem 3, for the COD algorithm, and Theorem 4, for the PECOK algorithm, of Bunea et al. (2018). Lemma 8 allows us to replace b G by G in all the results below, while incurring a small error, of order O C (d n)3 , which will be shown to be dominated by other error bounds. Remark 9 While Assumptions 1 and 2 are made for C , they do imply that c1 λmin (S ) and maxt S t,t c2 + c3 holds for S . Furthermore Lemma 29 implies the same restricted eigenvalue condition on λmin (S ) as on C . A.1. Main Proofs for the Cluster-Average Graph Estimators Before proving Theorem 3 and Claim 1 of Theorem 5, we state two propositions one regarding the asymptotic normality of eΩt,k and the other regarding the convergence rate of the variance estimator. The proofs are deferred until after that of the main result Proposition 10 (Asymptotic Normality of eΩt,k) Under the same conditions as in Theorem 3, we have max 1 t bτ]I[Rbτ > 0] (t,k) H0 E h I[|f Wt,k| > bτ]I[Rbτ > 0] To handle the Rbτ in the denominator, we use the identity 1 = P i=Rbτ Rbτ i(i+1). This implies 1 i(i + 1) = Plugging this into (45) and bringing the expectation inside the summation gives that FDR(bτ) = X 1 i(i + 1)E h I[|f Wt,k| > bτ]I[Rbτ > 0]I[i Rbτ] i 1 i(i + 1)E h I[|f Wt,k| > Φ 1 1 αRbτ 2NBY |H| ]I[Rbτ > 0]I[i Rbτ] i 1 i(i + 1)E h I[|f Wt,k| > Φ 1 1 α(i |H|) where the second line follows from the definition of the FDR cutoffand the last inequality holds since Rbτ (i |H|). The Berry-Esseen bound in Theorem 3 implies that P(|f Wt,k| > Φ 1 1 α(i |H|) NBY |H| + 2bn. Thus, it follows that NBY |H| + 2bn i i(i + 1)NBY + |H| i(i + 1)NBY 1 i + 1 1 NBY + |H| |H| + 1 1 NBY α + 2|H0|bn, where the last step follows from |H0| |H| 1 and the definition of NBY . This completes the proof of the Theorem 5, Claim 1. High-Dimensional Inference for Cluster-Based Graphical Models Proof of Proposition 10 Proof The proof is done in two steps. In Step 1, we show that intersected with the event E, n1/2|(eΩt,k Ω t,k)/st,k + Ω t,th(Ω k)/st,k| s1 log(K n) n1/2 , (47) with probability at least 1 C/(K n)3 and then use Lemma 16 to obtain the result. To prove (47), we decompose it as n1/2|(eΩt,k Ω t,k) + Ω t,th(Ω k)| = n1/2|(bΩt,k Ω t,k) bΩt,tbh(bΩ k) + Ω t,th(Ω k)| n1/2|(bΩt,k Ω t,k) Ω t,t(h(bΩ k) h(Ω k))| | {z } I.1 + n1/2|Ω t,t(bh(bΩ k) h(bΩ k))| | {z } I.2 + n1/2|(bΩt,t Ω t,t)bh(bΩ k)| | {z } I.3 In the following, we study these three terms separately. Recall that h(Ω k) = v T t ( b SΩ k ek), and bh(Ω k) = bv T t ( b SΩ k ek), where v t is a K-dimensional vector with (v t )t = 1 and (v t ) t = w t with w t = (S t, t) 1S t,t. Term I.1 reduces to |I.1| = n1/2|(bΩt,k Ω t,k) Ω t,tv T t b S(bΩ k Ω k)| n1/2|(bΩt,k Ω t,k)(1 Ω t,t( b St,t w T t b S t,t))| (48) + n1/2Ω t,t|( b St, t w T t b S t, t)(bΩ t,k Ω t,k)|. (49) Note that 1/Ω t,t = S t,t w T t S t,t. The term in (48) can be bounded by n1/2|(bΩt,k Ω t,k)Ω t,t(b St,t S t,t)| + n1/2|(bΩt,k Ω t,k)Ω t,tw T t ( b S t,t S t,t)| n1/2|Ω t,t| bΩ k Ω k 1 max( b S S max, w T t ( b S t S t ) ) Cs1 log(K n) with probability at least 1 (K n) 3, by λmax(Ω ) C and the concentration and error bound results in Lemmas 13, 14, 15. The term in (49) can be bounded by n1/2Ω t,t b St, t w T t b S t, t bΩ t,k Ω t,k 1 Cs1 log(K n) with probability at least 1 (K n) 3 again by Lemmas 14, 15. Thus, |I.1| s1 log(K n) n1/2 with probability at least 1 (K n) 3. For term I.2, we have |I.2| = n1/2Ω t,t|(bvt v )T ( b S bΩ k ek)| n1/2Ω t,t bvt v 1 b S bΩ k ek Cs1 log(K n) Eisenach, Bunea, Ning and Dinicu with probability at least 1 (K n) 3 by Lemma 15 and the constraint of the CLIME-type estimator. To control term I.3, first we observe that |bh(bΩ k)| = |bv T t ( b S bΩ k ek)| |v T t ( b SΩ k ek)| + |v T t b S(bΩ k Ω k)| + |(bvt v t )T ( b S bΩ k ek)| v t 1 b SΩ k ek + b St, t w T t b S t, t bΩ k Ω k 1 + bvt v t 1 b S bΩ k ek . As shown in the proof of Lemma 16, v t 1 s1/2 1 v t 2 Cs1/2 1 . The rest of the bounds on the above terms follows easily from Lemmas 14, 15. Thus, we have |bh(bΩ k)| C(s1 log(K n)/n)1/2 with high probability. Since |bΩt,t Ω t,t| C(s1 log(K n)/n)1/2, by Lemma 15, we obtain that |I.3| s1 log(K n) n1/2 with probability at least 1 (K n) 3. It is easily seen that Ω t,t 1 S t,t C > 0, see Remark 9. This implies that s2 t,k = Ω t,tΩ k,k + Ω 2 t,k is lower bounded by a positive constant. The proof of (47) is complete. In step 2, we need to verify that max 1 t 0 since Ω max λmax(Ω ) C. It is easily verified that Ω t,t 1 S t,t C > 0 (see Remark 9). This implies that s2 t,k = Ω t,tΩ k,k + Ω 2 t,k is lower bounded by a positive constant. Thus, max 1 t 2 maxt(C t,t)2 for arbitrary D1 > 0. Observe that for n sufficiently large, D1 p log(K n)/n 2 maxt(C t,t)2, and thus we need only consider this case. Choose D1 4 3 maxt(C t,t)2. Then it is clear that i=1 Zi ZT i C Eisenach, Bunea, Ning and Dinicu By applying the union bound across all entries in the matrix, we get the desired result that i=1 Zi ZT i C max D1 concluding the proof of parts (a) and (b). The proof of (c) and (d) are very similar and omitted. Lemma 25 Let C = 2 3, σ2 s = 1 |G s|2 P i G s γi, and Mi = Zi ET i A B 1. Then, (a) (Mi)s,t is sub-exponential with parameters α = 2 max σ2 s, C t,t and ν = 2 max σ2 s, C t,t , n Pn i=1 Mi max C maxt(σ2 t C t,t) q (c) (Miu)s is sub-exponential with parameters α = 2 max σ2 s, C t,t and ν = 2||u||2 max σ2 s, C t,t , and n Pn i=1 Miu C maxt(σ2 t C t,t)||u||2 q where u RK. Proof Let M = Pn i=1 Mi. From Lemma 27, Y1 = B 1A T E1 is a K-dimensional vector where the kth entry is given by (Y1)k = 1 |G k| i G k (E1)i. Because the errors are all independent mean zero Gaussian random variables, (Y1)s N(0, σ2 s). Therefore, as Y1 is independent of Z1 by definition, E[(Y1)s(Z1)t] = E[(Y1)s]E[(Z1)t] = 0. Further, Lemma 31 gives that (Y1)s(Z1)t is sub-exponential with parameters α = ν = 2 max σ2 s, C t,t . Using the independence of the samples, Corollary 33 gives that Ms,t is sub-exponential with parameters α = 2 max σ2 s, C t,t and ν = 2n max σ2 s, C t,t . Then, Corollary 35 gives that for arbitrary choice of D1 > 0, 1 n Ms,t D1 exp (log(K n))D2 1 4 max(σ2s,C t,t) 2 2 max σ2 s, C t,t 2 max(σ2s,C t,t) 2 max σ2 s, C t,t . High-Dimensional Inference for Cluster-Based Graphical Models Observe that for n sufficiently large, D1 p 2 max σ2 s, C t,t . If we choose D1 2 3 max σ2 s, C t,t , then we obtain that for n sufficiently large, 1 n Ms,t D1 Then by the union bound we can obtain i=1 Zi ET i A B 1 max D1 3 max maxs σ2 s, maxt C t,t , concluding the proof of parts (a) and (b). The proof of (c) and (d) are very similar and omitted. Lemma 26 Recall that m = mink |G k|, and let Mi = B 1A T Ei ET i A B 1 and µ = B 1A T Γ A B 1. Then, (a) (Mi µ)t,k is sub-exponential with parameters αt,k = 2 |G k||G t | maxi G t G k γ i and 2 |G k||G t | maxi G t G k γ i , n Pn i=1 Mi µ max C maxk γk q (c) ((Mi µ) u)t is sub-exponential with parameters αt = 2 m2 maxk γ k and νt = q 2 m2 ||u||2 maxk γ k, and n Pn i=1 (Mi µ) u max C||u||2 maxk γk q where C = 2 3 and and u RK. Proof To obtain the results, we first bound the sum M := Pn i=1 Mi entrywise. From Lemma 28, Corollary 32 and Corollary 33, we have that (Mi)t,k is sub-exponential with parameters 2 |G k||G t | max i G t G k γ i and νt,k = 2 |G k||G t | max i G t G k γ i . Therefore 1 n Mt,k is sub-exponential with parameters 2 n|G k||G t | max i G t G k γ i and ν = 2 n|G k||G t | max i G t G k γ i . Eisenach, Bunea, Ning and Dinicu Next, observe that µt,k = E[Mt,k] and denote N = maxi G t G k γ i . Then, from Lemma 34, we obtain 1 n Mt,k µt,k D1 n|G k||G t | exp (log(K n))D2 1 4N2 if 0 D1 q log(K n) n|G k||G t | log(K n) n|G k||G t | > Observe that for n sufficiently large, D1 q log(K n) n|G k||G t | 2N. If we choose D1 2 3N, then we obtain that for n sufficiently large, 1 n Mt,k µt,k D1 n|G k||G t | Therefore by taking the union bound, lower bounding n|G k||G t | by nm2 and choosing D1 2 3 maxk γ k, i=1 B 1A T (Ei ET i Γ )A B 1 max D1 concluding the proof of parts (a) and (b). The proof of (c) and (d) are very similar and omitted. Appendix D. Auxiliary Technical lemmas Lemma 27 For 1 k K, denote mk = |G k|. Then the matrix (A T A ) 1A T is a K d dimensional matrix given as [(A T A ) 1A T ]k,i = ( 1 mk if i G k 0 otherwise. Proof First, we must calculate B 1A T . For 1 k K, denote mk = |G k| and let ek be a unit vector in RK with 1 on the k position and 0 otherwise. Without loss of generality, we permute the rows of A such that for any 1 k K A j = ek, for Pk 1 i=1 mi + 1 j < Pk i=1 mi + 1 that is rows are ordered according to ascending group index. Here, for notational simplicity, we let m0 = 0. Thus, A T A = diag(m1, ..., m K) and the result follows immediately. Lemma 28 The matrices B 1A T (Ei ET i )A B 1 and B 1A T (Γ )A B 1 are given by (B 1A T (Ei ET i )A B 1)t,k = 1 |G t G k| q G k Ei,p Ei,q High-Dimensional Inference for Cluster-Based Graphical Models (B 1A T (Γ )A B 1)t,k = ( 1 |G t |2 P p G t γ p if t = k 0 otherwise. Proof The result can be obtained by a straightforward computation. Lemma 29 (Restricted Eigenvalue Condition for b C) If Assumptions 1 and 2 hold, then the matrix b C satisfies with probability at least 1 C (K n)3 , ||v||2 2 : v RK \ {0}, ||v S||1 3||v S||1 ( v T b C t, tv ||v||2 2 : v RK \ {0}, ||v S ||1 3||v S ||1 where κ 3 4c1 > 0. Proof We begin by proving the first claim. By Lemma 19, we have that b C C max n with high probability. Therefore, for K sufficiently large and for any v RK \ {0}, v T b Cv The proof is then done for κ = 3 4c1 as we assume the minimum eigenvalue of C is bounded below by c 1 0 . The proof of the second claim is identical because C is positive semidefinite, and it is well known that the minimum eigenvalue of any principal submatrix C t, t is bounded below by λmin(C ) c 1 0 . Lemma 30 Let M be a n n positive definite matrix and denote its inverse by L. Then, for all i = 1, . . . , n Mi,i Li,i 1. Proof By the block matrix inverse formula, it follows that M 1 i,i = Li,i LT i,i L 1 i, i L i,i. (58) Because M is positive definite, so is L. Recall that a matrix is positive definite if and only if all its principal minors are also positive definite. Therefore, L i, i is positive definite, as is L 1 i, i. Therefore, LT i,i L 1 i, i L i,i 0 and (58) becomes M 1 i,i Li,i. Lastly, if a matrix is positive definite, all its diagonal elements must be nonnegative, giving that Mi,i Li,i 1 as desired. Eisenach, Bunea, Ning and Dinicu Appendix E. Basic Tail Bounds for Random Variables This section collects some basic tail probability results for random variables. The proof is standard and omitted. Lemma 31 Let Y = (Y1, Y2) be a jointly Gaussian random vector with covariance matrix C. Then Y1Y2 is sub-exponential with parameters α = 4λmax (CY ) and ν = 2 2λmax (CY ). Corollary 32 Let Y1 N(0, σ2 1) and Y2 N(0, σ2 2) where σ2 1 σ2 2.Then Y1Y2 is subexponential with parameters α = 2σ2 1 and ν = Corollary 33 Consider Pn i=1 Xi where Xi are centered, independent sub-exponential random variables. Then Y = Pn i=1 Xi is sub-exponential with parameters α = maxi αi and ν = q Pn i=1 ν2 i . Lemma 34 (Tail Bound for Sub-Exponential Random Variables) Let X be a subexponential random variable with mean µ and parameters α and ν. Then 2ν2 ) for 0 t ν2 2α) for t > ν2 Corollary 35 Consider Y = Pn i=1 Xi, where Xi are centered, independent sub-exponential random variables. Let α = maxi αi and ν = q Pn i=1 ν2 i . Then, 2ν2/n) for 0 t ν2 2α) for t > ν2 Appendix F. Construction of a Pre-Clustering Variance Estimator We include in this section the construction of the pre-clustering estimator of Γ needed as an input of the PECOK algorithm of Section 2.2 above. For any a, b [d], define V (a, b) := max c,d [p]\{a,b} (bΣac bΣad) (bΣbc bΣbd) q bΣcc + bΣdd 2bΣcd , (59) with the convention 0/0 = 0. Guided by the block structure of Σ, we define b1(a) := argmin b [p]\{a} V (a, b) and b2(a) := argmin b [p]\{a,b1(a)} V (a, b), to be two elements close to a, that is two indices b1 = b1(a) and b2 = b2(a) such that the empirical covariance difference bΣbic bΣbid, i = 1, 2, is most similar to bΣac bΣad, for all variables c and d not equal to a or bi, i = 1, 2. It is expected that b1(a) and b2(a) either belong to the same group as a, or belong to some close groups. Then, our estimator eΓ is a diagonal matrix, defined by eΓaa = bΣaa + bΣb1(a)b2(a) bΣab1(a) bΣab2(a), for a = 1, . . . , d. (60) High-Dimensional Inference for Cluster-Based Graphical Models Intuitively, eΓaa should be close to Σaa + Σb1(a)b2(a) Σab1(a) Σab2(a), which is equal to Γaa in the favorable event where both b1(a) and b2(a) belong to the same group as a. In general, b1(a) and b2(a) cannot be guaranteed to belong to the same group as a. Nevertheless, these two surrogates b1(a) and b2(a) are close enough to a so that eΓ Γ max |Γ|max p log d/n. This last fact and the above construction are shown in Bunea et al. (2018). Appendix G. Comparison with Cai et al (2016) In our work, we bound λmin(S ) c1 and maxt S t,t c2 and the sparsity of Ω k by s1. We show that the CLIME estimator satisfies bΩ k Ω k 1 s1 Let us denote our parameter space by G1 = {Ω: max k Ω k 0 s1, max t {(Ω 1)tt} c2, λmax(Ω) 1/c1}. By contrast, Cai et al. (2016) considered the following parameter space for the precision matrix Ω= S 1 (in the exact sparse case) G = {Ω: max k Ω k 0 s1, Ω 1 Mn, κ(Ω) = λmax(Ω)/λmin(Ω) M1}, where Ω 1 is the matrix ℓ1-norm, Mn is allowed to increase with n and M1 is a constant bound for the condition number. Their minimax lower bound over G depends on Mn, that is for any bΩ k, sup Ω G bΩ k Ω k 1 Mns1 It seems that the upper bound (61) and the lower bound (62) contradict with each other. However, this is not the case because the parameter space G1 under which the estimator is developed is different from the parameter space G in the lower bound. This therefore opens up the possibility of obtaining different rates over different parameter spaces, for instance one which does not depend on ||Ω ||1, like in our case. However, neither parameter space can be viewed as a full relaxation of the other. Below we give explicit examples that show how the parameter spaces differ. Consider the sequence of precision matrices, indexed by K ΩK = |S(K)|I + 1S(K)1 S(K) RK, where S(K) is an arbitrary subset of [K] = {1, 2, ..., K} with |S(K)| = K1/4 and 1S(K) is the vector with 1 s in the indices indicated by S(K) and 0 s elsewhere. By using the Sherman-Morrison formula, we can verify that the corresponding covariance matrix is SK = 1 |S(K)|I 1 2|S(K)|2 1S(K)1 S(K). Eisenach, Bunea, Ning and Dinicu It is easy to check that λmin(ΩK) = |S(K)| = K1/4, λmax(ΩK) = 2|S(K)| = 2K1/4, λmin(SK) = 1 2|S(K)| = K 1/4/2. Thus κ(ΩK) = 2, and therefore Theorem 4.2 from Cai et al. (2016) can be applied to this sequence of parameters ΩK with s1 = K1/4 + 1 and Mn = 2K1/4. However, ΩK does not belong to our parameter space G1 because for any choice of c1, once K is sufficiently large, λmin(SK) < c1. Similarly, we can modify the above example such that ΩK satisfies our condition but does not belong to G. For instance, consider ΩK = I 1 1 + |S(K)|1S(K)1 S(K) RK. We can show that SK = I + 1S(K)1 S(K), λmin(ΩK) = 1 K1/4 + 1, λmax(ΩK) = 1, λmin(SK) = 1. Our conditions λmin(SK) c1 and maxt(SK)t,t c2 hold, and thus we obtain the rate (61) for the CLIME estimator. However, κ(ΩK) = K1/4 + 1 , as K . Thus, ΩK does not belong to the parameter space G, and Theorem 4.2 in Cai et al. (2016) is not applicable. Since our goal is to make inference on each entry of the precision matrix say Ωtk, a lower bound on the minimum eigenvalue on S is a mild and standard assumption for any high-dimensional inference on graphical models. After some careful analysis as explained in the main text, we show that under Assumption 4.1 and 4.2 the upper bound for the CLIME estimator does not depend on ||Ω ||1 or λmax(S ). Appendix H. B-H procedures for FDR control In this section, we analyze the theoretical properties of the B-H procedure for the clusteraverage graph. The procedure for latent variable graph is identical. Recall that the test statistic for H0,tk : Ω t,k = 0 is f Wt,k. The B-H procedure is defined as follows. Given the desired FDR level α, define bρ = inf n 0 ρ 2 p log K : 2(1 Φ(ρ))|H| max{P 1 t k K I(|f Wt,k| ρ), 1} α o , where |H| = K(K 1)/2 is the total number of hypotheses. If bρ does not exist, then we set bρ = 2 log K. In the above definition, we restrict ρ 2 log K in order to apply the High-Dimensional Inference for Cluster-Based Graphical Models Cramer-type moderate deviation result (Liu, 2013). The B-H procedure says that we reject H0,tk if |f Wt,k| bρ. The FDR and FDP are defined as P (t,k) H0 I(|f Wt,k| bρ) max{P (t,k) H I(|f Wt,k| bρ), 1} , where H0 := {(t, k) : 1 t < k K, such that Ω t,k = 0}, and FDR = E(FDP). Let A = n (t, k) : Ω t,k q Ω k,kΩ t,t 4 p denote the set of strong signals. The following theorem on the B-H procedure holds. Proposition 36 Assume that the conditions in Theorem 3 hold. Let K nr for some r > 0. In addition, assume that |A| log log K, s1 log3/2(K n)/n1/2 = o(1) and s1 = O(Kc) for some c < 1/2. Then, as n, K , we have FDP |H0|/|H| = α + op(1) and FDR |H0|/|H| = α + o(1). The proof of this proposition follows from Theorem 3.1 in Liu (2013). The conditions |A| log log K and s1 = O(Kc) are equation (12) in Liu (2013). The condition s1 log3/2(K n)/n1/2 = o(1) together with Theorem 3 guarantees that max 1 t