# highdimensional_inference_for_clusterbased_graphical_models__9d7ceec4.pdf

Journal of Machine Learning Research 21 (2020) 1-55 Submitted 6/18; Revised 12/19; Published 3/20

High-Dimensional Inference for Cluster-Based Graphical Models

Carson Eisenach eisenach@princeton.edu Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA

Florentina Bunea fb238@cornell.edu Yang Ning yn265@cornell.edu Claudiu Dinicu cd535@cornell.edu Department of Statistics and Data Science Cornell University Ithaca, NY 14850, USA

Editor: Nicolas Vayatis

Motivated by modern applications in which one constructs graphical models based on a very large number of features, this paper introduces a new class of cluster-based graphical models, in which variable clustering is applied as an initial step for reducing the dimension of the feature space. We employ model assisted clustering, in which the clusters contain features that are similar to the same unobserved latent variable. Two diﬀerent cluster-based Gaussian graphical models are considered: the latent variable graph, corresponding to the graphical model associated with the unobserved latent variables, and the cluster-average graph, corresponding to the vector of features averaged over clusters. Our study reveals that likelihood based inference for the latent graph, not analyzed previously, is analytically intractable. Our main contribution is the development and analysis of alternative estimation and inference strategies, for the precision matrix of an unobservable latent vector Z. We replace the likelihood of the data by an appropriate class of empirical risk functions, that can be specialized to the latent graphical model and to the simpler, but under-analyzed, cluster-average graphical model. The estimators thus derived can be used for inference on the graph structure, for instance on edge strength or pattern recovery. Inference is based on the asymptotic limits of the entry-wise estimates of the precision matrices associated with the conditional independence graphs under consideration. While taking the uncertainty induced by the clustering step into account, we establish Berry-Esseen central limit theorems for the proposed estimators. It is noteworthy that, although the clusters are estimated adaptively from the data, the central limit theorems regarding the entries of the estimated graphs are proved under the same conditions one would use if the clusters were known in advance. As an illustration of the usage of these newly developed inferential tools, we show that they can be reliably used for recovery of the sparsity pattern of the graphs we study, under FDR control, which is veriﬁed via simulation studies and an f MRI data analysis. These experimental results conﬁrm the theoretically established diﬀerence between the two graph structures. Furthermore, the data analysis suggests that the latent variable graph, corresponding to the unobserved cluster centers, can help provide more insight into the

c 2020 Carson Eisenach, Florentina Bunea, Yang Ning and Claudiu Dinicu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-357.html.

Eisenach, Bunea, Ning and Dinicu

understanding of the brain connectivity networks relative to the simpler, average-based, graph.

Keywords: Berry-Esseen bound, Graphical model, Latent variables, High-dimensional inference, Clustering, False discovery rate

1. Introduction

Over the last several decades, graphical models have become an increasingly popular method for understanding independence and conditional independence relationships between components of random vectors. More recently, the challenges posed by the estimation and statistical analysis of graphical models with many more nodes than the number of observations has led to renewed interest in these models, such as Meinshausen and B uhlmann (2006); Yuan and Lin (2007); Friedman et al. (2008); Verzelen (2008); Lam and Fan (2009); Rothman et al. (2008); Peng et al. (2009); Ravikumar et al. (2011); Yuan (2010); Cai et al. (2011); Sun and Zhang (2012); Liu et al. (2012); Xue and Zou (2012); Ning and Liu (2013); Cai et al. (2016); Tan et al. (2016); Fan et al. (2017); Yang et al. (2018); Feng and Ning (2019), to give only an incomplete list.

Nonetheless, when the dimension (number of nodes) grows very large and the sample size is small, the dependency among the components of a random vector may become weak, if it exists at all, and diﬃcult to detect without additional information. If the dimension of the random vector is in the thousands, even if the dependency structure can be detected by an estimated graphical model, it can be diﬃcult to interpret the results and extract meaningful scientiﬁc insights.

One solution to both of the aforementioned issues is to employ an initial dimension reduction procedure on the high dimensional vector. For example, in neuroscience applications, a typical functional magnetic resonance image (f MRI) consists of blood-oxygen-leveldependent (BOLD) activities measured at 200,000+ voxels of the brain, over a period of time. Instead of analyzing voxel-level data directly, scientists routinely cluster voxels into several regions of interest (ROI) with homogeneous functions using domain knowledge, and then carry out the analysis at the ROI-level. In this example, using the language of graphical models, the group structure of variables may boost the dependency signals. Similar pre-processing steps are used in other application domains, such as genomics, ﬁnance and economics.

Motivated by a rich set of applications, we consider variable clustering as the initial dimension reduction step applied to the observed vector X =: (X1, . . . , Xd) Rd. To the best of our knowledge, very little is known about the eﬀect of clustering on downstream analysis and, consequently, on the induced graphical models. Our contribution is the provision of a framework that allows for such an analysis. We introduce cluster-based graphical models, show how they can be estimated and, furthermore, provide the asymptotic distribution of the edge strength estimates.

These models are built on the assumption that the observed variables X = (X1, . . . , Xd) Rd can be partitioned into K unknown clusters G = {G 1, . . . , G K} such that variables in the same cluster share the same behavior. Following the intuition behind widely-used Kmeans type procedures, we deﬁne a population-level cluster as a group of variables that

High-Dimensional Inference for Cluster-Based Graphical Models

are noise corrupted versions of a hub-like variable. This hub-like variable is not directly observable, and is treated as a latent factor. Formally, we assume there exists a latent random vector Z RK, with mean zero and covariance matrix Cov(Z) = C , such that

X = AZ + E, (1)

for a zero mean error vector E with independent entries. The entries of the d K matrix A are given by Ajk = I{j G k}. A cluster of variables consist in those components of X with indices in the same G k. We denote Cov(E) = Γ , a diagonal matrix with entries Γ jj = γ j for any 1 j d. We also assume that the mean-zero noise E is independent of Z. Bunea et al. (2018) show that the clusters are uniquely deﬁned by the model in (1), provided that the smallest cluster contains at least two variables and C is strictly positive deﬁnite; this result holds irrespective of distributional assumptions. To keep the presentation focused, in this work we assume that Z N(0, C ) and E N(0, Γ ), which implies X N(0, Σ ) with Σ = AC AT + Γ . In this context we consider two related, but diﬀerent, graphical models:

(i) The latent variable graph, associated with the sparsity pattern of the precision matrix

Θ := C 1 (2)

of the Gaussian vector Z RK. The latent variable graph encodes conditional independencies (CI s) among the unobserved, latent variables Z.

(ii) The cluster-average graph, associated with the sparsity pattern of the precision matrix

Ω := S 1, (3)

where S is the covariance matrix of X RK, and X =: ( X1, . . . , XK) is the within cluster average given by Xk =: 1 |G k| P i G k Xi. The cluster-average graph encodes CI s among averages of observable random variables. In particular, we have

S = C + Γ ,

where Γ = diag( γ 1, ..., γ K) with γ k = 1 |G k|2 P j G k γ j .

Although both these graphs correspond to vectors of dimension K and are constructed based on the partition G , they are in general diﬀerent as the sparsity patterns of Θ and Ω

will typically diﬀer, and have diﬀerent interpretations. Therefore it would be misleading to use one as a proxy for the other when drawing conclusions. For instance, in the neuroscience example, if we interpret each latent variable as the function of a ROI, then the latent variable graph encodes the CI relationships between functions, which is one question of scientiﬁc interest. The diﬀerence between Θ and Ω shows that this question will not be typically answered by studying the cluster-average graph, although that may be tempting to do by practitioners.

Eisenach, Bunea, Ning and Dinicu

1.1. Our Contributions

Since the two cluster-based graphical models introduced above can both be of interest in a large array of applications, we provide inferential tools for both of them in this work. We assume that we observe n i.i.d. copies X1, . . . , Xn of X. The focus of our work is on post-clustering and post-regularization inference for these two sparse precision matrices. To this end, we derive the asymptotic distribution of novel estimators of their entries. These estimators can be used to answer any inferential questions of interest at the edge strength level, or can be combined to provide sparsity pattern recovery under FDR control. In Section 4.3 we provide an instance of the latter. Inference for the entries of a Gaussian precision matrix has received a large amount of attention in the past few years, most notably post-regularization inference, for instance Ren et al. (2013); Zhang and Zhang (2014); Jankova and van de Geer (2014); Gu et al. (2015); Barber and Kolar (2015); Jankov a and van de Geer (2017); Javanmard and Montanari (2013); van de Geer et al. (2013); Ning and Liu (2017); Cai and Guo (2017); Neykov et al. (2018); Ning et al. (2017); Fang et al. (2017). These works generalize the classical ideas of one-step estimation (Bickel, 1975) to the high-dimensional setup by ﬁrst constructing a sparse estimator of the precision matrix via regularization, and then building de-sparsiﬁed updates that are asymptotically normal. The eﬀect of the initial regularization step is controlled in the second step, and inference after regularization becomes valid. In this paper we consider a similar estimation strategy, but diﬀers from the existing literature in several important ways. In our work, we add another layer of data-dependent dimension reduction, via clustering, and provide a framework within which the variability induced by clustering can be controlled. Even after controlling for the clustering variability, we note that the existing procedures for estimation and, especially, post-regularization inference in Gaussian graphical models are not immediately applicable to our problem for the following reasons:

1. They are developed for variables that can be observed directly. From this perspective, they could, in principle, be applied to the cluster-average graph, but are not directly extendable to the latent graph;

2. To the best of our knowledge, all existing methods for precision matrix inference require the largest eigenvalue of the corresponding covariance matrix to be upper bounded by a constant. Such an assumption implies, in turn, that the Euclidean norm of each row of the covariance matrix is bounded, which reduces signiﬁcantly the parameter space for which inference is valid. The assumption holds, for instance, when the number of variables is bounded, or when the entries of each row are appropriately small.

To overcome these limitations, we take a diﬀerent approach in this work, that allows us to lift unpleasant technical conditions associated with other procedures, while maintaining the validity of inference for both the latent and the average graph. We summarize our main contributions below.

1. Methods for estimation tailored to high dimensional inference in latent, cluster-based, graphical models. We develop a new estimation strategy tailored to

High-Dimensional Inference for Cluster-Based Graphical Models

our ﬁnal goal, that of constructing approximately Gaussian estimators for the entries of the precision matrices Θ and Ω given in (2) and (3) above. Although we work under the assumption that the data is Gaussian, likelihood based estimators may be unsatisfactory, because their analysis can require stringent assumptions, as explained above (see also Jankova and van de Geer (2014)), or may become analytically intractable, as argued in Section 3.3, for the latent graph. We propose a method that mimics very closely the principles underlying the construction of an eﬃcient score function for estimation in the presence of high dimensional nuisance parameters (see for instance Bickel et al. (1993)), but we do not base it on the corresponding likelihood-derived quantities. The underlying principles are explained in Section 3.1. To the best of our knowledge, this is the ﬁrst estimation method of the latent precision matrix, that can be analyzed theoretically for inferential purposes. As an added beneﬁt of our estimation framework, the same principles can be applied for the estimation, and analysis, of the cluster-average graph. 2. The analysis of estimators of the precision matrix of unobserved cluster centers: Berry-Esseen-type bounds for Gaussian approximations. We verify, theoretically, that the estimators constructed with an inferential goal in mind do indeed have the desired properties. To this end, we derive the asymptotic distribution of the proposed entry-wise estimates of the latent graph, and also of the cluster-average graph. Moreover, we quantify the speed with which this limit is obtained, which we show to be proportional to 1/ n in both cases. We do so by establishing Berry-Esseen type bounds on the diﬀerence between the cumulative distribution function of our estimators and that of a standard Gaussian random variable that are valid for each K, d and n, and are presented in Theorems 3 and 4, respectively. As immediate applications, we can construct approximate conﬁdence intervals for one or a ﬁnite number of entries of the latent or average graph, or known linear functionals of such entries. While the answers we thus provide at the average graph level are similar to existing results, established for the full CI graph based on all d nodes, for instance by (Ren et al., 2013; Zhang and Zhang, 2014; Jankova and van de Geer, 2014; Gu et al., 2015; Barber and Kolar, 2015; Jankov a and van de Geer, 2017; Javanmard and Montanari, 2013; van de Geer et al., 2013; Ning and Liu, 2017; Cai and Guo, 2017; Neykov et al., 2018), the results for the latent graph are, to the best of our knowledge, the ﬁrst such results in the literature. We note, furthermore, that the average cluster graph can be viewed as a graph with observable nodes, the cluster averages, only after the clusters are estimated from the data. Our theoretical analysis takes this step into account. We discuss, in Section 2.2, clustering methods tailored to model (1), where the number of clusters K is unknown and is allowed to grow with n. Using the results of Bunea et al. (2018), these methods yield a partition b G = G , with high probability, provided that λmin(C ) > c, for a small positive quantity c made precise in Section 2.2. A lower bound on the smallest eigenvalue of the covariance matrix is the minimal condition under which inference in any graphical model can be performed. Therefore, consistent clustering via the model (1) does not require a further reduction of the parameter space for which the more standard post-regularized inference can be developed. Moreover, as Section 4 shows, asymptotic inference based on the estimated clusters reduces to asymptotic inference relative to the true clusters, G , without any need for data splitting. This fact holds true for both the average and the latent variable graph, and is in sharp contrast with a phenomenon often encountered in post-model selection inference, such as

Eisenach, Bunea, Ning and Dinicu

in variable selection in linear regression (Lockhart et al., 2014; Lee et al., 2013; Tibshirani et al., 2014). In that case, reducing inference to the consistently selected set of variables can only be justiﬁed over a reduced part of the parameter space (Bunea, 2004), and is therefore not a popular practice.

Another technical contribution is that the asymptotic normality of the estimators is established under relaxed conditions. Unlike the existing literature on de-biased inference on graphical models (Jankova and van de Geer, 2014; Jankov a and van de Geer, 2017), we do not require the bounded operator norm condition for the covariance matrix such as λmax(S ) C for cluster-average graph. As shown by Jankova and van de Geer (2014) and explained above, the analysis of the multivariate Gaussian likelihood may require stringent assumptions for cluster-average graph and becomes intractable for the latent graph. By using the proposed pseudo-likelihood function which has a much simpler form, we can remove the unpleasant assumption on the bounded operator norm. In addition, we reanalyze the CLIME estimator (Cai et al., 2011) bΩ k (the kth column of bΩ) under our Assumption 4.1 and 4.2, which is used as the initial estimator for inference. As explained in

Section 3.2, we can show that the CLIME estimator satisﬁes bΩ k Ω k 1 s1 q

n , where s1 is the sparsity of Ω k. This result does not require the bounded operator norm or matrix L1 norm condition and can be of independent interest.

To illustrate how one can use these newly developed inferential tools, we focus on the estimation of the sparsity pattern of the graphs, which can be equivalently viewed as a multiple-testing problem. It is well known that the exact sparsity pattern can be recovered, with high probability, only if the entries of each precision matrix are above the minimax optimal noise level O( p

log d/n) (Ravikumar et al., 2011; Meinshausen and B uhlmann, 2006). Since our aim is inference on the sparsity pattern without further restrictions on the parameter space, the next best type of error that we can control is the False Discovery Rate (FDR) (Benjamini and Hochberg, 1995). In Section 4.3 we use these results for pattern recovery under FDR control, and explain the eﬀect of the asymptotic approximations on this quantity.

This paper is organized as follows. Section 2 below contains a brief summary of existing results on model-assisted clustering, via model (1). Section 3 describes the estimation procedures for the latent variable graph and the cluster-average graph, respectively. In Section 4 we establish Berry-Esseen type central limit theorems for the estimators derived in Section 3, and provide bounds on the FDR associated with each graphical model under study, respectively. Section 5 gives numerical results using both simulated and real data sets.

2. Background

2.1. Notation

The following notation is adopted throughout this paper. Let d denote the ambient dimension, n the sample size, K the number of clusters and m the minimum cluster size. The matrix C denotes the population covariance of the latent vector Z. Likewise, the matrices Γ , Σ , Θ , S and Ω denote population-level quantities.

High-Dimensional Inference for Cluster-Based Graphical Models

For v = (v1, ..., vd)T Rd, and 1 q , we deﬁne v q = (Pd i=1 |vi|q)1/q, v 0 = |supp(v)|, where supp(v) = {j : vj = 0} and |A| is the cardinality of a set A. Denote v = max1 i d |vi| and v 2 = vv T . Assume that v can be partitioned as v = (v1, v2). Let f(v) denote the gradient of the function f(v), and 1f(v) = f(v)/ v1. Similarly, let 2f(v) denote the Hessian of the function f(v) and 2 12f(v) = 2f(v)/ v1 v2. For a d d matrix M = [Mjk], let M max = maxjk |Mjk|, M 1 = P jk |Mjk|, and M = maxk P j |Mjk|. If the matrix M is symmetric, then λmin(M) and λmax(M) are the minimal and maximal eigenvalues of M. Let [d] = {1, 2, ...., d}. For any j [d], we denote the jth row and jth column of M as Mj and M j, respectively. Similarly, let M j, k be the sub-matrix of M with the jth row and kth column removed. The notation Sd d refers to the set of all real, symmetric d d matrices. Likewise, Sd d + Sd d is the positive semi-deﬁnite cone. We use and to denote the Kronecker and Hadamard product of two matrices, respectively; we also may write M 2 = M M. Let ej denote the vector of all zeros except for a one in the jth position. The vector 1 is the vector of all ones. a b = max(a, b).

2.2. Model Assisted Variable Clustering

In this section we review existing results on variable clustering. Bunea et al. (2018) showed that if we use model (1) to deﬁne clusters of variables, these clusters are uniquely deﬁned, up to label switching, so long as m =: min1 k K |G k| 2 and the components of the latent vector Z are diﬀerent almost surely, or equivalently

(C ) =: min j<k E(Zj Zk)2 > 0.

Since (C ) = min j<k (ej ek)T C (ej ek) 2λmin (C ) ,

the clusters are uniquely deﬁned as soon as λmin (C ) > 0, which is the minimal condition under which one can study properties of the corresponding precision matrix. In addition, Bunea et al. (2018) developed two algorithms, PECOK and COD, that are shown to recover the clusters exactly, from n i.i.d. copies X1, . . . , Xn of X, as soon as

λmin (C ) c,

for a positive quantity c that approaches 0 as n grows. For the COD procedure,

c = O Σ max p

log(d n)/n .

On the other hand, for the PECOK procedure

c = O Γ max p

K log(d n)/mn ,

which can be much smaller when one has a few, balanced, clusters. These values of c are shown to be minimax or near-minimax optimal for cluster recovery. We refer to Theorems 3 and 4 in Bunea et al. (2018) for the precise expressions and details. Under these minimal conditions on λmin (C ), the exact recovery of the clusters holds with

Eisenach, Bunea, Ning and Dinicu

probability larger than 1 1/(d n). Because these conditions are suﬃciently weak, we are able to show, in Section 4, that inference in cluster-based graphical models is not hampered by the clustering step. For completeness, we outline the PECOK algorithm below, which consists in a convex relaxation of the K-means algorithm, further tailored to estimation of clusters G = {G 1, . . . , G K} deﬁned via the interpretable model (1). The PECOK algorithm consists in the following three steps:

1. Compute an estimator eΓ of the matrix Γ .

2. Solve the semi-deﬁnite program (SDP)

b B = argmax B D bΣ eΓ, B , (4)

where bΣ is the sample covariance matrix and

B 0 (symmetric and positive semideﬁnite) P a Bab = 1, b Bab 0, a, b tr(B) = K

3. Compute b G by applying a clustering algorithm on the rows (or equivalently columns) of b B.

The construction of an accurate estimator eΓ of Γ , before the cluster structure is known, is a crucial step for guaranteeing the statistical optimality of the PECOK estimator. Its construction is given in Bunea et al. (2018), and included in Appendix F, for the convenience of the reader. We will employ an eﬃcient algorithm for solving (4). Standard black-box SDP solvers, for a ﬁxed precision, exhibit O(d7) running time on (4), which is prohibitively expensive. Eisenach and Liu (2019) recently introduced the FORCE algorithm, which requires worst case O(d6K 2) time to solve the SDP, and in practice often performs the clustering rapidly. The key idea behind the FORCE algorithm is that an optimal solution to (4) can be attained by ﬁrst transforming (4) into an eigenvalue problem, and then using a ﬁrst-order method. Iterations of the ﬁrst-order method are interleaved with a dual step to round the current iterate to an integer solution of the clustering problem, and then searches for an optimality certiﬁcate. By using knowledge of both the primal and the dual SDPs, FORCE is able to ﬁnd the solution much faster than a standard SDP solver. We refer to Eisenach and Liu (2019) for the detailed algorithm.

3. Estimation of Cluster-based Graphical Models

In this section, we propose a uniﬁed estimation approach, that utilizes similar loss functions for estimation and inference in the cluster-average and the latent variable graphs. We ﬁrst describe our general principle, and then demonstrate its application to the two graphical models.

High-Dimensional Inference for Cluster-Based Graphical Models

3.1. One-step Estimators for High-Dimensional Inference

Assume that we observe n i.i.d. realizations X1, ..., Xn of X Rd. Let Q(β, X) denote a known mapping of β and X to R, where β is a q-dimensional unknown parameter of the distribution of X. Often this Q is referred to as the loss function. We deﬁne the target parameter β as β = argmin E(Q(β, X)).

Next, let us partition β as β = (θ, γ), where θ R is the univariate parameter of interest, and γ Rq 1 is a nuisance parameter. Our goal is to construct a n1/2-consistent and asymptotically normal estimator for θ in high-dimensional models with q = dim(β) n. In this case, the dimension of the nuisance parameter γ is large, which makes the inference on θ challenging. We start from the empirical risk function over n observations deﬁned as

i=1 Q(β, Xi). (6)

One standard choice for Qn is the negative log-likelihood function of the data. In this work, we conduct inference based on an alternative loss function, as the analysis of the log-likelihood can require unpleasant technical conditions that we would like to avoid, as discussed in Sections 4.1. That said, we mimic likelihood principles as much as possible, in order to make intuitive the construction below. For these reasons we will refer to Qn(β) as the negative pseudo-likelihood function. For now we leave Qn(β) unspeciﬁed a detailed discussion of its selection for inference in the latent variable graph and the cluster-average graph will be given in the following two subsections. In terms of Qn, we deﬁne the pseudo-information matrix for one observation as I(β) = E( 2Q(β , Xi)). We can partition this matrix as

I(β) = I11 I12 I21 I22

with the partitions corresponding to those of β = (θ, γ). When Qn is the negative log-likelihood function, and the dimension of the parameter is independent of n, then h(θ; γ) given by (8) is called the eﬃcient score function for θ, and classical theory shows that it admits solutions that are consistent, asymptotically normal and attain the information bound given by the reciprocal of (9) (Van der Vaart, 1998; Bickel et al., 1993). With these goals in mind, we similarly deﬁne the corresponding pseudo-score function for estimating θ in the presence of the nuisance parameter γ as

h(θ; γ) = 1Qn(β) I12I 1 22 2Qn(β)

1Q(β, Xi) I12I 1 22 2Q(β, Xi) (8)

and deﬁne the pseudo information of θ, in the presence of the nuisance parameter γ, as

I1|2 = I11 I12I 1 22 I21. (9)

Eisenach, Bunea, Ning and Dinicu

When the dimension of γ is ﬁxed, one can easily estimate I12 and I22 in (8) by their sample versions b I12 and b I22. However, such simple procedure fails when the dimension of γ is greater than the sample size, as b I22 is rank deﬁcient. To overcome this diﬃculty, rather than estimating I12 and I 1 22 separately, we directly estimate

w T = I12I 1 22 (10)

by bw = argmin w 1, s.t. 2 12Qn( bβ) w T 2 22Qn( bβ) λ , (11)

where λ is a non-negative tuning parameter, and bβ = (bθ, bγ) is an initial estimator, which is usually deﬁned case by case, for a given model. Then, we can plug bw and bγ into the pseudo-score function, which gives

bh(θ, bγ) = 1Qn(θ, bγ) bw T 2Qn(θ, bγ). (12)

Following the Z-estimation principle (Van der Vaart, 1998; Bickel et al., 1993), one could deﬁne the ﬁnal estimator of θ as the solution of the pseudo-score function bh(θ, bγ). However, in many examples, the pseudo-score function bh(θ, bγ) may have multiple solutions and it becomes unclear which root serves as a consistent estimator; see Small et al. (2000) for further discussion of the general estimating function context. To bypass this issue, we consider the following simple one-step estimation approach. Given the initial estimator bθ from the partition of bβ, we perform a Newton-Raphson update based on the pseudo-score function bh(θ, bγ), to obtain eθ, which is traditionally referred to as a one-step estimator by Bickel (1975). Speciﬁcally, we construct

eθ = bθ b I 1 1|2bh(bθ, bγ), (13)

where b I1|2 is an estimator of the partial information matrix I1|2. In Sections 3.2 and 3.3 below we show that, under appropriate conditions, the one-step estimator eθ constructed using the empirical risk functions Qn deﬁned in (16) and (31), respectively satisﬁes

n1/2(eθ θ ) = I 1 1|2n1/2h(β ) + op(1). (14)

By applying the central limit theorem to h(β ), we establish the asymptotic normality of eθ in Theorems 3 and 4. When Qn(β) is the negative log-likelihood of the data, Ning and Liu (2017) successfully used this approach and the resulting estimator eθ is asymptotically equivalent to the debiased estimator in Zhang and Zhang (2014); van de Geer et al. (2013). As we explain in the following subsections, analysis based on the log-likelihood becomes intractable for the latent graphical model and requires stringent technical conditions for the cluster-average graphical model. To overcome this diﬃculty, we employ pseudo-score functions derived from (16) and (31). The resulting one-step estimator still attains the information bound established in the literature, and more importantly requires weaker technical assumptions than the existing methods. In addition to (14), we derive explicitly the speed at which the normal approximation is attained.

High-Dimensional Inference for Cluster-Based Graphical Models

3.2. Estimation of the Cluster-Average Graph

Recall that we assume Z N(0, C ) and E N(0, Γ ), which implies X N(0, Σ ) with Σ = AC AT +Γ . The within-cluster average X =: ( X1, . . . , XK) RK is given by Xk =: 1 |G k| P i G k Xi, corresponding to the population level clusters. Because X N(0, Σ ), we

can verify that X N(0, S ), where

S = C + Γ , (15)

and Γ = diag( γ 1, ..., γ K) with γ k = 1 |G k|2 P j G k γ j . Recall that the precision matrix of X is Ω = S 1 = (C + Γ ) 1.

In this section we give the construction of the estimators of the cluster-average graph corresponding to X. Speciﬁcally, we use the generic strategy outlined in the previous section in order to construct n1/2-consistent and asymptotically normal estimators for each component Ω t,k of the precision matrix Ω , for 1 t < k K. For the estimation of each entry, the remaining K(K + 1)/2 1 parameters in Ω are treated as nuisance parameters. Since we observe n i.i.d. samples of X Rp, if the clusters and their number were known, then we would implicitly observe n i.i.d. samples of X RK. A priori, the clusters are not known else this problem would simply reduce to the standard setting. However to explain our method, we ﬁrst assume that clustering is given, and then show how to lift this assumption. Following our general principle, we would naturally tend to choose the negative loglikelihood function of the cluster-averages ( X1, ..., Xn) as the empirical risk function Qn(β) in (6). Along this line, Jankova and van de Geer (2014) proposed the de-biased estimator for Gaussian graphical models. However, the inference requires the irrepresentable condition (Ravikumar et al., 2011) on S , which can be restrictive. The alternative methods proposed by Ren et al. (2013); Jankov a and van de Geer (2017) imposed the condition that the largest eigenvalue of S is bounded. These technical conditions on S are diﬃcult to justify and can be avoided by using our approach. We propose to estimate each sparse row of Ω as explained below. Let S = n 1 Pn i=1 Xi XT i denote the sample covariance matrix of Xi. When K is small, the maximum likelihood estimator of Ω is S 1, which can be viewed as the solution of the following equation SΩ IK = 0. Thus, in the low dimensional setting, this equation deﬁnes the maximum likelihood estimator. Since we are only interested in Ω t,k, we can extract the kth column from the left hand side of the above equation, and use it as the pseudo-score function Un(Ω k) = SΩ k ek. To apply the inference strategy in Section 3.1, we need to construct a valid empirical risk function Qn(Ω k) such that Qn(Ω k) = Un(Ω k). Simple algebra shows that a possible choice is

Qn(Ω k) = 1

2ΩT k SΩ k e T k Ω k = 1

2ΩT k Xi XT i Ω k e T k Ω k), (16)

which we view in the sequel as the empirical risk corresponding to the population level risk

EQ(Ω k, X) = 1

2ΩT k S Ω k e T k Ω k, (17)

Eisenach, Bunea, Ning and Dinicu

based on the loss function

Q(Ω k, X) =: 1

2ΩT k X XT Ω k e T k Ω k. (18)

Since EQ(Ω k, X) = S Ω k ek = 0 (19)

and 2EQ(Ω k, X) = S , (20)

then the population risk EQ(Ω k, X) has the rows Ω k of the target precision matrix Ω as the unique minimizers, as desired, provided that S is positive deﬁnite, an assumption we make in Section 4.1. We note that the choice of the empirical risk Qn( ) and that of the corresponding pseudoscore Un( ) is not unique. We chose the particular form (16) because it is quadratic in Ω k, which greatly simpliﬁes the theoretical analysis and leads to weaker technical assumptions. Moreover, the property (19) is the same as that of the score function corresponding to the negative log-likelihood function, supporting our terminology. We use the general strategy presented in Section 3.1 to construct estimators that employ the empirical risk Qn( ) deﬁned by (16) above. We ﬁrst recall that Qn( ) depends on the unknown cluster structure G via Xi. We note that in general the estimated group b Gk may diﬀer from G k by a label permutation. For notational simplicity, we ignore this label permutation issue and treat b Gk as an estimate of G k (rather than G j for some j = k). To

deﬁne our estimator of Ω t,k, we ﬁrst replace Xi by c Xi and denote b S = n 1 Pn i=1 c Xi c XT i ,

where b Xik = 1 | b Gk| P

j b Gk Xij.

Let (t, k) be arbitrary, ﬁxed. Replacing S by b S in Qn( ), we follow Section 3.1 to deﬁne the pseudo-score function h(Ω k) = v T t ( b SΩ k ek), (21)

where v t is a K-dimensional vector with (v t )t = 1 and (v t ) t = w t with w t = (S t, t) 1S t,t, which is consistent with the deﬁnition in (10) above. To make inferences based on h(Ω k), we further need to estimate w t and Ω k. Following (11), an estimate of w t is given by

bwt = argmin w 1, s.t b St, t w T b S t, t λ , (22)

where λ is a tuning parameter. Then we can deﬁne bvt accordingly, and

bh(Ω k) = bv T t ( b SΩ k ek). (23)

Recall that the construction of the one-step estimator (13) requires an initial estimator of Ω k. To be concrete, we consider the following initial estimator of Ω k,

bΩ k = argmin β 1, s.t b Sβ ek max λ, (24)

where λ is a tuning parameter. This estimator has the same form as the CLIME estimator for the k-th column of Ω(Cai et al., 2011). However, unlike the CLIME estimator which

High-Dimensional Inference for Cluster-Based Graphical Models

requires λ Ω k 1 p

log K/n, in our Theorem 3 we assume λ = C p

log(K n)/n, where C only depends on the minimum eigenvalue of C and the largest diagonal entries of C

and Γ which are assumed bounded by constants in Assumptions 1 and 2. With this choice of λ, we show in Lemma 15 in Appendix A that

bΩ k Ω k 1 s1

with high probability, where s1 is the sparsity level of Ω k. The sharp concentration of the gradient and Hessian of the empirical risk function provided in Lemma 14 is the key for this result. As a comparison, Theorem 6 in Cai et al. (2011) only implies bΩ k Ω k 1

s1 Ω k 2 1 q

n . For many sparse matrices, the ℓ1 norm of a column, Ω k 1, can grow to inﬁnity with K or s1, and thus (61) gives a faster rate. In this case, and when λmin(C ) > c, Lemma 14 is instrumental in showing that the extra Ω k 2 1 factor in the rate of the original CLIME estimator can be avoided, whereas if only the marginal components of X are assumed to be sub-Gaussian, as in Cai et al. (2011), it may be unavoidable, without further conditions on Ω . We direct the reader to Appendix G for a more detailed discussion of the distinction between our results and those in Cai et al. (2011, 2016). Leveraging the block matrix inverse formula, we can show that the partial pseudo information matrix reduces to I1|2 = 1/Ω t,t. Finally, the one-step estimator is deﬁned as

eΩt,k = bΩt,k bh(bΩ k)bΩt,t, (26)

in accordance with (13). In Section 4, we show that under mild regularity conditions n1/2(eΩt,k Ω t,k) N(0, s2 tk), where s2 tk = Ω 2 t,k + Ω t,tΩ k,k. If bs2 tk = bΩ2 t,k + bΩt,tbΩk,k is a consistent estimator of the asymptotic variance, then a (1 α) 100% conﬁdence interval for Ωt,k is [eΩt,k z1 α/2bstk/n1/2, eΩt,k + z1 α/2bstk/n1/2],

where zα is the α-quantile of a standard normal distribution. Equivalently, we can use the scaled test statistics eΩt,k to construct a test for H0 : Ω t,k = 0 versus H1 : Ω t,k = 0 with α signiﬁcance level. Namely, the null hypothesis is rejected if and only if the above (1 α) 100% conﬁdence interval does not contain 0. We will employ such tests in Section 4.

3.3. Latent Variable Graph

Recall that the structure of the latent variable graph is encoded by the sparsity pattern of Θ = C 1, which is generally diﬀerent from the cluster-average group as C 1 and Ω = (C + Γ ) 1 may have diﬀerent sparsity patterns. In this section, we focus on the inference on the component Θ t,k, for some 1 t < k K. Similar to the cluster-average graph, we ﬁrst discuss the likelihood approach. The negative log-likelihood corresponding to model (1) indexed by the parameter (Θ, Γ) is, up to some additive and multiplicative constants, ℓ(Θ, Γ) = log |(AΘ 1AT + Γ)| + tr(bΣ(AΘ 1AT + Γ) 1),

Eisenach, Bunea, Ning and Dinicu

where bΣ = n 1 Pn i=1 Xi XT i . It is straightforward to show that the Fisher information matrix for (Θ, Γ) is given by

I(Θ, Γ) = (M AT Γ 1Σ Γ 1AM ) 2 (M AT Γ 1Σ 1F T ) 2Dd DT d (F Σ 1Γ 1AM ) 2 DT d (F Σ 1F T ) 2Dd

where Dd = (Id 1T d ) (1T d Id), M = (Θ + AT Γ 1A) 1 and F = Id AM AT Γ 1. As seen in Section 3.1, the inference based on the likelihood or equivalently eﬃcient score function (8) requires the estimation of I12I 1 22 which, given the complicated structure of the information matrix (27), becomes analytically intractable. A solution to this problem is inference based on an empirical risk function similar to (16), but tailored to the latent variable graph. With a slight abuse of notation, and reasoning as in (19) and (20), we notice that, for each k,

EQ(Θ k, X) = 1

2ΘT k C Θ k e T k Θ k, (28)

has the target Θ k as a unique minimizer, where the loss function Q(Θ k, X) is deﬁned as

Q(Θ k, X) = 1

2ΘT k CΘ k e T k Θ k, (29)

and the matrix C := ( Cjk)j,k has entries

Cjk = 1 |G j||G k|

a G j ,b G k (Xa Xb Γab), (30)

and Γab = 0 if a = b and Γaa = Xa Xa 1 |G k| 1 P a G k,a =j Xa Xj. Since E( C) = C , the risk relative to the loss function in (29) is indeed (28), and the empirical risk is

Qn(Θ k) = 1

2ΘT k C(i)Θ k e T k Θ k), (31)

where C(i) is obtained by replacing X in Cjk by Xi. Similar to the cluster-average graph, Qn(Θ k) also depends on the unknown cluster structure. We estimate G k by b Gk, and deﬁne bΓ = (bΓab), where bΓab = 0 if a = b and bΓaa = 1

n Pn i=1(Xia Xia 1 | b Gk| 1 P

a b Gk,a =j Xia Xij),

and b C = ( b Cjk) where b Cjk = 1

n Pn i=1( 1 | b Gj|| b Gk| P

a b Gj,b b Gk(Xia Xib bΓab)). We replace C by

b C in (31) above and follow exactly the strategy of Section 3.2, with b S replaced by b C, to construct the corresponding pseudo-score function bh(Θ k), similarly to (23), and the initial estimator bΘ k, similarly to (24). We combine these quantities, following the general strategy (13), as above, to obtain the ﬁnal one-step estimator of Θ t,k, deﬁned as

eΘt,k = bΘt,k bh( bΘ k)bΘt,t, (32)

after observing that, in this case, I1|2 = 1/Θ t,t. Although the form of this estimator is similar to (26), derived for the cluster-average graph, the study of the asymptotic normality of eΘt,k reveals that its asymptotic variance is much more involved, as will be discussed in detail in Section 4.2.2.

High-Dimensional Inference for Cluster-Based Graphical Models

4. Main Theoretical Results

4.1. Assumptions

In this section we state the two assumptions under which all our results are proved.

Assumption 1 The covariance matrix C of Z satisﬁes: c1 λmin (C ) and maxt C t,t c2, for some absolute constants c1, c2 > 0.

Assumption 2 The matrix Γ satisﬁes: max1 i d γ i c3 for some absolute constant c3 > 0, where γ i are the entries of the diagonal matrix Γ .

Assumptions 1 and 2 are minimal conditions for inference on precision matrices. Furthermore, they imply the conditions needed for clustering consistency derived in Bunea et al. (2018) and discussed in Section 2.2, for n suﬃciently large. Their work only requires that λmin (C ) is bounded from below by a sequence that converges to zero, as soon as Σ max and Γ max are bounded. This is strengthened by our assumptions. In general, a constant lower bound on C is standard in any inference on graphical models and is needed to show the asymptotic normality of the estimator introduced above (Ren et al., 2013; Jankova and van de Geer, 2014; Jankov a and van de Geer, 2017).

4.2. Asymptotic Normality via Berry-Esseen-type Bounds

4.2.1. Results for the Cluster-Average Graph

In the section, we show that the estimators eΩt,k given by (26) are asymptotically normal, for all t < k . We deﬁne the sparsity of the cluster-average graph as s1 N such that

k=1 I(Ω j,k = 0) s1.

Recall that the estimators (22) and (24) depend on the tuning parameters λ and λ . In the

following theorem, we choose λ λ q

n . For notational simplicity, we use C to denote a generic constant, the value of which may change from line to line.

Theorem 3 If Assumptions 1 and 2 hold, we have

max 1 t<k K sup x R

P( b Tt,k x) Φ(x) C (d n)3 + Cs1 log(K n)

n1/2 + C (K n)3 (33)

where b Tt,k = n1/2(eΩt,k Ω t,k) bstk , bs2 tk = bΩ2 t,k + bΩt,tbΩk,k and C is a positive constant.

Theorem 3, proved in Appendix A, gives the rate of the normal approximation of the distribution of the scaled and centered entries eΩt,k. The right hand side in (33) is nonasymptotic and is valid for each K, n and d. Its ﬁrst, small, term is the price to pay for having ﬁrst used the data for clustering, and it is dominated by the other two terms. From this perspective, the clustering step is the least taxing, as long as we can ensure

Eisenach, Bunea, Ning and Dinicu

its consistency, which in turn can be guaranteed under the minimal assumptions 1 and 2 already needed for the remaining steps. The second, and dominant, term regards the normal approximation of the distribution of n1/2(eΩt,k Ω t,k). (34)

Speciﬁcally, as an intermediate step, Proposition 10 in Appendix A shows that the diﬀerence between the c.d.f. of (34), scaled by stk = q

Ω 2 t,k + Ω t,tΩ k,k, and that of a standard Gaussian

random variable is bounded by s1 log(K n)

n1/2 . Therefore, asymptotic normality holds as soon as this quantity converges to zero, which agrees with the weakest sparsity conditions for Gaussian graphical model inference in the literature (Ren et al., 2013; Jankov a and van de Geer, 2017). In addition, the asymptotic variance s2 tk agrees with the minimum variance bound in Gaussian graphical models (Jankov a and van de Geer, 2017). Thus, inference based on the empirical risk function (16) does not lead to any asymptotic eﬃciency loss. Unlike the previous works, we do not require the bounded operator norm condition, λmax(S ) C. This condition is avoided in our analysis by using a more convenient empirical risk function (16), as opposed to the log-likelihood in Jankova and van de Geer (2014), and a CLIMEtype initial estimator (24) satisfying (61), as opposed to the node-wise Lasso estimator in Jankov a and van de Geer (2017). The last term in the normal approximation is O((K n) 3) which is dominated by the second one, and is associated with the replacement of the true variance s2 tk by the estimate bs2 tk. Finally, we note that the powers of the ﬁrst and the third term in the right hand side of (33) can be replaced by 2 + δ, for any δ > 0, and a change in this power also changes the associated constant C in the term Cs1 log(K n)

n1/2 . As shown in Theorem 5, to obtain valid FDR control, we need K2/(K n)2+δ = o(1), which holds for any δ > 0. For simplicity, we choose δ = 1 which gives the power 3.

4.2.2. Results for the Latent Variable Graph

In this section we show that the estimators eΘt,k given by (32) are asymptotically normal, for all t < k. We deﬁne the sparsity of the latent graph as s0 N such that

k=1 I(Θ j,k = 0) s0.

Inference for the estimator eΘtk follows the general approach outlined in Section 3.1. We prove in Proposition 17 in Appendix B that

n1/2(eΘt,k Θ t,k) = 1 n1/2

i=1 Θ t,tv T t ( C(i)Θ k ek) + op(1), (35)

where v t is a K-dimensional vector with (v t )t = 1 and (v t ) t = w t with w t = (C t, t) 1C t,t. and C(i) is deﬁned in (30). The terms of the sum in display (35) are mean zero random variables, and their variance is

σ2 tk = E(Θ t,tv T t ( C(i)Θ k ek))2,

High-Dimensional Inference for Cluster-Based Graphical Models

which does not have an explicit closed form, unlike the asymptotic variance of the estimates of the entries of Ω . However, we show in Proposition 23 in Appendix B that σ2 tk admits an approximation that is easy to estimate: σ2 tk [(Θ t,k)2 + Θ k,kΘ t,t] s0

where m = min1 k K |G k|. Guided by this approximation, we estimate σ2 tk by

bσ2 tk = bΘ2 t,k + bΘk,k bΘt,t.

When all clusters have the equal size, we obtain K = d/m. Thus the O(s0

m) terms can be ignored asymptotically in the sense that s0

d = o(1), when the clusters are approximately balanced, and their number satisﬁes K2 = o(d). This is a reasonable assumption in most applied clustering problems. We note that the estimator bσ2 tk may be inconsistent when the size of some clusters is too small. However, we recall that our ultimate goal is to use these estimators for recovering the sparsity pattern of Θ under FDR control. To evaluate the sensitivity of our overall procedure to the size of the smallest cluster, we conduct simulation studies in Section 5. The results shows that the proposed method works well as soon as m > 4. The following theorem gives the Berry Esseen normal approximation bound for the estimators of the entries of the precision matrix corresponding to the latent variable graph.

Theorem 4 If Assumptions 1 and 2 hold, then

max 1 t<k K sup x R

P( b Tt,k x) Φ(x) C (d n)3 + C (K n)3 + Cs0 log(K n)

where b Tt,k = n1/2(eΘt,k Θ t,k) bσtk and C is a positive constant.

Compared to the average graph, the Berry-Esseen bound in (36) contains an additional O(s0

m) term, stemming from the approximation of the analytically intractable asymptotic variance by an estimable quantity. The proof is deferred to Appendix B.

4.3. Application to Post-clustering FDR Control

Given the edge-wise inferential results for the cluster-average and latent variable graphs established above, we explain in this section how to combine them to control graph-wise inferential uncertainty. Speciﬁcally, we view the task of recovering the sparsity pattern as a multiple testing problem by selecting:

H0;tk : Ω t,k = 0 vs. H1;tk : Ω t,k = 0 for all 1 t < k K, (37)

for the cluster-average graph, and

H 0;tk : Θ t,k = 0 vs. H 1;tk : Θ t,k = 0 for all 1 t < k K. (38)

for the latent variable graph. In the following, we apply the B-Y procedure by Benjamini and Yekutieli (2001) for FDR control in the cluster-average graph. The procedure for latent

Eisenach, Bunea, Ning and Dinicu

variable graph is identical. In this section, we are not claiming to develop a diﬀerent FDR procedure; rather we make the simple point that if you can obtain asymptotic p-values of d dependent statistics we can combine them in standard ways to control the FDR, which is a direct consequence of the Berry-Esseen bounds derived in Section 4.2. Deﬁne the set of true null hypotheses, H0 := {(t, k) : 1 t < k K, such that Ω t,k = 0}, as the set of indices (t, k) for which there is no edge between the nodes t and k. To control the error incurred by multiple testing, we focus on the false discovery rate (FDR), which is the average number of Type I errors relative to the total number of discoveries (Benjamini and Hochberg, 1995). Recall that eΩt,k is a consistent and asymptotically normal estimator of Ωt,k. We consider the natural test statistic f Wt,k = n1/2eΩt,k/bstk for H0;tk, where bs2 tk = bΩ2 t,k + bΩt,tbΩk,k. Given a cutoﬀτ > 0, the total number of discoveries is

1 t<k K I[|f Wt,k| > τ].

Similarly, the number of false positives or false discoveries is given by

(t,k) H0 I[|f Wt,k| > τ].

The FDR is formally deﬁned as the expected ratio of Vτ over Rτ,

FDR(τ) := E Vτ

Rτ I[Rτ > 0] ,

where the indicator function is included to remove the trivial case Rτ = 0. Our goal is to ﬁnd a data-dependent cutoﬀτ such that FDR(τ) α + o(1) for any given 0 < α < 1. This is the best one can hope for when, as in our case, the distribution of the test statistics f Wt,k is only available asymptotically. The Berry-Esseen type bounds, derived in Theorems 3 and 4, allow us to precisely quantify the price we must pay for the asymptotic approximation and are instrumental to understanding asymptotic FDR control. In addition, the test statistics f Wt,k for diﬀerent hypotheses are dependent. To allow for the dependence, instead of the standard B-H procedure (Benjamini and Hochberg, 1995), we consider the more ﬂexible B-Y procedure by Benjamini and Yekutieli (2001). The resulting FDR procedure is as follows: reject all hypotheses such that |f Wt,k| bτ, where

bτ := min τ > 0 : τ Φ 1 1 αRτ 2NBY |H|

where |H| = K(K 1)/2 is the total number of hypotheses. Our next result shows when the FDR based on our test statistics is guaranteed to be no greater than α, asymptotically. The proofs can be found in Appendix A.

1. Assume that the conditions in Theorem 3 hold. For any 0 < α < 1, we have

FDR(bτ) α + 2|H0|bn, (40)

High-Dimensional Inference for Cluster-Based Graphical Models

where bn = C (d n)3 + C (K n)3 + Cs1 log(K n)

n1/2 , and |H0| is the number of true null hypotheses in (37).

2. Assume that the conditions in Theorem 4 hold. If we deﬁne the test statistic as f Wt,k = n1/2 eΘt,k/bσtk, and bτ as in (39), we have

FDR(bτ) α + 2|H 0|cn, (41)

where cn = C (d n)3 + C (K n)3 + Cs0 log(K n)

m , and |H 0| is the number of true null hypotheses in (38).

This theorem implies that our method can control the FDR asymptotically, in the sense that FDR(bτ) α +op(1), provided that s1|H0| log(K n) = o(n1/2), for the average graph, and s0|H 0| log(K n) = o(n1/2) for the latent graph. Gaussian graphical model estimation under FDR control was recently studied by Liu (2013). They showed that the B-H procedure can control FDR asymptotically under certain conditions. Their approach is based on the following Cramer-type moderate deviation result using our terminology,

max (t,k) H0 sup 0 t 2 log K

P( b Tt,k t)

2 2Φ(t) 1 = o(1), (42)

where b Tt,k is test statistic they proposed for estimation of the Gaussian graphical model structure. The result (42) controls the relative error of the Gaussian approximation within the moderate deviation regime [0, 2 log K], whereas our result is based on the control of the absolute error via the Berry-Esseen-type Gaussian approximation. One of the main advantages of their result is that the number of clusters is allowed to be K = o(nr), where r is a constant that can be greater than 1. However, to prove (42), they required that the number of strong signals tends to inﬁnity, that is |{(t, k) : Ω t,k/ q

Ω k,kΩ t,t C p

log K/n}| ,

which reduces signiﬁcantly the parameter space for which inference is valid. In contrast, the aim of this work is the study of pattern recovery without conditions on the signal strength of the entries of the target precision matrices, as in practice it is diﬃcult to assess whether these conditions are met. For completeness, we provide the detailed analysis of the B-H procedure including technical conditions, theoretical results and further discussion of the B-H procedure in Appendix H. The overall message conveyed by Theorem 5 is that, in the absence of any signal strength assumptions, cluster-based graphical models can still be recovered, under FDR control, provided that the number of clusters K is not very high relative to n, and provided that the clusters are not very small. This further stresses the importance of an initial dimension reduction step in high-dimensional graphical model estimation. For instance, results similar to those of Theorem 3 can be derived along the same lines for the estimation of the sparsity pattern of Σ 1, for a generic, unstructured, covariance matrix of X, where one replaces K by d throughout, and s1 is replaced by s, the number of non-zero entries in the d d matrix Σ 1. Then, the analogue of (41) of Theorem 5 shows that FDR control in generic graphical models, based on asymptotic approximations of p-values, cannot generally be guaranteed if d > n. Our work shows that extra structural assumptions, for instance those motivated by clustering, do alleviate this problem. The simulation study presented in the next section provides further support to our ﬁndings.

Eisenach, Bunea, Ning and Dinicu

5. Numerical Results

This section contains simulations and a real data analysis that illustrate the ﬁnite sample performance of the inferential procedures developed in the previous sections for the latent variable graph and cluster-average graph, respectively.

5.1. Synthetic Datasets

In this subsection, we demonstrate the eﬀectiveness of the FDR control procedures on synthetic datasets. We consider two settings (n, d) = (800, 400) and (500, 1000), and in each setting we vary the value of K and m. The error variable E is sampled from the multivariate normal distribution with covariance Γ whose entries γ i are generated from U[0.25, 0.5]. Recall that the latent variable Z follows from Z N(0, C ). We consider three diﬀerent models to generate the graph structure of Z. Once the graph structure is determined, the corresponding adjacency matrix W is found, and the precision matrix Θ = C 1 is taken as Θ = c W + (|λmin (W) | + 0.2)I, where c = 0.3 when d = 400 and c = 0.5 when d = 1000. Finally, we assign the cluster labels for all variables so that all clusters have approximately equal size, which gives us the matrix A. Given A, Z and E, we can generate X according to the model (1). We consider the following three generating models for the graph structure of Z:

Scale-Free Graph The Scale-Free model is a generative model for network data, whose degree distribution follows a power law. To be concrete, we generate the graph one node at a time, starting with a 2 node chain. For nodes 3, . . . , K, node t is added and one edge is added between t and one of the t 1 previous nodes. Denoting by ki the current degree of node i in the graph, the probability that node t and node i are connected is pi = ki/(P i ki). The number of edges in the resulting graph is always K.

Hub Graph The K nodes of the graph are partitioned evenly into groups of size N. Within each group, one node is selected to be the group hub, and an edge is added between it and the remainder of its group. N is either 5 or 6 depending upon the choice of K. The number of edges in the graph is K(N 1)/N, so for K = 100 with N = 5, the number of edges in the resulting graph is 80.

Band3 Graph This model generates a graph with a Toeplitz adjacency matrix. There is an edge between node i and node j if and only if |i j| B, where we set B = 3 in this scenario. In general, the number of edges in a band graph with K nodes is given by BK 3

2B. So, for K = 100 and B = 3, the number of edges in the graph is 294.

Recall that X N(0, S ), where S is deﬁned in (15). To determine the structure of the average graph, we numerically compute S 1 and threshold the matrix at 10 8. We examine the empirical FDR of our procedures on some synthetic datasets. The following protocol is followed in all the experiments:

1. Generate the graph structure of Z as speciﬁed above.

2. Simulate n observations from our model (1).

High-Dimensional Inference for Cluster-Based Graphical Models

3. Estimate the cluster partition b G. For computational convenience, we apply the FORCE algorithm (Eisenach and Liu, 2019) with known K.

4. Construct the test statistic f Wt,k deﬁned in Section 4.3. The regularization parameters λ and λ are chosen by 5-fold cross validation.

5. Find the FDR cutoﬀ(39) at level α; we consider three cases α = 0.05, 0.1, 0.2.

The simulation is repeated 100 times. To compare with our Benjamini-Yekutieli based FDR procedure, we also report the empirical FDR based on the more classical Benjamini Hochberg procedure. That is, we apply the same procedures 1-4, but in step 5 we replace the FDR cutoﬀin (39) with the Benjamini-Hochberg (B-H) cutoﬀ, i.e.,

bτBH := min τ > 0 : τ Φ 1 1 αRτ

Table 1 compares the empirical FDR based on our method with the B-H procedure under diﬀerent m, K settings when d = 400. When m is relatively large (e.g., m = 20), both methods can control FDR on average, although our method is relatively more conservative. As expected, the FDR control problem becomes more challenging for large K and small m. In this case the graph contains more nodes and each cluster contains fewer variables. We observe that when m = 5 our method can still control FDR reasonably well but the B-H method produces empirical FDR far beyond the nominal level, especially for hub graphs. The inferior performance of the B-H procedure is due to the fact that the dependence among the test statistics is not accounted for, demonstrating that the B-Y procedure is indeed necessary at least in the current simulation settings. Finally, we examine the empirical power of the FDR procedure under each scenario, which is deﬁned as

I[f Wt,k bτ]

where H1 is the set of alternative hypothesis. Table 2 gives the empirical power of our FDR procedure, and the B-H procedure when d = 400. It shows that our procedure and the B-H procedure have very high power in all scenarios. The same phenomenon is observed when d is large, i.e., d = 1000; see Tables 3 and 4. In summary, our proposed procedure can identify most of the signals in the graph while keeping FDR well controlled.

Remark 6 (Inexact Recovery) While our FORCE algorithm guarantees the estimated number of clusters is always K since we use the true value K as the input, the recovered partition b G may still diﬀer from the true partition G . In order to compare our results to the ground-truth, we need to ﬁnd an aligned version of Θ (or Ω ). Speciﬁcally, we calculate the mapping f : [K] [K] deﬁned by

f(k) = argmax l [K] | b Gk G l |,

and then construct the matrix Θ A, deﬁned entry-wise by

(Θ A)s,t := (Θ )f(s),f(t).

Eisenach, Bunea, Ning and Dinicu

Once we have obtained the aligned ground-truth matrix Θ A, we can proceed with computing the metric of interest (FDR and power). Although the preceding discussion is in terms of Θ , the same procedure applies to Ω .

Remark 7 (Discussion of the B-H procedure) Based on the results in Tables 3 and 4, it is clear that the power of both our procedure and the B-H procedure are satisfactory. However, for both setups (d = 400, 1000), the B-H procedure leads to the inﬂated FDR relative to the nominal level. While Appendix H shows that the B-H procedure can control FDR asymptotically under certain conditions, it seems in numerical examples the dependence among the test statistics leads to substantial errors in FDR control that are indeed not ignorable.

α = 0.05 α = 0.1 α = 0.2 K m Scalefree Band3 Hub Scalefree Band3 Hub Scalefree Band3 Hub

B-Y Based Procedure

80 5 1.16% 1.42% 5.99% 2.01% 2.64% 7.60% 4.01% 4.73% 10.33% 66 6 0.93% 1.03% 1.08% 1.99% 1.98% 1.96% 3.73% 3.68% 4.00% 50 8 1.16% 0.99% 1.09% 2.20% 1.77% 1.98% 3.90% 3.49% 3.73% 20 20 0.99% 0.88% 1.30% 1.81% 1.66% 2.26% 3.75% 3.38% 4.71%

80 5 1.15% 1.40% 6.29% 2.14% 2.67% 7.89% 4.00% 4.70% 10.43% 66 6 0.91% 1.04% 1.04% 1.85% 2.00% 2.14% 3.70% 3.71% 3.83% 50 8 1.16% 1.00% 1.11% 2.22% 1.80% 1.94% 3.98% 3.49% 3.61% 20 20 0.94% 0.88% 1.23% 1.81% 1.68% 2.26% 3.65% 3.38% 4.71%

B-H Based Procedure

80 5 8.23% 9.24% 15.60% 15.01% 16.39% 24.00% 28.16% 28.97% 38.58% 66 6 7.18% 7.38% 7.31% 14.12% 14.01% 13.78% 26.56% 25.84% 27.28% 50 8 6.94% 6.75% 6.69% 13.23% 12.73% 12.74% 25.20% 23.62% 26.05% 20 20 5.43% 4.54% 6.38% 11.09% 8.47% 10.71% 21.46% 17.15% 20.75%

80 5 8.42% 9.26% 15.77% 15.25% 16.43% 24.15% 28.03% 29.02% 38.57% 66 6 7.21% 7.37% 7.45% 13.99% 13.92% 13.81% 26.19% 25.77% 27.17% 50 8 6.82% 6.78% 6.67% 13.23% 12.67% 12.89% 25.10% 23.61% 25.91% 20 20 5.38% 4.51% 6.49% 11.01% 8.52% 10.66% 21.58% 17.09% 20.71%

Table 1: Averaged empirical FDR for synthetic data experiments with d = 400 and n = 800.

Group Average Procedures Do Not Recover Latent Graphs

In this section we demonstrate through simulation studies that procedures speciﬁcally tailored to recovering the latent variable graph are necessary. We do this by using both the methodology for the latent variable graph and the group-average graph to recover the latent graph structure. Because the diﬀerences between the latent and group-average graph can be small, we use larger sample sizes than in the previous studies. The experimental procedure is almost exactly the same as before, but now we hold most of the parameters of the generating distribution ﬁxed and examine the eﬀects of the error variance Γ and sample size n on the eﬃcacy of our methodologies for recovering the latent variable graph. In all the experiments, we impose a Band3 structure on the latent graph, set Γ = γI, and use (d, K, m) = (400, 20, 20).

High-Dimensional Inference for Cluster-Based Graphical Models

α = 0.05 α = 0.1 α = 0.2 K m Scalefree Band3 Hub Scalefree Band3 Hub Scalefree Band3 Hub

B-Y Based Procedure

80 5 88.22% 99.99% 99.00% 91.10% 100.00% 99.04% 93.38% 100.00% 99.11% 66 6 93.62% 100% 100% 95.42% 100% 100% 96.86% 100% 100% 50 8 97.18% 100% 100% 97.90% 100% 100% 98.47% 100% 100% 20 20 100% 100% 100% 100% 100% 100% 100% 100% 100%

80 5 88.27% 99.99% 98.93% 91.06% 100.00% 98.98% 93.37% 100.00% 99.07% 66 6 93.62% 100% 100% 95.40% 100% 100% 96.91% 100% 100% 50 8 97.18% 100% 100% 97.86% 100% 100% 98.47% 100% 100% 20 20 100% 100% 100% 100% 100% 100% 100% 100% 100%

B-H Based Procedure

80 5 95.75% 100.00% 99.18% 97.15% 100.00% 99.23% 98.48% 100.00% 99.26% 66 6 97.85% 100% 100% 98.78% 100% 100% 99.34% 100% 100% 50 8 99.27% 100% 100% 99.57% 100% 100% 99.80% 100% 100% 20 20 100% 100% 100% 100% 100% 100% 100% 100% 100%

80 5 95.82% 100.00% 99.14% 97.14% 100.00% 99.18% 98.53% 100.00% 99.23% 66 6 97.88% 100% 100% 98.78% 100% 100% 99.34% 100% 100% 50 8 99.24% 100% 100% 99.59% 100% 100% 99.78% 100% 100% 20 20 100% 100% 100% 100% 100% 100% 100% 100% 100%

Table 2: Averaged FDR power for synthetic data experiments with d = 400 and n = 800.

α = 0.05 α = 0.1 α = 0.2

K m Scalefree Band3 Hub Scalefree Band3 Hub Scalefree Band3 Hub

B-Y Based Procedure

80 12 1.40% 1.52% 1.78% 2.78% 2.81% 3.29% 5.71% 5.09% 6.30% 66 15 1.23% 1.29% 1.31% 2.62% 2.39% 2.69% 5.08% 4.33% 4.84% 50 20 1.05% 0.94% 1.26% 2.27% 1.85% 2.79% 4.28% 3.65% 4.58% 20 50 1.04% 1.03% 1.05% 1.81% 1.96% 2.20% 3.46% 3.26% 4.19%

80 12 1.55% 1.57% 1.78% 2.78% 2.80% 3.26% 5.57% 5.13% 6.13% 66 15 1.30% 1.30% 1.24% 2.64% 2.35% 2.64% 5.07% 4.34% 4.89% 50 20 1.09% 0.93% 1.26% 2.32% 1.89% 2.75% 4.29% 3.64% 4.51% 20 50 1.09% 1.01% 1.05% 1.91% 1.96% 2.14% 3.50% 3.28% 4.25%

B-H Based Procedure

80 12 10.99% 10.04% 12.00% 20.04% 18.18% 21.47% 35.50% 31.46% 37.24% 66 15 9.28% 8.20% 9.14% 16.88% 14.94% 16.82% 31.53% 27.75% 31.81% 50 20 7.73% 6.91% 7.96% 14.74% 13.13% 15.33% 27.03% 23.92% 28.72% 20 50 5.00% 4.85% 6.38% 11.01% 9.38% 11.60% 21.49% 17.00% 23.15%

80 12 10.99% 10.00% 12.00% 20.29% 18.20% 21.38% 35.56% 31.48% 37.10% 66 15 9.28% 8.24% 8.99% 17.03% 14.87% 16.87% 31.49% 27.82% 31.63% 50 20 7.60% 6.93% 7.94% 14.61% 13.13% 15.40% 27.12% 23.94% 28.71% 20 50 4.95% 4.91% 6.38% 10.97% 9.35% 11.60% 21.36% 16.94% 23.41%

Table 3: Averaged empirical FDR for synthetic data experiments with d = 1000 and n = 500.

First we note that if Γ = 0, the two graph structures are actually the same it is only as we introduce error into the observed variables that the latent variable and group-average graphs begin to diﬀer. In Tables 5 and 6 we see that as we increase the error variance, the the group averages methodology applied to the latent graph is unable to control the Type I error rate and the FDR at the desired level. Further, by increasing the sample size n,

Eisenach, Bunea, Ning and Dinicu

α = 0.05 α = 0.1 α = 0.2

K m Scalefree Band3 Hub Scalefree Band3 Hub Scalefree Band3 Hub

B-Y Based Procedure

80 12 71.42% 99.95% 100.00% 77.14% 99.97% 100.00% 82.29% 99.98% 100.00% 66 15 82.98% 99.98% 100.00% 86.32% 99.99% 100.00% 90.02% 100.00% 100.00% 50 20 90.45% 99.99% 100.00% 93.04% 99.99% 100.00% 94.86% 100.00% 100.00% 20 50 99.95% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%

80 12 71.54% 99.95% 100.00% 77.13% 99.97% 100.00% 82.25% 99.98% 100.00% 66 15 82.74% 99.98% 100.00% 86.26% 99.99% 100.00% 90.08% 100.00% 100.00% 50 20 90.45% 99.99% 100.00% 93.00% 99.99% 100.00% 94.76% 100.00% 100.00% 20 50 99.95% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%

B-H Based Procedure

80 12 87.15% 99.99% 100.00% 91.33% 100.00% 100.00% 94.66% 100.00% 100.00% 66 15 93.57% 100.00% 100.00% 95.63% 100.00% 100.00% 97.38% 100.00% 100.00% 50 20 96.27% 100.00% 100.00% 97.78% 100.00% 100.00% 98.94% 100.00% 100.00% 20 50 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%

80 12 87.22% 99.99% 100.00% 91.33% 100.00% 100.00% 94.58% 100.00% 100.00% 66 15 93.52% 100.00% 100.00% 95.65% 100.00% 100.00% 97.35% 100.00% 100.00% 50 20 96.33% 100.00% 100.00% 97.78% 100.00% 100.00% 98.94% 100.00% 100.00% 20 50 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%

Table 4: Averaged FDR power for synthetic data experiments with d = 1000 and n = 500.

we observe that the performance of the group average procedures applied to recovering the latent graph gets worse this is as expected because with increased n, the tests become accurate, but are by construction estimating the wrong graph!

L.V. Test Procedure G.A. Test Procedure n γ α = 0.01 α = 0.05 α = 0.1 α = 0.2 α = 0.01 α = 0.05 α = 0.1 α = 0.2

1600 0.5 0.56% 2.67% 5.83% 10.22% 0.61% 3.00% 6.39% 10.39% 1.0 0.78% 3.11% 5.22% 10.61% 0.78% 3.83% 6.89% 13.39% 1.5 0.61% 2.89% 5.11% 8.72% 0.94% 3.33% 6.22% 10.50% 2.0 1.00% 3.39% 6.39% 11.06% 1.39% 4.78% 7.61% 12.44% 2.5 0.56% 3.50% 6.17% 11.22% 1.89% 5.33% 9.17% 15.28% 3.0 1.72% 5.11% 8.61% 14.28% 3.00% 8.56% 11.89% 17.61% 3.5 1.06% 3.83% 6.78% 11.72% 3.39% 8.44% 12.94% 18.44% 4.0 1.50% 4.33% 8.33% 12.83% 5.22% 10.94% 14.22% 20.78%

3200 0.5 0.28% 1.89% 3.22% 7.50% 0.44% 2.17% 4.00% 8.11% 1.0 0.44% 3.00% 5.11% 9.44% 1.11% 4.22% 7.83% 14.17% 1.5 0.44% 2.72% 4.50% 9.56% 1.78% 5.61% 9.00% 13.61% 2.0 0.89% 3.56% 6.78% 11.61% 2.94% 7.50% 11.33% 16.78% 2.5 1.06% 3.61% 6.06% 10.67% 2.78% 9.67% 13.44% 19.00% 3.0 1.50% 5.44% 8.39% 14.33% 7.72% 13.78% 18.00% 22.78% 3.5 1.67% 5.44% 9.72% 15.72% 7.61% 13.83% 18.22% 22.89% 4.0 2.78% 5.83% 10.06% 16.28% 11.28% 17.11% 22.22% 27.06%

Table 5: Averaged Type I for the latent variable graph using both the latent variable and group averages methodology.

High-Dimensional Inference for Cluster-Based Graphical Models

L.V. Test Procedure G.A. Test Procedure n γ α = 0.01 α = 0.05 α = 0.1 α = 0.2 α = 0.01 α = 0.05 α = 0.1 α = 0.2

1600 0.5 0.83% 3.81% 4.95% 11.44% 0.41% 3.23% 5.51% 12.57% 1.0 0.62% 3.23% 6.07% 10.11% 1.03% 4.00% 7.51% 14.13% 1.5 0.21% 3.61% 5.88% 10.61% 1.23% 4.76% 6.25% 12.25% 2.0 1.23% 3.81% 6.43% 13.51% 1.64% 5.88% 9.26% 14.74% 2.5 0.83% 4.00% 7.16% 12.25% 3.23% 5.88% 10.95% 16.52% 3.0 2.24% 6.61% 10.11% 16.81% 3.23% 9.94% 15.19% 20.66% 3.5 1.23% 4.95% 6.98% 13.36% 3.81% 10.78% 15.79% 22.08% 4.0 1.64% 4.95% 8.40% 15.19% 6.25% 14.29% 17.95% 24.88%

3200 0.5 0.41% 2.44% 4.00% 7.51% 0.41% 2.64% 4.19% 8.75% 1.0 0.83% 3.42% 6.80% 10.61% 1.03% 4.57% 8.05% 16.38% 1.5 0.41% 3.23% 5.70% 9.09% 2.64% 7.34% 10.45% 16.81% 2.0 0.62% 4.38% 6.98% 12.25% 4.00% 9.77% 13.36% 20.27% 2.5 1.03% 4.57% 7.51% 11.60% 4.00% 12.09% 16.81% 23.44% 3.0 1.84% 7.51% 10.45% 16.38% 9.94% 17.81% 22.95% 28.14% 3.5 1.64% 6.98% 11.28% 18.92% 10.95% 18.23% 23.08% 27.93% 4.0 4.38% 8.57% 11.60% 18.92% 15.94% 22.95% 27.16% 31.82%

Table 6: Averaged empirical FDR error rates for the latent variable graph using both the latent variable and group averages methodology.

5.2. f MRI Dataset

The study of brain relationships in humans via modern neuroimaging techniques has attracted an enormous amount of scientiﬁc interest in recent years. One fundamental goal in these studies is to understand the functional communication between brain regions, which plays a key role in complex cognitive processes. To study the functional connectivity network, Power et al. (2011) partitioned the human brain into regions of interest (ROIs) and represented each ROI by a node in a graph. They identiﬁed several subgraphs of highly related nodes (i.e, clusters), which can represent the major functional systems of the brain. One of the main goals in their work is to understand the relationship between diﬀerent subgraphs or clusters. To this end, they estimated the relationship between clusters by thresholding the correlation matrix of all nodes in the same clusters in an adhoc way. The analysis of network of clusters can be conducted under a rigorous statistical framework by using the proposed cluster-based graphical models. In the section, we will apply the proposed method to study the dependence structure among functional systems of the brain.

As an illustration, we focus on the publicly available resting-state f MRI data from the Neuro-bureau pre-processed repository (Bellec et al., 2015). Speciﬁcally, we use the data from patient 1018959, session 1 in the KKI dataset. This f MRI dataset was pre-processed using the Athena pipeline and mapped to T1 MNI152 coordinate space. We choose this dataset to make our experiments easily reproducible, as the data are available pre-processed using standard alignment and denoising procedures. In a recent work, Luo (2014) estimated the f MRI network in a similar clustering model by an iterative procedure that maximizes the likelihood function with respect to the unknown cluster assignment and the precision

Eisenach, Bunea, Ning and Dinicu

matrix. However, their optimization problem is non-convex due to estimating the unknown cluster assignment and the algorithm is computationally intensive in high dimension. Following Power et al. (2011), we directly extract the 264 ROIs, which gives us d = 264 mean activities across n = 148 time periods. Since the FORCE algorithm requires to know the number of clusters in advance, we ﬁrst apply the CORD algorithm (Bunea et al., 2018) to estimate the number of clusters, which gives us K = 53. Then we reapply the FORCE algorithm with K = 53 to obtain the corresponding clusters. Recall that the goal is to analyze the dependence structure among functional systems of the brain. Under the latent variable model (1), we treat the measurement of ROIs as the observed variable X and the underlying brain function as the latent variable Z. Thus, the relationship between brain functional systems can be modeled by the latent variable graph. Using the FDR control procedures with α = 0.01, the estimated latent variable graph is shown in Figure 1. As a comparison, we also show the cluster average graph in Figure 2. However, in this example the biological meaning of the cluster-average variables X is unclear so that the interpretation of the cluster average graph seems diﬃcult. For purposes of clarity, we only display the nodes and connections corresponding to the 10 largest groups in these two ﬁgures. The groups are colored according to the functional network the majority of their nodes belong to as given in Power et al. (2011). It is known that the default mode may comprise multiple interacting subsystems (Andrews-Hanna et al., 2010). Thus our graph contains multiple nodes for the default mode, where each node may represent diﬀerent subsystems. To demonstrate the diﬀerence between the latent variable graph and the cluster average graph, we focus on three regions, default mode (pink), dorsal attention (brown) and salience (yellow). In the latent variable graph, dorsal attention and salience are connected and both of them are strongly connected with many subsystems of the default mode. This dependence structure is supported by the neuroscience theory that the default mode, engaged during rest and internally directed tasks, should exhibit anticorrelation with networks engaged during externally directed tasks, such as the dorsal attention network and the salience network (Zhou et al., 2017). Moreover, this ﬁnding is also consistent with many existing empirical results such as Fox et al. (2005); Fransson (2005); Smith et al. (2009); Raichle (2015). However, in the cluster-averages graph, dorsal attention is conditionally independent of any subsystems of the default mode and salience is only loosely connected with the default mode. In summary, the latent variable graph seems to be more biologically meaningful than the cluster-average graph in order to interpret the dependence structure of the functional systems of the brain.

Acknowledgments

We would like to thank the reviewers and editor for their suggestions to improve this paper. Florentina Bunea was partially supported by NSF-DMS 712709. Yang Ning was partially supported by NSF-DMS 1854637. We are grateful to Xi Luo for help with the interpretation of our data analysis results.

High-Dimensional Inference for Cluster-Based Graphical Models

FMRI - Latent Graph ( b K = 53)

Uncertain Auditory Default mode Salience Cingulo-opercular Task Control Dorsal Attention Sensory/Somatomotor Hand

Figure 1: Recovered latent graph structure between 10 largest clusters in f MRI data with FDR level α = 0.01 colored according to their functions.

FMRI - Averages Graph ( b K = 53)

Uncertain Auditory Default mode Salience Cingulo-opercular Task Control Dorsal Attention Sensory/Somatomotor Hand

Figure 2: Recovered cluster-averages graph structure between 10 largest clusters in f MRI data with FDR level α = 0.01.

Eisenach, Bunea, Ning and Dinicu

Appendix A. Estimation in the Cluster-Average Graph

In this section we provide the proofs of the results needed for establishing the asymptotic normality of the estimators of the entries of Ω . These results make use of the fact that consistent clustering is possible, under our assumptions, as stated below.

Lemma 8 Let E =: { b G = G }, for b G estimated by either the COD or the PECOK algorithm of Bunea et al. (2018). Then, under Assumptions 1 and 2, we have

P(E) 1 C (d n)3 .

The conclusion of this Lemma is proved in Theorem 3, for the COD algorithm, and Theorem 4, for the PECOK algorithm, of Bunea et al. (2018). Lemma 8 allows us to replace b G by G in all the results below, while incurring a small error, of order O C (d n)3 , which will be shown to be dominated by other error bounds.

Remark 9 While Assumptions 1 and 2 are made for C , they do imply that c1 λmin (S ) and maxt S t,t c2 + c3 holds for S . Furthermore Lemma 29 implies the same restricted eigenvalue condition on λmin (S ) as on C .

A.1. Main Proofs for the Cluster-Average Graph Estimators

Before proving Theorem 3 and Claim 1 of Theorem 5, we state two propositions one regarding the asymptotic normality of eΩt,k and the other regarding the convergence rate of the variance estimator. The proofs are deferred until after that of the main result

Proposition 10 (Asymptotic Normality of eΩt,k) Under the same conditions as in Theorem 3, we have

max 1 t<k K sup x R

P n1/2(eΩt,k Ω t,k)

stk x, E Φ(x) C (K n)3 + Cs1 log(K n)

Proposition 11 (Convergence Rate of the Variance Estimator) Under the same conditions as in Theorem 3, we have

max 1 t<k K |bs2 t,k s2 t,k| C

s1 log(K n)

n , max 1 t<k K

s1 log(K n)

with probability at least 1 (K n) 3.

Proof of Theorem 3

The proof relies crucially on Proposition 10. Once that result is established, the proof of Theorem 3 follows standard steps as explained below.

Denote E = {max1 t<k K |bstk/stk 1| r}, where r = C q

s1 log(K n)

n , and let E signify

the complement of the event E . Let Tt,k and b Tt,k denote the statistics n1/2(eΩt,k Ω t,k) stk and

High-Dimensional Inference for Cluster-Based Graphical Models

n1/2(eΩt,k Ω t,k) bstk , respectively. We ﬁrst consider the bound

P b Tt,k x Φ(x) P b Tt,k x, E , E Φ(x) + P( E ) + P( E)

= P Tt,k xbstk

stk , E , E Φ(x) + P( E ) + P( E).

Proposition 11 implies P( E ) C(K n) 3 and Lemma 8 above implies P( E) C(d n) 3

for some constant C. In addition, for x 0,

P Tt,k xbstk

stk , E , E Φ(x) P Tt,k x(1 + r), E Φ(x)

= n P Tt,k x(1 + r), E Φ(x(1 + r)) o + n Φ(x(1 + r)) Φ(x) o . (43)

For the ﬁrst term, Proposition 10 implies

max 1 t<k K sup x R

P Tt,k x(1 + r), E Φ(x(1 + r)) s1 log(K n)

By the mean value theorem, the second term Φ(x(1+r)) Φ(x) = φ(x(1+tr))xr, for some t [0, 1]. It is easily seen that supx R supt [0,1] |φ(x(1 + tr))x| C for some constant C. Plugging it into (43), we obtain

max 1 t<k K sup x R

n P Tt,k xbstk

stk , E , E Φ(x) o s1 log(K n)

n1/2 + r s1 log(K n)

n1/2 . (44)

When x < 0, similar to (43), the bound is

P Tt,k xbstk

stk , E , E Φ(x)

n P Tt,k x(1 r), E Φ(x(1 r)) o + n Φ(x(1 r)) Φ(x) o .

Thus (44) holds for x < 0 as well. Combining these results, we obtain

max 1 t<k K sup x R

n P b Tt,k x Φ(x) o s1 log(K n)

n1/2 + 1 (K n)3 + 1 (d n)3 .

Following the similar argument, we can also derive

max 1 t<k K sup x R

n Φ(x) P b Tt,k x o s1 log(K n)

n1/2 + 1 (K n)3 + 1 (d n)3 .

This completes the proof.

Proof of Theorem 5, Claim 1

The proof follows verbatim that of Theorem 8.5 in Giraud (2014), with the exception of the fact that exact p-values are replaced by approximate p-values, including the rate

Eisenach, Bunea, Ning and Dinicu

of approximation. We include the full proof for the convenience of the reader. By the deﬁnition of the FDR,

FDR(bτ) = E h P (t,k) H0 I[|f Wt,k| > bτ]I[Rbτ > 0]

(t,k) H0 E h I[|f Wt,k| > bτ]I[Rbτ > 0]

To handle the Rbτ in the denominator, we use the identity 1 = P i=Rbτ Rbτ i(i+1). This implies

1 i(i + 1) =

Plugging this into (45) and bringing the expectation inside the summation gives that

FDR(bτ) = X

1 i(i + 1)E h I[|f Wt,k| > bτ]I[Rbτ > 0]I[i Rbτ] i

1 i(i + 1)E h I[|f Wt,k| > Φ 1 1 αRbτ 2NBY |H|

]I[Rbτ > 0]I[i Rbτ] i

1 i(i + 1)E h I[|f Wt,k| > Φ 1 1 α(i |H|)

where the second line follows from the deﬁnition of the FDR cutoﬀand the last inequality holds since Rbτ (i |H|). The Berry-Esseen bound in Theorem 3 implies that

P(|f Wt,k| > Φ 1 1 α(i |H|)

NBY |H| + 2bn.

Thus, it follows that

NBY |H| + 2bn

i i(i + 1)NBY +

|H| i(i + 1)NBY

1 i + 1 1 NBY + |H| |H| + 1 1 NBY

α + 2|H0|bn,

where the last step follows from |H0|

|H| 1 and the deﬁnition of NBY . This completes the proof of the Theorem 5, Claim 1.

High-Dimensional Inference for Cluster-Based Graphical Models

Proof of Proposition 10

Proof The proof is done in two steps. In Step 1, we show that intersected with the event E,

n1/2|(eΩt,k Ω t,k)/st,k + Ω t,th(Ω k)/st,k| s1 log(K n)

n1/2 , (47)

with probability at least 1 C/(K n)3 and then use Lemma 16 to obtain the result. To prove (47), we decompose it as

n1/2|(eΩt,k Ω t,k) + Ω t,th(Ω k)|

= n1/2|(bΩt,k Ω t,k) bΩt,tbh(bΩ k) + Ω t,th(Ω k)|

n1/2|(bΩt,k Ω t,k) Ω t,t(h(bΩ k) h(Ω k))| | {z } I.1 + n1/2|Ω t,t(bh(bΩ k) h(bΩ k))| | {z } I.2

+ n1/2|(bΩt,t Ω t,t)bh(bΩ k)| | {z } I.3

In the following, we study these three terms separately. Recall that h(Ω k) = v T t ( b SΩ k ek), and bh(Ω k) = bv T t ( b SΩ k ek), where v t is a K-dimensional vector with (v t )t = 1 and (v t ) t = w t with w t = (S t, t) 1S t,t. Term I.1 reduces to

|I.1| = n1/2|(bΩt,k Ω t,k) Ω t,tv T t b S(bΩ k Ω k)|

n1/2|(bΩt,k Ω t,k)(1 Ω t,t( b St,t w T t b S t,t))| (48)

+ n1/2Ω t,t|( b St, t w T t b S t, t)(bΩ t,k Ω t,k)|. (49)

Note that 1/Ω t,t = S t,t w T t S t,t. The term in (48) can be bounded by

n1/2|(bΩt,k Ω t,k)Ω t,t(b St,t S t,t)| + n1/2|(bΩt,k Ω t,k)Ω t,tw T t ( b S t,t S t,t)|

n1/2|Ω t,t| bΩ k Ω k 1 max( b S S max, w T t ( b S t S t ) )

Cs1 log(K n)

with probability at least 1 (K n) 3, by λmax(Ω ) C and the concentration and error bound results in Lemmas 13, 14, 15. The term in (49) can be bounded by

n1/2Ω t,t b St, t w T t b S t, t bΩ t,k Ω t,k 1 Cs1 log(K n)

with probability at least 1 (K n) 3 again by Lemmas 14, 15. Thus, |I.1| s1 log(K n)

n1/2 with probability at least 1 (K n) 3. For term I.2, we have

|I.2| = n1/2Ω t,t|(bvt v )T ( b S bΩ k ek)|

n1/2Ω t,t bvt v 1 b S bΩ k ek Cs1 log(K n)

Eisenach, Bunea, Ning and Dinicu

with probability at least 1 (K n) 3 by Lemma 15 and the constraint of the CLIME-type estimator. To control term I.3, ﬁrst we observe that

|bh(bΩ k)| = |bv T t ( b S bΩ k ek)|

|v T t ( b SΩ k ek)| + |v T t b S(bΩ k Ω k)| + |(bvt v t )T ( b S bΩ k ek)|

v t 1 b SΩ k ek + b St, t w T t b S t, t bΩ k Ω k 1

+ bvt v t 1 b S bΩ k ek .

As shown in the proof of Lemma 16, v t 1 s1/2 1 v t 2 Cs1/2 1 . The rest of the bounds on the above terms follows easily from Lemmas 14, 15. Thus, we have |bh(bΩ k)| C(s1 log(K n)/n)1/2 with high probability. Since

|bΩt,t Ω t,t| C(s1 log(K n)/n)1/2,

by Lemma 15, we obtain that |I.3| s1 log(K n)

n1/2 with probability at least 1 (K n) 3. It is easily seen that Ω t,t 1 S t,t C > 0, see Remark 9. This implies that s2 t,k = Ω t,tΩ k,k + Ω 2 t,k is lower bounded by a positive constant. The proof of (47) is complete. In step 2, we need to verify that

max 1 t<k K sup x R

P n1/2Ω t,th(Ω k) stk x Φ(x) C (K n)3 + 1 n1/2 ,

which has been done in Lemma 16. Thus, combining with result (47), we can use the same simple union bound as in the proof of Theorem 3 to obtain the desired result.

Proof of Proposition 11

Proof By Lemma 15, under the event b G = G , we have

max 1 t,k K |bΩt,k Ω t,k| max 1 k K bΩ k Ω k 2 C1

s1 log(K n)

with probability at least 1 C4 (K n)3 . Under this event,

max 1 t<k K |bs2 t,k s2 t,k| = max 1 t<k K |bΩ2 t,k + bΩt,tbΩk,k (Ω 2 t,k + Ω t,tΩ k,k)|

max 1 t<k K |(bΩt,k Ω t,k)(bΩt,k + Ω t,k)|

+ bΩt,t|bΩk,k Ω k,k| + Ω k,k|bΩt,t Ω t,t|

s1 log(K n)

n (4 Ω max + δ)

s1 log(K n)

High-Dimensional Inference for Cluster-Based Graphical Models

for some constant δ > 0 since Ω max λmax(Ω ) C. It is easily veriﬁed that Ω t,t 1 S t,t C > 0 (see Remark 9). This implies that s2 t,k = Ω t,tΩ k,k + Ω 2 t,k is lower bounded by a positive constant. Thus,

max 1 t<k K

st,k 1 max 1 t<k K

bs2 t,k s2 t,k st,k(st,k + bst,k)

s1 log(K n)

A.2. Key Lemmas for Estimators of the Cluster-Average Graph

Remark 12 In the following proofs, we always assume the event E = { b G = G } holds. Using a similar argument to that used in the proof of Theorem 3, the following bounds will hold with probability at least 1 C (K n)3 .

Lemma 13 (Consistency of b S) If Assumptions 1 and 2 hold, then with probability greater than 1 C (K n)3 ,

b S S max C

for some constant C dependent only on c1, c2, and c3 from Assumptions 1 and 2.

Proof The proof follows from the proof of Theorem 1 in Cai et al. (2011).

Lemma 14 (Concentration of Gradient and Hessian) If Assumptions 1 and 2 hold, then with probability greater than 1 C (K n)3 , we have that

(a) max1 k K b SΩ k ek C1 q

(b) max1 t K b St, t w T t b S t, t C2 q

(c) max1 t K w T t ( b S t, t S t, t) C2 q

for some constants C1 and C2 dependent only on c1, c2, and c3 from Assumptions 1 and 2.

Proof We start from the decomposition

b S S = (A T A ) 1A T (bΣ Σ )A (A T A ) 1.

Eisenach, Bunea, Ning and Dinicu

Denoting by B = A T A , we can write

(A T A ) 1A T (bΣ Σ )A (A T A ) 1

i=1 B 1A T n (A Zi + Ei)(A Zi + Ei)T A C A T Γ o A B 1

i=1 Zi ZT i C + 1

i=1 Zi ET i A B 1 + 1

i=1 B 1A T Ei ZT i

i=1 B 1A T (Ei ET i Γ )A B 1. (50)

First, note that b SΩ k ek = ( b S S )Ω k, and ||Ω k||2 λmax (Ω ) λmin (S ) 1. As seen in Remark 9, we can bound the smallest eigenvalue of S from below by λmin (C ). We therefore obtain that ||Ω k||2 c1 + c3. Using this bound, we apply Lemma 24, part (d) to control the term 1 n

Zi ZT i C Ω k.

The remaining terms in (50) can similarly be controlled using Lemma 25 and Lemma 26. Applying the triangle inequality therefore gives

max 1 k K b SΩ k ek C1

with probability at least 1 C (K n)3 , concluding the proof of claim (a). For the remaining two claims, we can rewrite

w t = S t,t S T t,t(S t, t) 1S t,t Ω t,t = 1 Ω t,t Ω t,t

by the block matrix inverse formula. Using Lemma 30, it follows that ||w t ||2 λmin (Ω ) maxt S t,t. Then we can see that

max 1 t K || b St, t w T t b S t, t|| = max 1 t K ||( b St, t S t, t) w T t ( b S t, t S t, t)||

max 1 t K ||( b St, t S t, t)|| | {z } (i)

+ max 1 t K ||w T t ( b S t, t S t, t)|| | {z } (ii)

By using Lemma 13, (i) is bounded with high probability. Following the same step as in the

proof of (a), we have that (ii) C2 q

n , with probability at least 1 C (K n)3 . Part (c) is the same as the term (ii), concluding the proof.

Lemma 15 (Consistency of Initial Estimator) If Assumptions 1 and 2 hold, then,

(a) max1 k K bΩ k Ω k 1 C1s1 q

High-Dimensional Inference for Cluster-Based Graphical Models

(b) max1 k K bΩ k Ω k 2 C1 q

s1 log(K n)

(c) max1 t K bvt v t 1 C2s1 q

(d) max1 k t K |(bvt v t )T b S(bΩ k Ω k)| C3 s1 log(K n)

with probability at least 1 C4 (K n)3 . C1, C2, C3, C4 and C5 are constants, dependent only upon c0, c1, c2, and c3 from Assumptions 1 and 2.

Proof The proof follows from the same argument as in the proof of Lemma 21 with Lemma 20 replaced by Lemma 14.

Lemma 16 (CLT for the Pseudo-Score Function) Recall that s2 tk = Ω 2 t,k + Ω t,tΩ k,k.

Let Fn denote the CDF of n1/2v T t ( b SΩ k ek)/(stk/Ω t,t). If Assumptions 1 and 2 hold, then max 1 t<k K sup x R |Fn(x) Φ(x)| C(n 1/2 + (d n) 3),

where C is a constant dependent only upon c0, c1, and c2.

Proof The proof is very similar to that of Lemma 22, so we only summarize the key steps here. For the cluster averages score function, we have the similar bound that

Fn(x) Φ(x) e Fn(x) Φ(x) + P( E),

where E is the event b G = G , and e Fn(x) is the CDF of n 1/2 Pn i=1 v T t ( Xi XT i Θ k ek)/(stk/Θ tt). Lemma 8 implies that P( E) (d n) 3. As in Lemma 22, we can verify the Lyapunov condition to control e Fn(x) Φ(x). Lastly, from the Isserlis theorem, we get that

Var(v T 1 X XT v2) = (v T 1 S v1)(v T 2 S v2) + (v T 1 S v2)2,

for any vector v1 and v2. Using this, it is straightforward to shown

E[v T t ( Xi XT i Θ k ek)2] = (stk/Θ tt)2,

concluding the proof.

Appendix B. Estimation in the Latent Variable Graph

This section contains the proofs which establish the asymptotic normality of the estimator of Θ .

Eisenach, Bunea, Ning and Dinicu

B.1. Main Proofs for the Latent Variable Graph Estimators

As in Section B we develop the proofs of Theorem 4 and Claim 2 of Theorem 5 by ﬁrst establishing two supporting propositions.

Proposition 17 (Asymptotic Normality of eΘt,k) Under the same conditions as in Theorem 4, we get

max 1 t<k K sup x R

n1/2(eΘt,k Θ t,k)

σt,k < x, E

Φ(x) C (K n)3 + Cs0 log(K n)

Proof The proof follows all the steps of Proposition 10, with Lemma 13 replaced by Lemma 19, 14 by 20, 15 by 21 and 16 by 22.

Proposition 18 (Convergence Rate of the Variance Estimator) Under the same conditions as in Theorem 4, we get

max 1 t<k K |bσ2 t,k σ2 t,k| C

s0 log(K n)

max 1 t<k K

s0 log(K n)

with probability at least 1 (K n) 3.

Proof Similar to the proof of Proposition 11, we can prove that

max 1 t<k K |bσ2 t,k (Θ 2 t,k + Θ t,tΘ k,k)| C

s0 log(K n)

with probability at least 1 (K n) 3. Then, by Lemma 23, we obtain

max 1 t<k K |bσ2 t,k σ2 t,k| C

s0 log(K n)

The second statement can be similarly derived.

Having established the asymptotic normality of eΘt,k and the convergence rate of bσ2 t,k, we proceed with the proofs of the main results.

Proof of Theorem 4

Proof The proof follows in exactly the same manner as that of Theorem 3, but invokes Proposition 18, instead of Proposition 11, and Proposition 17, instead of Proposition 10, as one needs to establish diﬀerent intermediate results, speciﬁcally tailored to estimation of the latent graph.

High-Dimensional Inference for Cluster-Based Graphical Models

Proof of Theorem 5, Claim 2

Proof The proof follows in exactly the same manner as that of Theorem 5, Claim 1.

B.2. Key Lemmas for Estimators of the Latent Graph

As before, we assume in the following proofs that the event E = { b G = G } holds. Using a similar argument to that used in the proof of Theorem 4, the following bounds will hold with probability at least 1 C (K n)3 .

Lemma 19 (Consistency of b C) If Assumptions 1 and 2 hold, then with probability greater than 1 C (K n)3 ,

b C C max C

for some constant C dependent only on c1, c2, and c3 from Assumptions 1 and 2.

Proof Denoting B = A T A , we begin by using the decomposition

b C C = B 1A T (bΣ Σ )A B 1 B 1A T (bΓ Γ )A B 1. (51)

Next, we can write for i G k,

(bΓ Γ )i,i = bγi γ i

= bΣi,i 1 |G k| 1

j G k,j =i bΣi,j γ i

= bΣi,i 1 |G k| 1

j G k,j =i bΣi,j Σ i,i + 1 |G k| 1

j G k,j =i Σ i,j

= bΣi,i Σ i,i 1 |G k| 1

h bΣi,j Σ i,j i ,

which implies that ||bΓ Γ ||max 2||bΣ Σ ||max. Therefore from Lemma 27 we see that

||B 1A T (bΓ Γ )A B 1||max 2

m||B 1A T (bΣ Σ )B 1||max,

demonstrating that it will suﬃce to bound the ﬁrst term in (51), which we now do. For the ﬁrst term in (51), we have

B 1A T (bΣ Σ )A B 1

i=1 B 1A T n (A Zi + Ei)(A Zi + Ei)T A C A T Γ o A B 1

i=1 Zi ZT i C + 1

i=1 Zi ET i A B 1 + 1

i=1 B 1A T Ei ZT i

i=1 B 1A T (Ei ET i Γ )A B 1. (52)

Eisenach, Bunea, Ning and Dinicu

Using the triangle inequality, we can apply Lemma 24, Lemma 25 and Lemma 26 to bound 52. Combining these results with that P(E) 1 c0/(d n)3, we obtain

b C C max C

with probability at least 1 C (K n) for some constant C dependent only on c1, c2, and c3 from Assumptions 1 and 2.

Lemma 20 (Concentration of Gradient and Hessian of the Loss Function) If Assumptions 1 and 2 hold, then with probability greater than 1 C (K n)3 , we have that

(a) max1 k K b CΘ k ek C1 q

(b) max1 t K b Ct, t w T t b C t, t C2 q

(c) max1 t K w T t (b C t, t C t, t) C2 q

for absolute constants C1 and C2 dependent only on c1, c2, and c3 from Assumptions 1 and 2.

Proof Let B = A T A and observe that b CΘ k ek = (b C C )Θ k. Following the decomposition (52), we can similarly show that

B 1A T (bΣ Σ )A B 1Θ k

i=1 Zi ZT i Θ k C Θ k + 1

i=1 Zi ET i A B 1Θ k + 1

i=1 B 1A T Ei ZT i Θ k

i=1 B 1A T (Ei ET i Γ )A B 1Θ k. (53)

As in the proof of Lemma 19, we have that

||B 1A T (bΓ Γ )A B 1Θ k|| 2

m||B 1A T (bΣ Σ )A B 1Θ k|| ,

demonstrating that again it will suﬃce to bound the ﬁrst term in 53. Note that ||Θ k||2 λmax (Θ ) c 1 1 . Therefore, by using the triangle inequality, we can apply Lemma 24, Lemma 25 and Lemma 26 to bound the ﬁrst term in 53. Combining these results and P(E) 1 C/(d n)3, we obtain

max 1 k K b CΘ k ek C1

with probability at least 1 C (K n)3 for some constant C1 dependent only on c1, c2, and c3 from Assumptions 1 and 2.

High-Dimensional Inference for Cluster-Based Graphical Models

For the remaining two claims, we can rewrite

w t = C t,t C T t,t(C t, t) 1C t,t Θ t,t = 1 Θ t,t Θ t,t

by the block matrix inverse formula. Using Lemma 30, it follows that ||w t ||2 λmax (Θ ) maxt C t,t. Then we see that

max 1 t K ||b Ct, t w T t b C t, t|| = max 1 t K ||(b Ct, t C t, t) w T t (b C t, t C t, t)||

max 1 t K ||(b Ct, t C t, t)|| | {z } (i)

+ max 1 t K ||w T t (b C t, t C t, t)|| | {z } (ii)

Clearly, using Lemma 19, (i) is bounded with high probability. Likewise, Lemma 19 demonstrates that b C t,i C t,i is a sub-exponential random vector with parameters dependent only on c1, c2, and c3 from Assumptions 1 and 2. Thus w T t (b C t,i C t,i) is sub-exponential and because ||w t ||2 λmax (Θ ) maxt C t,t, we obtain that

max 1 t K b Ct, t w T t b C t, t C2

with probability at least 1 C (K n)3 for some constant C2. C2 is dependent only on c1, c2, and c3 from Assumptions 1 and 2. The ﬁnal result is bounded by the previous one, concluding the proof.

Lemma 21 (Consistency of Initial Estimators) If Assumptions 1 and 2 hold, then

(a) max1 k K bΘ k Θ k 1 C1s0 q

(b) max1 k K bΘ k Θ k 2 C1 q

s0 log(K n)

(c) max1 t K bvt v t 1 C2s0 q

(d) max1 k t K |(bvt v t )T b C( bΘ k Θ k)| C3 s0 log(K n)

with probability at least 1 C4 (K n)3 . C1, C2, C3, C4 are constants, dependent only upon c0, c1, c2, and c3 from Assumptions 1 and 2.

Proof Below, the constants Ca, C a, Cb, C b, C b , Cc and C c will depend only upon c0, c1, c2, and c3 from Assumptions 1 and 2. We ﬁrst prove part (b). The proof of part (a) is similar. Let b = bwt w t , noting that we can consider wt instead of vt as the tth entries in both the estimated and true value are 1. By S denote the support of w t . w t is s0-sparse because w t is a multiple of Θ t,k, which we know to be s0-sparse.

Eisenach, Bunea, Ning and Dinicu

By Lemma 20, there exists Cb such that for λ Cb q

n , w t is feasible for (22)

with probability at least 1 C b (K n)3 . Assuming w t is feasible, then it follows by deﬁnition that ||(w t )S||1 ||(bwt)S||1 + ||(bwt) S||1. This in turn implies by the triangle inequality that

|| b S||1 || b S||1. Letting λ = Cb q

n , it follows from the triangle inequality that

||b C t, t b || ||bw T t b C t, t Ct, t | + ||w T t b C t, t Ct, t |

In addition, note that || b ||1 2|| b S||1 2 s0|| b S||2 2 s0|| b ||2. Therefore combining with the above, this gives

b T b C t, t || b ||1||b C t, t b ||

s0 log(K n)

n || b ||2.

From Lemma 29, b T b C t, t b 4c1

3 || b ||2 2 with probability at least 1 C b (K n)3 . Therefore

|| b ||2 16Cc1

s0 log(K n)

|| b ||1 32Cbs0c1

with probability at least 1 max{C b,C b } (K n)3 .

To obtain part (c), ﬁrst let δ = max1 k t K |(bvt v t )T b C( bΘ k Θ k)| and then by applying Holder s inequality and the triangle inequality, we obtain

δ max 1 k t K ||bvt v t ||1||b C( bΘ k Θ k)||

max 1 k t K ||bvt v t ||1 ||b C bΘ k ek|| + ||b CΘ k ek|| . (54)

With choice of λ as above, the KKT conditions give that ||b C bΘ k ek|| Cb q

From Lemma 20, we have that ||b CΘ k ek|| C q

n with probability at least 1

C (K n)3 . Using part (b), we get that max1 t K bvt v t 1 Cs0 q

n with probability

at least 1 C (K n)3 . The desired result now follows from (54).

Lemma 22 (CLT for the Pseudo-Score Function) Recall that

σ2 tk = E(Θ ttv T t ( C(i)Θ k ek))2,

High-Dimensional Inference for Cluster-Based Graphical Models

with C(i) deﬁned in (30). Let Fn denote the CDF of n 1/2v T t (b CΘ k ek)/(σtk/Θ tt). If Assumptions 1 and 2 hold, then we have

max 1 t<k K sup x R |Fn(x) Φ(x)| C(n 1/2 + (d n) 3).

where C is a constant dependent only upon c0, c1, and c2.

Proof Denote by E the event that b G = G . We have

Fn(x) Φ(x) e Fn(x) Φ(x) + P( E),

where e Fn(x) is the CDF of n 1/2 Pn i=1 v T t ( C(i)Θ k ek)/(σtk/Θ tt). To control e Fn(x) Φ(x), we now verify the Lyapunov condition. As in the proof of Lemma 20, we can write

v T t ( C(i)Θ k ek) = v T t ( C(i) C )Θ k.

From Lemmas 24 - 26 we see that the entries in Qi = ( C(i) C )Θ k are sub-exponential with parameters α = C1 and ν = C2 which depend only upon λmax (Θ ), maxk γ k, and maxt C t,t. Recall the deﬁnition of v t : (v t )t = 1 and (v t ) t = w t = (C t, t) 1C t,t. By the block matrix inverse formula, we can rewrite

w t = C t,t C T t,t(C t, t) 1C t,t Θ t,t = 1 Θ t,t Θ t,t.

Using Lemma 30, it follows that ||w t ||2 λmax (Θ ) maxt C t,t and ||v t ||2 λmax (Θ ) maxt C t,t+ 1. From Lemma 33 and the above, v T t Qi is sub-exponential with parameters α = C1 and ν = ||v t ||2C2 λmax (Θ ) maxt C t,t + 1 C2. Therefore, v T t ( C(i)Θ k ek) has third moments bounded above by some constant ρ that depends only upon λmax (Θ ), maxk γ k, and maxt C t,t. All three quantities are bounded above by constants per Assumptions 1 and 2. Thus, max1 t<k K supx( e Fn(x) Φ(x)) Cn 1/2 by the classical Berry-Esseen Theorem, and therefore max 1 t<k K sup x R (Fn(x) Φ(x)) C(n 1/2 + (d n) 3).

Similarly, it can be shown that supx R(Φ(x) Fn(x)) C(n 1/2+(d n) 3). This completes the proof.

Lemma 23 (Approximation for Asymptotic Variance) Under Assumptions 1 and 2, we have that σ2 tk = Θ 2 t,k + Θ t,tΘ k,k + , (55)

where | | Cs0

m and C is a constant dependent only upon c1, c2 and c3.

Proof Recall that σ2 tk = E(Θ ttv T t ( C(i)Θ k ek))2. Using the identity vec(M1M2M3) = (MT 3 M1)T vec(M2)(MT 3 M1), we have

σ2 tk = (Θ t,t)2(Θ T k v t )T E h vec( C(i)) vec( C(i))T i (Θ T k v t ). (56)

Eisenach, Bunea, Ning and Dinicu

Computing the expectation: After some straightforward, albeit lengthy, algebra we can show that E h vec( C(i)) vec( C(i))T i = M1 + M2 + M3,

where M1 := C C and M2 := [C j C T i ]ij. The matrices M1 and M2 contribute to the ﬁrst two terms in (55). The term M3 := E vec( C(i)) vec( C(i))T M1 M2, however, is unique to the latent graph and contributes the higher order term in (55). Evaluating the ﬁrst order terms: By (56), we have

σ2 tk = (Θ t,t)2(Θ k v t )T (M1 + M2)(Θ k v t ) +

with deﬁned as := (Θ t,t)2(Θ k v t )T M3(Θ k v t ). (57)

Next, we observe that

(Θ k v t )T M1(Θ k v t ) = Θ T k C Θ k v T t C v t

= Θ k,k Θ t,t ,

where we used that v T t C v t = (Θ t,t) 1. Similarly, we can ﬁnd that

(Θ k v t )T M2(Θ k v t ) = Θ 2 tk Θ 2 tt .

Bounding the higher order terms: What remains is to bound the magnitude of the term in (57). Lengthy algebra yields:

|(Θ k v t )T M3(Θ k v t )| C

m (Θ k v t )T (M4 + M5 + M6 + M7) (Θ k v t ),

where C depends only on c1, c2 and c3. Here, M4 = I (11T ). For l = 5, 6, 7 the matrices Ml are deﬁned block-wise by

( 1e T i if i = j 0 o/w and M6;ij :=

( ej1T if i = j 0 o/w and M7;ij :=

( I if i = j 0 o/w .

Further lengthy algebra gives:

|(Θ k v t )T M4(Θ k v t )| s0 2 c2 1 (1 + c2 2 c2 1 ),

|(Θ k v t )T M5(Θ k v t )| s0 2

2 c2 1 (1 + c2 2 c2 1 ),

|(Θ k v t )T M6(Θ k v t )| s0 2

2(c2 1 + c2 2) c4 1

1 + c2 2 c2 1 ,

High-Dimensional Inference for Cluster-Based Graphical Models

|(Θ k v t )T M7(Θ k v t )| s0 2(c2 1 + c2 2) c4 1 .

Plugging these bounds into the expression for in (57), we obtain

concluding the proof.

Appendix C. Concentration Results

The lemmas below provide important results regarding the concentration properties of some of the estimators b C and variables Z.

(a) Zi ZT i consists of entries which are sub-exponential with parameters α = 4 maxt(C t,t)2

2 maxt(C t,t)2,

n Pn i=1 Zi ZT i C max C maxt(C t,t)2 q

(c) Zi ZT i u consists of entries which are sub-exponential with parameters α = 4 maxt(C t,t)2

2||u||2 maxt(C t,t)2, and

n Pn i=1 Zi ZT i C u max C||u||2 maxt(C t,t)2 q

where C = 4

3 and u RK.

Proof From Lemma 31, each element in the matrices Zi ZT i are sub-exponential with parameters α = 4 maxt(C t,t)2 and ν = 2

2 maxt(C t,t)2. Therefore by Corollary 33, the entries in 1

n Pn i=1 Zi ZT i are sub-exponential with parameters α = 4

n maxt(C t,t)2 and ν =

2 n maxt(C t,t)2. Therefore by the tail bound for sub-exponential random variables, we see that

i=1 Zi ZT i C

exp (log(K n))D2 1 16 maxt(C t,t)4 if 0 D1 q

n 2 maxt(C t,t)2

n log(K n) 8 maxt(C t,t)2

n > 2 maxt(C t,t)2

for arbitrary D1 > 0. Observe that for n suﬃciently large, D1 p

log(K n)/n 2 maxt(C t,t)2, and thus we need only consider this case. Choose D1 4

3 maxt(C t,t)2. Then it is clear that

i=1 Zi ZT i C

Eisenach, Bunea, Ning and Dinicu

By applying the union bound across all entries in the matrix, we get the desired result that

i=1 Zi ZT i C max D1

concluding the proof of parts (a) and (b). The proof of (c) and (d) are very similar and omitted.

Lemma 25 Let C = 2

3, σ2 s = 1 |G s|2 P i G s γi, and Mi = Zi ET i A B 1. Then,

(a) (Mi)s,t is sub-exponential with parameters α =

2 max σ2 s, C t,t and ν =

2 max σ2 s, C t,t ,

n Pn i=1 Mi max C maxt(σ2 t C t,t) q

(c) (Miu)s is sub-exponential with parameters α =

2 max σ2 s, C t,t and ν =

2||u||2 max σ2 s, C t,t , and

n Pn i=1 Miu C maxt(σ2 t C t,t)||u||2 q

where u RK.

Proof Let M = Pn i=1 Mi. From Lemma 27, Y1 = B 1A T E1 is a K-dimensional vector where the kth entry is given by

(Y1)k = 1 |G k|

i G k (E1)i.

Because the errors are all independent mean zero Gaussian random variables, (Y1)s N(0, σ2 s). Therefore, as Y1 is independent of Z1 by deﬁnition, E[(Y1)s(Z1)t] = E[(Y1)s]E[(Z1)t] = 0. Further, Lemma 31 gives that (Y1)s(Z1)t is sub-exponential with parameters α = ν =

2 max σ2 s, C t,t . Using the independence of the samples, Corollary 33 gives that Ms,t is sub-exponential with parameters α =

2 max σ2 s, C t,t and ν =

2n max σ2 s, C t,t . Then, Corollary 35 gives that for arbitrary choice of D1 > 0,

1 n Ms,t D1

exp (log(K n))D2 1 4 max(σ2s,C t,t) 2

2 max σ2 s, C t,t

2 max(σ2s,C t,t)

2 max σ2 s, C t,t .

High-Dimensional Inference for Cluster-Based Graphical Models

Observe that for n suﬃciently large, D1 p

2 max σ2 s, C t,t . If we choose D1 2

3 max σ2 s, C t,t , then we obtain that for n suﬃciently large,

1 n Ms,t D1

Then by the union bound we can obtain

i=1 Zi ET i A B 1 max D1

3 max maxs σ2 s, maxt C t,t , concluding the proof of parts (a) and (b). The proof of (c) and (d) are very similar and omitted.

Lemma 26 Recall that m = mink |G k|, and let Mi = B 1A T Ei ET i A B 1 and µ = B 1A T Γ A B 1. Then,

(a) (Mi µ)t,k is sub-exponential with parameters αt,k =

2 |G k||G t | maxi G t G k γ i and

2 |G k||G t | maxi G t G k γ i ,

n Pn i=1 Mi µ max C maxk γk q

(c) ((Mi µ) u)t is sub-exponential with parameters αt =

2 m2 maxk γ k and νt = q

2 m2 ||u||2 maxk γ k, and

n Pn i=1 (Mi µ) u max C||u||2 maxk γk q

where C = 2

3 and and u RK.

Proof To obtain the results, we ﬁrst bound the sum M := Pn i=1 Mi entrywise. From Lemma 28, Corollary 32 and Corollary 33, we have that (Mi)t,k is sub-exponential with parameters

2 |G k||G t | max i G t G k γ i and νt,k =

2 |G k||G t | max i G t G k γ i .

Therefore 1

n Mt,k is sub-exponential with parameters

2 n|G k||G t | max i G t G k γ i and ν =

2 n|G k||G t | max i G t G k γ i .

Eisenach, Bunea, Ning and Dinicu

Next, observe that µt,k = E[Mt,k] and denote N = maxi G t G k γ i . Then, from Lemma 34, we obtain

1 n Mt,k µt,k D1

n|G k||G t |

exp (log(K n))D2 1 4N2 if 0 D1 q

log(K n) n|G k||G t |

log(K n) n|G k||G t | >

Observe that for n suﬃciently large, D1 q

log(K n) n|G k||G t |

2N. If we choose D1 2

3N, then we obtain that for n suﬃciently large,

1 n Mt,k µt,k D1

n|G k||G t |

Therefore by taking the union bound, lower bounding n|G k||G t | by nm2 and choosing D1 2

3 maxk γ k,

i=1 B 1A T (Ei ET i Γ )A B 1 max D1

concluding the proof of parts (a) and (b). The proof of (c) and (d) are very similar and omitted.

Appendix D. Auxiliary Technical lemmas

Lemma 27 For 1 k K, denote mk = |G k|. Then the matrix (A T A ) 1A T is a K d dimensional matrix given as

[(A T A ) 1A T ]k,i =

( 1 mk if i G k 0 otherwise.

Proof First, we must calculate B 1A T . For 1 k K, denote mk = |G k| and let ek be a unit vector in RK with 1 on the k position and 0 otherwise. Without loss of generality, we permute the rows of A such that for any 1 k K A j = ek, for Pk 1 i=1 mi + 1 j < Pk i=1 mi + 1 that is rows are ordered according to ascending group index. Here, for notational simplicity, we let m0 = 0. Thus, A T A = diag(m1, ..., m K) and the result follows immediately.

Lemma 28 The matrices B 1A T (Ei ET i )A B 1 and B 1A T (Γ )A B 1 are given by

(B 1A T (Ei ET i )A B 1)t,k = 1 |G t G k|

q G k Ei,p Ei,q

High-Dimensional Inference for Cluster-Based Graphical Models

(B 1A T (Γ )A B 1)t,k =

( 1 |G t |2 P p G t γ p if t = k

0 otherwise.

Proof The result can be obtained by a straightforward computation.

Lemma 29 (Restricted Eigenvalue Condition for b C) If Assumptions 1 and 2 hold, then the matrix b C satisﬁes with probability at least 1 C (K n)3 ,

||v||2 2 : v RK \ {0}, ||v S||1 3||v S||1

( v T b C t, tv

||v||2 2 : v RK \ {0}, ||v S ||1 3||v S ||1

where κ 3 4c1 > 0.

Proof We begin by proving the ﬁrst claim. By Lemma 19, we have that b C C max

n with high probability. Therefore, for K suﬃciently large and for any v RK \ {0}, v T b Cv

The proof is then done for κ = 3 4c1 as we assume the minimum eigenvalue of C is bounded below by c 1 0 . The proof of the second claim is identical because C is positive semideﬁnite, and it is well known that the minimum eigenvalue of any principal submatrix C t, t is bounded below by λmin(C ) c 1 0 .

Lemma 30 Let M be a n n positive deﬁnite matrix and denote its inverse by L. Then, for all i = 1, . . . , n Mi,i Li,i 1.

Proof By the block matrix inverse formula, it follows that

M 1 i,i = Li,i LT i,i L 1 i, i L i,i. (58)

Because M is positive deﬁnite, so is L. Recall that a matrix is positive deﬁnite if and only if all its principal minors are also positive deﬁnite. Therefore, L i, i is positive deﬁnite, as is L 1 i, i. Therefore, LT i,i L 1 i, i L i,i 0 and (58) becomes M 1 i,i Li,i. Lastly, if a matrix is positive deﬁnite, all its diagonal elements must be nonnegative, giving that Mi,i Li,i 1 as desired.

Eisenach, Bunea, Ning and Dinicu

Appendix E. Basic Tail Bounds for Random Variables

This section collects some basic tail probability results for random variables. The proof is standard and omitted.

Lemma 31 Let Y = (Y1, Y2) be a jointly Gaussian random vector with covariance matrix C. Then Y1Y2 is sub-exponential with parameters α = 4λmax (CY ) and ν = 2

2λmax (CY ).

Corollary 32 Let Y1 N(0, σ2 1) and Y2 N(0, σ2 2) where σ2 1 σ2 2.Then Y1Y2 is subexponential with parameters α =

2σ2 1 and ν =

Corollary 33 Consider Pn i=1 Xi where Xi are centered, independent sub-exponential random variables. Then Y = Pn i=1 Xi is sub-exponential with parameters α = maxi αi and

ν = q Pn i=1 ν2 i .

Lemma 34 (Tail Bound for Sub-Exponential Random Variables) Let X be a subexponential random variable with mean µ and parameters α and ν. Then

2ν2 ) for 0 t ν2

2α) for t > ν2

Corollary 35 Consider Y = Pn i=1 Xi, where Xi are centered, independent sub-exponential

random variables. Let α = maxi αi and ν = q Pn i=1 ν2 i . Then,

2ν2/n) for 0 t ν2

2α) for t > ν2

Appendix F. Construction of a Pre-Clustering Variance Estimator

We include in this section the construction of the pre-clustering estimator of Γ needed as an input of the PECOK algorithm of Section 2.2 above. For any a, b [d], deﬁne

V (a, b) := max c,d [p]\{a,b}

(bΣac bΣad) (bΣbc bΣbd) q

bΣcc + bΣdd 2bΣcd , (59)

with the convention 0/0 = 0. Guided by the block structure of Σ, we deﬁne

b1(a) := argmin b [p]\{a} V (a, b) and b2(a) := argmin b [p]\{a,b1(a)} V (a, b),

to be two elements close to a, that is two indices b1 = b1(a) and b2 = b2(a) such that the empirical covariance diﬀerence bΣbic bΣbid, i = 1, 2, is most similar to bΣac bΣad, for all variables c and d not equal to a or bi, i = 1, 2. It is expected that b1(a) and b2(a) either belong to the same group as a, or belong to some close groups. Then, our estimator eΓ is a diagonal matrix, deﬁned by

eΓaa = bΣaa + bΣb1(a)b2(a) bΣab1(a) bΣab2(a), for a = 1, . . . , d. (60)

High-Dimensional Inference for Cluster-Based Graphical Models

Intuitively, eΓaa should be close to Σaa + Σb1(a)b2(a) Σab1(a) Σab2(a), which is equal to Γaa in the favorable event where both b1(a) and b2(a) belong to the same group as a. In general, b1(a) and b2(a) cannot be guaranteed to belong to the same group as a. Nevertheless, these two surrogates b1(a) and b2(a) are close enough to a so that eΓ Γ max |Γ|max p

log d/n. This last fact and the above construction are shown in Bunea et al. (2018).

Appendix G. Comparison with Cai et al (2016)

In our work, we bound λmin(S ) c1 and maxt S t,t c2 and the sparsity of Ω k by s1. We show that the CLIME estimator satisﬁes

bΩ k Ω k 1 s1

Let us denote our parameter space by

G1 = {Ω: max k Ω k 0 s1, max t {(Ω 1)tt} c2, λmax(Ω) 1/c1}.

By contrast, Cai et al. (2016) considered the following parameter space for the precision matrix Ω= S 1 (in the exact sparse case)

G = {Ω: max k Ω k 0 s1, Ω 1 Mn, κ(Ω) = λmax(Ω)/λmin(Ω) M1},

where Ω 1 is the matrix ℓ1-norm, Mn is allowed to increase with n and M1 is a constant bound for the condition number. Their minimax lower bound over G depends on Mn, that is for any bΩ k,

sup Ω G bΩ k Ω k 1 Mns1

It seems that the upper bound (61) and the lower bound (62) contradict with each other. However, this is not the case because the parameter space G1 under which the estimator is developed is diﬀerent from the parameter space G in the lower bound. This therefore opens up the possibility of obtaining diﬀerent rates over diﬀerent parameter spaces, for instance one which does not depend on ||Ω ||1, like in our case. However, neither parameter space can be viewed as a full relaxation of the other. Below we give explicit examples that show how the parameter spaces diﬀer. Consider the sequence of precision matrices, indexed by K

ΩK = |S(K)|I + 1S(K)1 S(K) RK,

where S(K) is an arbitrary subset of [K] = {1, 2, ..., K} with |S(K)| = K1/4 and 1S(K) is the vector with 1 s in the indices indicated by S(K) and 0 s elsewhere. By using the Sherman-Morrison formula, we can verify that the corresponding covariance matrix is

SK = 1 |S(K)|I 1 2|S(K)|2 1S(K)1 S(K).

Eisenach, Bunea, Ning and Dinicu

It is easy to check that

λmin(ΩK) = |S(K)| = K1/4,

λmax(ΩK) = 2|S(K)| = 2K1/4,

λmin(SK) = 1 2|S(K)| = K 1/4/2.

Thus κ(ΩK) = 2, and therefore Theorem 4.2 from Cai et al. (2016) can be applied to this sequence of parameters ΩK with s1 = K1/4 + 1 and Mn = 2K1/4. However, ΩK does not belong to our parameter space G1 because for any choice of c1, once K is suﬃciently large, λmin(SK) < c1. Similarly, we can modify the above example such that ΩK satisﬁes our condition but does not belong to G. For instance, consider

ΩK = I 1 1 + |S(K)|1S(K)1 S(K) RK.

We can show that SK = I + 1S(K)1 S(K),

λmin(ΩK) = 1 K1/4 + 1,

λmax(ΩK) = 1,

λmin(SK) = 1.

Our conditions λmin(SK) c1 and maxt(SK)t,t c2 hold, and thus we obtain the rate (61) for the CLIME estimator. However, κ(ΩK) = K1/4 + 1 , as K . Thus, ΩK does not belong to the parameter space G, and Theorem 4.2 in Cai et al. (2016) is not applicable. Since our goal is to make inference on each entry of the precision matrix say Ωtk, a lower bound on the minimum eigenvalue on S is a mild and standard assumption for any high-dimensional inference on graphical models. After some careful analysis as explained in the main text, we show that under Assumption 4.1 and 4.2 the upper bound for the CLIME estimator does not depend on ||Ω ||1 or λmax(S ).

Appendix H. B-H procedures for FDR control

In this section, we analyze the theoretical properties of the B-H procedure for the clusteraverage graph. The procedure for latent variable graph is identical. Recall that the test statistic for H0,tk : Ω t,k = 0 is f Wt,k. The B-H procedure is deﬁned as follows. Given the desired FDR level α, deﬁne

bρ = inf n 0 ρ 2 p

log K : 2(1 Φ(ρ))|H|

max{P 1 t k K I(|f Wt,k| ρ), 1} α o ,

where |H| = K(K 1)/2 is the total number of hypotheses. If bρ does not exist, then we set bρ = 2 log K. In the above deﬁnition, we restrict ρ 2 log K in order to apply the

High-Dimensional Inference for Cluster-Based Graphical Models

Cramer-type moderate deviation result (Liu, 2013). The B-H procedure says that we reject H0,tk if |f Wt,k| bρ. The FDR and FDP are deﬁned as

P (t,k) H0 I(|f Wt,k| bρ)

max{P (t,k) H I(|f Wt,k| bρ), 1} ,

where H0 := {(t, k) : 1 t < k K, such that Ω t,k = 0}, and FDR = E(FDP). Let

A = n (t, k) : Ω t,k q

Ω k,kΩ t,t 4 p

denote the set of strong signals. The following theorem on the B-H procedure holds.

Proposition 36 Assume that the conditions in Theorem 3 hold. Let K nr for some r > 0. In addition, assume that |A| log log K, s1 log3/2(K n)/n1/2 = o(1) and s1 = O(Kc) for some c < 1/2. Then, as n, K , we have

FDP |H0|/|H| = α + op(1) and FDR |H0|/|H| = α + o(1).

The proof of this proposition follows from Theorem 3.1 in Liu (2013). The conditions |A| log log K and s1 = O(Kc) are equation (12) in Liu (2013). The condition s1 log3/2(K n)/n1/2 = o(1) together with Theorem 3 guarantees that

max 1 t<k K sup x R

P( b Tt,k x) Φ(x) = o(1/ p

log(K n)), (63)

which is equivalent to condition (13) in Liu (2013). We refer to Liu (2013) for the detailed proof. Proposition 36 implies that the B-H procedure can control FDR asymptotically under certain assumption. However, our numerical results show that this procedure may fail to control FDR in our model when K is relatively large. One possible reason is that the op(1) or o(1) terms in the approximation of FDP or FDR in the above proposition are not suﬃciently small in simulations due to the dependence of the test statistics. Thus, if the goal of the data analysis is to ﬁnd statistically reliable scientiﬁc discoveries, the B-Y method, while quite conservative, may be the one that s more suitable for this purpose.

Jessica R Andrews-Hanna, Jay S Reidler, Jorge Sepulcre, Renee Poulin, and Randy L Buckner. Functional-anatomic fractionation of the brain s default network. Neuron, 65 (4):550 562, 2010.

Rina Foygel Barber and Mladen Kolar. Rocket: Robust conﬁdence intervals via kendall s tau for transelliptical graphical models, 2015.

Pierre Bellec, Carlton Chu, Francois Chouinard-Decorte, Yassine Benhajali, Daniel S. Margulies, and R. Cameron Craddock. The Neuro Bureau ADHD-200 Preprocessed repository. Neuro Image, 2015. ISSN 10959572. doi: 10.1016/j.neuroimage.2016.06.034.

Eisenach, Bunea, Ning and Dinicu

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289 300, 1995.

Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165 1188, 2001.

Peter J Bickel. One-step huber estimates in the linear model. Journal of the American Statistical Association, 70(350):428 434, 1975.

Peter J Bickel, Chris AJ Klaassen, Peter J Bickel, Y Ritov, J Klaassen, Jon A Wellner, and YA Acov Ritov. Eﬃcient and adaptive estimation for semiparametric models. Johns Hopkins University Press Baltimore, 1993.

Florentina Bunea. Consistent covariate selection and post model selection inference in semiparametric regression. Annals of Statistics, pages 898 927, 2004.

Florentina Bunea, Christophe Giraud, Xi Luo, Martin Royer, and Nicolas Verzelen. Model assisted variable clustering: minimax-optimal recovery and algorithms, 2018.

T. Cai, W. Liu, and X. Luo. A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594 607, 2011.

T Tony Cai and Zijian Guo. Conﬁdence intervals for high-dimensional linear regression: Minimax rates and adaptivity. The Annals of Statistics, 45(2):615 646, 2017.

T Tony Cai, Weidong Liu, and Harrison Zhou. Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. The Annals of Statistics, 44(2):455 488, 2016.

Carson Eisenach and Han Liu. Eﬃcient, certiﬁably optimal clustering with applications to latent variable graphical models. Mathematical Programming, 2019. doi: 10.1007/ s10107-019-01375-2.

Jianqing Fan, Han Liu, Yang Ning, and Hui Zou. High dimensional semiparametric latent graphical model for mixed data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(2):405 421, 2017.

Ethan X Fang, Yang Ning, and Han Liu. Testing and conﬁdence intervals for high dimensional proportional hazards models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1415 1437, 2017.

Huijie Feng and Yang Ning. High-dimensional mixed graphical model with ordinal data: Parameter estimation and statistical inference. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 654 663, 2019.

Michael D Fox, Abraham Z Snyder, Justin L Vincent, Maurizio Corbetta, David C Van Essen, and Marcus E Raichle. The human brain is intrinsically organized into dynamic, anticorrelated functional networks. Proceedings of the National Academy of Sciences, 102 (27):9673 9678, 2005.

High-Dimensional Inference for Cluster-Based Graphical Models

Peter Fransson. Spontaneous low-frequency bold signal ﬂuctuations: An fmri investigation of the resting-state default mode of brain function hypothesis. Human brain mapping, 26 (1):15 29, 2005.

J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008.

Christophe Giraud. Introduction to high-dimensional statistics, volume 138. CRC Press, 2014.

Quanquan Gu, Yuan Cao, Yang Ning, and Han Liu. Local and global inference for high dimensional gaussian copula graphical models, 2015.

Jana Jankova and Sara van de Geer. Conﬁdence intervals for high-dimensional inverse covariance estimation, 2014.

Jana Jankov a and Sara van de Geer. Honest conﬁdence regions and optimality in highdimensional precision matrix estimation. Test, 26(1):143 162, 2017.

Adel Javanmard and Andrea Montanari. Conﬁdence intervals and hypothesis testing for high-dimensional regression, 2013.

C. Lam and J. Fan. Sparsistency and rates of convergence in large covariance matrix estimation. Annals of Statistics, 37:42 54, 2009.

Jason D Lee, Dennis L Sun, Yuekai Sun, and Jonathan E Taylor. Exact inference after model selection via the lasso, 2013.

Han Liu, Fang Han, Ming Yuan, John Laﬀerty, and Larry Wasserman. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics, 40(4):2293 2326, 2012.

Weidong Liu. Gaussian graphical model estimation with false discovery rate control. The Annals of Statistics, pages 2948 2978, 2013.

Richard Lockhart, Jonathan Taylor, Ryan J Tibshirani, and Robert Tibshirani. A signiﬁcance test for the lasso. The Annals of Statistics, 42(2):413 468, 2014.

Xi Luo. A hierarchical graphical model for big inverse covariance estimation with an application to fmri. ar Xiv preprint ar Xiv:1403.4698, 2014.

N. Meinshausen and P. B uhlmann. High dimensional graphs and variable selection with the lasso. Annals of Statistics, 34(3):1436 1462, 2006.

Matey Neykov, Yang Ning, Jun S Liu, and Han Liu. A uniﬁed theory of conﬁdence regions and testing for high dimensional estimating equations. Statistical Science, 2018.

Yang Ning and Han Liu. High-dimensional semiparametric bigraphical models. Biometrika, 100(3):655 670, 2013.

Eisenach, Bunea, Ning and Dinicu

Yang Ning and Han Liu. A general theory of hypothesis tests and conﬁdence regions for sparse high dimensional models. The Annals of Statistics, 45(1):158 195, 2017.

Yang Ning, Tianqi Zhao, Han Liu, et al. A likelihood ratio framework for high-dimensional semiparametric regression. The Annals of Statistics, 45(6):2299 2327, 2017.

Jie Peng, Pei Wang, Nengfeng Zhou, and Ji Zhu. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104(486):735 746, 2009.

Jonathan D Power, Alexander L Cohen, Steven M Nelson, Gagan S Wig, Kelly Anne Barnes, Jessica A Church, Alecia C Vogel, Timothy O Laumann, Fran M Miezin, Bradley L Schlaggar, and Steven E Petersen. Functional network organization of the human brain. Neuron, 17(724):665 678, 2011. doi: 10.1016/j.neuron.2011.09.006.

Marcus E Raichle. The brain s default mode network. Annual review of neuroscience, 38: 433 447, 2015.

Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, and Bin Yu. Highdimensional covariance estimation by minimizing l 1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935 980, 2011.

Zhao Ren, Tingni Sun, Cun-Hui Zhang, and Harrison H. Zhou. Asymptotic normality and optimalities in estimation of large gaussian graphical model, 2013.

Adam J Rothman, Peter J Bickel, Elizaveta Levina, and Ji Zhu. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494 515, 2008.

Christopher G Small, Jinfang Wang, and Zejiang Yang. Eliminating multiple root problems in estimation. Statistical Science, 15(4):313 341, 2000.

Stephen M Smith, Peter T Fox, Karla L Miller, David C Glahn, P Mickle Fox, Clare E Mackay, Nicola Filippini, Kate E Watkins, Roberto Toro, Angela R Laird, et al. Correspondence of the brain s functional architecture during activation and rest. Proceedings of the National Academy of Sciences, 106(31):13040 13045, 2009.

Tingni Sun and Cun-Hui Zhang. Sparse matrix inversion with scaled lasso. Technical report, 2012.

Kean Ming Tan, Yang Ning, Daniela M Witten, and Han Liu. Replicates in high dimensions, with applications to latent variable graphical models. Biometrika, 103(4):761 777, 2016.

Ryan J Tibshirani, Jonathan Taylor, Richard Lockhart, and Robert Tibshirani. Exact post-selection inference for sequential regression procedures, 2014.

Sara van de Geer, Peter B uhlmann, and Ya acov Ritov. On asymptotically optimal conﬁdence regions and tests for high-dimensional models, 2013.

A. V. Van der Vaart. Asymptotic statistics. Cambridge University Press, Cambridge, UK, 1998.

High-Dimensional Inference for Cluster-Based Graphical Models

Nicolas Verzelen. Gaussian graphical models and Model selection. Ph D thesis, Universit e Paris Sud-Paris XI, 2008.

L. Xue and H. Zou. Regularized rank-based estimation of high-dimensional nonparanormal graphical models. The Annals of Statistics, 40(5):2541 2571, 2012.

Zhuoran Yang, Yang Ning, and Han Liu. On semiparametric exponential family graphical models. The Journal of Machine Learning Research, 19(1):2314 2372, 2018.

M. Yuan. High dimensional inverse covariance matrix estimation via linear programming. Journal of Machine Learning Research, 11(8):2261 2286, 2010.

M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19 35, 2007.

Cun-Hui Zhang and Stephanie S. Zhang. Conﬁdence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217 242, 2014. ISSN 1467-9868. doi: 10.1111/rssb.12026.

Yuan Zhou, Karl J Friston, Peter Zeidman, Jie Chen, Shu Li, and Adeel Razi. The hierarchical organization of the default, dorsal attention and salience networks in adolescents and young adults. Cerebral cortex, 28(2):726 737, 2017.