# treebased_node_aggregation_in_sparse_graphical_models__573b3f6c.pdf

Journal of Machine Learning Research 23 (2022) 1-36 Submitted 1/21; Revised 5/22; Published 9/22

Tree-based Node Aggregation in Sparse Graphical Models

Ines Wilms i.wilms@maastrichtuniversity.nl Department of Quantitative Economics Maastricht University Maastricht, The Netherlands

Jacob Bien jbien@usc.edu Department of Data Sciences and Operations Marshall School of Business, University of Southern California California, USA

Editor: David Wipf

High-dimensional graphical models are often estimated using regularization that is aimed at reducing the number of edges in a network. In this work, we show how even simpler networks can be produced by aggregating the nodes of the graphical model. We develop a new convex regularized method, called the tree-aggregated graphical lasso or tag-lasso, that estimates graphical models that are both edge-sparse and node-aggregated. The aggregation is performed in a data-driven fashion by leveraging side information in the form of a tree that encodes node similarity and facilitates the interpretation of the resulting aggregated nodes. We provide an eﬃcient implementation of the tag-lasso by using the locally adaptive alternating direction method of multipliers and illustrate our proposal s practical advantages in simulation and in applications in ﬁnance and biology.

Keywords: aggregation, graphical model, high-dimensionality, regularization, sparsity

1. Introduction

Graphical models are greatly useful for understanding the relationships among large numbers of variables. Yet, estimating graphical models with many more parameters than observations is challenging, which has led to an active area of research on high-dimensional inverse covariance estimation. Numerous methods attempt to curb the curse of dimensionality through regularized estimation procedures (e.g., Meinshausen and B uhlmann, 2006; Yuan and Lin, 2007; Banerjee et al., 2008; Friedman et al., 2008; Rothman et al., 2008; Peng et al., 2009; Yuan, 2010; Cai et al., 2011, 2016). Such methods aim for sparsity in the inverse covariance matrix, which corresponds to graphical models with only a small number of edges. A common method for estimating sparse graphical models is the graphical lasso (glasso) (Yuan and Lin, 2007; Banerjee et al., 2008; Rothman et al., 2008; Friedman et al., 2008), which adds an ℓ1-penalty to the negative log-likelihood of a sample of multivariate normal random variables. While this and many other methods focus on the edges for dimension reduction, far fewer contributions (e.g., Tan et al., 2015; Eisenach et al., 2020; Pircalabelu and Claeskens, 2020) focus on the nodes as a guiding principle for dimension reduction.

c 2022 Ines Wilms and Jacob Bien.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0105.html.

Wilms and Bien

Nonetheless, node dimension reduction is becoming increasingly relevant in many areas where data are being measured at ﬁner levels of granularity. For instance, in biology, modern high-throughput sequencing technologies provide low-cost microbiome data at high resolution; in neuroscience, brain activity in hundreds of regions of interest can be measured; in ﬁnance, data at the individual company level at short time scales are routinely analyzed; and in marketing, joint purchasing data on every stock-keeping-unit (product) are recorded. The ﬁne-grained nature of these data brings new challenges. The sheer number of ﬁne-grained, often noisy, variables makes it diﬃcult to detect dependencies. Moreover, there can be a mismatch between the resolution of the measurement and the resolution at which natural meaningful interpretations can be made. The purpose of an analysis may be to draw conclusions about entities at a coarser level of resolution than happened to be measured. Because of this mismatch, practitioners are sometimes forced to devise ad hoc post-processing steps involving, for example, coloring the nodes based on some classiﬁcation of them into groups in an attempt to make the structure of an estimated graphical model more interpretable and the domain-speciﬁc takeaways more apparent (e.g., Millington and Niranjan, 2019). Our solution to this problem is to incorporate the side information about the relationship between nodes directly into the estimation procedure. In our framework, this side information is encoded as a tree whose leaves correspond to the measured variables. Such tree structures are readily available in many domains (e.g., taxonomies in biology and hierarchical classiﬁcations of jobs, companies, and products in business) and is well-suited to expressing the multi-resolution structure that is present in many problems. We propose a new convex regularization procedure, called tag-lasso, which stands for tree-aggregatedgraphical-lasso. This procedure combines node (or variable) aggregation with edge-sparsity. The tree-based aggregation serves to both amplify the signal of similar, low-level variables and render a graphical model involving nodes at an appropriate level of scale to be relevant and interpretable. The edge-sparsity encourages the graphical model involving the aggregated nodes to have a sparse network structure. Our procedure is based on a tree-based parameterization strategy that translates the node aggregation problem into a sparse modeling problem, following an approach previously introduced in the regression setting (Yan and Bien, 2021). In Figure 1 (to be discussed more thoroughly in Section 4), we see that tag-lasso is able to recover the aggregated, sparse graph structure. By doing so, it yields a more accurate estimate of the true graph, and its output is easier to interpret than the full, noisy graph obtained by the glasso. The rest of the paper is organized as follows. Section 2 introduces the tree-based parameterization structure for nodewise aggregation in graphical models. Section 3 introduces the tag-lasso estimator, formulated as a solution to a convex optimization problem, for which we derive an eﬃcient algorithm. Section 4 presents the results of a simulation study. Section 5 illustrates the practical advantages of the tag-lasso on ﬁnancial and microbiome data sets. Section 6 concludes.

2. Node Aggregation in Penalized Graphical Models

Let S be the empirical covariance matrix based on n multivariate normal observations of dimension p, with mean vector µ and covariance matrix Σ. The target of estimation

Tree-aggregated Graphical Lasso

True graph True Ω

True aggregated graph True aggregated Ω

tag lasso: estimated graph tag lasso: Ω

tag lasso: aggregated graph tag lasso: aggregated Ω

glasso: estimated graph glasso: Ω

Figure 1: Top: True graph and precision matrix Ωwith corresponding aggregated graph and precision matrix. Middle: Estimation output of the tag-lasso. Bottom: Estimation output of the glasso.

is the precision matrix Ω= Σ 1, whose sparsity pattern provides the graph structure of the Gaussian graphical model, since Ωjk = 0 is equivalent to variables j and k being conditionally independent given all other variables. To estimate the precision matrix, it is common to use a convex penalization method of the form

bΩ= argmin Ω { logdet(Ω) + tr(SΩ) + λP(Ω) s.t. Ω= Ω , Ω 0}, (1)

where tr( ) denotes the trace, 0 denotes a positive deﬁnite matrix, P( ) is a convex penalty function, and λ > 0 is a tuning parameter controlling the degree of penalization. Choosing the ℓ1-norm P(Ω) = Ω diag 1, (2)

where Ω diag contains the unique oﬀ-diagonal elements, yields the graphical lasso (glasso) (Friedman et al., 2008; Yuan and Lin, 2007; Banerjee et al., 2008; Rothman et al., 2008). It encourages bΩto be sparse, corresponding to a graphical model with few edges.

Wilms and Bien

However, when Ωis not sparse, demanding sparsity in bΩmay not be helpful, as we will show in Section 2.1. Such settings can arise when data are measured and analyzed at ever higher resolutions (a growing trend in many areas, see e.g. Callahan et al. 2017). A tree is a natural way to represent the diﬀerent scales of data resolution, and we introduce a new choice for P that uses this tree to guide node aggregation, thereby allowing for a data adaptive choice of data scale for capturing dependencies. Such tree-based structures are available in many domains. For instance, companies can be aggregated according to hierarchical industry classiﬁcation codes; products can be aggregated from brands towards product categories; brain voxels can be aggregated according to brain regions; microbiome data can be aggregated according to taxonomy. The resulting penalty function then encourages a more general and yet still highly interpretable structure for bΩ. In the following subsection, we use a toy example to illustrate the power of such an approach.

2.1 Node Aggregation

Consider a toy example with p variables

j=3 Xj + ε1

j=3 Xj + ε2

Xj = εj, for 3 j p,

where ε1, . . . , εp are independent standard normal random variables. By construction, it is clear that there is a very simple relationship between the variables: The ﬁrst two variables both depend on the sum of the other p 2 variables. However, a standard graphical model on the p variables does not naturally express this simplicity. The ﬁrst row of Table 1 shows the covariance and precision matrices for the full set of variables X1, . . . , Xp. The graph in the last column then visually represents the same information as the precision matrix. While this graph does convey the message that variables 1 and 2 are conditionally independent, it is extremely dense with O(p2) edges. As such, the precise structure among the remaining variables is hard to infer from the graph, an issue that only becomes worse when the number of variables p increases. Imagine if instead we could form a graphical model with only three variables: X1, X2, e X, where the last variable e X = Pp j=3 Xj aggregates all but the ﬁrst two variables. The bottom row of Table 1 results in a graphical model that matches the simplicity of the situation. The graph with aggregated nodes maintains its simplicity even when p increases. The lack of sparsity in the p-node graphical model means that the graphical lasso will not do well; its estimation accuracy will suﬀer unless the sample size is extremely large. Nonetheless, a method that could perform node aggregation would be able to yield a highlyinterpretable aggregated sparse graphical model since X1 and X2 are conditionally independent given the aggregated variable e X. It is useful to map from the small aggregated graphical model to the original p-node graphical model. One does so by writing the precision matrix in G-block format (Bunea et al., 2020, although they introduce this terminology in the context of the covariance

Tree-aggregated Graphical Lasso

Nodes Covariance Matrix Precision Matrix Graphical Σ Ω Model

X1, . . . , Xp

p 1 p 2 1 p 2 p 2 p 1 1 p 2 1p 2 1p 2 Ip 2

! 1 0 1 p 2 0 1 1 p 2 1p 2 1p 2 L

with L = Ip 2 + 2 1p 2 1 p 2

X1, X2, e X

p 1 p 2 p 2 p 2 p 1 p 2 p 2 p 2 p 2

! 1 0 1 0 1 1 1 1 2 + 1/(p 2)

Note: Let 1d denote a d-dimensional column vector of ones, and Id be the d d identity matrix.

Table 1: Toy example: Covariance and precision matrices with corresponding graphical model (drawn for p = 50) for the full (top) and aggregated (bottom) set of nodes.

matrix, not its inverse) for a given partition G = {G1, ..., GK} of the nodes {1, . . . , p} and corresponding p K membership matrix M, with entries Mjk = 1 if j Gk, and Mjk = 0 otherwise. In particular, there exists a K K symmetric matrix C and a p p diagonal matrix D such that the precision matrix can be written as Ω= MCM + D. The blockstructure of Ωis captured by the ﬁrst part of the decomposition, the aggregated K K precision matrix on the set of aggregated nodes can then be written as Ωagg = C + Dagg, where Dagg = (M D 1M) 1 is diagonal. In the above example, K = 3, G1 = {1}, G2 = {2}, G3 = {3, . . . , p} and MCM has only three distinct rows/columns since the aggregated variables j = 3, . . . , p share all their entries. In the presence of node aggregation and edge sparsity, the graphical model corresponding to the aggregated precision matrix is far more parsimonious than the graphical model on the full precision matrix (see Table 1). As motivated by this example, our main goal is to estimate the precision matrix in such a way that we can navigate from a p-dimensional problem to a K-dimensional problem whose corresponding graphical model provides a simple description of the conditional dependency structure among K aggregates of the original variables. In the following proposition, we show that this can be accomplished by looking for a precision matrix that has a G-block structure. The proof of the proposition is included in Appendix A.

Proposition 1 Suppose X Np(0, Ω 1) with Ω= MCM + D, where M {0, 1}p K is the membership matrix, D 0, and let e X = M X RK be the vector of aggregated variables. Then e X has precision matrix Ωagg = C+Dagg, where Dagg is a diagonal matrix, and therefore cij = 0 is equivalent to the aggregates e Xi and e Xj being conditionally independent given all other aggregated variables.

Wilms and Bien

Ω1 Ω2 Ω3 Ω4 Ω5

Figure 2: An example of a tree T encoding similarity among p = 5 variables.

The same matrix C thus enters the formula for Ωas well as Ωagg, thereby ensuring that both reﬂect the same conditional independence structure. While Proposition 1 thereby gives us the desired interpretation in the graphical model with K aggregated nodes, in practice, the partition G, its size K, and corresponding membership matrix M are, however, unknown. Rather than considering arbitrary partitions of the variables, we constrain ourselves speciﬁcally to partitions guided by a known tree. In so doing, we allow ourselves to exploit side information and help ensure that the aggregated nodes will be easily interpretable. To this end, we introduce a tree-based parameterization strategy that allows us to embed the node dimension reduction into a convex optimization framework.

2.2 Tree-Based Parameterization

Our aggregation procedure assumes that we have, as side information, a tree that represents the closeness (or similarity) of variables. We introduce here a matrix-valued extension of the tree-based parameterization developed in Yan and Bien (2021) for the regression setting. We consider a tree T with p leaves Ω1, . . . , Ωp where Ωj denotes column 1 j p of Ω. We restrict ourselves to partitions that can be expressed as a collection of branches of T . Newly aggregated nodes are then formed by summing variables within branches. To this end, we assign a p-dimensional parameter vector γu to each node u in the tree T (see Figure 2 for an example). Writing the set of nodes in the path from the root to the jth leaf (variable) as ancestor(j) {j}, we express each column/row in the precision matrix as

u ancestor(j) {j} γu + djej, (3)

where we sum over all the γu s along this path, and ej denotes the p-dimensional vector with all zeros except for its jth element that is equal to one. In the remainder, we will make extensive use of the more compact notation Ω= AΓ + D, where A {0, 1}p |T | is a binary matrix with Ajk = 1{uk ancestor(j) {j}} = 1{j descendant(uk) {uk}} with 1{ } denoting the indicator function, Γ is a |T | p parameter matrix collecting the γu s in its rows with |T | denoting the cardinality of the tree and D is a diagonal parameter matrix with elements d1, . . . , dp. By zeroing out γu s, certain nodes will be aggregated, as can be seen from the illustrative example in Figure 3. More precisely, let V = {u : γu = 0} denote the set of non-zero rows in

Tree-aggregated Graphical Lasso

𝜴1 𝜴2 𝜴3 𝜴4 𝜴5

Figure 3: Left: An example of a 5 5-dimensional Ωand a tree T that relates the corresponding p = 5 variables. We have Ωi = γi + γ1:3 + γ1:5 for i = 1, 2, 3 and Ωj = γj +γ4:5 +γ1:5 for j = 4, 5, by equation (3), ignoring the diagonal elements. Middle: By zeroing out the γi s in the gray nodes, we aggregate the rows/columns of Ωinto two groups indicated by the two colors: Ω1 = Ω2 = Ω3 = γ1:3 + γ1:5 (blue) and Ω4 = Ω5 = γ1:5 (red). Right: The precision matrix Ωthus has a block-structure.

Γ and let AV be the sub-matrix of A where only the columns corresponding to the non-zero rows in Γ are kept. The number of blocks K in the aggregated network is then given by the number of unique rows in AV. The membership matrix M (Section 2.1), and hence the set of aggregated nodes, can then be derived from the variables (rows) in the matrix AV that share all their row-entries. In the next section, we introduce the tag-lasso, which is based on this parameterization.

We prefer to perform aggregation according to tree-based structures, as opposed to more general directed acyclic graphs, because of the interpretability trees naturally oﬀer. For trees, each node has only one parent and, hence, there exists a single path from the leaves (original variables) towards the root node (complete aggregation). In such cases, the intermediate levels in the tree provide a direct, natural labelling of the corresponding aggregated variables. We discuss the option to perform aggregation guided by more general graph-based structures in Section 6.

3. Tree Aggregated Graphical lasso

To achieve dimension reduction via node aggregation and edge sparsity simultaneously, we extend optimization problem (1) by incorporating the parameterization introduced above.

Wilms and Bien

Our estimator, called the tag-lasso, is deﬁned as

(bΩ, bΓ, b D) = argmin Ω,Γ,D { logdet(Ω) + tr(SΩ) + λ1 Γ r 2,1 + λ2 Ω diag 1

s.t. Ω= Ω , Ω 0, γr = γ1p, Ω= AΓ + D, D diag, Djj 0 for j = 1, . . . , p}, (4)

with Γ r 2,1 = P u T r γu 2 and T r being the set of all nodes in T other than the root. This norm induces row-wise sparsity on all non-root rows of Γ. This row-wise sparsity, in turn, induces node aggregation as explained in Section 2.2. The root is excluded from this penalty term so that in the extreme case of a large λ1 one gets complete aggregation but not necessarily sparsity (in this extreme, all oﬀ-diagonal elements of bΩare equal to the scalar γ that appears in the equality constraint involving γr). While λ1 controls the degree of node aggregation, λ2 controls the degree of edge sparsity. When λ1 = 0, the optimization problem in (4) reduces to the glasso. The tag-lasso estimator imposes a block structure on the rows of bΩ b D via the constraint Ω= AΓ + D. From the sparsity pattern of bΓ, we obtain b V = {u T : bγu = 0}, the set of non-zero rows in bΓ, such that we can write bΩ b D = AbΓ = Ab V bΓb V. Algorithm 1 in Appendix A details how the membership matrix c M can be obtained from the original binary matrix A and the set b V. Due to the enforced symmetry on the precision matrix, the block structure imposed on the rows of bΩ b D holds likewise for its columns. The next proposition then shows how the solution provided by the tag-lasso can be re-written in G-block format. The proof of the proposition is included in Appendix A.

Proposition 2 Given a solution (bΩ, bΓ, b D) to the tag-lasso problem, there exists a p K partition matrix c M and a symmetric K K matrix b C such that bΩ= c Mb Cc M + b D.

The G-block structure imposed by the tag-lasso (in case K < p) implies that the tag-lasso is especially useful to apply when one believes that dimension reduction can be leveraged in terms of node-aggregation in addition to edge-sparsity. Finally, note that optimization problem (4) ﬁts into the general formulation of penalized graphical models given in (1) since it can be equivalently expressed as bΩ= argmin Ω { logdet(Ω) + tr(SΩ) + λ1Paggregate(Ω) + λ2Psparse(Ω) s.t. Ω= Ω , Ω 0},

Paggregate(Ω) = min Γ,D { Γ r 2,1 s.t. γr = γ1p, Ω= AΓ+D, D diag, Djj 0 for j = 1, . . . , p}

and Psparse(Ω) is the ℓ1-norm deﬁned in (2). The proposed tag-lasso estimator thus determines the node aggregation and edge sparsity while also producing an estimate of the precision matrix by solving one convex optimization problem. One might wonder how such a one-stage procedure compares to a two-stage procedure where ﬁrst the level of node aggregation is determined, and secondly, the glasso is applied using the aggregated nodes determined in the ﬁrst stage. In Appendix B we detail this two-stage procedure and compare its performance to the tag-lasso estimator through a simulation study. Across various simulation designs, we ﬁnd that the proposed (one-stage) tag-lasso estimator provides important improvements in terms of estimation accuracy over such a two-stage benchmark as the latter struggles to retrieve the correct aggregation level.

Tree-aggregated Graphical Lasso

3.1 Locally Adaptive Alternating Direction Method of Multipliers

We develop an alternating direction method of multipliers (ADMM) algorithm (Boyd et al., 2011), speciﬁcally tailored to solving (4). Our ADMM algorithm is based on solving this equivalent formulation of (4):

min Ω(1),Ω(2),Ω(3)

Γ(1),Γ(2),Ω,Γ,D

{ logdet(Ω(1)) + tr(SΩ(1)) + λ1 Γ(1) r 2,1 + λ2 Ω diag(3) 1

s.t. Ω(1) = Ω(1) , Ω(1) 0, γ(1) r = γ(1)1p, Ω(2) = AΓ(2)+D, D diag, Djj 0, j = 1, . . . , p,

Ω= Ω(1) = Ω(2) = Ω(3) and Γ = Γ(1) = Γ(2)}. (5)

Additional copies of Ωand Γ are introduced to eﬃciently decouple the optimization problem. Furthermore, we use an extension called locally adaptive-ADMM (LA-ADMM, Xu et al., 2017) with adaptive penalization to improve performance. The full details of the algorithm are provided in Appendix C. The computational complexity of the algorithm in terms of the number of variables p and the size of the tree |T | is O(p|T |2) (number of iterations + 1) since each initialization step as well as each per iteration update is at most O(p|T |2). Bounds on the number of iterations for ADMM and LA-ADMM can be found in Xu et al. (2017).

3.2 Selection of the Tuning Parameters

To select the tuning parameters λ1 and λ2, we form a 10 10 grid of (λ1, λ2) values and ﬁnd the pair that minimizes a 5-fold cross-validated likelihood-based score,

n logdet(bΩ Fk) + tr(SFk bΩ Fk) o , (6)

where bΩ Fk is an estimate of the precision matrix trained while withholding the samples in the kth fold and SFk is the sample covariance matrix computed on the kth fold. In particular, we take bΩ Fk to be a re-ﬁtted version of our estimator (e.g., Belloni and Chernozhukov, 2013). After ﬁtting the tag-lasso, recall that we obtain b V = {u T : bγu = 0}, the set of non-zero rows in bΓ, which suggests a particular node aggregation; and b E = {(i, j) : bΩij = 0}, the set of non-zero elements in bΩ, which suggests a particular edge sparsity structure. We then re-estimate Ωby maximizing the likelihood subject to these aggregation and sparsity constraints: min Ω,Γb V,D logdet(Ω) + tr(SΩ)

subject to Ω= Ω , Ω 0,

γb V,r = γ1p,

Ω= Ab VΓb V + D, D diag., Djj 0 for j = 1, . . . , p

Ωij = 0, for (i, j) / b E.

We solve this with an LA-ADMM algorithm similar to what is described in Section 3.1 and Appendix C.

Wilms and Bien

3.3 Connections to Related Work

Combined forms of dimension reduction in graphical models can be found in, amongst others, Chandrasekaran et al. (2012); Tan et al. (2015); Eisenach et al. (2020); Brownlees et al. (2020); Pircalabelu and Claeskens (2020). Chandrasekaran et al. (2012) consider a blend of principal component analysis with graphical modeling by combining sparsity with a low-rank structure. Tan et al. (2015) and Eisenach et al. (2020) both propose two-step procedures that ﬁrst cluster variables in an initial dimension reduction step and subsequently estimate a cluster-based graphical model. Brownlees et al. (2020) introduce partial correlation network models with community structures but rely on the sample covariance matrix of the observations to perform spectral clustering. Our procedure diﬀers from these works by introducing a single convex optimization problem that simultaneously induces aggregation and edge sparsity for the precision matrix. Our work is most closely related to Pircalabelu and Claeskens (2020) who estimate a penalized graphical model and simultaneously classify nodes into communities. However, Pircalabelu and Claeskens (2020) do not use tree-based node-aggregation. Our approach, in contrast, considers the tree T as an important part of the problem to help determine the extent of node aggregation, and as a consequence the number of aggregated nodes (i.e. clusters, communities or blocks) K, in a data-driven way through guidance of the tree-based structure on the nodes.

4. Simulations

We investigate the advantages of jointly exploiting node aggregation and edge sparsity in graphical models. To this end, we compare the performance of the tag-lasso to two benchmarks:

(i) oracle: The aggregated, sparse graphical model in (7) is estimated subject to the true aggregation and sparsity constraints. The oracle is only available for simulated data and serves as a best case benchmark.

(ii) glasso: This does not perform any aggregation (corresponding to the tag-lasso with λ1 = 0). A sparse graph on the full set of variables is estimated. The glasso is computed using the same LA-ADMM algorithm as detailed in Appendix C. The tuning parameter is selected from a 10-dimensional grid as the value that minimizes the 5fold cross-validation likelihood-based score in equation (6) with bΩ Fk taken to be the glasso estimate.

All simulations were performed using the simulator package (Bien, 2016) in R (R Core Team, 2017). We evaluate the estimators in terms of three performance metrics: estimation accuracy, aggregation performance, and sparsity recovery. We evaluate estimation accuracy by averaging over many simulation runs the Kullback-Leibler (KL) distance

KL = logdet(ΣbΩ) + tr(ΣbΩ) p,

where Σ = Ω 1 is the true covariance matrix. Note that the KL distance is zero if the estimated precision matrix equals the true precision matrix.

Tree-aggregated Graphical Lasso

chain: Ω random: Ω unbalanced: Ω unstructured: Ω

chain: aggregated graph

random: aggregated graph

unbalanced: aggregated graph

unstructured: full graph

Figure 4: Four aggregation designs: chain, random, unbalanced and unstructured graphs with corresponding precision matrix (top) and graph on the set of aggregated nodes (bottom).

To evaluate aggregation performance, we use two measures: the Rand index (Rand, 1971) and the adjusted Rand index (Hubert and Arabie, 1985). Both indices measure the degree of similarity between the true partition on the set of nodes 1, . . . , p and the estimated partition. The Rand index ranges from zero to one, where one means that both partitions are identical. The adjusted Rand index performs a re-scaling to account for the fact that random chance will cause some variables to occupy the same group. Finally, to evaluate sparsity recovery, we use the false positive and false negative rates

FPR = #{(i, j) : bΩij = 0 and Ωij = 0}

#{(i, j) : Ωij = 0} and FNR = #{(i, j) : bΩij = 0 and Ωj = 0}

#{(i, j) : Ωij = 0} .

The FPR reports the fraction of truly zero components of the precision matrix that are estimated as nonzero. The FNR gives the fraction of truly nonzero components of the precision matrix that are estimated as zero.

4.1 Simulation Designs

Data are drawn from a multivariate normal distribution with mean zero and covariance matrix Σ = Ω 1. We take p = 15 variables and investigate the eﬀect of increasing the number of variables in Section 4.3. We consider four diﬀerent simulation designs, shown in Figure 4, each having a diﬀerent combination of aggregation and sparsity structures for the precision matrix Ω. Aggregation is present in the ﬁrst three structures. The precision matrix has a G-block structure with K = 3 blocks. In Section 4.4, we investigate the eﬀect of varying the number of blocks. In the chain graph, adjacent aggregated groups are connected through an edge.

Wilms and Bien

γ1 γ2 γ3 γ4 γ5

γ6 γ7 γ8 γ9 γ10

γ11 γ12 γ13 γ14 γ15

γ11 γ12 γ13 γ14 γ15

Figure 5: A simple tree used for the tag-lasso ideal (left) and a more realistic tree used for the tag-lasso realistic (right).

This structure corresponds to the motivating example of Section 1. In the random graph, one non-zero edge in the aggregated network is chosen at random. In the unbalanced graph, the clusters are of unequal size. In the unstructured graph, no aggregation is present. Across all designs, we take the diagonal elements of Ωto be 1, the elements within a block of aggregated variables to be 0.5, and the non-zero elements across blocks to be 0.25. We generate 100 diﬀerent data sets for every simulation design and use a sample size of n = 120. The number of parameters (p + p(p 1)/2 = 120) equals the sample size. The tag-lasso estimator relies on the existence of a tree to perform node dimension reduction. We consider two diﬀerent tree structures throughout the simulation study. First, we use an ideal tree which contains the true aggregation structure as the sole aggregation level between the leaves and the root of the tree. As an example, the true aggregation structure for the chain graph structure is shown in the left panel of Figure 5. We form A corresponding to this oracle tree to obtain the tag-lasso ideal estimator. We also consider a more realistic tree, shown in the right panel of Figure 5, following a construction similar to that of Yan and Bien (2021). The tree is formed by performing hierarchical clustering of p latent points chosen to ensure that the tree contains the true aggregation structure and that these true clusters occur across a variety of depths. In particular, we generate K cluster means µ1, . . . , µK with µi = 1/i. We set the number of latent points associated with each of the K means equal to the cluster sizes from Figure 4. These latent points are then drawn independently from N(µi, [0.05 minj(µi µj)]2). Finally, we form A corresponding to this tree to obtain the tag-lasso realistic estimator.

4.2 Results

We subsequently discuss the results on estimation accuracy, aggregation performance, and sparsity recovery. Estimation Accuracy. Boxplots of the KL distances for the three estimators (tag-lasso ideal, tag-lasso realistic and glasso) relative to the oracle are given in Figure 6. The ﬁrst three panels correspond to simulation designs with aggregation structures. In these settings, the tag-lasso estimators considerably outperform the glasso, on average by a factor ﬁve. The tag-lasso ideal method performs nearly as well as the oracle. Comparing the tag-lasso realistic method to the tag-lasso ideal method suggests a minimal price paid for using a more realistic tree.

Tree-aggregated Graphical Lasso

GGGGG GGGGGG G GGGGGG

GG GG GGGGGG

KL distance relative to oracle

1 4 6 8 10 12 14 16 18 20

chain random unbalanced unstructured

Figure 6: Estimation accuracy of the three estimators relative to the oracle.

n=120 n=240 n=480 n=960

0.0 1.0 2.0 3.0

tag lasso ideal

Sample size

KL distance

n=120 n=240 n=480 n=960

0.0 1.0 2.0 3.0

tag lasso realistic

Sample size

KL distance

Figure 7: Estimation accuracy of the tag-lasso estimators for the chain design with ﬁxed p = 15 and increasing sample size n.

The unstructured panel of Figure 6 shows a case in which there is sparsity but no aggregation in the true data generating model. As expected, the glasso performs best in this case: its average KL distance is around 0.41 versus 0.51 for the tag-lasso estimators. We thus observe a minimal cost to applying the tag-lasso, which encompasses the glasso as a special case when λ1 = 0. The tag-lasso estimators can indeed reject the additional prior information provided by the tree, thereby selecting a small λ1 value in the grid, hence a dense bΓ and no node aggregation. In fact, the tag-lasso estimators include on average around 11 nodes in the aggregated graph, which is relatively close to the p = 15 original variables it should include, thereby explaining the small loss in estimation accuracy compared to the glasso.

While the paper does not contain theoretical results on the consistency of the tag-lasso estimator, we do investigate this numerically. We consider the chain design with p = 15 and increase the sample size from n = 120 to 240, 480, 960. Figure 7 contains the results for both tag-lasso estimators. As expected, we see that the estimation accuracy gradually increases

Wilms and Bien

Estimators chain random unbalanced unstructured RI ARI RI ARI RI ARI RI ARI tag-lasso ideal 1.00 (.00) 1.00 (.01) 1.00 (.00) 1.00 (.00) 1.00 (.00) 0.99 (.01) 0.84 (.02) NA tag-lasso realistic 0.95 (.01) 0.88 (.01) 0.97 (.01) 0.93 (.01) 0.94 (.01) 0.85 (.02) 0.81 (.02) NA glasso 0.71 (.00) 0.00 (.00) 0.71 (.00) 0.00 (.00) 0.67 (.00) 0.00 (.00) 1.00 (.00) NA

Table 2: Aggregation performance of the three estimators, as measured by the Rand index (RI) and adjusted Rand index (ARI), for the four simulation designs. Standard errors are in parentheses.

Estimators chain random unbalanced unstructured FPR FNR FPR FNR FPR FNR FPR FNR tag-lasso ideal 0.22 (.04) 0.00 (.00) 0.19 (.04) 0.00 (.01) 0.46 (.05) 0.00 (.00) 0.06 (.01) 0.15 (.01) tag-lasso realistic 0.30 (.04) 0.02 (.01) 0.13 (.02) 0.09 (.01) 0.44 (.04) 0.05 (.01) 0.05 (.01) 0.14 (.01) glasso 0.80 (.02) 0.08 (.01) 0.73 (.01) 0.09 (.01) 0.82 (.02) 0.07 (.01) 0.16 (.01) 0.04 (.01)

Table 3: Sparsity recovery of the three estimators, as measured by the false positive rate (FPR) and false negative rate (FNR), for the four simulation designs. Standard errors are in parentheses.

(equivalently the KL distance decreases) for both tag-lasso estimators as the sample size increases relative to the ﬁxed number of variables. Aggregation Performance. Table 2 summarizes the aggregation performance of the three estimators in terms of the Rand index (RI) and adjusted Rand index (ARI). No results on the ARI in the unstructured simulation design are reported since it cannot be computed for a partition consisting of singletons. The tag-lasso estimators perform very well. If one can rely on an oracle tree, the tag-lasso perfectly recovers the aggregation structure, as reﬂected in the perfect (A)RI values of the tag-lasso ideal method. Even when the tag-lasso uses a more complex tree structure, it recovers the correct aggregation structure in the vast majority of cases. The glasso returns a partition of singletons as it is unable to perform dimension reduction through aggregation, as can be seen from its zero values on the ARI. Sparsity Recovery. Table 3 summarizes the results on sparsity recovery (FPR and FNR). The tag-lasso estimators enjoy favorable FPR and FNR, mostly excluding the irrelevant conditional dependencies (as reﬂected by their low FPR) and including the relevant conditional dependencies (as reﬂected by their low FNR). In the simulation designs with aggregation, the glasso pays a large price for not being able to reduce dimensionality through aggregation, leading it to include too many irrelevant conditional dependencies, as reﬂected through its large FPRs. In the unstructured design, the rates of all estimators are, overall, low.

4.3 Increasing the Number of Nodes

We investigate the sensitivity of our results to an increasing number of variables p. We focus on the chain simulation design from Section 4.1 and subsequently double p from 15 to 30, 60 and 120 while keeping the number of blocks K ﬁxed at three. The sample size n is set proportional to the complexity of the model, as measured by Kp+p. Hence, the sample sizes corresponding to the increasing values of p are respectively, n = 120, 240, 480, 960, thereby

Tree-aggregated Graphical Lasso

Increasing number of nodes

Number of nodes

Mean KL distance

p=15 p=30 p=60 p=120

G tag lasso ideal tag lasso realistic glasso oracle

Increasing number of blocks

Number of blocks

Mean KL distance

K=3 K=5 K=6 K=10

Figure 8: Estimation accuracy of the four estimators (on a log-scale) for increasing number of variables p (and ﬁxed K = 3, left panel) the number of blocks K (and ﬁxed p = 30, right panel).

keeping the ratio of the sample size to the complexity ﬁxed at two. In each setting, the number of parameters to be estimated is large, equal to 120, 465, 1830, 7260, respectively; thus increasing relative to the sample size.

The left panel of Figure 8 shows the mean KL distance (on a log-scale) of the four estimators as a function of p. As the number of nodes increases, the estimation accuracy of the tag-lasso estimators and the oracle increases slightly. For ﬁxed K and increasing p, the aggregated nodes which can be thought of as the average of p/K random variables may be stabler, thereby explaining why the problem at hand does not get harder when increasing p for the methods with node aggregation. By contrast, the glasso which is unable to exploit the aggregation structure performs worse as p increases. For p = 120, for instance, the tag-lasso estimators outperform the glasso by a factor 50.

Results on aggregation performance and sparsity recovery are presented in Figure 15 of Appendix D. The tag-lasso ideal method perfectly recovers the aggregation structure for all values of p. The realistic tag-lasso s aggregation performance is close to perfect and remains relatively stable as p increases. The glasso is unable to detect the aggregation structure, as expected and reﬂected through its zero ARIs. The tag-lasso estimators also maintain a better balance between the FPR and FNR than the glasso. While their FPRs increase as p increases, their FNRs remain close to perfect, hence all relevant conditional dependencies are recovered. The glasso, in contrast, fails to recover the majority of relevant conditional dependencies when p = 60, 120, thereby explaining its considerable drop in estimation accuracy.

Wilms and Bien

KL distance relative to oracle

1 2 4 6 8 10

Estimators RI ARI FPR FNR tag-lasso ideal 1.00 1.00 0.02 0.55 tag-lasso realistic 0.89 0.84 0.58 0.28 glasso 0.67 0.00 0.05 0.93

Figure 9: High-dimensional design with p = 150, n = 120. Left: Estimation accuracy of the three estimators relative to the oracle. Right: Aggregation performance and sparsity recovery of the three estimators. Standard errors around the reported results are all smaller than 0.05, and are thus not reported.

4.4 Increasing the Number of Blocks

Next, we investigate the eﬀect of increasing the number of blocks K. We take the chain simulation design from Section 4.1 and increase the number of blocks from K = 3 to K = 5, 6, 10, while keeping the number of variables ﬁxed at p = 30. The right panel of Figure 8 shows the mean KL distance (on a log-scale) of the four estimators as a function of K. As one would expect, the diﬀerence between the aggregation methods and the glasso decreases as K increases. However, for all K considered, the glasso does far less well than the aggregation based methods. Similar conclusions hold in terms of aggregation and sparsity recovery performance. Detailed results are presented in Figure 16 of Appendix D. The tag-lasso ideal method performs as well as the oracle in terms of capturing the aggregation structure; the taglasso realistic method performs close to perfect and its aggregation performance improves with increasing K. In terms of sparsity recovery, the tag-lasso estimators hardly miss relevant conditional dependencies and only include a small number of irrelevant conditional dependencies. The glasso s sparsity recovery performance is overall worse but does improve with increasing K.

4.5 High-dimensional p > n Design

Finally, we investigate the performance of the tag-lasso in a high-dimensional design where the number of variables p exceeds the sample size n. To this end, we consider the chain design with p = 150 and n = 120. The left panel of Figure 9 presents boxplots of the KL distances of the tag-lasso estimators and glasso relative to the oracle. In this high-dimensional design, the tag-lasso estimators pay a larger price in terms of estimation accuracy compared to the oracle. The same holds for the tag-lasso realistic compared to the tag-lasso ideal. Still both tag-lasso estimators considerably outperform the glasso.

Tree-aggregated Graphical Lasso

The right panel of Figure 9 summarizes the aggregation performance and sparsity recovery of the tag-lasso estimators and glasso. The aggregation performance of the tag-lasso estimators remains high. Moreover, they balance false positives and false negatives better than the glasso. While the tag-lasso ideal mainly displays a high FNR, the tag-lasso realistic also suﬀers from returning an overly dense graph, as can be seen from its higher FPR. Glasso, in contrast, cannot handle the many non-zero elements in the true precision matrix in combination with the small sample size. To tackle the dimensionality, it returns an overly sparse solution as can be seen from the high FNR.

5. Applications

We consider two applications: a ﬁnancial (Section 5.1) and a microbiome application (Section 5.2).

5.1 Financial Application

We demonstrate our method on a ﬁnancial data set containing daily realized variances of p = 31 stock market indices from across the world in 2019 (n = 254). Daily realized variances based on ﬁve minute returns are taken from the Oxford-Man Institute of Quantitative Finance (publicly available at http://realized.oxford-man.ox.ac.uk/data/download). Following standard practice, all realized variances are log-transformed. An overview of the stock market indices is provided in Appendix E. We encode similarity between the 31 stock market indices according to geographical region, and use the tree shown in Figure 10 to apply the tag-lasso estimator. Since the diﬀerent observations of the consecutive days are (time)-dependent, we ﬁrst ﬁt the popular and simple heterogeneous autoregressive (HAR) model of (Corsi, 2009) to each of the individual log-transformed realized variance series. Graphical displays of the residual series of these 31 HAR models suggest that almost all autocorrelation in the series is captured. We then apply the tag-lasso to the residual series to learn the conditional dependency structure among stock market indices. Estimated Graphical Model. We ﬁt the tag-lasso estimator, with 5-fold cross-validation to select tuning parameters, to the full data set, with the matrix A encoding the tree structure in Figure 10. The tag-lasso returns a solution with K = 6 aggregated blocks; the sparsity pattern of the full estimated precision matrix is shown in the top left panel of Figure 11. The coloring of the row labels and the numbering of columns convey the memberships of each variable to aggregated blocks (to avoid clutter, only the ﬁrst column of each block is labeled). Dimension reduction mainly occurs through node aggregation, as can be seen from the aggregated precision matrix in the bottom left panel of Figure 11. The resulting aggregated graphical model is rather dense with only about half of the oﬀ-diagonal entries being non-zero in the estimated aggregated precision matrix, thereby suggesting strong volatility connectedness. The solution returned by the tag-lasso estimator consists of one singlemarket block (block 5: Canada) and ﬁve multi-market blocks, which vary in size. The Australian, South-America, and all Asian stock markets form one aggregated block (block 6). Note that the tag-lasso has aggregated these merely because they have the same non-dependence structure (i.e. all of these markets are estimated to be conditionally inde-

Wilms and Bien

North-America

South-America

North-Europe

Central-Europe

South-Europe

West-Europe

Figure 10: Geography-based tree for the stock market data, which aggregates the p = 31 stock market indices (leaves) over several sub-continents towards a single root. Leaves, which represent individual stock markets, are displayed horizontally.

pendent of each other and all other markets). The remaining aggregated nodes concern the US market (block 4) and three European markets, which are divided into North-Europe (block 1), Central-, South-Europe & STOXX50E (block 2), and West-Europe (block 3). In the aggregated network, the latter two and the US play a central role as they are the most strongly connected nodes: These three nodes are connected to each other, the US node is additionally connected to Canada, whereas these European nodes are additionally connected with North-Europe. Out-of-sample Performance. We conduct an out-of-sample exercise to compare the taglasso estimator to the glasso estimator. We take a random sample of n = 203 observations (80% of the full data set) to form a training sample covariance matrix and use the remaining data to form a test sample covariance matrix Stest, and repeat this procedure ten times. We ﬁt both the tag-lasso and glasso estimator to the training covariance matrix, with 5-fold cross-validation on the training data to select tuning parameters. Next, we compute their corresponding out-of-sample errors on the test data, as in (6). The top right panel of Figure 11 shows each of these ten test errors for both the tag-lasso (x-axis) and the glasso estimator (y-axis). The fact that in all ten replicates the points are well above the 45-degree line indicates that the tag-lasso estimator has better estimation error than the glasso. Tag-lasso has a lower test error than glasso in all ten replicates, resulting in a substantial reduction in glasso s test errors. This indicates that jointly exploiting edge and node dimension reduction is useful for precision matrix estimation in this context.

5.2 Microbiome Application

We next turn to a data set of gut microbial amplicon data in HIV patients (Rivera-Pinto et al., 2018), where our goal is to estimate an interpretable graphical model, capturing the interplay between diﬀerent taxonomic groups of the microbiome. Bien et al. (2020) recently showed that tree-based aggregation in a supervised setting leads to parsimonious predictive

Tree-aggregated Graphical Lasso

OMXHPI OMXSPI

BVLG FTMIB GDAXI

IBEX SMSI SSMI STOXX50E

AEX BFX FCHI FTSE

DJI IXIC RUT SPX GSPTSE

HSI KS11 N225 SSEC

25 20 15 10

25 20 15 10

Out of sample errors

Figure 11: Stock market indices data. Top left: Sparsity pattern (non-zeros in black) of full ˆΩwith aggregation structure conveyed through row label coloring and column numbering. Top right: Test errors across the ten replications (dots) for the taglasso versus glasso. Bottom: Aggregated graph for the K = 6 nodes obtained with the tag-lasso as an adjacency matrix (bottom left) and as a network (bottom right) with the size of each node proportional to the number of original variables it aggregates.

models. The data set has n = 152 HIV patients, and we apply the tag-lasso estimator to all p = 104 bacterial operational taxonomic units (OTUs) that have non-zero counts in over half of the samples. We use the taxonomic tree that arranges the OTUs into natural hierarchical groupings of taxa: with 17 genera, 11 families, ﬁve orders, ﬁve classes, three phyla, and one kingdom (the root node). We employ a standard data transformation from the ﬁeld of compositional data analysis (see e.g., Aitchison, 1982) called the centered log-ratio (clr) transformation that is commonly used in microbiome graphical modeling (Kurtz et al., 2015; Lo and Marculescu, 2018; Kurtz et al., 2019). After transformation, Kurtz et al. (2015) apply the glasso, Lo and Marculescu (2018) incorporate phylogenetic information into glasso s optimization problem through weights within the ℓ1-penalty, and Kurtz et al. (2019) estimate a latent graphical model which combines sparsity with a low-rank structure.

Wilms and Bien

10 11 12 13 14

15 16 17 18

Otu000075 Otu000066 Otu000015 Otu000060 Otu000088 Otu000006 g_Intestinimonas

Otu000161 f_Porphyromonadaceae

Otu000047 Otu000055 Otu000134 Otu000156 g_Bacteroides*

Otu000158 Otu000025 Otu000007 g_Prevotella

g_Blautia g_Alistipes*

Otu000057 g_Incertae_Sedis f_Ruminococcaceae*

g_Ruminococcus*

Otu000154 Otu000077 f_Lachnospiraceae*

k_Bacteria*

g_Bacteroides

g_Prevotella

f_Porphyromonadaceae

g_Alistipes

k_Bacteria*

Figure 12: Microbiome data. Full precision matrix (left) and aggregated precision matrix (right) estimated by the tag-lasso with an unconstrained ﬁve-fold crossvalidation (top) and with a cross-validation subject to the constraint that there are at most ten blocks (bottom).

We instead, use the tag-lasso to learn a sparse aggregated network from the clr-transformed microbiome compositions. While the clr-transform induces dependence between otherwise independent components, Proposition 1 in Cao et al. (2019) provides intuition that as long as the underlying graphical model is sparse and p is large, these induced dependencies may have minimal eﬀect on the covariance matrix. Future work could more carefully account for the induced dependence, incorporating ideas from Cao et al. (2019) or Kurtz et al. (2019). Estimated Graphical Model. We ﬁt the tag-lasso to the full data set and use 5-fold cross-validation to select the tuning parameters. The tag-lasso estimator provides a sparse aggregated graphical model with K = 28 aggregated blocks (a substantial reduction in

Tree-aggregated Graphical Lasso

otu161 otu156

g_Prevotella otu134

g_Bacteroides

f_Porphyromonadaceae g_Alistipes

015 161 060 088 075 066 056 120 054 049 028

102 014 006 012 009 003

135 140 145 150 155 160

135 140 145 150 155 160

Out of sample errors

Figure 13: Microbiome data. Left: Aggregated network estimated by the constrained CV version of the tag-lasso. The colour of the nodes is based on their level of aggregation (OTU: pink, genus: orange, family: blue); their width is proportional to the number of OTUs they aggregate. Middle: Network estimated by the glasso. Right: Test errors across the ten replications for the unconstrained (solid black) and constrained (unﬁlled blue) CV version of the tag-lasso versus the glasso.

nodes from the original p = 104 OTUs). The top panel of Figure 12 shows the sparsity pattern of the p p estimated precision matrix (top left) and of the K K estimated aggregated precision matrix (top right). A notable feature of the tag-lasso solution is that it returns a wide range of aggregation levels: The aggregated network consists of 17 OTUs, 7 nodes aggregated to the genus level (these nodes start with g ), 3 to the family level (these nodes start with f ), and 1 node to the kingdom level (this node starts with k ). Some aggregated nodes, such as the g Blautia node (block 19), contain all OTUs within their taxa; some other aggregated nodes, indicated with an asterisk like the k Bacteria* node (block 28), have some of their OTUs missing. This latter block consists of 18 OTUs from across the phylogenetic tree that are estimated to be conditionally independent with all other OTUs in the data set. While the tag-lasso determines the aggregation level in a data-driven way through crossvalidation, practitioners or researchers may also sometimes wish to restrict the number of blocks K to a pre-determined level when such prior knowledge is available or if this is desirable for interpretability. As an illustration, we consider a constrained cross-validation scheme in which we restrict the number of blocks K to maximally ten and select the sparsity parameters with the best cross-validated error among those solutions with K 10. The bottom panel of Figure 12 shows the sparsity pattern of the full and aggregated precision matrices estimated by this constrained version of the tag-lasso. The resulting network consists of K = 8 aggregated nodes. The k Bacteria* node now aggregates 78 OTUs that are estimated to be conditionally independent of each other and all others. The interactions among the remaining nodes are shown in the left panel of Figure 13, which consists of three OTUs (OTU134, OTU156, and OTU161, in pink), three genera (Prevotella, Bacteroides, and Alistipes in orange) and one family (Porphyromonadaceae in

Wilms and Bien

blue). The resulting network is much simpler than the one estimated by the glasso, shown in the middle panel of Figure 13. The glasso ﬁnds 58 OTUs to be conditionally independent with all others, but the interactions among the remaining 46 OTUs are much more diﬃcult to interpret. The glasso is limited to working at the OTU-level, which prevents it from providing insights about interactions that span diﬀerent levels of the taxonomy. Out-of-sample Performance. We conduct the same out-of-sample exercise as described in Section 5.1. The right panel of Figure 13 presents the ten test errors (black dots) for the unconstrained CV tag-lasso and glasso. In all but one case, the tag-lasso leads to a better ﬁt than the glasso, suggesting that it is better suited for modeling the conditional dependencies among the OTUs. The unﬁlled blue dots show the same but for the constrained CV taglasso. In all ten cases, it underperforms the unconstrained CV tag-lasso (see shift to the right on the horizontal axis); however, its performance is on a par with the glasso, with test errors close to the 45 degree line. Thus, there does not appear to be a cost in outof-sample-performance to the interpretability gains of the constrained tag-lasso over the glasso.

6. Conclusion

Detecting conditional dependencies between variables, as represented in a graphical model, forms a cornerstone of multivariate data analysis. However, graphical models, characterized by a set of nodes and edges, can quickly explode in dimensionality due to ever-increasing ﬁne-grained levels of resolution at which data are measured. In many applications, a tree is available that organizes the measured variables into various meaningful levels of resolution. In this work, we introduce the tag-lasso, a novel estimation procedure for graphical models that curbs this curse of dimensionality through joint node and edge dimension reduction by leveraging this tree as side information. Node dimension reduction is achieved by a penalty that allows nodes to be aggregated according to the tree structure; edge dimension reduction is achieved through a standard sparsity-inducing penalty. As such, the taglasso generalizes the popular glasso approach to sparse graphical modelling. An R package called taglasso implements the proposed method and is available on the Git Hub page (https://github.com/ineswilms/taglasso) of the ﬁrst author. The tree is a crucial ingredient for performing node aggregation with the tag-lasso and opens up several interesting avenues for future research. On the one hand, multiple trees could be available. For instance, we could think of aggregating stock data by sector, transaction volume, or capitalization. In such cases, it would be interesting to let the trees compete in, for instance, a cross-validation exercise from which the tree that best ﬁts the data can be selected to guide the node aggregation. On the other hand, some applications lack the availability of a tree but have more general graph-based structures available that could guide the node aggregation. To this end, it would be interesting to further investigate how the machinery of graph fusion penalties (see e.g., Wang et al., 2016) could be leveraged. Finally, we do not provide theory in the form of estimation error bounds for the tag-lasso estimator. Possible starting directions to this end can be found in Rothman et al. (2008) or Ravikumar et al. (2011). It would be interesting to investigate the eﬀect of diﬀerent tree structures on the bounds. In Ravikumar et al. (2011), for instance, the maximal node degree is a relevant quantity. For our penalty on Ω, which involves an internal optimization

Tree-aggregated Graphical Lasso

problem over Γ and D, it would be interesting to understand the relevant quantity that captures the interplay between the tree structure and the true node aggregation structure.

Acknowledgments

We thank the referees for their constructive comments which substantially improved the quality of the manuscript. We thank Christian M uller for useful discussions. Jacob Bien was supported in part by NSF CAREER Award DMS-1653017 and NIH Grant R01GM123993.

Appendix A. Proof of Propositions

A.1 Proof of Proposition 1

Proof First, note that e X follows a K-dimensional multivariate normal distribution with mean zero and covariance matrix M (D+MCM ) 1M. Next, we re-write this covariance matrix by two successive applications of equation (23) in Henderson and Searle (1981):

M (D + MCM ) 1M = M D 1M M D 1M(I + CM D 1M) 1CM D 1M

= [M D 1M] 1 + C 1 .

Hence, the precision matrix of e X is given by (M D 1M) 1 +C. Now since (M D 1M) 1

is diagonal, cij = 0 e Xi e Xj| e X {i,j}, for any i, j = 1, . . . , K and with e X {i,j} containing all aggregated variables except for aggregate i and j.

A.2 Proof of Proposition 2

Proof Denote the tag-lasso solution by (bΩ, bΓ, b D) and let b V = {u : bγu = 0} be the set of non-zero rows in bΓ. Deﬁne the partial ordering over the nodes of the tree so that u v means that u / ancestor(v) and label the nodes of b V so that u1 u2 u|b V|.

Consider the following algorithm to obtain the partition matrix c M. We work our way through the nodes in b V in a bottom-up fashion (according to the tree). In a Gram-Schmidtlike procedure, for each node, we subtract out the contributions of its descendants. This approach achieves orthogonality while preserving the binary matrix structure. Note that the approach may result in one or more zero columns in the partition matrix, which are removed at the end before returning the ﬁnal partition matrix c M.

Wilms and Bien

Algorithm 1 Compute partition matrix from tag-lasso solution

Input: A, b V

for ℓ= 1, . . . |b V| do

u descendant(ul) b V Mu (8)

Form the p K matrix c M (with K |b V|) by taking as columns all Muℓsuch that Muℓ = 0

Output: c M

To prove that c M is a partition matrix, we need to show that it is a binary matrix with c M c M a diagonal matrix (i.e., orthogonal columns). Let Tu denote the subtree rooted at u T . For any u b V, equation (8) implies that

Au = Mu + X

u descendant(u) b V Mu = X

u Tu b V Mu . (9)

By equation (9), we can then re-write equation (8) as

u children(a)

u Tu b V Mu

u children(a) Au. (10)

Now supp(Au) supp(Av) = for u, v children(a) and u = v, and with supp() denoting the support. Using that supp(Au) supp(Aa) for u children(a), it follows that Ma {0, 1}p and supp(Ma) supp(Aa). Now, consider a, b b V, a = b with the following three cases:

1. If b descendant(a), then

M b Ma = M b

u descendant(a) b V Mu

= M b Ab M b X

u descendant(a) b V Mu

= M b A b M b X

u Tb b V M u,

= M b A b M b A b = 0. (11)

Tree-aggregated Graphical Lasso

The second line follows from supp(Mb) supp(Ab) supp(Aa); the third line follows since supp(Mb) supp(Ab) and descendant(a) Tb = Tb thereby recalling that b descendant(a); and the fourth line follows from equation (9).

2. If a descendant(b), then M b Ma = 0 follows from case 1.

3. If a and b are in disjoint branches, then supp(Aa) supp(Ab) = so M a Mb = 0.

We have thus constructed a partition matrix c M, namely c M {0, 1}p K with c M c M diagonal. By construction c Mu is in the column space of Ab V, denoted as c Mu Lcol(Ab V), for each u b V. Hence, Lcol(c M) Lcol(Ab V). It then follows that bΩ b D = AbΓ = Ab V bΓb V Lcol(c M).

By symmetry of bΩ b D, bΩ b D is also in the row space of c M. We can therefore write

bΩ b D = c Mc M+(bΩ b D)(c M+) c M ,

where c M+ = (c M c M) 1c M is the projection matrix onto the column space of c M. Taking b C = c M+(bΩ b D)(c M+) then establishes the result.

Appendix B. Tag-lasso Estimator Compared to a Two-stage Benchmark

We compare the performance of the tag-lasso estimator to a two-stage benchmark. For the benchmark, we ﬁrst apply the tag-lasso estimator with λ2 = 0 to solely determine the level of node aggregation. Secondly, we apply the graphical lasso with a hard aggregation constraint (similar to the one used in equation (7)), as provided by the outcome of the ﬁrst stage. We use the simulation designs detailed in Section 4.1 to compare the estimation accuracy of this two-stage estimator to the tag-lasso and regular glasso. For the tag-lasso and two-stage estimators both the ideal and realistic tree structure are used. Figure 14 presents the results on KL distance for the ﬁve estimators across the four simulation designs. The tag-lasso estimators provide, overall, a considerable improvement in terms of estimation accuracy over their two-stage benchmarks. For the simulation designs where node aggregation is present (chain, random, unbalanced), the two-stage estimators do still outperform the glasso on average.

Appendix C. Details of the LA-ADMM Algorithm

The augmented Lagrangian of (5) is given by

logdet(Ω(1)) + tr(SΩ(1)) + 1 {Ω(1) = Ω(1) , Ω(1) 0} + U(1), Ω(1) Ω + ρ

2 Ω(1) Ω 2 F

+ λ1 Γ(1) r 2,1 + 1 {γ(1) r = γ(1)1p} + U(4), Γ(1) Γ + ρ

2 Γ(1) Γ 2 F

+ 1 {Ω(2) = AΓ(2) + D, D diag., Djj 0} + U(2), Ω(2) Ω + ρ

2 Ω(2) Ω 2 F

+ U(5), Γ(2) Γ + ρ

2 Γ(2) Γ 2 F

+ λ2 Ω diag(3) 1 + U(3), Ω(3) Ω + ρ

2 Ω(2) Ω 2 F , (12)

Wilms and Bien

KL distance

0.5 1 1.5 2 2.5

KL distance

0.5 1 1.5 2

KL distance

0.5 1 1.5 2 2.5

unstructured

KL distance

Figure 14: Estimation accuracy of the tag-lasso estimators compared to the two-stage estimators and glasso.

where U(i) (for i = 1, . . . , 5) are the dual variables, and ρ is a penalty parameter. Note that equation (12) is of the same form as Equation (3.1) in Boyd et al. (2011) and thus involves iterating three basic steps: (i) minimization with respect to (Ω(1), Ω(2), Ω(3), Γ(1), Γ(2), D), (ii) minimization with respect to (Ω, Γ), and (iii) update of (U(1), . . . , U(5)). Step (i) decouples into four independent problems, whose solutions are worked out in Sections C.1-C.4. Step (ii) involves the minimization of a diﬀerentiable function of Ωand Γ and boils down to the calculation of simple averages, as shown in Section C.5. Step (iii) s update of the dual variables is provided in C.6. Algorithms 2-3 then provide an overview of the LA-ADMM algorithm to solve problem (5). We use the LA-ADMM algorithm with ρ1 = 0.01, Tstages = 10, maxit = 100.

C.1 Solving for Ω(1)

Minimizing the augmented Lagrangian with respect to Ω(1) gives bΩ(1) k+1, the solution to the optimization problem

min Ω(1){ logdet(Ω(1)) + tr(SΩ(1)) + U(1), Ω(1) Ω + ρ

2 Ω(1) Ω 2 F s.t. Ω(1) = Ω(1) , Ω(1) 0}

= min Ω(1){ logdet(Ω(1)) + tr(SΩ(1)) + ρ

2 Ω(1) (bΩk b U(1) k /ρ) 2 F s.t. Ω(1) = Ω(1) , Ω(1) 0}.

The solution should satisfy the ﬁrst order optimality condition

ρbΩ(1) k+1 (bΩ(1) k+1) 1 = ρbΩk b U(1) k S. (13)

Tree-aggregated Graphical Lasso

Algorithm 2 LA-ADMM

Input: S, A, p, |T |, λ1, λ2, ρ1, maxit, Tstages

Initialization: Set

bΩ0 0; bΓ0 0

for t Tstages do

(bΩt, bΓt, b Dt) ADMM(S, A, p, |T |, λ1, λ2, ρt, maxit, bΩt 1, bΓt 1)

Output: bΩTstages, bΓTstages, b DTstages

This means that the eigenvectors of bΩ(1) k+1 are the same as the eigenvectors of ρbΩk b U(1) k S

and that the eigenvalues of bΩ(1) k+1 are a simple function of the eigenvalues of ρbΩk b U(1) k S. Consider the orthogonal eigenvalue decomposition of right hand side:

ρbΩk b U(1) k S = QΛQ ,

where Λ = diag(δ1, . . . , δp) and QQ = Q Q = I. Multiply (13) by Q on the left and Q on the right

ρ Ω(1) k+1 ( Ω(1) k+1) 1 = Λ, with Ω(1) k+1 = Q bΩ(1) k+1Q.

bΩ(1) k+1 = Q Ω(1) k+1Q , with Ω(1) k+1,jj = δj + q

C.2 Solving for Γ(1)

Minimizing the augmented Lagrangian with respect to Γ(1) gives

bΓ(1) k+1 := argmin Γ(1) {ρ

2 Γ(1) (bΓk b U(4) k /ρ) 2 F + λ1 Γ(1) r 2,1 s.t. γ(1) r = γ(1)1p}.

The solution is groupwise soft-thresholding:

bΓ(1) k+1,j =

( SG(bΓk,j b U(4) k,j/ρ, λ1/ρ), if j = 1, . . . , |T |\{r}

bγk1p, if j = r. (15)

with the group-wise soft-thresholding operator SG(γ, λ) = max(1 λ/ γ 2, 0)γ applied to γ Rp, and bγk is equal to the average of the p-dimensional vector bΓk,r b U(4) k,r/ρ. Note that

Wilms and Bien

Algorithm 3 ADMM

Input: S, A, p, |T |, λ1, λ2, ρ, maxit, Ω0, Γ0.

Initialization: Set

bΩ(i) 0 b U(i) 0 Ω0 for i = 1, . . . , 3

bΓ(j) 0 b U(j+3) 0 Γ0 for j = 1, . . . , 2

e A A I|T |

e A+ ( e A e A) 1 e A

A Ip+|T | e A e A+

C (Ip : 0p |T |) e A e A+ 1:p with e A+ 1:p the ﬁrst p columns of e A+

C diag(C C) 1

for k maxit do

bΩ(1) k Q Ωk 1Q , see equation (14).

bΩ(3) k,ij S(bΩk 1,ij b U(3) k 1,ij/ρ, λ2/ρ), i, j = 1, . . . , p, see equation (19).

bΓ(1) k,j SG(bΓk 1,j b U(4) k 1,j/ρ, λ1/ρ), j = 1, . . . , |T |\{r}, see equation (15).

bΓ(1) k,r bγk 11p, see equation (15).

diag( b Dk) Cdiag(( Af M) C)+, see equation (18).

bΓ(2) k e A+(f M e Dk), see equation (17).

bΩ(2) k = AbΓ(2) k + b Dk, see equation (16)

bΩk (bΩ(1) k + bΩ(2) k + bΩ(3) k )/3

bΓk (bΓ(1) k + bΓ(2) k )/2

b U(i) k b U(i) k 1 + ρ bΩ(i) k bΩk , for i = 1, . . . , 3

b U(j+3) k b U(j+3) k 1 + ρ bΓ(j) k bΓk , for j = 1, . . . , 2

Output: bΩmaxit, bΓmaxit, b Dmaxit

Tree-aggregated Graphical Lasso

in this Appendix we use the capitalized Γj notation to index the jth row of the matrix Γ whereas we use lowercase γu when indexing a node u based on the tree structure in Section 2 of the main paper.

C.3 Solving for Ω(2), Γ(2), D

Minimizing the augmented Lagrangian with respect to Ω(2), Γ(2), D gives

(bΩ(2) k+1, bΓ(2) k+1, b Dk+1) := argmin Ω(2),Γ(2),D {ρ

2 Ω(2) (bΩk b U(2) k /ρ) 2 F + ρ

2 Γ(2) (bΓk b U(5) k /ρ) 2 F

s.t. Ω(2) = AΓ(2) + D, D diagonal, Djj 0 for j = 1, . . . , p}.

The solution bΩ(2) k+1 = AbΓ(2) k+1 + b Dk+1 (16)

is immediate and we are left with

(bΓ(2) k+1, b Dk+1) := argmin Γ(2),D {1

2 e AΓ(2) + e D f M 2 F s.t. D diagonal, Djj 0 for j = 1, . . . , p}

where we have substituted Ω(2) = AΓ(2) + D and we denote

e A = A I|T |

R(p+|T |) |T |, e D = D 0|T | p

R(p+|T |) p, and f M =

bΩk b U(2) k /ρ bΓk b U(5) k /ρ

R(p+|T |) p.

The solution

bΓ(2) k+1 = ( e A e A) 1 e A (f M e Dk+1)

= (A A + I|T |) 1(A : I|T |)(f M e Dk+1) (17)

is immediate and we are left with

b Dk+1 := argmin D {1

2 (f M e D) e A( e A e A) 1 e A (f M e D) 2 F s.t. D diag., Djj 0, j = 1, . . . , p, }

= argmin D {1

2 (Ip+|T | e A( e A e A) 1 e A )(f M e D) 2 F s.t. D diag., Djj 0, j = 1, . . . , p, }

= argmin D {1

2 B CD 2 F s.t. D diag., Djj 0, j = 1, . . . , p, },

with B = (Ip+|T | e A( e A e A) 1 e A )f M R(p+|T |) p, C = (Ip : 0p |T |) e A( e A e A) 1A R(p+|T |) p. The solution is

diag( b Dk+1) = diag(C C) 1diag(B C)+. (18)

Wilms and Bien

C.4 Solving for Ω(3)

Minimizing the augmented Lagrangian with respect to Ω(3) gives

bΩ(3) k+1 := argmin Ω(3) { U(3), Ω(3) Ω + ρ

2 Ω(3) Ω 2 F + λ2 Ω diag(3) 1}

= argmin Ω(3) {ρ

2 Ω(3) (bΩk b U(3) k /ρ) 2 F + 1

2ρ U(3) 2 F + λ2 Ω diag(3) 1}

= argmin Ω(3) {ρ

2 Ω(3) (bΩk b U(3) k /ρ) 2 F + λ2 Ω diag(3) 1}.

The solution is simply elementwise soft-thresholding:

bΩ(3) k+1,ij =

( S(bΩk,ij b U(3) k,ij/ρ, λ2/ρ), if i = j bΩk,ij b U(3) k,ij/ρ, if i = j, (19)

with the soft-threshold operator S(ω, λ) = sign(ω)max(|ω| λ, 0) applied to ω R.

C.5 Update Variables Ωand Γ

Minimizing the augmented Lagrangian with respect to variables Ωand Γ gives

bΩk+1 := argmin Ω

i=1 bΩ(i) k+1 (Ω b U(i) k /ρ) 2 F

ρ UΩ k (20)

bΓk+1 := argmin Γ

i=1 bΓ(i) k+1 (Γ b U(i+3) k /ρ) 2 F

ρ UΓ k, (21)

where Ωk := bΩ(1) k + bΩ(2) k + bΩ(3) k 3 , UΩ k := b U(1) k + b U(2) k + b U(3) k 3 , Γk := bΓ(1) k + bΓ(2) k 2 , UΓ k :=

b U(4) k + b U(5) k 2 .

C.6 Update Dual Variables

The updates of the dual variables are given by

b U(i) k+1 := b U(i) k + ρ bΩ(i) k+1 bΩk+1 , for i = 1, . . . , 3

b U(j+3) k+1 := b U(j+3) k + ρ bΓ(j) k+1 bΓk+1 , for j = 1, . . . , 2.

Similarly, averaging the ﬁrst three updates and the latter two gives

UΩ k+1 := UΩ k + ρ Ωk+1 bΩk+1 , for i = 1, . . . , 3 (22)

UΓ k+1 := UΓ k + ρ Γk+1 bΓk+1 , for j = 1, . . . , 2. (23)

Substituting (20) and (21) into (22) and (23) yields that UΩ k+1 = UΓ k+1 = 0 after the ﬁrst iteration.

Tree-aggregated Graphical Lasso

Appendix D. Additional Simulation Results

0.5 0.6 0.7 0.8 0.9 1.0

Number of nodes

p=15 p=30 p=60 p=120

G tag lasso ideal tag lasso realistic glasso oracle

0.0 0.2 0.4 0.6 0.8 1.0

Adjusted Rand Index

Number of nodes

p=15 p=30 p=60 p=120

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate

Number of nodes

p=15 p=30 p=60 p=120

0.0 0.2 0.4 0.6 0.8 1.0

False Negative Rate

Number of nodes

p=15 p=30 p=60 p=120

Figure 15: Simulation results for increasing number of nodes p. Top: Aggregation performance (RI: left; ARI: right); Bottom: Sparsity recovery (FPR: left; FNR: right) of the four estimators

Wilms and Bien

0.6 0.7 0.8 0.9 1.0

Number of blocks

K=3 K=5 K=6 K=10

G tag lasso ideal tag lasso realistic glasso oracle

0.0 0.2 0.4 0.6 0.8 1.0

Adjusted Rand Index

Number of blocks

K=3 K=5 K=6 K=10

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate

Number of blocks

K=3 K=5 K=6 K=10

0.00 0.02 0.04 0.06 0.08 0.10

False Negative Rate

Number of blocks

K=3 K=5 K=6 K=10

Figure 16: Simulation results for increasing number of blocks K. Top: Aggregation performance (RI: left; ARI: right); Bottom: Sparsity recovery (FPR: left; FNR: right) of the four estimators

Tree-aggregated Graphical Lasso

Appendix E. Financial Application: Data Description

Abbreviation Description Location DJI Dow Jones Industrial Average US IXIC Nasdaq 100 US SPX S&P 500 Index US RUT Russel 2000 US GSPTSE S&P/TSX Composite index Canada BVSP BVSP BOVESPA Index Brazil MXX IPC Mexico Mexico OMXC20 OMX Copenhagen 20 Index Denmark OMXHPI OMX Helsinki All Share Index Finland OMXSPI OMX Stockholm All Share Index Sweden OSEAX Oslo Exchange All-share Index Norway GDAXI Deutscher Aktienindex Germany SSMI Swiss Stock Market Index Switzerland BVLG Portuguese Stock Index Portugal FTMIB Financial Times Stock Exchange Milano Indice di Borsa Italy IBEX Iberia Index 35 Spain SMSI General Madrid Index Spain AEX Amsterdam Exchange Index Netherlands BFX Bell 20 Index Belgium FCHI Cotation Assist ee en Continue 40 France FTSE Financial Times Stock Exchange 100 UK STOXX50E EURO STOXX 50 Europe HSI HANG SENG Index Hong Kong KS11 Korea Composite Stock Price Index (KOSPI) South Korea N225 Nikkei 225 Japan SSEC Shanghai Composite Index China STI Straits Times Index Singapore KSE Karachi SE 100 Index Pakistan BSESN S&P Bombay Stock Exchange Sensitive Index India NSEI NIFTY 50 India AORD All Ordinaries Index Australia

Table 4: Financial Application: Data Description, as taken from https://realized. oxford-man.ox.ac.uk/data/assets.

J. Aitchison. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139 160, 1982.

O. Banerjee, L. El Ghaoui, and A. d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine

Wilms and Bien

Learning Research, 9(Mar):485 516, 2008.

A. Belloni and V. Chernozhukov. Least squares after model selection in high-dimensional sparse models. Bernoulli, 19(2):521 547, 2013.

J. Bien. The simulator: an engine to streamline simulations. ar Xiv preprint ar Xiv:1607.00021, 2016.

J. Bien, X. Yan, L. Simpson, and C. L. M uller. Tree-aggregated predictive modeling of microbiome data. bio Rxiv, 2020.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1 122, 2011.

C. Brownlees, G. S. Gumundsson, and G. Lugosi. Community detection in partial correlation network models. Journal of Business & Economic Statistics, (just-accepted):1 33, 2020.

F. Bunea, C. Giraud, X. Luo, M. Royer, and N. Verzelen. Model assisted variable clustering: minimax-optimal recovery and algorithms. The Annals of Statistics, 48(1):111 137, 2020.

T. Cai, W. Liu, and X. Luo. A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594 607, 2011.

T. Cai, W. Liu, and H. Zhou. Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. The Annals of Statistics, 44(2):455 488, 2016.

B. J. Callahan, P. J. Mc Murdie, and S. P. Holmes. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME journal, 11(12): 2639 2643, 2017.

Y. Cao, W. Lin, and H. Li. Large covariance estimation for compositional data via composition-adjusted thresholding. Journal of the American Statistical Association, 114 (526):759 772, 2019.

V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky. Latent variable graphical model selection via convex optimization. The Annals of Statistics, 40(4):1935 1967, 2012.

F. Corsi. A simple approximate long-memory model of realized volatility. Journal of Financial Econometrics, 7(2):174 196, 2009.

C. Eisenach, F. Bunea, Y. Ning, and C. Dinicu. High-dimensional inference for cluster-based graphical models. Journal of Machine Learning Research, 21(53):1 55, 2020.

J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008.

H. V. Henderson and S. R. Searle. On deriving the inverse of a sum of matrices. Siam Review, 23(1):53 60, 1981.

Tree-aggregated Graphical Lasso

L. Hubert and P. Arabie. Comparing partitions. Journal of classiﬁcation, 2(1):193 218, 1985.

Z. D. Kurtz, C. L. M uller, E. R. Miraldi, D. R. Littman, M. J. Blaser, and R. A. Bonneau. Sparse and compositionally robust inference of microbial ecological networks. PLo S Comput Biol, 11(5):e1004226, 2015.

Z. D. Kurtz, R. Bonneau, and C. L. M uller. Disentangling microbial associations from hidden environmental and technical factors via latent graphical models. bio Rxiv, 2019.

C. Lo and R. Marculescu. Pglasso: Microbial community detection through phylogenetic graphical lasso. ar Xiv preprint ar Xiv:1807.08039, 2018.

N. Meinshausen and P. B uhlmann. High-dimensional graphs and variable selection with the lasso. The Annals of statistics, 34(3):1436 1462, 2006.

T. Millington and M. Niranjan. Quantifying inﬂuence in ﬁnancial markets via partial correlation network inference. In 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA), pages 306 311. IEEE, 2019.

J. Peng, P. Wang, N. Zhou, and J. Zhu. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104(486):735 746, 2009.

E. Pircalabelu and G. Claeskens. Community-based group graphical lasso. Journal of Machine Learning Research, 21(64):1 32, 2020.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria., 2017. URL https://www.R-project.org/.

W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846 850, 1971.

P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935 980, 2011.

J. Rivera-Pinto, J. J. Egozcue, V. Pawlowsky-Glahn, R. Paredes, M. Noguera-Julian, and M. L. Calle. Balances: a new perspective for microbiome analysis. m Systems, 3(4):1 12, 2018. doi: 10.1128/m Systems.00053-18. URL https://msystems.asm.org/content/3/ 4/e00053-18.

A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494 515, 2008.

K. M. Tan, D. Witten, and A. Shojaie. The cluster graphical lasso for improved estimation of Gaussian graphical models. Computational statistics & data analysis, 85:23 36, 2015.

Y.X. Wang, J. Sharpnack, A. J. Smola, and R. J. Tibshirani. Trend ﬁltering on graphs. Journal of Machine Learning Research, 17(105):1 41, 2016.

Wilms and Bien

Y. Xu, M. Liu, Q. Lin, and T. Yang. Admm without a ﬁxed penalty parameter: Faster convergence with new adaptive penalization. In Advances in Neural Information Processing Systems, pages 1267 1277, 2017.

X. Yan and J. Bien. Rare feature selection in high dimensions. Journal of the American Statistical Association, 116(534):887 900, 2021.

M. Yuan. High dimensional inverse covariance matrix estimation via linear programming. Journal of Machine Learning Research, 11:2261 2286, 2010.

M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19 35, 2007.