# svdbased_screening_for_the_graphical_lasso__dec1c579.pdf

SVD-Based Screening for the Graphical Lasso

Yasuhiro Fujiwara , Naoki Marumo , Mathieu Blondel , Koh Takeuchi , Hideaki Kim , Tomoharu Iwata , Naonori Ueda

NTT Communication Science Laboratories, NTT Software Innovation Center, Osaka University NTT Communication Science Laboratories {fujiwara.yasuhiro, marumo.naoki, mathieu.blondel, takeuchi.koh, kin.hideaki, iwata.tomoharu, ueda.naonori}@lab.ntt.co.jp

The graphical lasso is the most popular approach to estimating the inverse covariance matrix of highdimension data. It iteratively estimates each row and column of the matrix in a round-robin style until convergence. However, the graphical lasso is infeasible due to its high computation cost for large size of datasets. This paper proposes Sting, a fast approach to the graphical lasso. In order to reduce the computation cost, it efﬁciently identiﬁes blocks in the estimated matrix that have nonzero elements before entering the iterations by exploiting the singular value decomposition of data matrix. In addition, it selectively updates elements of the estimated matrix expected to have nonzero values. Theoretically, it guarantees to converge to the same result as the original algorithm of the graphical lasso. Experiments show that our approach is faster than existing approaches.

1 Introduction As a result of development of the Internet technologies, large datasets are common in many ﬁelds [Shiokawa et al., 2015; Mishima and Fujiwara, 2015; Nakatsuji et al., 2014]. Since real-world data typically has high-dimensionality, many researchers are interested in estimating the graph structure of a Gaussian Markov random ﬁeld in the many variable setting. The graph structure is a natural representation of the conditional dependence among many variables. In the graph structure, each node is a single variable and an edge between two variables implies that they have conditional dependency [Fujiwara et al., 2017; 2014]. For real-world data, the graph structure typically results in a complete graph where every pair of nodes is connected by an edge. This is because realworld data inevitably includes noise; some of the dependencies indicated by the structure might be due to noise. In order to increase the robustness of the graph structure against data noise, it is important to estimate a sparse structure from highdimensional data so that the graph structure can effectively represent essential relationships among variables. AI researchers have proposed approaches to estimating sparse undirected graphical models that exploit L1 regularization [Grechkin et al., 2015; Chen et al., 2010]. Many stud-

ies assume that the observations have a multivariate Gaussian distribution with mean µ and covariance matrix Σ. Thus, the conditional independence can be represented by the inverse covariance matrix, Θ(= Σ 1). If the (i, j)-th element of matrix Θ is zero (i.e., Θ[i, j] = 0), then the i-th and j-th variables are conditionally independent. The approaches for estimating sparse graph structures impose the penalty of L1 to estimate matrix Θ so as to increase its sparsity. The graphical lasso by Friedman et al. is the most popular approach [Friedman et al., 2008]. It is based on penalized maximumlikelihood estimation and has been widely adopted by statisticians and computer scientists [Zhao et al., 2015]. If S is an empirical covariance matrix such that S = X X where X is a N P data matrix of N observations and P variables, the graphical lasso estimates the sparse graph by maximizing the following objective function:

log detΘ Tr(SΘ) ρ Θ 1 (1)

where det denotes the determinant, Tr is the trace, and 1 is the L1 norm. In Equation (1), ρ is a positive L1 regularization parameter that determines the sparseness of the estimated graph; if ρ is large, the graph has greater sparsity. Note that the problem of Equation (1) is convex [Friedman et al., 2008]. The graphical lasso exploits a coordinate descent algorithm that updates coefﬁcients of the lasso to solve the problem [Fujiwara et al., 2016a; 2016b]. Speciﬁcally, it iteratively computes each row and column of the estimated matrix as a solution of the lasso problem in a round-robin manner until convergence. Since the graphical lasso can obtain the exact maximization of the above L1-penalized log-likelihood, it can effectively estimate the graph structure from high dimensional data. Due to its high effectiveness, the graphical lasso is actively used on the front lines of various ﬁelds. One example is market risk management by computing the portfolio risk of ﬁnancial assets such as credit default swaps (CDS) [Neuberg and Glasserman, 2016]. In market risk management, it is crucial to accurately estimate a correlation matrix so as to minimize the risk of underestimating the true variance of the portfolio. By effectively reducing unimportant pairwise dependencies among assets, the graphical lasso can estimate the correlation matrix in an economically meaningful way. In fact, it has revealed high correlations such as New York Times - Gannett and Ford - General Motors. In addition, the graphical

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

lasso is being used in detecting psychiatric disorders [Rosa et al., 2015]. Many psychiatric disorders, such as depression and schizophrenia, result from brain connectivity disorders. Thus, they are effectively detected from the f MRI signals of brains. Beyond those above, the graphical lasso is used in many other applications such as generating novel hypotheses in strategic management [Li et al., 2016] and inferring biological networks [Vinciotti et al., 2016]. Although the graphical lasso is widely used, the original algorithm by Friedman et al. incurs high computation cost. In order to improve the efﬁciency of the graphical lasso, a screening approach was proposed to locate blocks in the estimated inverse covariance matrix that can have nonzero elements [Witten et al., 2011]. The approaches by Friedman et al. are a standard in estimating the inverse covariant matrix; the famous method, glasso, is based on the coordinate descent algorithm and the screening approach [Friedman et al., 2015]. However, since the screening approach needs to compute all elements of the empirical covariance matrix before entering the iterations, it can incur high computation cost. This paper proposes Sting as a novel and efﬁcient approach for the graphical lasso. To improve the efﬁciency, we efﬁciently identify blocks in the estimated matrix that have nonzero elements by computing the singular value decomposition (SVD) of data matrix. Since the use of SVD avoids computing all elements of the covariance matrix, we can improve the efﬁciency of the graphical lasso. In addition, we selectively update coefﬁcients of the lasso; we iteratively update only those coefﬁcients expected to have nonzero values until convergence so as to increase the efﬁciency. After convergence, we check unselected coefﬁcients to ensure they do not have nonzero values. Since our approach can safely prune unnecessary update computations, it provably guarantees to converge to the same result as the original algorithm of the graphical lasso. With the enhanced efﬁciency, the proposed approach can improve the effectiveness of AI-based decisionmaking, risk management, and data science. In the case of psychiatric disorder detection, disease diagnosis is an important application of AI [Xu et al., 2016], and our approach can realize even small delays in the onset and in the progress of depression and schizophrenia; we can effectively reduce the burden of psychiatric disorders such as depression and schizophrenia by providing adequate institutional and home healthcare based on the proposed approach. The remainder of this paper is organized as follows: Section 2 describes related work. Section 3 gives an overview of our research background. Section 4 introduces our approach. Section 5 reviews our experiments and their results. Section 6 provides our conclusions.

2 Related Work

The screening approach proposed by Friedman et al. is the most famous way of improving the efﬁciency of the graphical lasso [Witten et al., 2011]. This approach computes small blocks of nonzero elements in the inverse covariance matrix. Theoretically, it is derived from the KKT condition of the optimization problem of Equation (1). Therefore, it does not sacriﬁce the exact maximization property of the graphi-

cal lasso. Hsieh et al. proposed approximation approaches based on Newton s method [Hsieh et al., 2013]. It improves the efﬁciency of Newton s method in estimating the inverse covariance matrix since straightforward implementation of the method suffers high computation cost. However, this approach does not guarantee the same results as the graphical lasso unlike our proposal. Liu et al. proposed St ART, which can effectively choose the regularization parameter for high dimensional inferencing of undirected graphs based on edge stability [Liu et al., 2010]. It uses the least amount of regularization that simultaneously makes a graph sparse and replicable under random sampling. Liu et al. also proposed a nonparanormal approach that uses a semiparametric Gaussian copula for high dimensional inference [Liu et al., 2012]. It efﬁciently computes the inverse covariance matrix by using nonparametric rank-based statistics to estimate the correlation matrix. In this paper, we proposed more efﬁcient approach than the above approaches.

3 Preliminaries Given N P data matrix X of N observations and P variables, the graphical lasso estimates inverse matrix Θ of covariance matrix Σ from matrix X by maximizing the penalized log-likelihood of Equation (1) [Friedman et al., 2008]. Let S be an empirical covariance matrix, it optimizes each row and corresponding column of W = S + ρI in coordinate descent fashion where I is an identity matrix. Speciﬁcally, it ﬁrst partitions matrices W and S as follows:

W = W11 w12 w 12 w22

, S = S11 s12 s 12 s22

where W11 and S11 are p 1 p 1 matrices, w12 and s12 are column vectors of length p 1, and w22 and s22 are scalars. The graphical lasso iteratively permutates each row/column of matrices W and S by exploiting the coordinate descent so that target columns are always the last to compute the solution of the lasso. Let ˆβi be the i-th coefﬁcient of the lasso that corresponds to the i-column of matrix W11, the graphical lasso updates coefﬁcient ˆβi as follows:

ˆβ[i] S(s12[i] P j =i W11[j, i]ˆβj, ρ)/W11[i, i] (3)

where S is the soft-threshold operator (i.e., S(z, ρ) = sign(z)(|z| ρ)+), s12[i] is the i-th element of s12, and W11[j, i] is the (j, i)-th element of W11. Note that the diagonal of matrix W remains unchanged in the iterations. In addition, coefﬁcients obtained by the lasso are typically sparse. If ˆβ is a column vector of length p 1 whose i-th element is ˆβ[i], the graphical lasso ﬁlls in the corresponding column and row of matrix W by computing w12 = W11ˆβ. These procedures are repeated until matrix W reaches convergence. The graphical lasso estimates elements of inverse matrix Θ after the iterations. If ˆθ12 is a column vector of length p 1 in matrix Θ that corresponds to vector w12, it computes the estimated matrix as ˆθ12 = ˆβ/(w22 w 12ˆβ). This equation indicates that ˆθ12 is a rescaling of vector ˆβ. In order to reduce the high computation cost, Friedman et al. proposed the screening approach as described in Section 2

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

[Witten et al., 2011]. It is based on the observation that the ith variable would be an isolated node in the estimated graph if |S[i, j]| ρ holds for all j-th variables such that j = i in empirical covariance matrix S. Speciﬁcally, it computes an adjacency matrix where each pair of nodes is connected if |S[i, j]| > ρ holds, and then computes blocks of connected components. The screening approach limits the update computations to those coefﬁcients that are contained in the connected components; it can effectively reduce the number of coefﬁcients of the lasso by exploiting the block structures. The screening approach is the most popular approach for implementing the graphical lasso. In fact, the standard method of the graphical lasso, glasso, improves the efﬁciency by exploiting the screening approach [Friedman et al., 2015]. However, it still needs to compute all elements of the empirical covariance matrix to obtain the adjacency matrix. Therefore, the graphical lasso incurs high computation cost. If T is the number of update computations for each row/column of matrix Σ and B is the numbers of nonzero coefﬁcients of the lasso for each row/column of the estimated matrix, the graphical lasso requires O(NP 2 + PBT) time.

4 Proposed Method

The proposed approach, Sting, can efﬁciently estimate the inverse covariance matrix and so reduce the high computation cost of the graphical lasso. First, we overview the idea underlying Sting. That is followed by a full description including its theoretical properties.

4.1 Ideas So as to reduce the number of coefﬁcients updated in each iteration, the existing algorithms of the graphical lasso compute each covariance of matrix S and add an edge if |S[i, j]| > ρ holds to obtain the block structures. This indicates that it must compute all the covariances of matrix S. In addition, it updates all the coefﬁcients obtained by the block structures in round-robin style even though almost all the coefﬁcients are expected to be zero due to the sparseness property of the lasso. To reduce the computation cost, our approach efﬁciently computes upper and lower bounds of covariance S[i, j] by exploiting SVD. Only if the covariance is determined to have an edge by the bounds, we exactly compute S[i, j]; this avoids computing all the covariances of matrix S. In addition, Sting iteratively updates selected coefﬁcients expected to have nonzero values until convergence. Since our approach updates only a few of the coefﬁcients of the block structures, it can efﬁciently solve the lasso. Since our approach checks unselected coefﬁcients whether they have nonzero values, it can provably guarantee to converge to the same results as the graphical lasso.

4.2 Upper and Lower Bounds Before the iterations, we compute the upper and lower bounds of covariance S[i, j]. This approach approximates data matrix X of N P size by computing SVD where the target rank is set to n (n N). SVD is orthogonal transformation and gives high approximation quality in terms of squared loss [Press et al., 2007]. Let U be a unitary matrix of N n size

and X be the transformed matrix of n P size. Matrix X is represented as X U X. Therefore, let xi and xi be the i-th column vector of matrix X and X, respectively, which yields xi U xi. The upper bound is deﬁned as follows:

Deﬁnition 1 Let S[i, j] be the upper bound of covariance S[i, j], S[i, j] is given as follows:

S[i, j] = 1

2{ xi 2 2 + xj 2 2 Pn k=1( xi[k] xj[k])2} (4)

where 2 is the L2-norm of a column vector.

Note that we can efﬁciently compute the L2-norm of the column vectors in O(NP) time before the iterations. Similarly, we can efﬁciently compute the SVD of matrix X by using the existing approach of [Halko et al., 2011]; this takes O(NP log n) time. The lower bound is deﬁned as follows:

Deﬁnition 2 S[i, j] is the lower bound of covariance S[i, j] and is given by the following equation:

S[i, j] = 1

2{ xi 2 2 xj 2 2 +Pn k=1( xi[k]+ xj[k])2} (5)

We introduce the following lemma that S[i, j] and S[i, j] give the upper and lower bounds, respectively:

Lemma 1 For each covariance S[i, j] of matrix S, we have S[i, j] S[i, j] S[i, j].

Proof We ﬁrst prove that S[i, j] S[i, j] holds. For each element of column vector xi and xj, we have xi[k]xj[k] = 1 2{(xi[k])2 + (xj[k])2 (xi[k] xj[k])2} [Blitzstein and Hwang, 2014]. Since S = X X, we have

S[i, j] = x i xj = PN k=1 xi[k]xj[k]

2 PN k=1{(xi[k])2 + (xj[k])2 (xi[k] xj[k])2}

2{ xi 2 2 + xj 2 2 PN k=1(xi[k] xj[k])2}

Since SVD is orthogonal transformation, PN k=1(xi[k] xj[k])2 = PN k=1( xi[k] xj[k])2 holds. In addition, ( xi[k] xj[k])2 0 holds. Therefore, we have

2{ xi 2 2+ xj 2 2 Pn k=1( xi[k] xj[k])2} = S[i, j]

Similarly, we have the property of S[i, j] S[i, j] since xi[k]xj[k] = 1 2{ (xi[k])2 (xj[k])2 + (xi[k] + xj[k])2} holds for each element of column vector xi and xj. In terms of computation cost, it takes O(n) time to compute the bounds from Deﬁnition 1 and 2 if we have the L2norm of each column vector in matrix X and SVD of the matrix. This indicates that we can compute the upper and lower bounds, S[i, j] and S[i, j], much faster than S[i, j] since it needs O(N) time to compute S[i, j] = PN k=1 xi[k]xj[k]. As described in Section 3, the screening approach by Friedman et al. obtains the block structures by computing the adjacency matrix from matrix S. We can effectively identify edges in the adjacency matrix by exploiting the bounds. Theoretically, our approach is based on the following lemma:

Lemma 2 The node pair of the i-th and j-th variables must meet the condition of S[i, j] > ρ or S[i, j] < ρ if the node pair have an edge in the adjacency matrix.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Proof In order to prove Lemma 2, we show that the node pair of the i-th and j-th variables does not have an edge in the adjacency matrix if S[i, j] ρ and S[i, j] ρ. As described in Section 3, the adjacency matrix has an edge between the i-th and j-th variables if and only if |S[i, j]| > ρ holds. If S[i, j] ρ and S[i, j] ρ, we have ρ S[i, j] S[i, j] S[i, j] ρ from Lemma 1. Therefore, the node pair does not have an edge in the adjacency matrix. As a result, if the node pair of the i-th and j-th variables has an edge, we have S[i, j] > ρ or S[i, j] < ρ. This lemma indicates that we can avoid computing S[i, j] when obtaining edges in the adjacency matrix if S[i, j] ρ and S[i, j] ρ; we compute S[i, j] only if S[i, j] < ρ or S[i, j] > ρ. Thus, we can avoid computing all the covariances of matrix S when exploiting the screening approach.

4.3 Selective Update

As described in Section 3, the screening approach updates the coefﬁcients of the block structures in round-robin style even though almost all the coefﬁcients are zero due to the sparseness property of the lasso. In order to further improve efﬁciency, we update only those coefﬁcients expected to have nonzero values until convergence and then check unselected coefﬁcients after the iterations. Our approach is based on the observation that, if |S[i, j]| > ρ, the (i, j)-th element of inverse matrix Θ is likely to have nonzero value and thus the corresponding coefﬁcient of the lasso is expected to have nonzero value [Langfelder and Horvath, 2008]. Note that vector ˆθ12 is a rescaling of vector ˆβ as described in Section 3. Let Bi be the block that contains the ith variable obtained by the screening approach and Bi be a set such that Bi = {j|Bj = Bi}; the screening approach updates the i-th coefﬁcient by exploiting Bi i. Instead, our approach updates the corresponding coefﬁcients given by using Ui until convergence from Equation (3) for the i-th variable where Ui is initialized as Ui = {j|S[i, j] > ρ j Bi j = i}. Note that it is clear that Ui Bi and we can efﬁciently obtain Ui by exploiting the upper and lower bounds. In addition, if U i is a set such that U i = {j|j / Ui j Bi j = i}, we check the coefﬁcients in U i whether they have nonzero values after the iterations. Our approach exploits set Ui + U i in computing the i-th coefﬁcient. Since Ui +U i = Bi i holds, our approach can obtain the same coefﬁcient of the the i-th variable as the screening approach that exploits Bi i. This approach is theoretically based on the following property:

Lemma 3 Let ˆβ be a converged coefﬁcient vector for set Ui + U i. If (1) coefﬁcient vector ˆβ reaches convergence after the iterations for set Ui and (2) ˆβ[i] = 0 and ˆβ [i] = 0 hold for all i such that i U i, we have ˆβ = ˆβ after computing vector ˆβ from vector ˆβ by using coordinate descent.

Proof Since ˆβ[i] = 0 and ˆβ [i] = 0 hold i U i, we have ˆβ = ˆβ for all i such that i U i. If ˆβ [i] = 0 holds i U i, it is clear that such coefﬁcients do not have any impact on the update results from Equation (3). In addition, coefﬁcient vector ˆβ has already reached convergence for Ui. Therefore,

Algorithm 1 Sting

Input: matrix X, parameter ρ, target rank of SVD n Output: matrix Θ 1: compute rank-n SVD of matrix X; 2: A = 0; 3: for i = 1 to P do 4: for j = 1 to P do 5: compute the upper and lower bounds of S[i, j]; 6: if S[i, j] > ρ or S[i, j] < ρ then 7: compute S[i, j]; 8: if S[i, j] > ρ or S[i, j] < ρ then 9: add edge e[i, j] to adjacency matrix A; 10: compute the blocks of connected components from adjacency matrix A; 11: for i = 1 to m do 12: compute S(i); 13: W(i) = S(i) + ρI(i); 14: repeat 15: for each j Bi do 16: compute Uj; 17: repeat 18: compute ˆβ until convergence by using Uj; 19: compute U j; 20: for each k U j do 21: compute ˆβ[k] from Equation (3); 22: if ˆβ[k] = 0 then 23: add k to Uj; 24: until k U j, ˆβ[k] = 0 25: w(i) 12 = W(i) 11 ˆβ; 26: until W(i) reaches convergence 27: compute Θ(i) from W(i); return Θ;

ˆβ[i] = ˆβ [i] holds i Ui after the update computations to obtain coefﬁcient vector ˆβ if we use coefﬁcient vector ˆβ in the update computation. As a result, ˆβ = ˆβ holds. Lemma 3 indicates that, while the screening approach updates coefﬁcients in Bi in round-robin style, our approach can selectively update the coefﬁcients in Ui (Ui Bi) without sacriﬁcing accuracy by checking the coefﬁcients in U i. While the coefﬁcients in Ui are iteratively updated, we do not perform the iterative computation for coefﬁcients in U i; this improves efﬁciency. In the proposed approach, if a coefﬁcient in U i has a nonzero value, we add the coefﬁcient to set Ui and recursively perform these procedures until we have ˆβ [i] = 0 for all i such that i U i. We show a detailed algorithm of this approach in the next section.

4.4 Algorithm

Algorithm 1 gives a full description of our approach. In Algorithm 1, A is the adjacency matrix of size P P needed by the screening approach and m is the number of blocks obtained by the screening approach. By following the paper of the screening approach [Witten et al., 2011], it processes each block one by one. In addition, let S(i), W(i), I(i), w(i) 12 , W(i) 11 , and Θ(i) be matrix and vector of the i-th block that correspond to S, W, I, w12, W11, and Θ, respectively. Our approach starts by computing the SVD of data matrix X and initializing adjacency matrix A (lines 1-2). In order to efﬁciently obtain edges in the adjacency matrix, it computes the upper and lower bounds of each covariance of matrix S and, if the bounds meet the condition to have an edge (Lemma 2), it exactly computes elements of the em-

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Madelon ISOLET Gisette Arcene

Wall clock time [s]

huge Sting(50) Sting(100) Sting(150)

Madelon ISOLET Gisette Arcene

Wall clock time [s]

huge Sting(50) Sting(100) Sting(150)

Madelon ISOLET Gisette Arcene

Wall clock time [s]

huge Sting(50) Sting(100) Sting(150)

(1) ρ = 0.5 (2) ρ = 0.7 (3) ρ = 0.9 Figure 1: Estimation time of each approach.

pirical covariance matrix (lines 3-9). Then, it computes the blocks from matrix A (line 10). Following the screening approach, it computes the elements of matrix Θ by exploiting the block structures. For each block, it ﬁrst computes matrix S(i) and W(i) (lines 12-13). Then, it iteratively computes the coefﬁcients by exploiting Uj until convergence (lines 15-18). It terminates the iterations if ˆβ[k] = 0 holds for all k such that k U j based on Lemma 3 (lines 19-24). After the iterations, it updates matrix W(i) (line 25). It performs these procedures until matrix W(i) reaches convergence (line 26). Sting has the following property for the computation cost: Lemma 4 Let t be the number of update computations for each row/column of the proposed approach, it requires O(NP log n+n P 2 +PBt+NP 2/m) time to estimate elements of the inverse covariance matrix. Proof As shown in Algorithm 1, it ﬁrst computes the SVD of data matrix X that needs O(NP log n) time [Halko et al., 2011]. To obtain the block structures, it computes the upper and lower bounds for all covariances of matrix S in O(n P 2) time. In addition, it computes the covariances of matrix S corresponding to the blocks in O(NP 2/m) time. So as to reduce the number of update computations, it ﬁrst updates the coefﬁcients expected to have nonzero values until convergence and then checks whether the other coefﬁcients have nonzero values in O(PBt+P 2/m) time. Therefore, the computation cost of Sting is O(NP log n+n P 2+PBt+NP 2/m). Sting has the following property for the estimation results: Lemma 5 Our approach converges to the same results as the graphical lasso in estimating matrix Θ. Proof As shown in Algorithm 1, it computes the (i, j)-th covariance of matrix S if we have S[i, j] > ρ or S[i, j] < ρ. This indicates that it does not compute the covariance of matrix S if S[i, j] ρ and S[i, j] ρ hold. From Lemma 2, it is clear that the (i, j)-th covariance does not have an edge in the adjacency matrix in that case. Therefore, our approach can exactly obtain the same adjacency matrix as the graphical lasso. In addition, although it selectively updates coefﬁcients in solving the lasso, it can obtain the same convergence results by checking unselected coefﬁcients based on Lemma 3. As a result, since the optimization problem of Equation (1) is convex [Friedman et al., 2008], the proposed approach converges to the same results as the graphical lasso. As shown in Theorem 4 and 5, the proposed approach theoretically converges to the same results as the graphical lasso with improved efﬁciency.

5 Experimental Evaluation

In this section, we experimentally evaluate the efﬁciency of our approach. We perform experiments on the datasets of Madelon, ISOLET, Gisette, and Arcene. These datasets have 600, 6238, 1000, and 700 observations, respectively. In addition, these datasets have 500, 618, 5000, and 10000 variables, respectively. Details of the datasets are shown in the UC Irvine Machine Learning Repository1. In the datasets, each column vector is standardized to have mean zero and variance one [Tan et al., 2015]. In the experiments, we compare the proposed approach to glasso, QUIC, and huge. As described in Section 3, glasso is the most popular method for the graphical lasso based on the coordinate descent algorithm and the screening approach. QUIC is a recent method based on the Newton s method [Hsieh et al., 2013]. Note that QUIC does not guarantee the same results as the graphical lasso as described in Section 3. In addition, huge is a state-of-the-art method for the graphical lasso that is based on the coordinate descent algorithm [Friedman et al., 2008] and recent approaches [Liu et al., 2012; 2010] detailed in Section 2. This method can yield the same results as the graphical lasso. In the experiments, we used the latest versions; glasso 1.8 [Friedman et al., 2015], QUIC 1.1 [Hsieh et al., 2015], and huge 1.2.7 [Zhao et al., 2015]. In this section, glasso , QUIC , and huge represent the results of glasso, QUIC, and huge, respectively. In addition, Sting(n) denotes the results of our approach where we set the target rank of SVD to n. We set the convergence threshold to 10 4. We conducted all experiments on a Linux 2.70 GHz Intel Xeon server. While glasso is implemented in Fortran, sting, QUIC, and huge are implemented in C/C++.

5.1 Efﬁciency

We evaluated the estimation time of each approach. Figure 1 shows the results where we set L1 regularization parameter, ρ, to 0.5, 0.7, and 0.9. Note the results of the proposed approach include the time taken to compute the SVD of the data matrix before entering the iterations. As shown in Figure 1, our approach is much faster than the previous approaches. Speciﬁcally, our approach is up to 300 times faster than glasso, up to 100 times faster than QUIC, and up to 45 times faster than huge. As described in Section 4.2, Sting does not compute all the elements of the empirical covariance matrix by exploiting the upper and

1https://archive.ics.uci.edu/ml/index.html

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Madelon ISOLET Gisette Arcene

Number of covariance computations

without bounds

with bounds

Figure 2: Number of covariance computations.

101 102 103

104 105 106 107 108 109

Madelon ISOLET Gisette Arcene

Number of update computations

round-robin

Figure 3: Number of update computations.

lower bounds. In addition, it selectively updates coefﬁcients that are expected to have nonzero values as described in Section 4.3. As a result, our approach is superior to the existing approaches in terms of computation time as shown in Figure 1. This ﬁgure also indicates that the estimation time of our approach only slightly varies against the target rank of SVD, n. As described in Section 4.2, it takes O(n) time to compute the upper and lower bounds by SVD. Therefore, we can increase the efﬁciency of computing the bounds if rank n have a small value. On the other hand, we can reduce the number of covariance computations if rank n has a large value. In conclusion, we can efﬁciently compute the bounds by reducing the target rank at the sacriﬁce of the number of covariance computations. Due to the trade-off derived from rank n, our approach is not so sensitive to the number of the target rank. In addition, Figure 1 indicates that the efﬁciency of Sting rises with ρ. This is because the proposed approach effectively exploits the sparseness property of the graphical lasso. Note that the proposed approach is theoretically guaranteed to converge to the same estimation results as the graphical lasso as shown in Theorem 5. Therefore, Figure 1 and Theorem 5 indicate that our approach can successfully reduce the computation time while providing the same guaranteed solution as the graphical lasso in estimating the inverse covariance matrix. Note that huge yield the same result as the graphical lasso while QUIC is an approximate approach.

5.2 Effectiveness In the following experiments, we examine the effectiveness of the two core methods: (1) computing the upper and lower bounds and (2) selective update in the coordinate descent.

Covariance computations As described in Section 4.2, we compute the upper and lower bounds of each covariance in matrix S to efﬁciently obtain the block structures used in the screening approach. In this section, we evaluate the number of covariance computations needed to evaluate the effectiveness of this approach. Figure 2 shows the results where we set ρ = 0.7 and n = 100. In this ﬁgure, without bounds represents the number of covariance computations where the upper and lower bounds are not used in obtaining the block structures; i.e., all the covariances in matrix S are computed. In addition, with bounds indicates the results when we used the upper and lower bounds in obtaining the block structures. As shown in Figure 2, our approach can effectively reduce the number of covariance computations by exploiting the upper and lower bounds. For computing the block structures,

the screening approach by Friedman et al. needs to compute all the covariances in matrix S to obtain the covariances that have high absolute values. Therefore, it incurs the high computation cost of O(NP 2) where N and P are the numbers of observations and variables, respectively. Even though our approach needs to compute the bounds for all covariances in matrix S, the computation cost is small since we can efﬁciently compute the bounds by using SVD of rank n; the computation cost of this procedure is O(n P 2) where n N. In addition, since we can effectively approximate the data matrix by exploiting SVD, our approach can successfully prune the unlikely covariances that have low absolute values. As a result, by exploiting the upper and lower bounds, we can effectively reduce the number of covariance computations needed to obtain the block structure in the screening approach.

Update computations

In order to improve the efﬁciency, we selectively update the coefﬁcients expected to have nonzero values as described in Section 4.3. We plot the number of update computations needed to compute the coefﬁcients in Figure 3 to show the effectiveness of this approach. In this ﬁgure, round-robin is the number of update computations where the coefﬁcients are updated in round-robin style. In addition, selective indicates the results where the coefﬁcients are selectively updated by using our approach. We set ρ = 0.7 and n = 100. Figure 3 indicates that our approach can effectively reduce the number of updated computations. As described in Section 3, the graphical lasso computes the blocks of connected components from the adjacent matrix. While an edge between a variance pair in the adjacent matrix indicates that the variance pair has a covariance of high absolute value, the covariances can be included in the same block of connected components even if they do not have high absolute value. In addition, if covariance has low absolute value, the coefﬁcient corresponding to the covariance is expected to be zero due to the L1 penalty of the objective function [Langfelder and Horvath, 2008]. As a result, the coefﬁcients in the graphical lasso have the property of sparseness. This indicates that the graphical lasso performs unnecessary update computations where many coefﬁcients result in zero since it updates the coefﬁcients of each block in round-robin style. On the other hand, our approach selectively updates only those coefﬁcients that correspond to covariances with high absolute values. Since we do not iteratively compute the unlikely coefﬁcients, we can effectively reduce the number of update computations in obtaining the inverse covariance matrix.

6 Conclusions

We proposed Sting, an efﬁcient algorithm that improves the efﬁciency of the graphical lasso. Our approach computes the upper and lower bounds of covariances and selectively updates the coefﬁcients of the lasso expected to have nonzero values. Experiments showed that it offers improved efﬁciency over existing approaches. Sting will allow many the graphical lasso-based applications to be processed more efﬁciently, and will improve the effectiveness of AI-based applications.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

References [Blitzstein and Hwang, 2014] Joseph K. Blitzstein and Jessica Hwang. Introduction to Probability. Chapman and Hall/CRC, 2014. [Chen et al., 2010] Xi Chen, Yan Liu, Han Liu, and Jaime G. Carbonell. Learning Spatial-temporal Varying Graphs with Applications to Climate Data Analysis. In AAAI, 2010. [Friedman et al., 2008] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse Inverse Covariance Estimation with the Graphical Lasso. Biostatistics, 9(3):432 441, July 2008. [Friedman et al., 2015] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Package glasso . ftp://cran.rproject.org/pub/R/web/packages/glasso/glasso.pdf, 2015. [Fujiwara et al., 2014] Yasuhiro Fujiwara, Go Irie, Shari Kuroyama, and Makoto Onizuka. Scaling Manifold Ranking b Bsed Image Retrieval. PVLDB, 8(4):341 352, 2014. [Fujiwara et al., 2016a] Yasuhiro Fujiwara, Yasutoshi Ida, Junya Arai, Mai Nishimura, and Sotetsu Iwamura. Fast Algorithm for the Lasso Based L1-graph Construction. PVLDB, 10(3):229 240, 2016. [Fujiwara et al., 2016b] Yasuhiro Fujiwara, Yasutoshi Ida, Hiroaki Shiokawa, and Sotetsu Iwamura. Fast Lasso Algorithm via Selective Coordinate Descent. In Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 1561 1567, 2016. [Fujiwara et al., 2017] Yasuhiro Fujiwara, Naoki Marumo, Mathieu Blondel, Koh Takeuchi, Hideaki Kim, Tomoharu Iwata, and Naonori Ueda. Scaling Locally Linear Embedding. In ACM SIGMOD, pages 1479 1492, 2017. [Grechkin et al., 2015] Maxim Grechkin, Maryam Fazel, Daniela M. Witten, and Su-In Lee. Pathway Graphical Lasso. In AAAI, pages 2617 2623, 2015. [Halko et al., 2011] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review, 53(2):217 288, 2011. [Hsieh et al., 2013] Cho-Jui Hsieh, M aty as A. Sustik, Inderjit S. Dhillon, Pradeep Ravikumar, and Russell A. Poldrack. BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables. In NIPS, pages 3165 3173, 2013. [Hsieh et al., 2015] Cho-Jui Hsieh, Matyas A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar. Package QUIC . https://cran.r-project.org/web/packages/QUIC/index.html, 2015. [Langfelder and Horvath, 2008] Peter Langfelder and Steve Horvath. WGCNA: an R Package for Weighted Correlation Network Analysis. BMC Bioinformatics, 9, 2008. [Li et al., 2016] Mei Li, Ying Lin, Shuai Huang, and Craig Crossland. The Use of Sparse Inverse Covariance Estimation for Relationship Detection and Hypothesis Generation

in Strategic Management. Strategic Management Journal, 37(1):86 97, 2016. [Liu et al., 2010] Han Liu, Kathryn Roeder, and Larry A. Wasserman. Stability Approach to Regularization Selection (St ARS) for High Dimensional Graphical Models. In NIPS, pages 1432 1440, 2010. [Liu et al., 2012] Han Liu, Fang Han, Ming Yuan, John D. Lafferty, and Larry A. Wasserman. High Dimensional Semiparametric Gaussian Copula Graphical Models. In ICML, 2012. [Mishima and Fujiwara, 2015] Takeshi Mishima and Yasuhiro Fujiwara. Madeus: Database Live Migration Middleware under Heavy Workloads for Cloud Environment. In ACM SIGMOD, pages 315 329, 2015. [Nakatsuji et al., 2014] Makoto Nakatsuji, Yasuhiro Fujiwara, Hiroyuki Toda, Hiroshi Sawada, Jinguang Zheng, and James Alexander Hendler. Semantic Data Representation for Improving Rensor Factorization. In AAAI, pages 2004 2012, 2014. [Neuberg and Glasserman, 2016] Richard Neuberg and Paul Glasserman. The Correlation Structure of Credit Default Swaps. Columbia Business School Research Paper, 2016. [Press et al., 2007] William H. Press, Saul A. Teukolsky, William T Vetterling, and Brian P. Flannery. Numerical Recipes 3rd Edition. Cambridge University Press, 2007. [Rosa et al., 2015] Maria J. Rosa, Liana Portugal, Tim Hahn, Andreas J. Fallgatter, Marta I. Garrido, John Shawe Taylor, and Janaina Mour ao Miranda. Sparse Networkbased Models for Patient Classiﬁcation Using f MRI. Neuro Image, 105:493 506, 2015. [Shiokawa et al., 2015] Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka. SCAN++: Efﬁcient Algorithm for Finding Clusters, Hubs and Outliers on Largescale Graphs. PVLDB, 8(11):1178 1189, 2015. [Tan et al., 2015] Kean Ming Tan, Daniela M. Witten, and Ali Shojaie. The Cluster Graphical Lasso for Improved Estimation of Gaussian Graphical Models. Computational Statistics & Data Analysis, 85:23 36, 2015. [Vinciotti et al., 2016] Veronica Vinciotti, Ernst C. Wit, Rick Jansen, Eco J. C. N. de Geus, Brenda W. J. H. Penninx, Dorret I. Boomsma, and Peter A. C. t Hoen. Consistency of Biological Networks Inferred from Microarray and Sequencing Data. BMC Bioinformatics, 17(1):254, 2016. [Witten et al., 2011] Daniela M. Witten, Jerome H. Friedman, and Noah Simon. New Insights and Faster Computations for the Graphical Lasso. Journal of Computational and Graphical Statistics, 20(4):892 900, 2011. [Xu et al., 2016] Zenglin Xu, Shandian Zhe, Yuan Qi, and Peng Yu. Association Discovery and Diagnosis of Alzheimer s Disease with Bayesian Multiview Learning. JAIR, 56:247 268, 2016. [Zhao et al., 2015] Tuo Zhao, Xingguo Li, Han Liu, Kathryn Roeder, John Lafferty, and Larry Wasserman. Package huge . https://cran.rproject.org/web/packages/huge/huge.pdf, 2015.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)