# evaluating_the_statistical_significance_of_biclusters__15dfa6d0.pdf Evaluating the statistical significance of biclusters Jason D. Lee, Yuekai Sun, and Jonathan Taylor Institute of Computational and Mathematical Engineering Stanford University Stanford, CA 94305 {jdl17,yuekai,jonathan.taylor}@stanford.edu Biclustering (also known as submatrix localization) is a problem of high practical relevance in exploratory analysis of high-dimensional data. We develop a framework for performing statistical inference on biclusters found by score-based algorithms. Since the bicluster was selected in a data dependent manner by a biclustering or localization algorithm, this is a form of selective inference. Our framework gives exact (non-asymptotic) confidence intervals and p-values for the significance of the selected biclusters. 1 Introduction Given a matrix X Rm n, biclustering or submatrix localization is the problem of identifying a subset of the rows and columns of X such that the bicluster or submatrix consisting of the selected rows and columns are significant compared to the rest of X. An important application of biclustering is the identification of significant genotype-phenotype associations in the (unsupervised) analysis of gene expression data. The data is usually represented by an expression matrix X whose rows correspond to genes and columns correspond to samples. Thus genotype-phenotype associations correspond to salient submatrices of X. The location and significance of such biclusters, in conjunction with relevant clinical information, give preliminary results on the genetic underpinnings of the phenotypes being studied. More generally, given a matrix X Rm n whose rows correspond to variables and columns correspond to samples, biclustering seeks sample-variable associations in the form of salient submatrices. Without loss of generality, we consider square matrices X Rn n of the form X = M + Z, Zij N(0, σ2) M = µe I0e T J0, µ 0, I0, J0 [n]. (1.1) The components of e I, I [n] are given by (e I)i = 1 i I 0 otherwise . For our theoretical results, we assume the size of the embedded submatrix |I0| = |J0| = k and the noise variance σ2 is known. The biclustering problem, due to its practical relevance, has attracted considerable attention. Most previous work focuses on finding significant submatrices. A large class of algorithms for biclustering are score-based, i.e. they search for submatrices that maximize some score function that measures the significance of a submatrix. In this paper, we focus on evaluating the significance of submatrices found by score-based algorithms for biclustering. More precisely, let I(X), J(X) [n] be a (random) pair output by a biclustering algorithm. We seek to test whether the localized submatrix XI(X),J(X) contains any signal, i.e. test the hypothesis i I(X) j J(X) Mij = 0. (1.2) Since the hypothesis depends on the (random) output of the biclustering algorithm, this is a form of selective inference. The distribution of the test statistic P i I(X) j J(X) Xij depends on the specific algorithm, and is extremely difficult to derive for many heuristic biclustering algorithms. Our main contribution is to test whether a biclustering algorithm has found a statistically significant bicluster. The tests and confidence intervals we construct are exact, meaning that in finite samples the type 1 error is exactly α. This paper is organized as follows. First, we review recent work on biclustering and related problems. Then, in section 2, we describe our framework for performing inference in the context of a simple biclustering algorithm based on a scan statistic. We show 1. the framework gives exact (non-asymptotic) Unif(0, 1) p-values under H0, and the p-values can be inverted to form confidence intervals for the amount of signal in XI(X),J(X). 2. under the minimax signal-to-noise ratio (SNR) regime µ q k , the test has full asymptotic power . In section 4, we show the framework handles more computationally tractable biclustering algorithms, including a greedy algorithm originally proposed by Shabalin et al. [12]. In the supplementary materials, we discuss the problem in the more general setting where there are multiple emnbedded submatrices. Finally, we present experimental validation of the various tests and biclustering algorithms. 1.1 Related work A slightly easier problem is submatrix detection: test whether a matrix has an embedded submatrix with nonzero mean [1, 4]. This problem was recently studied by Ma and Wu [11] who characerized the minimum signal strength µ for any test and any computationally tractable test to reliably detect an embedded submatrix. We emphasize that the problem we consider is not the submatrix detection problem, but a complementary problem. Submatrix detection asks whether there are any hidden row-column associations in a matrix. We ask whether a submatrix selected by a biclustering algorithm captures the hidden association(s). In practice, given a matrix, a practitioner might perform (in order) 1. submatrix detection: check for a hidden submatrix with elevated mean. 2. submatrix localization: attempt to find the hidden submatrix. 3. selective inference: check whether the selected submatrix captures any signal. We focus on the third step in the pipeline. Results on evaluating the significance of selected submatrices are scarce. The only result we know of is by Bhamidi, Dey and Nobel, who characterized the asymptotic distribution of the largest k k average submatrix in Gaussian random matrices [6]. Their result may be used to form an asymptotic test of (1.2). The submatrix localization problem, due to its practical relevance, has attracted considerable attention [5, 2, 3]. Most prior work focuses on finding significant submatrices. Broadly speaking, submatrix localization procedures fall into one of two types: score-based search procedures and spectral algorithms. The main idea behind the score-based approach to submatrix localization is significant submatrices should maximize some score that measures the significance of a submatrix, e.g. the average of its entries [12] or the goodness-of-fit of a two-way ANOVA model [8, 9]. Since there are exponentially many submatrices, many score-based search procedure use heuristics to reduce the search space. Such heuristics are not guaranteed to succeed, but often perform well in practice. One of the purposes of our work is to test whether a heuristic algorithm has identified a significant submatrix. The submatrix localization problem exhibits a statistical and computational trade-off that was first studied by Balakrishnan et al. [5]. They compare the SNR required by several computationally efficient algorithms to the minimax SNR. Recently, Chen and Xu [7] study the trade-off when there are several embedded submatrices. In this more general setting, they show the SNR required by convex relaxation is smaller than the SNR required by entry-wise thresholding. Thus the power of convex relaxation is in separating clusters/submatrices, not in identifying one cluster/submatrix. 2 A framework for evaluating the significance of a submatrix Our main contribution is a framework for evaluating significance of a submatrix selected by a biclustering algorithm. The framework allows us to perform exact (non-asymptotic) inference on the selected submatrix. In this section, we develop the framework on a (very) simple score-based algorithm that simply outputs the largest average submatrix. At a high level, our framework consists of characterizing the selection event {(I(X), J(X)) = (I, J)} and applying the key distributional result in [10] to obtain a pivotal quantity. 2.1 The significance of the largest average submatrix To begin, we consider performing inference on output of the simple algorithm that simply returns the k k submatrix with largest sum. Let S be the set of indices of all k k submatrices of X, i.e. S = {(I, J) | I, J [n], |I| = |J| = k}. The Largest Average Submatrix (LAS) algorithm returns a pair (ILAS(X), JLAS(X)) (ILAS(X), JLAS(X)) = arg max (I,J) S e T I Xe J The optimal value S(1) = tr e JLAS(X)e T ILAS(X)X is distributed like the maxima of n k 2 (correlated) normal random variables. Although results on the asymptotic distribution (k fixed, n growing) of S(1) (under H0 : µ = 0) are known (e.g. Theorem 2.1 in [6]), we are not aware of any results that characterizes the finite sample distribution of the optimal value. To avoid this pickle, we condition on the selection event ELAS(I, J) = {(ILAS(X), JLAS(X)) = (I, J)} (2.1) and work with the distribution of X | {(ILAS(X), JLAS(X)) = (I, J)} . We begin by making a key observation. The selection event given by (2.1) is equivalent to X satisfying a set of linear inequalities given by tr e Je T I X tr e J e T I X for any (I , J ) S \ (I, J). (2.2) Thus the selection event is equivalent to X falling in the polyhedral set CLAS(I, J) = X Rn n | tr e Je T I X tr e J e T I X for any (I , J ) S \ (I, J) . (2.3) Thus, X | {(ILAS(X), JLAS(X)) = (I, J)} = X | {X CLAS(I, J)} is a constrained Gaussian random variable. Recall our goal was to perform inference on the amount of signal in the selected submatrix XILAS(X),JLAS(X). This task is akin to performing inference on the mean parameter1 of a constrained Gaussian random variable, namely X | {X CLAS(I, J)} . We apply the selective inference framework by Lee et al. [10] to accomplish the task. Before we delve into the details of how we perform inference on the mean parameter of a constrained Gaussian random variable, we review the key distribution result in [10] concerning constrained Gaussian random variables. Theorem 2.1. Consider a Gaussian random variable y Rn with mean ν Rn and covariance Σ Sn n ++ constrained to a polyhedral set C = {x Rp | Ay b} for some A Rm n, b Rm. 1The mean parameter is the mean of the Gaussian prior to truncation. Let η Rn represent a linear function of y. Define α = AΣη ηT Ση and V+(y) = sup j:αj<0 1 αj (bj (Ay)j + αjηT y) (2.4) V (y) = inf j:αj>0 1 αj (bj (Ay)j + αjηT y) (2.5) V 0(y) = inf j:αj=0 bj (Ay)j (2.6) F(x, ν, σ2, a, b) = Φ x ν The expression F(ηT y, ηT ν, ηT Ση, V (y), V+(y)) is a pivotal quantity with a Unif(0, 1) distribution, i.e. F ηT y, ηT ν, ηT Ση, V (y), V+(y) | {Ay b} Unif(0, 1). (2.8) Remark 2.2. The truncation limits V+(y) and V (y) (and V 0(y)) depend on η and the polyhedral set C. We omit the dependence to keep our notation manageable. Recall X | {ELAS(I, J)} is a constrained Gaussian random variable (constrained to the polyhedral set CLAS(I, J) given by (2.3)). By Theorem 2.1 and the characterization of the selection event ELAS(I, J), the random variable F S(1), tr e Je T I M , σ2k2, V (X), V+(X) | {ELAS(I, J)} , where V+(X) and V (X) (and V 0(X)) are evaluated on the polyhedral set CLAS(I, J), is uniformly distributed on the unit interval. The mean parameter tr e Je T I M is the amount of signal captured by XI,J: tr e Je T I M = |I I0| |J J0| µ. What are V+(X) and V (X)? Let EI ,J = e Ie T J for any I , J [n]. For convenience, we index the constraints (2.2) by the pairs (I , J ). The term αI ,J is given by αI ,J = |I I ||J J | k2 k2 . Since |I I | |J J | < k2, αI ,J is negative for any (I , J ) Sn,k \ (I, J), and the upper truncation limit V+(X) is . The lower truncation limit V (X) simplifies to V (X) = max (I ,J ):αI ,J <0 tr ET I,JX k2 tr (EI,J EI ,J )T X k2 |I I | |J J | . (2.9) We summarize the developments thus far in a corollary. Corollary 2.3. We have F S(1), tr e Je T I M , k2σ2, V (X), | {ELAS(I, J)} Unif (0, 1) (2.10) V (X) = max (I ,J ):αI ,J <0 tr ET I,JX k2 tr (EI,J EI ,J )T X k2 |I I | |J J | (2.11) Under the hypothesis H0 : tr e JLAS(X)e T ILAS(X)M = 0, (2.12) we expect F S(1), 0, k2σ2, V (X), | {ELAS(I, J)} Unif (0, 1) Thus 1 F S(1), 0, k2σ2, V (X), is a p-value for the hypothesis (2.12). Under the alternative, we expect the selected submatrix to be (stochastically) larger than under the null. Thus rejecting H0 when the p-value is smaller than α is an exact α level test for H0; i.e. Pr0 (reject H0 | {ELAS(I, J)}) = α. Since the test controls Type I error at α for all possible selection events (i.e. all possible outcomes of the LAS algorithm), the test also controls Type I error unconditionally: Pr0 (reject H0) = X I,J [n] Pr0 (reject H0 | {ELAS(I, J)}) Pr0 ({ELAS(I, J)}) I,J [n] Pr0 ({ELAS(I, J)}) = α. Thus the test is an exact α-level test of H0. We summarize the result in a Theorem. Theorem 2.4. The test that rejects when F S(1), 0, k2σ2, V (X), 1 α V (X), = max (I ,J ):αI ,J <0 tr ET ILAS(X),JLAS(X)X k2 tr EILAS(X),JLAS(X) EI ,J T X k2 ILAS(X) I JLAS(X) J , is a valid α-level test for H0 : P i I(X) j J(X) Mij = 0. To obtain confidence intervals for the amount of signal in the selected submatrix, we invert the pivotal quantity given by (2.10). By Corollary 2.3, the interval n ν R : α 2 F S(1), ν, k2σ2, V (X), 1 α is an exact 1 α confidence interval for P i I(X) j J(X) Mij. When (ILAS(X), JLAS(X)) = (I0, J0), (2.13) is a confidence interval for µ. Like the test given by Lemma 2.4, the confidence intervals given by (2.13) are also valid unconditionally. 2.2 Power under minimax signal-to-noise ratio In section 2, we derived an exact (non-asymptotically valid) test for the hypothesis (2.12). In this section, we study the power of the test. Before we delve into the details, we review some relevant results to place our result in the correct context. Balakrishnan et al. [5] show µ must be at least Θ σ q for any algorithm to succeed (find the embedded submatrix) with high probability. They also show the LAS algorithm is minimax rate optimal; i.e. the LAS algorithm finds the embedded submatrix with probability 1 4 n k when k . We show that the test given by Theorem 2.4 has asymptotic full power under the same signal strength. The proof is given in the appendix. Theorem 2.5. Let µ = C q k . When C > max 1 α log(n k)( k 5/4), 4 + 4 q 2 , the α-level test given by Corollary 2.3 has power at least 1 5 n k; i.e. Pr(reject H0) 1 5 n k. Further, for any sequence (n, k) such that n , when C > 4, and k n 2 , Pr(reject H0) 1. 3 General scan statistics Although we have elected to present our framework in the context of biclustering, the framework readily extends to scan statistics. Let z N(µ, Σ), where E[z] has the form E[zi] = µ i S 0 otherwise for some µ > 0 and S [ n ]. The set S belongs to a collection C = {S1, . . . , SN}. We decide which index set in C generated the data by ˆS = arg max S C P i S zi. (3.1) Given ˆS, we are interested in testing the null hypothesis H0 : E[z ˆS] = 0. (3.2) To perform exact inference for the selected effect µ ˆS, we must first characterize the selection event. We observe that the selection event { ˆS = S} is equivalent to X satisfying a set of linear inequalities given by e T Sz e T S z for any S C \ S. (3.3) Given the form of the constraints (3.3), a S = (e S e S)T e S e T Se S = 1 |S| (|S S | |S|) for any S C \ S. Since |S S | |S| , we have a S [ 1, 0], which implies V+(z) = . The term V (z) also simplifies: V (z) = sup S 1 a S ((e S e S )T z + a S e T Sz) = e T Sz + sup S 1 a S ((e S e S )T z). Let y(1), y(2) be the largest and second largest scan statistics. We have V (z) z(1) + sup S ((e S e S)T z) = z(1) + z(2) z(1) = z(2). Intuitively, the pivot will be large (the p-value will be small), when e T Sz exceeds the lower truncation limit V by a large margin. Since the second largest scan statistic is an upper bound for the lower truncation limit, the test will reject when y(1) exceeds y(2) by a large margin. Theorem 3.1. The test that rejects when F z(1), 0, k2σ2, V (z), 1 α where V (X) = e T ˆSz + sup S 1 a S ((e ˆS e S )T z), is a valid α-level test for H0 : e T ˆSµ = 0. To our knowledge, most precedures for obtaining valid inference on scan statistics require careful characterization of the asymptotic distribution of e T ˆSz. Such results are usually only valid when the components of z are independent with identical variances (e.g. see [6]), and can only be used to test the global null: H0 : E[z] = 0. Our framework not only relaxes the independence and homeoskedastic assumption, but also allows us to for confidence intervals for the selected effect size. 4 Extensions to other score-based approaches Returning to the submatrix localization problem, we note that the framework described in section 2 also readily handles other score-based approaches, as long as the scores are affine functions of the entries. The main idea is to partition Rn n into non-overlapping regions that corresponding to a possible outcomes of the algorithm; i.e. the event that the algorithm outputs a particular submatrix is equivalent to X falling in the corresponding region of Rn n. In this section, we show how to perform exact inference on biclusters found by more computationally tractable algorithms. 4.1 Greedy search Searching over all n k 2 submatrices to find the largest average submatrix is computationally intractable for all but the smallest matrices. Here we consider a family of heuristics based on a greedy search algorithm proposed by Shabalin et al. [12] that looks for local largest average submatrices. Their approach is widely used to discover genotype-phenotype associations in high-dimensional gene expression data. Here the score is simply the sum of the entries in a submatrix. Algorithm 1 Greedy search algorithm 1: Initialize: select J0 [n]. 2: repeat 3: Il+1 the indices of the rows with the largest column sum in Jl 4: Jl+1 the indices of the columns with the largest row sum in Il+1 5: until convergence To adapt the framework laid out in section 2 to the greedy search algorithm, we must characterize the selection event. Here the selection event is the path of the greedy search: EGr S = EGr S (I1, J1), (I2, J2), . . . is the event the greedy search selected (I1, J1) at the first step, (I2, J2) at the second step, etc. In practice, to ensure stable performance of the greedy algorithm, Shabalin et al. propose to run the greedy search with random initialization 1000 times and select the largest local maximum. Suppose the m -th greedy search outputs the largest local maximum. The selection event is EGr S,1 EGr S,1000 m = arg max m=1,...,1000 e T IGr S,m(X)Xe JGr S,m(X) where EGr S,m = EGr S (I1 m, J1 m), (I2 m, J2 m), . . . , m = 1, . . . , 1000 is the event the m-th greedy search selected (I1 m, J1 m) at the first step, (I2 m, J2 m) at the second step, etc. An alternative to running the greedy search with random initialization many times and picking the largest local maximum is to initialize the greedy search intelligently. Let Jgreedy(X) be the output of the intelligent initialization. The selection event is given by EGr S Jgreedy(X) = J0 , (4.1) where EGr S is the event the greedy search selected (I1, J1) at the first step, (I2, J2) at the second step, etc. The intelligent initialization selects J0 when e T [n]Xej e T [n]Xej for any j J0, j [n] \ J0, (4.2) which corresponds to selecting the k columns with largest sum. Thus the selection event is equivalent to X falling in the polyhedral set CGr S n X Rn n | tr eje T [n]X tr ej e T [n]X for any j J0, j [n] \ J0o , where CGr S is the constraint set corresponding to the selection event EGr S (see Appendix for an explicit characteriation). 4.2 Largest row/column sum test An alternative to running the greedy search is to use a test statistic based off choosing the k rows and columns with largest sum. The largest row/column sum test selects a subset of columns J0 when e T [n]Xej e T [n]Xej for any j J0, j [n] \ J0 (4.3) which corresponds to selecting the k columns with largest sum. Similarly, it selects rows I0 with largest sum. Thus the selection event for initialization at (I0, J0) is equivalent to X falling in the polyhedral set n X Rn n | tr eje T [n]X tr ej e T [n]X for any j J0, j [n] \ J0o n X Rn n | tr eie T [n]X tr ei e T [n]X for any i I0, i [n] \ I0o . (4.4) The procedure of selecting the k largest rows/columns was analyzed in [5]. They proved that when µ 4/k p n log(n k) the procedure recovers the planted submatrix. We show a similar result for the test statistic based off the intelligent initialization F tr e J0(X)e T I0(X)X , 0, σ2k2, V (X), V +(X) . (4.5) Under the null of µ = 0, the statistic (4.5) is uniformly distributed, so type 1 error is controlled at level α. The theorem below shows that this computationally tractable test has power tending to 1 for µ > 4 n log(n k). Theorem 4.1. Let µ = C n log(n k). Assume that n 2 exp(1) and n k 2. When C > 1 + 1 4n2 + 2 n, 2 log 2/α , the α-level test given by Corollary 2.3 has power at least 1 9 n k; i.e. Pr(reject H0) 1 9 n k. Further, for any sequence (n, k) such that n , when C > 4, and k n 2 , Pr(reject H0) 1. Signal Strength Signal Strength Signal Strength n=50 n=100 n=500 n=1000 Figure 1: Random initialization with 10 restarts Signal Strength Signal Strength Signal Strength n=50 n=100 n=500 n=1000 Figure 2: Intelligent initialization In practice, we have found that initializing the greedy algorithm with the rows and columns identified by the largest row/column sum test stabilizes the performance of the greedy algorithm and preserves power. By intersecting the selection events from the largest row/column sum test and the greedy algorithm, the test also controls type 1 error. Let (Iloc(X), Jloc(X)) be the pair of indices returned by the greedy algorithm initialized with (I0, J0) from the largest row/column sum test. The test statistic is given by F tr e Jloc(X)e T Iloc(X)X , 0, σ2k2, V (X), V +(X) , (4.6) where V +(X), V (X) are now computed using the intersection of the greedy and the largest row/column sum selection events. This statistic is also uniformly distributed under the null. We test the performance of three of the biclustering algorithms: Algorithm 1 with the intelligent initialization in (4.4) and Algorithm 1 with 10 random restarts. We generate data from the model (1.1) for various values of n and k. We only test the power of each procedure, since all of the algorithms discussed provably control type 1 error. The results are in Figures 1, and 2. The y-axis shows power (the probability of rejecting) and the x-axis is rescaled signal strength µ .q k . The tests were calibrated to control type 1 error at α = .1, so any power over .1 is nontrivial. From the k = log n plot, we see that the intelligently initialized greedy procedure outperforms the greedy algorithm with a single random initialization and the greedy algorithm with 10 random initializations. 5 Conclusion In this paper, we considered the problem of evaluating the statistical significance of the output of several biclustering algorithms. By considering the problem as a selective inference problem, we are able to devise exact significance tests and confidence intervals for the selected bicluster. We also show how the framework generalizes to the more practical problem of evaluating the significance of multiple biclusters. In this setting, our approach gives sequential tests that control family-wise error rate in the strong sense. [1] Louigi Addario-Berry, Nicolas Broutin, Luc Devroye, G abor Lugosi, et al. On combinatorial testing problems. The Annals of Statistics, 38(5):3063 3092, 2010. [2] Brendan PW Ames. Guaranteed clustering and biclustering via semidefinite programming. Mathematical Programming, pages 1 37, 2012. [3] Brendan PW Ames and Stephen A Vavasis. Convex optimization for the planted k-disjointclique problem. Mathematical Programming, 143(1-2):299 337, 2014. [4] Ery Arias-Castro, Emmanuel J Candes, Arnaud Durand, et al. Detection of an anomalous cluster in a network. The Annals of Statistics, 39(1):278 304, 2011. [5] Sivaraman Balakrishnan, Mladen Kolar, Alessandro Rinaldo, Aarti Singh, and Larry Wasserman. Statistical and computational tradeoffs in biclustering. In NIPS 2011 Workshop on Computational Trade-offs in Statistical Learning, 2011. [6] Shankar Bhamidi, Partha S Dey, and Andrew B Nobel. Energy landscape for large average submatrix detection problems in gaussian random matrices. ar Xiv preprint ar Xiv:1211.2284, 2012. [7] Yudong Chen and Jiaming Xu. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. ar Xiv preprint ar Xiv:1402.1267, 2014. [8] Yizong Cheng and George M Church. Biclustering of expression data. In ISMB, volume 8, pages 93 103, 2000. [9] Laura Lazzeroni and Art Owen. Plaid models for gene expression data. Statistica Sinica, 12(1):61 86, 2002. [10] Jason D Lee, Dennis L Sun, Yuekai Sun, and Jonathan E Taylor. Exact post-selection inference with the lasso. ar Xiv preprint ar Xiv:1311.6238, 2013. [11] Zongming Ma and Yihong Wu. Computational barriers in minimax submatrix detection. ar Xiv preprint ar Xiv:1309.5914, 2013. [12] Andrey A Shabalin, Victor J Weigman, Charles M Perou, and Andrew B Nobel. Finding large average submatrices in high dimensional data. The Annals of Applied Statistics, pages 985 1012, 2009.