# submatrix_localization_via_message_passing__372ea6d1.pdf

Journal of Machine Learning Research 18 (2018) 1-52 Submitted 6/17; Revised 2/18; Published 4/18

Submatrix localization via message passing

Bruce Hajek b-hajek@illinois.edu Department of ECE and Coordinated Science Lab University of Illinois at Urbana-Champaign Urbana, IL 61801, USA

Yihong Wu yihong.wu@yale.edu Department of Statistics and Data Science Yale University New Haven, CT 06520, USA

Jiaming Xu xu972@purdue.edu Krannert School of Management Purdue University West Lafayette, IN 47907, USA

Editor: Qiang Liu

The principal submatrix localization problem deals with recovering a K K principal submatrix of elevated mean µ in a large n n symmetric matrix subject to additive standard Gaussian noise, or more generally, mean zero, variance one, subgaussian noise. This problem serves as a prototypical example for community detection, in which the community corresponds to the support of the submatrix. The main result of this paper is that in the regime Ω( n) K o(n), the support of the submatrix can be weakly recovered (with o(K) misclassiﬁcation errors on average) by an optimized message passing algorithm if λ = µ2K2/n, the signal-to-noise ratio, exceeds 1/e. This extends a result by Deshpande and Montanari previously obtained for K = Θ( n) and µ = Θ(1). In addition, the algorithm can be combined with a voting procedure to achieve the information-theoretic limit of exact recovery with sharp constants for all K n log n( 1

8e + o(1)). The total running time of the algorithm is O(n2 log n). Another version of the submatrix localization problem, known as noisy biclustering, aims to recover a K1 K2 submatrix of elevated mean µ in a large n1 n2 Gaussian matrix. The optimized message passing algorithm and its analysis are adapted to the bicluster problem assuming Ω( ni) Ki o(ni) and K1 K2. A sharp informationtheoretic condition for the weak recovery of both clusters is also identiﬁed.

Keywords: Submatrix localization, biclustering, message passing, spectral algorithms computational complexity, high-dimensional statistics

1. Introduction

The problem of submatrix detection and localization, also known as noisy biclustering (Hartigan, 1972; Shabalin et al., 2009; Kolar et al., 2011; Butucea and Ingster, 2013; Butucea et al., 2015; Ma and Wu, 2015; Chen and Xu, 2014; Cai et al., 2017), deals with ﬁnding a submatrix with an elevated mean in a large noisy matrix, which arises in many applica-

c 2018 Bruce Hajek, Yihong Wu, and Jiaming Xu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/17-297.html.

Hajek, Wu and Xu

tions such as social network analysis and gene expression data analysis. A widely studied statistical model is the following:

W = µ1C 1 1 C 2 + Z, (1)

where µ > 0, 1C 1 and 1C 2 are indicator vectors of the row and column support sets C 1 [n1] and C 2 [n2] of cardinality K1 and K2, respectively, and Z is an n1 n2 matrix consisting of independent standard normal entries. The objective is to accurately locate the submatrix by estimating the row and column support based on the large matrix W. For simplicity we start by considering the symmetric version of this problem, namely, locating a principal submatrix, and later extend our theoretic and algorithmic ﬁndings to the asymmetric case. To this end, consider

W = µ1C 1 C + Z, (2)

where C [n] has cardinality K and Z is an n n symmetric matrix with {Zij}1 i j n being mutually independent standard normal. Given the data matrix W, the problem of interest is to recover C . This problem has been investigated in (Deshpande and Montanari, 2015; Montanari et al., 2015; Hajek et al., 2017) as a prototypical example of the hidden community problem,1 because the distribution of the entries exhibits a community structure, namely, Wi,j N(µ, 1) if both i and j belong to C and Wi,j N(0, 1) if otherwise. Assuming that C is drawn from all subsets of [n] of cardinality K uniformly at random, we focus on the following two types of recovery guarantees.2 Let ξ = 1C {0, 1}n denote the indicator of the community. Let bξ = bξ(A) {0, 1}n be an estimator.

We say that bξ exactly recovers ξ if, as n , P[ξ = bξ] 0.

We say that bξ weakly recovers ξ if, as n , d(ξ, bξ)/K 0 in probability, where d denotes the Hamming distance.

The weak recovery guarantee is phrased in terms of convergence in probability, which turns out to be equivalent to convergence in mean. Indeed, the existence of an estimator satisfying d(ξ, bξ)/K 0 is equivalent to the existence of an estimator such that E[d(ξ, bξ)] = o(K) (see (Hajek et al., 2017, Appendix A) for a proof). Clearly, any estimator achieving exact recovery also achieves weak recovery; for bounded K, these two criteria are equivalent. Intuitively, for a ﬁxed matrix size n, as either the submatrix size K or the signal strength µ decreases, it becomes more diﬃcult to locate the submatrix. A key role is played by the parameter

which is the signal-to-noise ratio for classifying an index i according to the statistic P j Wi,j, which is distributed according to N(µK, n) if i C and N(0, n) if i C . As shown in

1. A slight variation of the model in (Deshpande and Montanari, 2015; Hajek et al., 2017) is that the data matrix therein is assumed to have zero diagonal. As shown in (Hajek et al., 2017), the absence of the diagonal has no impact on the statistical limit of the problem as long as K , which is the case considered in the present paper. 2. Exact and weak recovery are called strong consistency and weak consistency in (Amini et al., 2013; Mossel et al., 2015), respectively.

Submatrix localization via message passing

Appendix A, it turns out that if the submatrix size K grows linearly with n, the informationtheoretic limits3 of both weak and exact recovery are easily attainable via thresholding. To see this, note that in the case of K n simply thresholding the row sums can provide weak recovery in O(n2) time provided that λ , which coincides with the information-theoretic conditions of weak recovery as proved in (Hajek et al., 2017). Moreover, in this case, one can show that this thresholding algorithm followed by a linear-time voting procedure achieves exact recovery whenever information-theoretically possible. Thus, this paper concentrates on weak and exact recovery in the sublinear regime of

Ω( n) K o(n). (3)

We show that an optimized message passing algorithm provides weak recovery in nearly linear O(n2 log n) time if λ > 1/e. This extends the suﬃcient conditions obtained in (Deshpande and Montanari, 2015) for the regime K = Θ( n) and µ = Θ(1).4 Our algorithm is the same as the message passing algorithm proposed in (Deshpande and Montanari, 2015), except that we ﬁnd the polynomial that maximizes the signal-to-noise ratio via Hermite polynomials instead of using the truncated Taylor series as in (Deshpande and Montanari, 2015). The proofs follow closely those in (Deshpande and Montanari, 2015), with the most essential diﬀerences described at the end of Section 2. We observe that λ > 1/e is much more stringent than λ > 4K

K , the informationtheoretic weak recovery threshold established in (Hajek et al., 2017). It is an open problem whether any polynomial-time algorithm can provide weak recovery for λ 1/e. In addition, we show that if λ > 1/e, the message passing algorithm followed by a linear-time voting procedure can provide exact recovery whenever information-theoretically possible. This procedure achieves the optimal exact recovery threshold determined in (Hajek et al., 2017) with sharp constants if K ( 1

8e + o(1)) n log n. See Section 3.1 for a detailed comparison with information-theoretic limits. The message passing algorithm is simpler to formulate and analyze for the principal submatrix recovery problem; nevertheless, we show in Section 5 how to adapt the message passing algorithm and its analysis to the biclustering problem. Sharp conditions for exact recovery for the biclustering problem was obtained in (Butucea et al., 2015). We show that calculations in (Butucea et al., 2015) with minor adjustments provide information-theoretic conditions for weak recovery as well. The connection between weak and exact recovery via the voting procedure described in (Hajek et al., 2017) carries over to the biclustering problem. The analysis of the message passing algorithm is based on the moment method adopted in (Deshpande and Montanari, 2015). When the noise matrix Z is Gaussian, an alternative technique to analyze message passing algorithms is introduced in (Bayati and Montanari, 2011) and generalized by (Javanmard and Montanari, 2013). A distinct advantage of the

3. In this paper, by information-theoretic limits, we mean the suﬃcient and necessary conditions for attaining weak or exact recovery by any estimator, regardless of its computational cost. 4. The main results (Theorems 1 and 3) of (Deshpande and Montanari, 2015) assume µ = Θ(1) but not K = Ω( n)). This is because, as pointed out at the end of the proof of (Deshpande and Montanari, 2015, Theorem 3), if K = ω(

N), then the spectral method and its proof in (Alon et al., 1998) already work. However, the state evolution analysis of message passing algorithm still assumes K = Θ( n) as stated in (Deshpande and Montanari, 2015, Lemma 2.2).

Hajek, Wu and Xu

moment method in our context is that the Gaussian assumption can be relaxed to a subgaussian assumption. Accordingly, we introduce the following assumption.

Assumption 1 Given C [n] and µ > 0, the following holds. W is an n n symmetric matrix with {Wij}1 i j n being mutually independent random variables. Let Zij = Wij µI{i,j C }. Then E [Zij] = 0 for all i, j, and var(Zij) = 1 for (i, j) C C . Finally, there is a constant γ > 0 that does not depend on n such that E es Zij eγs2/2 for s R, i.e. the W s and Z s are subgaussian with proxy variance γ.

The variance of a subgaussian random variable is less than or equal to its proxy variance, so Assumption 1 implies γ 1, and var(Zij) γ for all i, j [n] and all n 1. Of course, Assumption 1 holds in the Gaussian case such that the Zij are all N(0, 1) random variables.

Notation For any positive integer n, let [n] = {1, . . . , n}. For any set T [n], let |T| denote its cardinality and T c denote its complement. For two sets S, T, let S T = (S \ T) (T \ S) denote the set diﬀerence. For an m n matrix M, let M and M F denote its spectral and Frobenius norm, respectively. Let σi(M) denote its singular values ordered decreasingly. For any S [m], T [n], let MST R|S| |T| denote (Mij)i S,j T and for m = n abbreviate MS = MSS. For a vector x, let x denote its Euclidean norm. We use standard big O notations, e.g., for any sequences {an} and {bn}, an = Θ(bn) or an bn if there is an absolute constant c > 0 such that 1/c an/bn c. All logarithms are natural and we use the convention 0 log 0 = 0. Let Φ and Q denote the cumulative distribution function (CDF) and complementary CDF of the standard normal distribution, respectively. For ϵ [0, 1], deﬁne the binary entropy function h(ϵ) ϵ log 1

ϵ + (1 ϵ) log 1 1 ϵ. We say a sequence of events En holds with high probability, if P {En} 1 as n . Denote the Kolmogorov-Smirnov (KS) distance between distributions µ and ν by d KS(µ, ν) supx R |µ(( , x]) ν(( , x])|.

2. Algorithms and main results

To avoid a plethora of factors 1 n in the notation, we describe the message-passing algorithm using the scaled version

A = 1 n W. (4)

Under Assumption 1, the entries Aij are subgaussian with proxy variance γ

n, mean 0 or µ n,

and variance 1

n for (i, j) C C . This section presents algorithms and theoretical guarantees for the symmetric model (2). Extensions to the asymmetric case for the biclustering problem (1) are given in Section 5.2. Let f( , t): R R be a scalar function for each iteration t. Let θt+1 i j denote the message transmitted from index i to index j at iteration t + 1, which is given by

θt+1 i j = X

ℓ [n]\ {i,j} Aℓif(θt ℓ i, t), j = i [n]. (5)

Submatrix localization via message passing

with the initial conditions θ0 i j 0. Moreover, let θt+1 i denote index i s belief at iteration t + 1, which is given by

ℓ [n]\{i} Aℓif(θt ℓ i, t). (6)

The form of (5) is inspired by belief propagation algorithms, which have the natural nonbacktracking property: the message sent from i to j at time t + 1 does not depend on the message sent from j to i at time t, thereby reducing the eﬀect of echoes of messages sent by j. To present an informal derivation of the state evolution equations, which track the asymptotic distributions of the messages, let us postulate the following assumptions: Suppose that for each ﬁxed t, as n : (a) the empirical distribution of (θt i : i C ) converges to N(µt, τ 2 t ) and the empirical distribution of (θt i : i [n]\C ) converges to N(0, τ 2 t ); (b) {θt i j} are independent of A; (c) θt i j θt i. Then it follows from (6) and K = o(n) that for any i C ,

E θt+1 i | {θt ℓ i : ℓ = i} (b) = X

ℓ [n]\{i} E [Aℓi] f(θt ℓ i, t)

ℓ C \{i} f(θt ℓ i, t)

λ E [f(µt + τt Z, t)] ,

and for any i [n],

var θt+1 i | {θt ℓ i : ℓ = i} (b) = X

ℓ [n]\{i} var (Aℓi) f(θt ℓ i, t)2

ℓ [n]\(C {i}) f(θt ℓ i, t)2 + o(1)

(a),(c) n E f(τt Z, t)2 ,

where Z represents a generic standard normal random variable. Since the conditional means and variances have deterministic limits, those are also the limits of the unconditional means and variances. Therefore, we get the following recursive equations for t 0:

λE [f(µt + τt Z, t)] , (7)

τt+1 = E f(τt Z, t)2 , (8)

where the initial conditions are µ0 = τ0 = 0. Following (Deshpande and Montanari, 2015), we call (7) and (8) the state evolution equations. The heuristic derivation of state evolution equations given above is certainly not rigorous mainly due to the dependency between θt i j s and A. In Section 6, we present a rigorous justiﬁcation of state evolution equations via the moment method following (Deshpande and Montanari, 2015). A crucial fact that we exploit

Hajek, Wu and Xu

is the non-backtracking property of the message passing rule (5), which has the eﬀect of reducing the dependency between θt i j s and A. Suppose, for the time being, that message distributions are Gaussian with parameters accurately tracked by the state evolution equations. Then it is reasonable to estimate C

by selecting those indices i such that θt+1 i exceeds a given threshold. More speciﬁcally, classifying an index i based on θt+1 i boils down to testing two Gaussian hypotheses with signal-to-noise ratio µt+1

τt+1 . This gives guidance for selecting the functions f( , t) based on µt and τt to maximize µt+1

τt+1 . For t = 0 any choice of f is equivalent, so long as f(0, 0) > 0. Without loss of generality, for t 1, we can assume that the variances are normalized, namely, τt = 1 (e.g., we take f(0, 0) = 1 to make τ1 = 1) and choose f( , t) to be the maximizer of max g {E[g(µt + Z)]: E[g(Z)2] = 1} (9)

where Z N(0, 1). By change of measure, E[g(µt + Z)] = E[g(Z)ρ(Z)], where

ρ(x) = d N(µt, 1)

d N(0, 1) (x) = exµt µ2 t /2. (10)

Clearly, the best g aligns with ρ and we obtain

f(x, t) = ρ(x) p

E[ρ2(Z)] = exµt µ2 t . (11)

With this optimized f, we have τt 1 and the state evolution (7) reduces to

λE [f(µt + Z, t)] =

or, equivalently, µ2 t+1 = λeµ2 t . (12)

Therefore if λ > 1/e, then (12) has no ﬁxed point and hence µt as t . Directly carrying out the above heuristic program, however, seems challenging. To rigorously justify the state evolution equations in Section 6, we rely on the the method of moments, requiring f to be a polynomial, which prompts us to look for the best polynomial of a given degree that maximizes the signal-to-noise ratio. Denoting the corresponding state evolution by (bµt, bτt), we aim to solve the following ﬁnite-degree version of (9):

max{E[g(bµt + Z)]: E[g(Z)2] = 1, deg(g) d}. (13)

As shown in Lemma 7, this problem can be easily solved via Hermite polynomials, which form an orthogonal basis with respect to the Gaussian measure, and the optimal choice, denoted by fd( , t), is the best degree-d L2-approximation of the the likelihood ratio (10), which can be obtained by normalizing the ﬁrst d + 1 terms in the orthogonal expansion of (10). Compared to (Deshpande and Montanari, 2015, Lemma 2.3) which shows the existence of a good choice of polynomial that approximates the ideal state evolution (12) based on Taylor expansions, our approach is to ﬁnd the best message-passing rule of a given

Submatrix localization via message passing

degree which results in the following state evolution that is optimal among all polynomial f of degree d:

bµ2 t+1 = λ

bµ2k t k! . (14)

For any λ > 1/e, there is an explicit choice of the degree d depending only on λ denoted by d (λ),5 so that bµt as t and the state evolution (14) for ﬁxed t correctly predicts the asymptotic behavior of the messages when n . Therefore, as discussed above, e C produced by thresholding messages θt i, is likely to contain a large portion of C , but since K = o(n), it may (and most likely will) also contain a large number of indices not in C . Following (Deshpande and Montanari, 2015, Lemma 2.4), we show that the power iteration6 (a standard spectral method) in Algorithm 1 can remove a large portion of the outlier vertices in e C. Combining message passing plus spectral cleanup yields Algorithm 1 for estimating C

based on the messages θt i, with theoretical guarantees given in Theorem 1.

Algorithm 1 Message passing

1: Input: n, K N, µ > 0, A Rn n, d , t N, and s > 0.

2: Initialize: θ0 i j = 0 for all i, j [n] with i = j and θ0 i = 0. For t 0, deﬁne the sequence of degree-d polynomials fd ( , t) as per Lemma 7 and bµt in (14).

3: Run t 1 iterations of message passing as in (5) with f = fd and compute θt i for all i [n] as per (6).

4: Find the set e C = {i [n] : θt i bµt /2}.

5: (Cleanup via power method) Recall that A e C denotes the restriction of A to the rows

and columns with index in e C. Sample u0 uniformly from the unit sphere in R| e C| and compute ut+1 = A e Cut/ A e Cut for 0 t s log n 1. Let bu = u s log n . Return b C, the set of K indices i in e C with the largest values of |bui|.

Theorem 1 Fix λ > 1/e. Let K and µ depend on n in such a way that µ2K2/n λ and Ω( n) K o(n) as n . Suppose either C is deterministic with |C | K, or C is random such that |C |/K 1 in probability as n . Suppose Assumption 1 holds for some γ > 0. Let d = d (λ) as in (28). For every η (0, 1), there exist explicit positive constants t , s , c depending on λ, η and γ such that Algorithm 1 returns b C satisfying | b C C | ηK, with probability converging to one as n , and the total time complexity is bounded by c(η, λ, γ)n2 log n, where c(η, λ, γ) as either η 0 or λ 1/e.

Remark 2 Algorithm 1 requires the knowledge of the parameter λ to deﬁne the sequence of polynomials fd ( , t) and bµt, and the knowledge of the parameter K in the spectral cleanup

5. See (28) and Remark 9 for the expression. 6. As far as statistical utility is concerned, we could replace bu produced by the power iteration by the leading singular vector of A e C, but that would incur a higher time complexity because singular value decomposition in general takes O(n3) time to compute.

Hajek, Wu and Xu

step. To avoid the need to know K, we can simply replace the last step of the spectral cleanup (involving choosing the K coordinates of the largest magnitude of bu) by applying k-means with k = 2 on the set {|bui| : i e C}. See Appendix C for details. With this modiﬁcation, Theorem 1 continues to hold as long as λ (or a lower bound thereof) is known in order to set the degree d and the iteration number t .

Remark 3 As pointed out in (Deshpande and Montanari, 2015, Remeark 2.5), the eﬀective signal-to-noise ratio λ can be potentially improved by a suitable entrywise pre-processing of the observed matrix W. In particular, in (4) we let Aij = g(Wij) for some transformation function g : R R. The optimal transformation g in the Gaussian case for which γ = 1 is given by the maximizer of

E [g(µ + Z)] E [g(Z)] : E g(Z)2 = 1

In view of (9) and (10), the optimal transformation is the scaled likelihood ratio:

d N(0, 1) (Wij) 1 = 1 p

e Wijµ µ2/2 1

and the signal-to-noise ratio λ is increased to

If the resulting Aij is subgaussian with scale O(1/n), then Theorem 1 still applies. However, even if the results extend, in the regime of K n which we are mostly interested in, we have µ 0 and eλ = λ(1 + o(1)), and thus pre-processing cannot boost the signal-to-noise ratio asymptotically.

After the message passing algorithm and spectral cleanup are applied in Algorithm 1, a ﬁnal linear-time voting procedure is deployed to obtain weak or exact recovery, leading to Algorithm 2 next. As in (Deshpande and Montanari, 2015), we consider a threshold estimator for each vertex i based on a sum over b C given by ri = P

j b C Aij. Intuitively, ri can be viewed as the aggregated votes received by the index i in b C, and the algorithm picks the set of K indices with the most signiﬁcant votes . To show that this voting procedure succeeds in weak recovery, a key step is to prove that ri is close to P j C Aij.

If µ = Θ(1) as in (Deshpande and Montanari, 2015), given that | b C C | = o(K), the error incurred by summing over b C instead of over C could be bounded by truncating Aij to a large magnitude. However, for µ 0 that approach fails. Our approach is to introduce the clean-up procedure in Algorithm 2 based on the successive withholding method described in (Hajek et al., 2017) (see also (Condon and Karp, 2001; Mossel et al., 2014) for variants of this method). In particular, we randomly partition the set of vertices into 1/δ subsets. One at a time, one subset, say S, is withheld to produce a reduced set of vertices Sc, on which we apply Algorithm 1. The estimate obtained from Sc is then used by the voting procedure to classify the vertices in S. The analysis of the two stages is decoupled because conditioned on C , the outcome of Algorithm 1 depends only on ASc, which is independent of ASSc used in the voting.

Submatrix localization via message passing

Algorithm 2 Message passing plus voting

1: Input: n, K N, µ > 0, A Rn n, δ (0, 1) with 1/δ, nδ N, d , t N, and s > 0.

2: Partition [n] into 1/δ subsets Sk of size nδ randomly.

3: (Approximate recovery) For each k = 1, . . . , 1/δ, run Algorithm 1 (message passing for approximate recovery) with input n(1 δ), K(1 δ) , µ, ASc k, d , t , s which outputs b Ck.

4: (Clean up) For each k = 1, . . . , 1/δ compute ri = P

j b Ck Aij for all i Sk and return C , the set of K indices in [n] with the largest values of ri.

The following theorem provides a suﬃcient condition for the message passing plus voting cleanup procedure (Algorithm 2) to achieve weak recovery, and, if an additional suﬃcient condition is also satisﬁed, exact recovery.

Theorem 4 Suppose K and µ depend on n in such a way that µ2K2

n λ for some ﬁxed λ > 1/e, and Ω( n) K o(n) as n . Suppose Assumption 1 holds for some γ > 0 and |C | = K. Let δ > 0 be such that λe(1 δ) > 1. Deﬁne d = d (λ(1 δ)) as per (28). Then there exist positive constants t , s , c determined explicitly by δ, λ and γ, such that

1. (Weak recovery) Algorithm 2 returns C with |C C |/K 0 in probability as n .

2. (Exact recovery) Furthermore, assume that

Kµ 2 log K + 2 log n > γ. (15)

Let δ > 0 be chosen such that for all suﬃciently large n,

λe(1 δ), Kµ(1 2δ) 2Kγ log K + p

2Kγ log(n K) + δ

Then Algorithm 2 returns C with P{C = C } 0 as n .

The total time complexity is bounded by c(δ, λ, γ)n2 log n, where c(δ, λ, γ) as δ 0 or λ 1/e.

Remark 5 Theorem 4 ensures Algorithm 2 achieves exact recovery if both (15) and λ > 1/e hold; it is of interest to compare these two conditions. Note that

Kµ 2 log K + 2 log n =

λe r n 8e K log n 2

log K/ log n) .

Hence, if lim infn K log n/n 1/(8eγ), then lim infn log K/ log n = 1; thus (15) implies λ > 1/e and hence (15) alone is suﬃcient for Algorithm 2 to succeed. If instead lim supn K log n/n 1/(8eγ), then λ > 1/e implies (15) and thus λ > 1/e alone is suﬃcient for Algorithm 2 to succeed. The asymptotic regime considered in (Deshpande and

Hajek, Wu and Xu

Montanari, 2015) entails K = Θ( n), in which case the condition λ > 1/e is suﬃcient for exact recovery, as shown in (Deshpande and Montanari, 2015). The idea of upgrading weak recovery to exact recovery via a local voting procedure has also appeared in (Abbe et al., 2016; Mossel et al., 2015; Abbe and Sandon, 2015; Yun and Proutiere, 2015) under the context of stochastic block models with community sizes scaling linearly in n. As shown in (Hajek et al., 2017, Corollary 4) in the Gaussian case for which γ = 1, the condition (15), with strict inequality replaced by greater than or equal, is necessary for exact recovery.

We ﬁnish this section by discussing the connections and distinctions to the previous work (Deshpande and Montanari, 2015). Versions of Theorems 1 and 4 are given in (Deshpande and Montanari, 2015) for the case K = Θ( n) and µ = Θ(1). We extend the range of K to Ω( n) K o(n), showing that the message passing plus a cleanup procedure achieves the optimal exact recovery threshold in the Gaussian case with sharp constants if K ( 1

8e + o(1)) n log n. The algorithms and proofs are nearly the same; we comment here on the main diﬃculties we encountered when allowing K/ n and µ 0.

First, a key ingredient in the proof of Theorem 1 is Lemma 6. Its proof is based on the moment method and a larger K requires modiﬁcation of bounds from (Deshpande and Montanari, 2015) used in calculating the moments of messages, i.e., E h (θt i j)mi for ﬁxed m N, by a more careful counting argument. We refer the interested readers to Remark 31 right after the proof of Lemma 6 for more details.

Secondly, after the message passing algorithm and spectral cleanup are applied in Algorithm 1, a ﬁnal cleanup procedure is applied to obtain weak recovery or exact recovery (when possible). As in (Deshpande and Montanari, 2015), we consider a threshold estimator for each vertex i based on a sum over b C. If K = Θ( n) as assumed in (Deshpande and Montanari, 2015), then λ being a constant implies that the mean µ is bounded away from zero. In this case if | b C C | = o(K), the error incurred by summing over b C instead of over C could be bounded by truncating Aij to a large magnitude ρ and bounding the diﬀerence of sums by ρ C b C = o(K) µK. However, for K n with vanishing µ this approach fails. Instead, we rely on the cleanup procedure in Algorithm 2 which entails running Algorithm 1 for 1/δ times on subsampled vertices. A related diﬀerence we encounter is that if K is large enough then the condition λ > 1/e alone is not suﬃcient for exact recovery, but adding the information-theoretic condition (15) suﬃces.

Lastly, the method of moments requires f( , t) to be a polynomial so that the exponential function (11), which results in the ideal state evolution (12), cannot be directly applied. It is shown in (Deshpande and Montanari, 2015, Lemma 2.3) that for any λ > 1/e and any threshold M there exists d = d (λ, M) so that taking f to be the truncated Taylor series of (11) up to degree d results in the state evolution bµt which exceeds M after some ﬁnite time t (λ, M); however, no explicit formula of d , which is needed to instantiate Algorithm 1, is provided. Although in principle this does not pose any algorithmic problem as d can be found by an exhaustive search in O(1) time independent of n, it is more satisfactory to ﬁnd the best polynomial message passing rule explicitly which maximizes the signal-to-noise ratio for a given degree (Lemma 7) and provides an explicit formula of d as a function of λ only (Remark 9).

Submatrix localization via message passing

3. Statistical optimality and computational considerations

3.1 Comparison with information-theoretic limits in the Gaussian case

As noted in the introduction, in the regime K = Θ(n), a thresholding algorithm based on row sums provides weak and, if a voting procedure is also used, exact recovery whenever it is informationally possible in the Gaussian case. In this subsection, we compare the performance of the message passing algorithms to the information-theoretic limits on the recovery problem in the regime (3). Throughout this section we restrict attention to the Gaussian case, such that Zij N(0, 1) and γ = 1. Also, for converse results, we assume the true community C is a subset of [n] of cardinality K selected uniformly at random. Notice that the comparison here takes into account the sharp constant factors. Informationtheoretic limits for the biclustering problem are discussed in Section 5.1.

Weak recovery The information-theoretic threshold for weak recovery has been determined in (Hajek et al., 2017, Theorem 2), which, in the regime of (3), boils down to the following: If

lim inf n Kµ2

K > 1, (16)

then weak recovery is possible; conversely, if weak recovery is possible, then

lim inf n Kµ2

This implies that the minimal signal-to-noise ratio for weak recovery is

for any ϵ > 0, which vanishes in the sublinear regime of K = o(n). In contrast, in the regime of (3), message passing (Algorithm 1) demands a non-vanishing signal-to-noise ratio, namely, λ > 1/e, to achieve weak recovery. No polynomial-time algorithm is known to succeed if λ 1/e, suggesting that computational complexity might incur a severe penalty on the statistical optimality when K = o(n).

Exact recovery When the submatrix size satisﬁes (3), if (15) with γ = 1 holds, then exact recovery is possible; conversely, if exact recovery is possible, then

Kµ 2 log K + 2 log n 1. (18)

See (Hajek et al., 2017, Corollary 4). This implies that the minimal signal-to-noise ratio for exact recovery is

log K 2 (19)

for any ϵ > 0. Consequently, we ﬁnd that the critical submatrix size for which message passing (plus cleanup) can achieve optimal exact recovery is n log n. Speciﬁcally,

Hajek, Wu and Xu

K = ω n log n . In this regime, the right hand side of (19) goes to and hence the minimal signal-to-noise ratio for exact recovery is much higher than that of weak recovery via message passing, namely, 1/e. Thus, exact recovery can be attainable in polynomial-time by message-passing plus voting clean-up (Algorithm 2).

K = Θ( n log n). In this regime, if we let K = ρn log n, the right hand side of (19) is at least 1/e if ρ 1

8e and strictly less than 1/e otherwise. In view of Theorem 4, we conclude that message passing plus voting cleanup (Algorithm 2) achieves the sharp threshold of exact recovery if

8e + o(1) n log n. (20)

K = o n log n . In this regime, the right hand side of (19) is o(1). No polynomialtime algorithm (including semideﬁnite programming relaxation (Hajek et al., 2016)) is known to achieve weak, let alone exact, recovery, when λ = o(1).

A counterpart of this conclusion for the biclustering problem is obtained in Remark 20 in terms of the submatrix sizes.

3.2 Comparison with the spectral limit

It is reasonable to conjecture that λ > 1 is the spectral limit for recoverability by spectral estimation methods. This conjecture is rather vague, because it is diﬃcult to deﬁne what constitutes spectral methods. Nevertheless, some evidence for this conjecture is provided by (Deshpande and Montanari, 2015, Proposition 1.1), which, in turn, is based on results on the spectrum of a random matrix perturbed by adding a rank-one deterministic matrix (Knowles and Yin, 2013, Theorem 2.7). The message passing framework used in this paper itself provides some evidence for the conjecture. Indeed, if f(x, 0) 1 and f(x, t) = x for all t 1, the iterates θt are close to what is obtained by iterated multiplication by the matrix A, beginning with the all one vector, which is the power method for computation of the eigenvector corresponding to the principal eigenvalue of A.7 To be more precise, with this linear f the message passing equation (5) can be expressed in terms of powers of the non-backtracking matrix B R(n 2) (n 2) associated with the matrix A, deﬁned by Bef = Ae1,e21{e2=f1}1{e1 =f2}, where e = (e1, e2) and f = (f1, f2) are directed pairs of indices. Let Θt Rn(n 1) denote the messages on directed edges with Θt e = θt e1 e2. Then, (5) simply becomes Θt = Bt1. To evaluate the performance of this method, we turn to the state evolution equations (7) and (8), which yield µt = λt/2 and τt = 1 for all t 1. Therefore, by a simple variation of Algorithm 1 and Theorem 1, if λ > 1, the linear message passing algorithm can provide weak recovery. For the submatrix detection problem, namely, testing µ = 0 (pure noise) versus µ > 0, as opposed to support recovery, if λ is ﬁxed with λ > 1, a simple thresholding test based

7. If we included i, j in the summation in (5) and (6), then we would have θt = At1 exactly. Since the entries of A are OP (1/ n), we expect this only incurs a small diﬀerence to the sum for ﬁnite number of iterations.

Submatrix localization via message passing

on the largest eigenvalue of the matrix A provides detection error probability converging to zero (F eral and P ech e, 2007), while if λ < 1 no test based solely on the eigenvalues of A can achieve vanishing probability of error (Montanari et al., 2015). It remains, however, to establish a solid connection between the detection and estimation problem for submatrix localization for spectral methods.

3.3 Computational barriers in the Gaussian case

A recent line of work (Kolar et al., 2011; Ma and Wu, 2015; Chen and Xu, 2014; Cai et al., 2017) has uncovered a fascinating interplay between statistical optimality and computational eﬃciency for the recovery problem and the related detection and estimation problem.8 Assuming the hardness of the planted clique problem, rigorous computational lower bounds have been obtained in (Ma and Wu, 2015; Cai et al., 2017) through reduction arguments in the Gaussian case. In particular, it is shown in (Ma and Wu, 2015) that when K = nα for 0 < α < 2/3, merely achieving the information-theoretic limits of detection within any constant factor (let alone sharp constants) is as hard as detecting the planted clique; the same hardness also carries over to exact recovery in the same regime. Furthermore, it is shown that the hardness of estimating this type of matrix, which is both low-rank and sparse, highly depends on the loss function (Ma and Wu, 2015, Section 5.2). For example, for K = Θ( n), entry-wise thresholding attains an O(log n) factor of the minimax mean-square error; however, if the error is gauged in squared operator norm instead of Frobenius norm, attaining an O( n/ log n) factor of the minimax risk is as hard as solving planted clique. Similar reductions have been shown in (Cai et al., 2017) for exact recovering of the submatrix of size K = nα and the planted clique recovery problem for any 0 < α < 1. The results in (Ma and Wu, 2015; Cai et al., 2017) revealed that the diﬃculty of submatrix localization crucially depends on the size and planted clique hardness kicks in if K = n1 Θ(1). In search of the exact phase transition point where statistical and computational limits depart, we further zoom into the regime of K = n1 o(1). We showed in (Hajek et al., 2016) no computational gap exists in the regime K = ω(n/ log n), since a semideﬁnite programming relaxation of the maximum likelihood estimator can achieve the information limit for exact recovery with sharp constants. The current paper further pushes the boundary to K n log n( 1

8e + o(1)), in which case the sharp information limits can be attained in nearly linear-time via message passing plus clean-up. However, as soon as K n log n( 1

8e ϵ) for any ϵ > 0, a gap emerges between the statistical limits and the suﬃcient condition of message passing plus clean-up, given by λ > 1/e.

4. Proofs of algorithm correctness

We ﬁrst justify the state evolution equations via the following key lemma, which establishes the asymptotic normality of the empirical distribution of messages with mean and variance given by (7) and (8). A version of this lemma is proved in (Deshpande and Montanari,

8. The papers (Kolar et al., 2011; Ma and Wu, 2015; Chen and Xu, 2014; Cai et al., 2017) considered the biclustering version of the submatrix localization problem (1).

Hajek, Wu and Xu

2015) by assuming µ = Θ(1) and K = Θ( n). The proof is given in Section 6 using the method of moments, closely following (Deshpande and Montanari, 2015).

Lemma 6 Let f( , t) be a ﬁnite-degree polynomial for each t 0. Let K and µ depend on n such that K2µ2

n λ for some λ > 0 and Ω( n) K o(n). Suppose Assumption 1 holds for some γ > 0, and suppose either C is deterministic with |C | K, or C is random such that |C |/K 1 in probability as n . Let A = W/ n and set θ0 i j = 0. Consider the message passing algorithm deﬁned by (5) and (6). Then for each ﬁxed t, as n ,

i C δθt i, N(µt, τ 2 t )

i/ C δθt i, N(0, τ 2 t )

where µt and τt are deﬁned in (7) and (8), respectively; 1 |C | P i C δθt i and 1 n |C | P i/ C δθt i are the empirical distributions of θt i for i C and i / C , respectively.

Next we prove Theorems 1-4. Lemma 6 implies that if i C , then θt i N(µt, τ 2 t ); if i / C , then θt i N(0, τ 2 t ). Ideally, one would pick the optimal f(x, t) = eµt(x µt) which result in the optimal state evolution µt+1 =

λeµ2 t /2 and τt = 1 for all t 1. Furthermore, if λ > 1/e, then µt as t , and thus we can hope to estimate C by selecting the indices i such that θt i exceeds a certain threshold. The caveat is that Lemma 6 needs f to be a polynomial of ﬁnite degree. Next we proceed to ﬁnd the best degree-d polynomial for iteration t, denoted by fd( , t), which maximizes the signal to noise ratio. Recall that the Hermite polynomials {Hk : k 0} are the orthogonal polynomials with respect to the standard normal distribution (cf. (Szeg o, 1975, Section 5.5)), given by

Hk(x) = ( 1)k ϕ(k)(x)

i=0 ( 1)i(2i 1)!! k 2i

xk 2i, (21)

where ϕ denotes the standard normal density and ϕ(k)(x) is its k-th derivative; in particular, H0(x) = 1, H1(x) = x, H2(x) = x2 1. Furthermore, deg(Hk) = k and {H0, . . . , Hd} span all polynomials of degree at most d. For Z N(0, 1), E[Hm(Z)Hn(Z)] = m!δm=n and E[Hk(µ + Z)] = µk for all µ R; hence the relative density d N(µ,1)

d N(0,1)(x) = eµx µ2/2 admits the following expansion:

k=0 Hk(x)µk

Truncating and normalizing the series at the ﬁrst d+1 terms immediately yields the solution to (13) as the best degree-d L2-approximation to the relative density, described as follows:

Lemma 7 Fix d N and deﬁne bµt according to the iteration (14) with bµ0 = 0, namely,

bµ2 t+1 = λGd(bµ2 t ), Gd(µ) =

Submatrix localization via message passing

k=0 ak Hk(x), (24)

where ak bµk t k! (Pd k=0 bµ2k t k! ) 1/2. Then fd( , t) is the unique maximizer of (13) and the state evolution (7) and (8) with f = fd coincides with τt = 1 and µt = bµt. Furthermore, for any d 2 the equation Gd(a) = a Gd 1(a) (25)

has a unique positive solution, denoted by a d. Let

λ d = 1 Gd 1(a d) (26)

and deﬁne λ 1 = 1. Then

1. for any d N and any λ > λ d, bµt as t and hence for any M > 0,

t (λ, M) = inf{t : bµt > M} (27)

2. λ d 1/e monotonically as d according to λ d = 1/e + 1/e2+o(1)

Remark 8 The best aﬃne update gives λ 1 = 1; for the best quadratic update, a 2 =

2 and hence λ 2 = 1 1+

2 0.414. More values of the threshold are given below, which converges to 1/e 0.368 rapidly.

d 1 2 3 4 5 λ d 1 0.414 0.376 0.369 0.368

Remark 9 Let d (λ) = inf{d N : λ d < λ}, (28)

which is ﬁnite for any λ > 1/e. Then for any d d , bµt as t . As λ approaches the critical value 1/e, the degree d (λ) blows up according to d (λ) = Θ(log 1 λe 1/ log log 1 λe 1), as a consequence of the last part of Lemma 7.

Remark 10 (Best aﬃne message passing) For d = 1, the best state evolution is given by bµ2 t+1 = λ(1 + bµ2 t )

and the corresponding optimal update rule is

f1(x, t) = 1 + bµtx p

1 + bµ2 t .

This is strictly better than f(x, t) = x described in Section 3.2 which gives bµ2 t+1 = λbµ2 t ; nevertheless, in order to have bµt we still need to assume the spectral limit λ 1.

Hajek, Wu and Xu

Proof [Proof of Lemma 7] Note that any degree-d polynomial g can be written in terms of the linear combination:

k=0 ck Hk(x),

where the coeﬃcients {ck} satisfy E[g2(Z)] = Pd k=0 k!c2 k = 1. By a change of measure, E[g(bµt + Z)] = E[g(Z)ebµt Z bµ2 t /2] = Pd k=0 ckbµk t , in view of the orthogonal expansion (22). Thus, to solve the maximization problem (13), it is equivalent to solving

k=0 ckbµk t :

k=0 k!c2 k = 1

By Cauchy-Schwarz inequality, the optimal coeﬃcients and the optimal polynomial fd( , t) are given by (24), resulting in the following state evolution

λ max{E[g(bµt + Z)]: E[g(Z)2] = 1, deg(g) d} =

which is equivalent to (23). Next we analyze the behavior of the iteration (23). The case of d = 1 follows from the obvious fact that bµ2 t+1 = λ(bµ2 t + 1) diverges if and only if λ 1. For d 2, note that Gd is a strictly convex function with Gd(0) = 1 and G d = Gd 1. Also, (Gd(a) a Gd 1(a)) = a G d(a) < 0. Thus, Gd(a) a Gd 1(a) is strictly decreasing on a > 0 with value 1

d! at a = 1 and limit as a , so (25) has a unique positive solution a d and it satisﬁes a d > 1. Furthermore, (Gd(a) a Gd 1(a)) a=1 = Pd 2 k=0 1 k!, so by Taylor s theorem,

Gd(a) a Gd 1(a) = 1

1 k! + O((a 1)2),

a d = 1 + 1

d! Pd 2 k=0 1 k! + O(1/(d!)2).

Consider next the values of λ such that bµt diverges. For very large λ, Gd(a) dominates a/λ pointwise and bµt diverges. The critical value of λ is when Gd(a) and a/λ meet tangentially, namely, λGd 1(a) = 1, λGd(a) = a,

whose solution is given by a = a d and λ = λ d, where

λ d 1 Gd 1(a d) = 1 Gd 1(1) + G d 1(1)(a d 1) + O((a d 1)2)

= 1 Pd k=0 1 k! + O(1/(d!)2) = 1/e +

P k=d+1 1/k! + O(1/(d!)2)

e Pd k=0 1/k!

= 1/e + 1/e2 + o(1)

Submatrix localization via message passing

Thus, λ d is the minimum value such that for all λ > λ d, λGd(a) > a for all a > 0, so that starting from any bµt 0 we have bµt monotonically. The fact λ d is decreasing in d follows from the fact Gd is pointwise increasing in d.

Lemmas 6 and 7 immediately imply the following partial recovery results.

Lemma 11 Assume that λ > 1/e and Ω( n) K o(n). Fix any ϵ (0, 1). Let M = p

8 log(1/ϵ) and run the message passing algorithm for t iterations with f = fd , d = d (λ) as in (28), and t = t (λ, M) as in (27). Let e C = {i : θt i bµt /2}. Then with probability converging to one as n ,

1 K | e C C | 1 ϵ (29)

K(1 ϵ) | e C| nϵ. (30)

Proof Notice that

| e C C | = X

i C 1{θt i bµt /2}.

By the choice of f = fd in (24), we have τt = 1 for all t 1. It follows from Lemma 6 that

lim n 1 K | e C C | = P {bµt + Z bµt /2} , (31)

where the convergence is in probability. Notice that we have used d = d (λ) and t = t (λ, M) deﬁned by (28) and (27) in Lemma 7. Thus bµt M = p

8 log(1/ϵ) and

P {bµt + Z bµt /2} = Q(bµt /2) e bµ2 t /8 ϵ,

which, in view of (31), implies (29) with probability converging to one as n . Similarly, Lemma 6 implies that in probability

lim n 1 n| e C\C | = P {Z bµt /2} = Q(bµt /2) ϵ. (32)

Since K = o(n), we have P{K(1 ϵ) | e C| nϵ} 1.

Although e C contains a large portion of C , since | e C| is linear in n with high probability, i.e., | e C|/n Q(bµt /2) by Lemma 6, it is bound to contain a large number of outlier indices. The next lemma, closely following (Deshpande and Montanari, 2015, Lemma 2.4), shows that given the conclusion of Lemma 11, the power iteration in Algorithm 1 can remove most of the outlier indices in e C. Its proof is presented in Appendix B.

Lemma 12 Suppose λ = µ2K2

n 1/e, K , |C |/K 1 in probability, and e C is a set (possibly depending on A) such that (29) (30) hold for some 0 < ϵ < ϵ0, where 0 < ϵ0 1/2 is determined by 1 ϵ0 = 8e p

4γh(ϵ0) + 10γϵ0. Let

λ(1 ϵ)/(8 p

4eγh(ϵ) + 10eγϵ)) , (33)

Hajek, Wu and Xu

where h(ϵ) ϵ log 1

ϵ + (1 ϵ) log 1 1 ϵ is the binary entropy function. Then b C with | b C| K produced by Algorithm 1 returns | b C C | η(ϵ, λ, γ)K, with probability converging to one as n , where

η(ϵ, λ, γ) = 2ϵ + eγ 4608h(ϵ) + 11520ϵ

λ(1 ϵ)2 . (34)

Proof [Proof of Theorem 1] Given η (0, 1), choose an arbitrary ϵ (0, ϵ0) such that η(ϵ, λ, γ) deﬁned in (34) is at most η. With t speciﬁed in Lemma 11 and s speciﬁed in Lemma 12, the probabilistic performance guarantee in Theorem 1 readily follows by combining Lemmas 11 and 12. The time complexity of Algorithm 1 follows from the fact that for both the BP algorithm and the power method each iteration has complexity O(n2) and Algorithm 1 entails running BP and the power method for t and s log n iterations respectively; both t and s are constants depending only on η, λ, and γ.

Proof [Proof of Theorem 4] (Weak recovery) Fix k [1/δ] and let C k = C Sc k. Deﬁne the n(1 δ) n(1 δ) matrix Ak ASc k, which corresponds to the submatrix localization problem for a planted community C k whose size has a hypergeometric distribution, resulting from sampling without replacement, with parameters (n, K, (1 δ)n) and mean (1 δ)K. By a result of (Hoeﬀding, 1963), the distribution of |C k| is convex order dominated by the distribution that would result from sampling with replacement, namely, the Binom n(1 δ), K

distribution. In particular, Chernoﬀbounds for Binom(n(1 δ), K

n ) also hold for |C k|, so

|C k|/((1 δ)K) 1 in probability as n . Note that ((1 δ)K)2µ2

n(1 δ) λ(1 δ) and λ(1 δ)e > 1 by the choice of δ. Let d (λ(1 δ)) be given in (28), i.e.,

d (λ(1 δ)) = inf{d N : λ d < λ(1 δ)}.

Choose an arbitrary ϵ (0, ϵ0) to satisfy η(ϵ, λ(1 δ), γ) δ, i.e.,

2ϵ + eγ 4608h(ϵ) + 11520ϵ

λ(1 δ)(1 ϵ)2 δ.

Deﬁne bµt recursively according to (14) with λ replaced by λ(1 δ) and bµ0 = 0, i.e.,

bµ2 t+1 = λ(1 δ)

bµ2k t k! .

Deﬁne t (δ, λ, γ) according to (27) with M = 8 log(1/ϵ), and s (δ, λ, γ) according to (33) with λ replaced by λ(1 δ). Then Theorem 1 with n and K replaced by n(1 δ) and K(1 δ) implies that as n ,

P n | b Ck C k| δK for 1 k 1/δ o 1.

Given (C k, b Ck), each of the random variables ri n for i Sk is conditionally subgaussian with proxy variance at most Kγ. Furthermore, on the event, Ek = {| b Ck C k| δK},

| b Ck C k| | b Ck| | b Ck C k| = K(1 δ) | b Ck C k| K(1 2δ).

Submatrix localization via message passing

Therefore, on the event Ek, for i Sk C , ri n has mean greater than or equal to K(1 2δ)µ, and for i Sk\C , ri has mean zero. Deﬁne the following set by thresholding

C o = {i [n] : ri (1 2δ)

The number of indices in Sk incorrectly classiﬁed by C o Sk satisﬁes (use |Sk| = δn):

E (C o Sk) (C Sk) δne Ω(n/K),

where the last inequality follows because ri is subgaussian with proxy variance at most γK/n. Summing over k [1/δ] yields E [|C o C |] ne Ω(n/K). By Markov s inequality,

P |C o C | K2/n n2

K2 e Ω(n/K) K=o(n) = o(1).

Instead of C o, Algorithm 2 outputs C which selects the K indices in [n] with the largest values of ri. Applying the same argument as that at the end of the proof of Lemma 12, we get |C C | 2|C C o| + ||C | K|, and hence |C C |/K 0 in probability. (Exact recovery) By the union bound and Chernoﬀ s bound for subgaussian random variables, the maximum of m subgaussian random variables Xi with zero mean and proxy variance at most γ satisﬁes that

P max 1 i m Xi γ p

2 log m + t m exp p

2 log m + t 2 /2

2 log m t2/2 .

It follows that max1 i m Xi is at most 2γ log m + o P (1) as m . Also, for k [1/δ], |Sk C | |C | = K and |Sk\C | |[n]\C | = n K. Therefore,

min i Sk C ri n K(1 2δ)µ p

2Kγ log K + o P (

max j Sk\C ri n p

2Kγ log(n K) + o P (

Since k ranges over a ﬁnite number of values, namely, [1/δ], (35) and (36) continue to hold with left-hand sides replaced by mini C ri n and maxj / C ri n, respectively. Therefore, by the choice of δ, mini C ri n > maxj [n]\C ri n with probability converging to one as n and so C = C with probability converging to one as well. (Time complexity) The running time of Algorithm 2 is dominated by invoking Algorithm 1 for a constant number, 1/δ, of times, and the number of iterations within Algorithm 1 is (t + s log n)n2, with both t and s as either δ 0 or λ 1/e. In particular, the threshold comparisons require O(n2) computations. Thus, the total complexity of Algorithm 2 is as stated in the theorem.

Hajek, Wu and Xu

5. The Gaussian biclustering problem

We return to the biclustering problem where the goal is to locate a submatrix whose row and column support need not coincide. Consider the model (1) parameterized by (n1, n2, K1, K2, µ) indexed by a common n with n . In Section 5.1 we present the information limits for weak and exact recovery for the Gaussian bicluster model. The sharp conditions given for exact recovery are from (Butucea et al., 2015), and calculations from (Butucea et al., 2015) with minor adjustment provide conditions for weak recovery as well. Section 5.2 shows how the optimized message passing algorithm and its analysis can be extended from the symmetric case to the asymmetric case for biclustering and compares its performance to the fundamental limits. As originally observed in (Hajek et al., 2017) for recovering the principal submatrix, the connection between weak and exact recovery via the voting procedure extends to the biclustering problem as well. Note that for the sake of simplicity, we focus on Gaussian biclustering where the noise matrix Z is Gaussian; the results can be readily extended to the case where Z has subguassian entries as we did in the symmetric case.

5.1 Information-theoretic limits for Gaussian biclustering

Information-theoretic conditions ensuring exact recovery of both C 1 and C 2 by the maximum likelihood estimator (MLE), i.e.,

( b CMLE 1 , b CMLE 2 ) = arg max |C1|=K1 |C2|=K2

are obtained in (Butucea et al., 2015). While (Butucea et al., 2015) does not focus on conditions for weak recovery, the calculations therein combined with the voting procedure for exact recovery described in (Hajek et al., 2017) in fact resolve the information limits for both weak and exact recovery in the bicluster Gaussian model. Throughout this section we assume that Ki = o(ni) for i = 1, 2. For the converse results we assume C i is a subset of [ni] of cardinality Ki selected uniformly at random for i = 1, 2, with C 1 independent of C 2. Let

λi = K2 i µ2

ni , for i = 1, 2.

Theorem 13 (Weak recovery thresholds for Gaussian biclustering) If

lim inf n µ K1K2 p

2(K1 log(n1/K1) + K2 log(n2/K2)) > 1, (37)

then both C 1 and C 2 can be weakly recovered by the MLE. Conversely, if both C 1 and C 2 can be weakly recovered by some estimator, then

lim inf n µ K1K2 p

2(K1 log(n1/K1) + K2 log(n2/K2) 1. (38)

Submatrix localization via message passing

If C 2 (C 1) can be weakly recovered, one can further obtain exact recovery of C 1 (C 2) via a voting cleanup procedure similar to Algorithm 2; it uses the method of successive withholding. We give the voting procedure in Algorithm 3 for exact recovery of C 1 based on weak recovery of C 2; exact recovery of C 2 based on weak recovery of C 1 is analogous.

Algorithm 3 Weak recovery of C 2 plus cleanup for exact recovery of C 1

1: Input: n1, n2, K1, K2 N, µ > 0, A Rn1 n2, δ (0, 1) with 1/δ, n1δ N.

2: (Partition) Partition [n1] into 1/δ subsets Sk of size n1δ randomly.

3: (Approximate recovery) For each k = 1, . . . , 1/δ, let Ak denote the restriction of A to the rows with index in Sc k, run an estimator capable of weak recovery of C 2 with input (n1(1 δ), n2, K1(1 δ) , K2, µ, Ak) which outputs b C2k.

4: (Clean up) For each k = 1, . . . , 1/δ compute ri = P

j b C2k Aij for all i Sk and return C 1, the set of K indices in [n1] with the largest values of ri.

Theorem 14 (Exact recovery thresholds for Gaussian biclustering) If for some small δ > 0, C 2 can be weakly recovered even if a fraction δ of the rows of the matrix are hidden, and if

K2µ 2 log K1 + 2 log n1 > 1, (39)

then C 1 can be exactly recovered by the voting procedure. Conversely, if C 1 can be exactly recovered by some estimator, then

K2µ 2 log K1 + 2 log n1 1. (40)

Similarly, if for some small δ > 0, C 1 can be weakly recovered even if a fraction δ of the columns of the matrix are hidden, and if

K1µ 2 log K2 + 2 log n2 > 1, (41)

then C 2 can be exactly recovered by the voting procedure. Conversely, if C 2 can be exactly recovered by some estimator, then

K1µ 2 log K2 + 2 log n2 1. (42)

The proofs of Theorems 13 and 14 are given in Appendix D. The suﬃcient conditions involving δ in Theorem 14 require a certain robustness of the estimator for weak recovery. If the rows indexed by a set S, with S [n1] and |S| = δn1, are hidden, then the observed matrix has dimensions n1(1 δ) n2 and the planted submatrix has K1 |S C 1| K1(1 δ) rows and K2 columns. It is shown in (Hajek et al., 2017, Section IV.B) that the MLE has this robustness property for weak recovery of a principal submatrix, and a similar extension can be established for weak recovery for biclustering. The estimator used is the MLE based on the assumption that the submatrix to be found has shape K1(1 δ) K2. With that extension in hand, the following corollary is a consequence of the two theorems, and it recovers the main result of (Butucea et al., 2015).

Hajek, Wu and Xu

Corollary 15 If (37), (39), and (41) hold, then C 1 and C 2 can both be exactly recovered by the MLE. Conversely, if exact recovery is possible, then (38), (40), and (42) hold.

We conclude this subsection with a few remarks on Theorems 13 and 14:

1. If n1 = n2 and K1 = K2, the suﬃcient conditions and the necessary conditions for weak and for exact recovery, respectively, are identical to those in (Hajek et al., 2017) for the recovery of a K K principal submatrix with elevated mean, in a symmetric n n Gaussian matrix. Basically, in the bicluster problem the data matrix provides roughly twice the information (because the matrix is not symmetric) and there is twice the information to be learned, namely C 1 and C 2 instead of only C , and the factors of two cancel to yield the same conditions. It therefore follows from (Hajek et al., 2017, Remark 7), that if n1 = n2 and K1 = K2 n1/9 1 , then (37) implies (39) and (41); in this regime, (37) alone is the sharp condition for both weak and exact recovery.

2. If λi = K2 i µ2

ni are two ﬁxed positive constants and if K1 K2, then (37) holds for all suﬃciently large n, so weak recovery is information theoretically possible. In contrast, our proof that the optimized message passing algorithm provides weak recovery in this regime requires (λ1, λ2) G, where G is deﬁned in (52) in the next subsection.

5.2 Message passing algorithm for the Gaussian biclustering model

Suppose ni and Ω( ni) Ki o(ni) for i {0, 1}, as n . The belief propagation algorithm and our analysis of it for recovery of a single set of indices can be naturally adapted to the biclustering model. Let f( , t) : R R be a scalar function for each iteration t. To be deﬁnite, we shall describe the algorithm such that at each iteration, the messages are passed either from the row indices to the column indices, or vice-versa, but not both. The messages are deﬁned as follows for t 0 :

(t even) θt+1 i j = 1 n2

ℓ [n2]\{j} Wiℓf(θt ℓ i, t), i [n1], j [n2] (43)

(t odd) θt+1 j i = 1 n1

ℓ [n1]\{i} Wℓjf(θt ℓ j, t), j [n2], i [n1], (44)

with the initial condition θ0 ℓ i = 0 for (ℓ, i) [n2] [n1]. Moreover, let the aggregated beliefs be given by

(t even) θt+1 i = 1 n2

ℓ [n2] Wiℓf(θt ℓ i, t), i [n1] (45)

(t odd) θt+1 j = 1 n1

ℓ [n1] Wℓjf(θt ℓ j, t), j [n2]. (46)

Recall λi = K2 i µ2

ni for i = 1, 2. Suppose as n , for t even (odd), θt i is approximately N(µt, τt) for i C 1 (i C 2) and N(0, τt) for i [n1]\C 1 (i [n2]\C 2). Then similar to

Submatrix localization via message passing

the symmetric case, the update equations of message passing and the fact that θt i j θt i for all i, j suggest the following state evolution equations for t 0:

( λ2E [f(µt + τt Z, t)] t even λ1E [f(µt + τt Z, t)] t odd (47)

τt+1 = E f(τt Z, t)2 . (48)

The optimal choice of f for maximizing the signal-to-noise ratio µt+1

τt+1 is again f(x, t) =

exµt µ2 t . With this optimized f, we have τt+1 = 1 and the state evolution equations reduce to

( λ2eµ2 t t even λ1eµ2 t t odd (49)

with µ0 = 0. To justify the state evolution equations, we rely on the method of moments, requiring f to be polynomial. Thus, we choose f = fd( , t) as per Lemma 7, which maximizes the signal-to-noise ratio among all polynomials with degree up to d. With f = fd, we have τt+1 = 1 and the state evolution equations reduce to

( λ2Gd(bµ2 t ) t even λ1Gd(bµ2 t ) t odd (50)

where Gd(µ) = Pd k=0 µk

k! . Combining message passing with spectral cleanup, we obtain the following algorithm for estimating C 1 and C 2.

Algorithm 4 Message passing for biclustering

1: Input: n1, n2, K1, K2 N, µ > 0, W Rn1 n2, d N, t 2N, and s > 0.

2: Initialize: θ0 ℓ i = 0 for (ℓ, i) [n2] [n1]. For t 0, deﬁne the sequence of degree-d

polynomials fd ( , t) as per Lemma 7 and bµt according to (50).

3: Run t iterations of message passing as in (43) and (44) with f = fd and compute θt i for all i [n1] as per (45) and θt +1 j for all j [n2] as per (46).

4: Find the sets e C1 = {i [n1] : θt i bµt /2} and e C2 = {j [n2] : θt +1 j bµt +1/2}.

5: (Cleanup via power method) Denote the restricted matrix W e C1 e C2 by f W. Sample u0

uniformly from the unit sphere in R| e C1| and compute ut+2 = f W f W ut/ f W f W ut , for t even and 0 t 2 s log(n1n2) 2. Let bu = u2 s log(n1n2) . Return b C1, the set of K1 indices i in e C1 with the largest values of |bui|. Compute the power iteration with f W f W for odd values of t and return b C2 similarly.

We now turn to the performance of Algorithm 4. Let

G = {(λ1, λ2) : µt }, (51)

Gd = {(λ1, λ2) : bµt }. (52)

Hajek, Wu and Xu

As d , Gd(µ) eµ uniformly over bounded intervals. It suggests that if (λ1, λ2) G, then there exists a d (λ1, λ2) such that (λ1, λ2) Gd and hence bµt as t . The following lemma conﬁrms this intuition.

Lemma 16 For d 1, Gd Gd+1 with G1 = {(λ1, λ2) : λ1λ2 1}, and d=1Gd = G.

Proof By deﬁnition, G1(x) = 1 + x and thus for t even, bµ2 t+2 = λ1(1 + λ2(1 + bµ2 t )). As a consequence, bµt if and only if λ1λ2 1, proving the claim for G1. Let φd(x) λ1Gd(λ2Gd(x)) so that bµ2 t+2 = φd(bµ2 t ) for t even . The fact Gd Gd+1 G follows from the fact φd(x) is increasing in d and φd(x) < φ(x), where φ is deﬁned in Remark 17. To prove d=1Gd = G, ﬁx (λ1, λ2) G. It suﬃces to show that (λ1, λ2) Gd for d suﬃciently large. Since φ2(x)/x2 as x , there exists an absolute constant x0 > 1 such that φd(x) x2 whenever x x0 and d 2. Let t0 be an even number such that µ2 t0 > x0. Since φd(x) converges to φ(x) uniformly on bounded intervals, it follows that the ﬁrst t0/2 iterates using φd converge to the corresponding iterates using φ. So, for d large enough, bµ2 t0 > x0, and hence, for such d, bµ2 t as t , so (λ1, λ2) Gd.

0.2 0.4 0.6 0.8 1

Figure 1: Required signal-to-noise ratios by Algorithm 4 for biclustering.

Remark 17 Clearly G is an open subset of R2 + and G is an upper closed set. Let G denote its boundary and let φ(x) λ1eλ2ex, so that µ2 t+2 = φ(µ2 t ) for t even. Note that (λ1, λ2) G if and only if the function is such that for some x > 0, φ(x) = x and φ (x) = 1. Since φ (x) = φ(x)y, where y = λ2ex, it follows that xy = 1 where x = λ1ey. Therefore, it is convenient to express the boundary of G in the parametric form

G = {(xe 1/x, x 1e x) : x > 0}.

It follows that (1/e, 1/e) G and {(λ1, λ2) R2 + : λ1λ2 e 2}\{(1/e, 1/e)} G (see Fig. 1 for an illustration). Boundaries of Gd can be determined similar to (25) (see Fig. 2 for plots).

Submatrix localization via message passing

0.0 0.5 1.0 1.5 2.0 0.0

Figure 2: Boundaries of the regions Gd for d = 1, 2, 3; as d increases, Gd converges to G in Fig. 1.

The correctness proof for the spectral clean-up procedure in Algorithm 4 is given by Lemma 18 below with s deﬁned by (56); it is similar to Lemma 12 used in Theorem 1 but applies to rectangular matrices and uses singular value decomposition.

Lemma 18 Suppose µ K1K2 n1 + n2 1

for some c0 > 0. For i = 1, 2, suppose that |C i | Ki 1 in probability and e Ci is a set (possibly depending on W) such that

1 Ki | e Ci C i | 1 ϵ (54)

Ki(1 ϵ) | e Ci| niϵ (55)

hold for some 0 < ϵ < ϵ0, where ϵ0 depends only on c0. Let

log 1 ϵ 3c0 p

where h(ϵ) ϵ log 1

ϵ + (1 ϵ) log 1 1 ϵ is the binary entropy function. Then b Ci returned by Algorithm 4 satisﬁes | b Ci C i | η(ϵ)Ki for i = 1, 2, with probability converging to one as n , where

η(ϵ) = 2ϵ + 650c2 0 h(ϵ) + ϵ (1 ϵ)2 . (57)

With Lemma 18, we are ready to show that the bicluster message passing algorithm (Algorithm 4) approximately recovers C 1 and C 2, provided that (λ1, λ2) G.

Hajek, Wu and Xu

Theorem 19 Fix λ1, λ2 > 0. Suppose K2 i µ2

ni λi, K1 K2, and Ω( ni) Ki o(ni) as n , for i = 1, 2. Consider the model (1) with |C i |/Ki 1 in probability as n . Suppose (λ1, λ2) G and deﬁne d (λ1, λ2) as in (59). For every η (0, 1), there exist explicit positive constants t , s depending on (λ1, λ2, η) such that Algorithm 4 returns | b Ci C i | (1 η)Ki for i = 1, 2 with probability converging to 1 as n , and the total running time is bounded by c(η, λ1, λ2)n1n2 log(n1n2), where c(η, λ1, λ2) as either η 0 or (λ1, λ2) approaches G.

Remark 20 (Exact biclustering via message passing) If the assumptions of Theorem 19 hold and the voting condition (39) (respectively, (41)) holds, then C 1 (respectively, C 2) can be exactly recovered by message passing plus a voting procedure as described in Algorithm 3. Similar to the analysis in the symmetric case, whenever information-theoretic suﬃcient conditions for exact recovery (39) (41) imply the suﬃcient condition of message passing for weak recovery, i.e., (λ1, λ2) G deﬁned in (52), there is no computational gap for exact recovery. To be more precise, consider Ki = ρin log n for i = 1, 2. Then (39) and (41) are equivalent to λi > 8ρi. Thus, whenever K1 and K2 are large enough so that (8ρ1, 8ρ2) lies in the closure cl(G), or more generally, lim inf n K1 log n1

n1 , lim inf n K2 log n2

8cl(G), (58)

then Algorithm 4 plus voting achieves the information-theoretic exact recovery threshold with optimal constants. This result can be viewed as a two-dimensional counterpart of (20) obtained for the symmetric case.

Proof [Proof of Theorem 19] The proof follows step-by-step that of Theorem 1; we shall point out the minor diﬀerences. Given λ1 and λ2, deﬁne

d (λ1, λ2) = inf{d N : (λ1, λ2) Gd}. (59)

By the assumptions of Theorem 19, there exists c0 > 0 so that (53) holds. Given any η (0, 1), choose an arbitrary ϵ (0, ϵ0) such that η(ϵ) deﬁned in (57) is at most η. Notice that ϵ0 is determined by c0. Let M = 8 log(1/ϵ) and choose

t (λ1, λ2, M) = inf {t : min{bµt, bµt+1} > M} . (60)

In view of Lemma 16 and the assumption that (λ1, λ2) G, d is ﬁnite. Since (λ1, λ2) Gd , it follows that bµt and thus t (λ1, λ2, M) is ﬁnite. The assumptions of Theorem 19 imply that n1 n2. Lemmas 27 - 29 therefore go through as before, with n in the upper bounds taken to be min{n1, n2}, so that 1 ni 1 n. This modiﬁcation then implies that Lemma 6, justifying the state evolution equations, goes through as before. See Section 6.1 for more details. Finally, the proof is complete by invoking Lemma 18.

Submatrix localization via message passing

6. Justiﬁcation of state evolution equations

In this section we prove Lemma 6. Let f(x, t) = Pd i=0 qt ixi with |qt i| C for a constant C. Let {At, t 1} be i.i.d. matrices distributed as A conditional on C and let A0 = A. We now deﬁne a sequence of vectors {ξt, t 1} with ξt Rn given by

ξt+1 i j = X

ℓ [n]\ {i,j} At ℓif(ξt ℓ i, t), j = i [n] (61)

ℓ [n]\{i} At ℓif(ξt ℓ i, t)

ξ0 i j = 0. (62)

In the deﬁnition of ξt, fresh samples, At, of A are used at each iteration, and thus the moments of ξt in the asymptotic limit are easier to compute than those of θt. Use of the fresh samples At does not make the messages (ξt i ℓ: i [n]\ℓ) independent for ﬁxed ℓ [n] and ﬁxed t 2, because at t = 1 the messages sent by any one vertex to all other vertices are statistically dependent, so at t = 2 the messages sent by all vertices are statistically dependent. However, we can take advantage of the fact that the contribution of each individual message is small in the limit as n . Hence, we ﬁrst prove that ξt

and θt have the same moments of all orders as n , and then prove the lemma using the method of moments. The ﬁrst step is to represent (θt i j, θt i) and (ξt i j, ξt i) as sums over a family of ﬁnite rooted labeled trees as shown by (Deshpande and Montanari, 2015, Lemma 3.3). We next introduce this family in detail. We shall consider rooted trees T of the following form. All edges are directed towards the root. The set of vertices and the set of (directed) edges in a tree T are denoted by V (T) and E(T), respectively. Each vertex has at most d children. The set of leaf vertices of T, denoted by L(T), is the set of vertices with no children. Every vertex in the tree has a label which includes the type of the vertex, where the types are selected from [n]. The label of the root vertex consists of the type of the root vertex, and for every non-root vertex the label has two arguments, where the ﬁrst argument in the label is the type of the vertex (in [n]), and the second one is the mark (in {0, . . . , d}). For a vertex v in T, let ℓ(v) denote its type, r(v) its mark (if v is not the root), and |v| its distance from the root in T. For clarity, we restate the deﬁnition of family of rooted labeled trees introduced in (Deshpande and Montanari, 2015, Deﬁnition 3.2).

Deﬁnition 21 Let T t denote the family of labeled trees T with exactly t generations satisfying the conditions:

1. The root of T has degree 1.

2. Any path (v1, v2, . . . , vk) in the tree is non-backtracking, i.e., the types ℓ(vi), ℓ(vi+1), ℓ(vi+2) are distinct for all i, k.

3. For a vertex u that is not the root or a leaf, the mark r(u) is set to the number of children of v.

4. Note that t = maxv L(T) |v|. All leaves u with |u| t 1 have mark 0.

Hajek, Wu and Xu

Let T t i j T t be the subfamily satisfying the following additional conditions:

1. The type of the root is i.

2. The root has a single child with type distinct from i and j.

Similarly, let T t i T t be the subfamily satisfying the following:

1. The type of the root is i.

2. The root has a single child with type distinct from i.

We point out that under the above deﬁnition, a vertex of a tree in T t can have siblings of the same type and mark. Also two trees in T t are considered to be the same if and only if the labels of all vertices are the same, with the understanding that the order of the children of any given vertex matters. In addition, the mark of a leaf u with |u| = t is not speciﬁed and can possibly take any value in {0, . . . , d}. The following lemma is proved by induction on t and the proof can be found in (Deshpande and Montanari, 2015, Lemma 3.3).

A(T)Γ(T, q, t)θ(T),

A(T)Γ(T, q, t)θ(T),

u v E(T) Aℓ(u),ℓ(v),

Γ(T, q, t) Y

u v E(T) qt |u| r(u) ,

u v E(T):u L(T) (θ0 ℓ(u) ℓ(v))r(u).

A(T)Γ(T, q, t)θ(T),

A(T)Γ(T, q, t)θ(T),

u v E(T) At |u| ℓ(u),ℓ(v).

9. Often the initial messages for message passing are taken, with some abuse of notation, to have the form θ0 i j = θ0 i for all j, and then only the n variables θ0 i need to be speciﬁed. In that case, the expression for θ(T) simpliﬁes to θ(T) Q

u L(T )(θ0 ℓ(u))r(u).

Submatrix localization via message passing

Since the initial messages are zero, f(θ0 i j, 0) = q0 0. Thus, for notational convenience in what follows, we can assume without loss of generality that f(x, 0) q0 0, i.e., f(x, 0) is a degree zero polynomial. With this assumption, it follows that for a labeled tree T T t, Γ(T, q, t) = 0 unless the mark of every leaf of T is zero. If the mark of every leaf is zero, then θ(T) = 1, because in this case θ(T) is a product of terms of the form 00, which are all one, by convention. Therefore, Γ(T, q, t)θ(T) = Γ(T, q, t) for all T Tt. Consequently, the factor θ(T) can be dropped from the representations of θt i j, θt i, ξt i j, and ξt i given in Lemma 22. Applying Lemma 22, we can prove that all ﬁnite moments of θt i and ξt i are asymptotically the same. Before that, we need two key auxiliary lemmas.

Let φ(T)rs denote the number of occurrences of edges (u v) in the tree T with types ℓ(u), ℓ(v) = {r, s}.

Deﬁnition 23 For m 1 and given an m-tuple of trees T1, . . . , Tm, let G denote the undirected graph obtained by identifying the vertices of the same type in the trees and removing the edge directions. Let E(G) denote the edge set of G. Then an edge (r, s) is in E(G) if and only if Pm ℓ=1 φ(Tℓ)rs 1, i.e., the number of times covered is at least one. Let G1 denote the restriction of G to the vertices in C and G2 the restriction of G to the vertices in [n]\C . Let E(G1) and E(G2) denote the edge set of G1 and G2, respectively. Let EJ denote the set of edges in G with one endpoint in G1 and the other endpoint in G2.

Lemma 24 Suppose an m-tuple of trees T1, . . . , Tm T t has α edges in total, and there are k diﬀerent edges (r, s) in E(G1) which are covered exactly once, i.e., Pm ℓ=1 φ(Tℓ)rs = 1. Then E

for a consant c independent of n. The same conclusion also holds when replacing A(Tℓ) by A(Tℓ).

Proof By the deﬁnition of A(T),

j<j (Ajj ) Pm ℓ=1 φ(Tℓ)jj

E h (Ajj ) Pm ℓ=1 φ(Tℓ)jj i

j<j :Pm ℓ=1 φ(Tℓ)jj 2

E h (Ajj ) Pm ℓ=1 φ(Tℓ)jj i

j<j :Pm ℓ=1 φ(Tℓ)jj 2 E h |Ajj | Pm ℓ=1 φ(Tℓ)jj i

j<j :Pm ℓ=1 φ(Tℓ)jj 2

Pm ℓ=1 φ(Tℓ)jj

= cµkn α/2,

Hajek, Wu and Xu

where the last inequality follows because Zij are zero-mean subgaussian random vari-

ables with proxy variance γ and consequently for 1 p α, E Zij n

p cn p/2 and

p cn p/2 where c is a ﬁnite constant depending on γ, α, and µmax, where µmax

is an upper bound on µ for all n, which is ﬁnite10 by the assumptions that K = Ω( n) and λ = Θ(1).

We next deﬁne an equivalence relation on the family of m-tuples of trees in T t, which is useful for enumerating such m-tuples. Fix a set D [n] of distinguished types; the notion of equivalence depends on D. In this paper when we focus on the messages formed by one vertex {i} we take D = {i} and when we focus on the covariance of messages formed by two vertices, i and j, we take D = {i, j}.

Deﬁnition 25 For D [n], two m-tuples of trees in T t are equivalent (relative to D) if there is a permutation of the set of types [n] such that i maps to i for each i D, C maps to C , so also [n]\C maps to [n]\C , such that the following is true: the second m-tuple of trees is obtained by applying the permutation to the types of the vertices of the ﬁrst m-tuple of trees.

To clarify, if two m-tuples of trees are equivalent, in particular, the marks of the two mtuples must be the same, and the set of vertices with type i D is the same. Recall that for trees to be considered equal, the order of children matters. The same is true when considering trees to be equivalent relative to D. Thus, for example, if (T1, T2) is equivalent to (T 1, T 2) and if (T1, T2) has the following property: the ﬁrst child of the ﬁrst child of the root in T1 has the same type as the third child of the second child of the root in T2, then (T 1, T 2) must have the same property. Furthermore, if the common type for those two vertices in (T1, T2) is some i D, then those two vertices in (T 1, T 2) must also have type i. If the common type for those two vertices in (T1, T2) is some k C \D, then those two vertices in (T 1, T 2) must also have some common type k C \D.

Lemma 26 For a given set of distinguished types D, let S denote the set of equivalence classes on the family of m-tuples of trees in T t. Then |S| c for a constant c dependent on only m, t, d, |D| (not on n). Moreover, ﬁx any equivalence class S S and a representative m-tuple of trees (T1, . . . , Tm) which has t1 and t2 distinct types in C \D and [n]\(C D), respectively. Then |S| Kt1nt2.

Proof The total number of vertices of an m-tuple (T1, . . . , Tm) is bounded by a function of m, t, d alone and thus independently of n, therefore so are the number of ways to partition these vertices into subsets of vertices with the same type. For each such subset, we need to designate whether the type of the vertices in this subset is one of the distinguished types i D, or is in C \D, or is in [n]\(C D). The total number of distinct such designations is bounded by a function of m, t, d, |D| independently of n. Finally, we need to assign marks to the vertices of the trees, and the number of distinct assignments is bounded by a function of m, t, d alone. Hence, |S| is bounded by a function of m, t, d alone.

10. This is where the assumption K = Ω( n) is used because K2µ2

n is assumed to be a constant λ.

Submatrix localization via message passing

Fix a given equivalence class S S and a representative m-tuple of trees (T1, . . . , Tm) in S. Consider all the types from [n]\D that appear at least once for some vertex of some tree in the m-tuple. Then t1 is the number of such types in C \D and t2 is the number of such types in [n]\(C D). The cardinality of S is at most the product of the number of partial permutations of length t1 of elements chosen from C \D, times the number of partial permutations of length t2 of elements chosen from [n]\(C D). The conclusion follows.

Combining Lemmas 22, 24, and 26, we have the following lemma, showing that all ﬁnite moments of θt i and ξt i are asymptotically close.

Lemma 27 For any t 1, there exists a constant c independent of n and dependent on m, t, d, C such that for any i [n]: E (θt i)m E (ξt i)m cn 1/2.

Proof As explained right after Lemma 22, the assumption that f(x, 0) q0 0 implies that the factor θ(T) can be dropped in the representations given in Lemma 22. Therefore, it follows from Lemma 22 that for t 1,

E (θt i)m = X

T1,...,Tm T t i

ℓ=1 Γ(Tℓ, q, t)E

E (ξt i)m = X

T1,...,Tm T t i

ℓ=1 Γ(Tℓ, q, t)E

Because the coeﬃcients in the polynomial are bounded by C and there are m trees with each tree containing at most 1+d+ +dt 1 (d+1)t edges, | Qm ℓ=1 Γ(Tℓ, q, t)| Cm(d+1)t. Therefore, it suﬃces to show

T1,...,Tm T t i

In the following, let c denote a constant only depending on m, t, d and its value may change line by line. Recall (G, G1, G2) obtained from a given m-tuple of trees T1, . . . , Tm as deﬁned in Deﬁnition 23. We partition set {(T1, . . . , Tm) : Tℓ T t i } as a union of four disjoint sets Q R1 R2 R3, where

1. Q consists of m-tuples of trees (T1, . . . , Tm) such that there exists an edge (r, s) in E(G2) EJ which is covered exactly once.

2. R1 consists of m-tuples of trees (T1, . . . , Tm) such that all edges in E(G2) EJ are covered at least twice and at least one of them is covered at least 3 times.

3. R2 consists of m-tuples of trees (T1, . . . , Tm) such that each edge in E(G2) EJ is covered exactly twice and the graph G contains a cycle.

Hajek, Wu and Xu

4. R3 consists of m-tuples of trees (T1, . . . , Tm) such that each edge in E(G2) EJ is covered exactly twice and the graph G is a tree.

Fix any (T1, . . . , Tm) Q and let (r, s) be an edge in E(G2) E(J) which is covered exactly once. Since E [Ars] = 0 and Ars appears in the product Qm ℓ=1 A(Tℓ) once, it follows that E [Qm ℓ=1 A(Tℓ)] = 0. Similarly, E Qm ℓ=1 A(Tℓ) = 0. Therefore, it is suﬃcient to show that for j = 1, 2, 3,

(T1,...,Tm) Rj

First consider R1. Further, divide R1 according to the total number of edges in T1, . . . , Tm and the number of edges in E(G1) which are covered exactly once. In particular, for α = 1, . . . , m(d + 1)t and k = 0, 1, . . . , α, let R1,α,k denote the subset of R1 consisting of m-tuples of trees T1, . . . , Tm such that there are α edges in T1, . . . , Tm and there are k edges in E(G1) which are covered exactly once. It suﬃces to show that

(T1,...,Tm) R1,α,k

# cn 1/2. (63)

Fix α, k and an m-tuple of trees (T1, . . . , Tm) R1,α,k. It follows from Lemma 24 that E

# cµkn α/2. (64)

Throughout the proof of this lemma, let D = {i} be the set of distinguished types for deﬁning equivalence classes as speciﬁed in Deﬁnition 25. We consider breaking R1,α,k down into a large number of smaller sets, where each set is an equivalence class. While large, it follows from Lemma 26 that the number of these smaller sets depends on m, t, d, but not on n. Hence, it suﬃces to upper bound |S| for any given equivalence class S R1,α,k. It follows from Lemma 26 that |S| Kn1nn2, where n1 is the number of vertices in G1 with types in C \{i} and n2 is the number of vertices in G2 with types in [n]\(C {i}). We further upper bound n1 and n2. The graph G is connected (because all the trees have a root of type i), so n1 + n2 (the number of vertices of G minus one) is less than or equal to the number of edges in G. The number of edges in G is at most k + α k 1

2 because there are k edges in G covered once, and the rest are covered at least twice, with one edge covered at least three times. So n1 + n2 k + α k 1

2 . Also, since k of the edges in G have both endpoints in C , and the vertices of G2 have types in [n] C , there are at most α k 1

2 edges in G with at least one endpoint in G2. Moreover, since G is connected, each connected component in G2 is connected by at least one edge to a vertex in G1. Therefore, n2 |V (G2)| α k 1

2 . The bound Kn1nn2 is maximized subject to n1 + n2 k + α k 1

2 and n2 α k 1

2 by letting equality hold in both constraints, yielding |S| Kkn α k 1

2 . Combining with (64) shows that

(T1,...,Tm) S

# cµkn α/2Kkn α k 1

k n 1/2 cn 1/2, (65)

Submatrix localization via message passing

where we ve used the fact that µK n is bounded independently of n. In a similar way, it can be shown that

(T1,...,Tm) S

(T1,...,Tm) S

# cn 1/2. (66)

Since the number of equivalence classes S does not depend on n, (63) follows. Next consider R2. The previous argument carries over with a minor adjustment on the step of upper bounding n1 and n2. In particular, deﬁne R2,α,k accordingly as R1,α,k and then consider an equivalence class S R2,α,k corresponding to some representative m-tuple in R2,α,k. The number of edges in G is at most k + α k

2 because there are k edges in G covered once, and the rest are covered at least twice. Since G has n1 + n2 + 1 vertices, is connected, and has a cycle, n1 + n2 is less than or equal to the number of edges of G minus one, so n1 + n2 k + α k 2

2 . Also, since k of the edges in G have both endpoints with types in C , and V2 has types in [n] C , there are at most α k

2 edges in G with at least one endpoint in V2. Moreover, since G is connected, each connected component in G2 is connected by at least one edge to a vertex in G1. Therefore, n2 |V (G2)| α k

2 . The bound Kn1nn2 is maximized subject to these constraints by letting equality hold in both

constraints, yielding |S| Kk 1n α k

2 . So |S|µkn α/2 µK n

k /K c/K cn 1/2, and the reminder of the proof for bounding the contribution of R2 is the same as for R1 above. Finally, consider R3. It suﬃces to establish the following claim. The claim is that for any m-tuple such that G has no cycles, if two directed edges (a b) and (c d) map to the same edge in G, then they are at the same level in their respective trees (their trees might be the same). Indeed, if the claim is true, then for any m-tuple (T1, . . . , Tm) in R3 and any pair {r, s} [n], At rs appears in Qm ℓ=1 A(Tℓ) for at most one value of t, so that E [Qm ℓ=1 A(Tℓ)] = E Qm ℓ=1 A(Tℓ) . We now prove the claim. Let {r, s} denote the edge in G covered by both (a b) and (c d), i.e. {ℓ(a), ℓ(b)} = {ℓ(c), ℓ(d)} = {r, s}. First consider the case that ℓ(b) = ℓ(d). Let u1, . . . , uk denote the directed path in the tree containing b that goes from b to the root of that tree, so b = u1 and uk is the root of the tree. Since there are no cycles in G, and hence no cycles in the set of edges {{ℓ(u1), ℓ(u2)}, . . . , {ℓ(uk 1), ℓ(uk)}}, (viewed as a simple set, i.e. with duplications removed) it follows from the non-backtracking property that ℓ(u1), . . . , ℓ(uk) are distinct vertices in G. That is, (ℓ(u1), . . . , ℓ(uk)) is a simple path in G. Similarly, let v1, . . . , vk denote the path in the tree containing d that goes from d to the root of that tree, so d = v1 and vk is the root of that tree. As for the ﬁrst path, (ℓ(v1), . . . , ℓ(vk )) is also a simple path in G. Since the roots of all m trees have the same type, ℓ(uk) and ℓ(vk ) are the same vertex in G. Therefore, (ℓ(u1), . . . , ℓ(uk), ℓ(vk 1), . . . , ℓ(v1)) is a closed walk in G that is the concatenation of two simple paths. Since G has no cycles those two paths must be reverses of each other. That is, k = k and ℓ(uj) = ℓ(vj) for all j, and hence (a b) and (c d) are at the same level in their trees.

Hajek, Wu and Xu

Consider the remaining case, namely, that ℓ(b) = ℓ(c). Let u1, . . . , uk be deﬁned as before, and let v1, . . . , vk denote the path in the tree containing c that goes from c to the root of that tree, so c = v1, d = v2, and vk is the root of that tree. Arguing as before yields that k = k and ℓ(uj) = ℓ(vj) for 1 j k. Note that k 2 and so k 2 and ℓ(u2) = ℓ(v2) = ℓ(d) = ℓ(a). Thus, the types along the directed path a u1 u2 within one of the trees violates the non-backtracking property, so the case ℓ(b) = ℓ(c) cannot occur. The claim is proved. This completes the proof of Lemma 27.

The second step is to compute the moments of ξt in the asymptotic limit n . We need the following lemma to ensure that all moments of ξt are bounded by a constant independent of n.

Lemma 28 For any t 1, there exists a constant c independent of n and dependent on m, t, d, C, γ such that for any i, j [n]

|E (ξt i j)m | c, |E (ξt i)m | c.

Proof We prove the claim for ξt i; the claim for ξt i j follows by the similar argument. Since ξ0 i j = θ0 i j = 0 for all i [n], it follows from Lemma 22 that

E (ξt i)m = X

T1,...,Tm T t i

ℓ=1 Γ(Tℓ, q, t)E

Recalling (G, G1, G2) deﬁned as Deﬁnition 23 and following the same argument as used for proving Lemma 27, we can partition set {(T1, . . . , Tm) : Tℓ T t i } as a union of four disjoint sets Q R1 R2 R3, and show that

T1,...,Tm Q

ℓ=1 Γ(Tℓ, q, t)E

T1,...,Tm R1 R2

ℓ=1 Γ(Tℓ, q, t)

Hence, we only need to check R3. Again divide R3 according to the total number of edges in T1, . . . , Tm and the number of edges in E(G1) which are covered exactly once. In particular, R3 = 1 α m(d+1)t,0 k αR3,α,k, where R3,α,k is deﬁned in the similar way as R1,α,k. Furthermore, similar to R1,α,k, consider dividing R3,α,k into a number of equivalence classes (deﬁned relative to D = {i}), the number of which depends only on m, t, d, as shown in Lemma 26. To prove the lemma, it suﬃces to show that for any such equivalence class S,

(T1,...,Tm) S

Submatrix localization via message passing

Invoking Lemma 24, we have that E

# cµkn α/2,

(T1,...,Tm) S

# cµkn α/2|S|.

Fix a representative m-tuple (T1, . . . , Tm) for S. It follows from Lemma 26 that |S| Kn1nn2, where n1 is the number of vertices in G1 with types in C \{i} and n2 is the number of vertices in G2 with types in [n]\(C {i}). We can further bound n1 and n2 in the similar way as we did for |R1,α,k|, with the only adjustment being we cannot use the assumption that there exists at least one edge which is covered at least three times. There are n1 + n2 + 1 vertices in the connected graph G and, since the m-tuple is in R3,α,k, there are at most k + α k

2 edges in G, so n1 + n2 k + α k

2 . Also, at most α k

2 edges of G have at least one endpoint in V2 so n2 α k

2 . Therefore,

|S| Kn1nn2 Kkn α k

2 . It follows that

(T1,...,Tm) S

# cµkn α/2Kkn α k

and the proof is complete.

We also need the following lemma to show the convergence of 1 |C | P i C (ξt i)m in probability using the Chebyshev inequality.

Lemma 29 For any t 1, m 1 and i [n],

i C (ξt i)m !

ℓ C \{i} (ξt ℓ i)m

i [n]\C (ξt i)m

ℓ [n]\(C {i}) (ξt ℓ i)m

where the same also holds when replacing ξt by θt.

Hajek, Wu and Xu

Proof We prove the ﬁrst claim; the other claim follows by a similar argument. Notice that

i C (ξt i)m !

E (ξt i)m(ξt j)m E (ξt i)m E (ξt j)m .

There are K diagonal terms with i = j in the last displayed equation and each diagonal term is bounded by a constant independent of n in view of Lemma 28. Hence, to prove the claim, it suﬃces to consider the cross terms. Since there are K 2 cross terms, we only need

to show that for each cross term with i = j, E h (ξt i)m(ξt j)mi E (ξt i)m E h (ξt j)mi converges to 0 as n . Using the tree representation as shown by Lemma 22 yields E (ξt i)m(ξt j)m E (ξt i)m E (ξt j)m

T1,...,Tm T t i ,T 1,...,T m T t j

ℓ=1 A(Tℓ) A(T ℓ)

where c is a constant independent of n and dependent of m, t, d, γ. Let (G, G1, G2) denote the undirected simple graphs obtained from 2m-tuple of trees (T1, . . . , Tm, T 1, . . . , T m) as deﬁned in Deﬁnition 23. Notice that roots of T1, . . . , Tm have type i and roots of T 1, . . . , T m have type j, so either G is disconnected with one component containing i and the other component containing j, or G is connected. In the former case, there is no edge (r, s) E(G) which is covered by T1, . . . , Tm and T 1, . . . , T m simultaneously and thus E Qm ℓ=1 A(Tℓ) A(T ℓ) = E Qm ℓ=1 A(Tℓ) E Qm ℓ=1 A(T ℓ) . In the latter case, i.e., G is connected. We partition set {(T1, . . . , Tm, T 1, . . . , T m) : Tℓ T t i , T ℓ T t j } as a union of two disjoint sets Q R, where

1. Q consists of 2m-tuples of trees such that G is connected and there exists an edge (r, s) in E(G2) EJ which is covered exactly once.

2. R consists of 2m-tuples of trees such that G is connected and all edges in E(G2) EJ are covered at least twice.

If (T1, . . . , Tm, T 1, . . . , T m) Q, then

ℓ=1 A(Tℓ) A(T ℓ)

We are left to check R. Following the argument used in Lemma 27, further divide R according to the total number of edges in trees and the number of edges in E(G1) which is covered exactly once. In particular, deﬁne Rα,k in the similar manner as R1,α,k. Invoking Lemma 24, it can be shown that for any 2m-tuple in Rα,k E

ℓ=1 A(Tℓ) A(T ℓ)

# cµkn α/2,

Submatrix localization via message passing

Furthermore, let D = {i, j} to be the set of distinguished vertices in deﬁning equivalence classes as speciﬁced in Deﬁnition 25, so that Rα,k is divided into a number of equivalence classes, the number of which depends only on m, t, d, by Lemma 26. For any such equivalence class S Rα,k, it follows from the last displayed equation that

T1,...,Tm,T 1,...,T m S

ℓ=1 A(Tℓ) A(T ℓ)

cµkn α/2|S|.

There are two distinguished vertices, i and j, in the graph G, corresponding to the type of the root vertices of the ﬁrst m trees and the second m trees, respectively. It follows from Lemma 26 that |S| Kn1nn2, where n1 is the number of vertices in G1 with types in C \{i, j} and n2 is the number of vertices in G2 with types in [n]\(C {i, j}). There are n1 + n2 + 2 vertices in the connected graph G and at most k + α k

2 edges, so n1 + n2 k 1 + α k

2 . At most α k

2 edges have at least one endpoint in V2 and G is

connected, so n2 α k

2 . Thus, |S| Kn1nn2 Kk 1n α k

(T1,...,Tm,T 1,...,T m) S

ℓ=1 A(Tℓ) A(T ℓ)

cµkn α/2Kk 1n α k

In conclusion, var 1

K P i C (ξt i)m c/K and the ﬁrst claim follows.

With Lemma 28 and Lemma 29 in hand, we are ready to compute the moments of ξt

in the asymptotic limit n .

Lemma 30 For any t 0, m 1:

lim n E (ξt i j)m = E [(µt + τt Zt)m] , i C , j [n], j = i (67)

lim n E (ξt i j)m = E [(τt Zt)m] , i / C , j [n], j = i. (68)

lim n E (ξt i)m = E [(µt + τt Zt)m] , i C

lim n E (ξt i)m = E [(τt Zt)m] , i / C .

Proof Below we shall use the following version of the Berry-Esseen central limit theorem. There is an absolute constant C0 such that if X1, X2, . . . , Xn are independent mean zero random variables and Sn = X1 + + Xn then

(a) C0 P i [n] E |Xi|3

(var(Sn))3/2

(b) C0n1/4 P i [n] E X4 i 3/4

(var(Sn))3/2 ,

where Φ is the standard normal CDF; (a) is the original result of Esseen (Esseen, 1942); (b) follows by Jensen s inequality.

Hajek, Wu and Xu

We prove the ﬁrst two claims, (67) and (68). The other two follow similarly. The proof is by induction, so suppose (67) and (68) hold for some t and all n 1. We aim to show they also hold for t + 1. The above identities hold for the base case t = 0, because ξ0 i j = 0 for all i = j and µ0 = τ0 = 0. By the induction hypothesis, Lemma 29, and Chebyshev s inequality,

ℓ C \{i} (ξt ℓ i)m p= E [(µt + τt Zt)m] , i [n], (69)

ℓ [n]\(C {i}) (ξt ℓ i)m p= E [(τt Zt)m] , i [n], (70)

where Zt N(0, 1). Fix an i C . Then

lim n E h ξt+1 i j|Ft i = lim n E

ℓ C \{i,j} At ℓif(ξt ℓ i) + X

ℓ [n]\(C {j}) At ℓif(ξt ℓ i)|Ft

λ lim n 1 K

ℓ C \{i,j} f(ξt ℓ i)

λ lim n 1 K

ℓ C \{i} f(ξt ℓ i)

λE [f(µt + τt Zt)] = µt+1, (71)

where the ﬁrst equality follows from the deﬁnition of ξt+1 given by (61); the second equality holds because E At ℓi = µ/ n if ℓ C and E At ℓi = 0 otherwise; the third equality holds in view of Lemma 28, the fourth equality holds due to (69) and the fact f is a ﬁnite-degree polynomial; the last equality holds due to the deﬁnition of µt+1. Similarly,

lim n var ξt+1 i j|Ft = lim n

ℓ [n]\{i,j} var At ℓif(ξt ℓ i)|Ft

= lim n 1 n

ℓ [n]\{i,j} f(ξt ℓ i)2 (72)

= lim n 1 n

ℓ [n]\C {j} f(ξt ℓ i)2 + X

ℓ C \{i,j} f(ξt ℓ i)2

p= lim n 1 n

ℓ [n]\C f(ξt ℓ i)2 (74)

p= E f(τt Zt)2 = τ 2 t+1, (75)

where the ﬁrst equality follows from the conditional independence of At ℓif(ξt ℓ i) for ℓ [n]; the second equality holds because var(Aℓi) = 1/n for all ℓ; the third equality is the result of breaking a sum into two parts, the fourth equality holds in view of Lemma 28 and the

Submatrix localization via message passing

assumption that K = o(n); the ﬁfth equality holds in view of (70) and the fact f is a ﬁnite-degree polynomial; the last equality holds due to the deﬁnition of τt+1. Conditional on Ft, ξt+1 i j E h ξt+1 i j i is a sum of independent random variables. Therefore, by the form of the Berry-Esseen central limit theorem noted above,

ξt+1 i j E h ξt+1 i j|Ft i

var(ξt+1 i j|Ft) x Ft

C0n1/4 var(ξt+1 i j|Ft) 3/2

ℓ [n]\{j} f(ξt ℓ i)4E (At ℓi E At ℓi )4

C0c3/4n 1/2 var(ξt+1 i j|Ft) 3/2

ℓ [n]\{j} f(ξt ℓ i)4

where we used the fact E (At ℓi E At ℓi )4 cn 2, for a constant c depending only on the constant γ appearing in the subgaussian assumption. Taking the limit n and noticing that var(ξt+1 i j|Ft) p τ 2 t+1 and 1

n P ℓ [n]\{j} f(ξt ℓ i)4 p E f(τt Zt)4 (using the same steps as in (72)-(75)), we ﬁnd the righthand side of (76) converges to zero in probability. Thus, in view of (71), (75), and (76), for any x R,

lim n P n ξt+1 i j x|Ft o p= P {µt+1 + τt+1Zt+1 x} .

It follows by the dominated convergence theorem that

lim n P n ξt+1 i j x o = P {µt+1 + τt+1Zt+1 x} .

and, since convergence in distribution is preserved under continuous transformations:

lim n P n ξt+1 i j m x o = P {(µt+1 + τt+1Zt+1)m x} .

Since E h ξt+1 i j m+1i c for some c independent of n, the family of random variables (ξt+1 i j)m

is uniformly integrable, so that convergence in distribution implies convergence of means to the mean of the limiting distribution. Therefore (67) holds with t replaced by t + 1. To complete the proof by induction it remains to show (68) holds with t replaced by t + 1, so ﬁx i / C . Following the previous argument, one can easily check that

E h ξt+1 i j|Ft i = 0

lim n var ξt+1 i j|Ft p= τ 2 t+1,

and that, by the central limit theorem, uniformly bounded m + 1th moments, and uniform integrability, (68) holds with t replaced by t + 1.

Hajek, Wu and Xu

Proof [Proof of Lemma 6] We show the ﬁrst claim; the second one follows analogously. Fix t 1. The convergence property to be proved depends only on the sequence of random empirical distributions of (θt i : t C ) indexed by n. We may therefore assume without loss of generality that all the random variables (θt i : t C ) for diﬀerent n are deﬁned on a single underlying probability space; the joint distribution for diﬀerent values of n can be arbitrary. To show the convergence in probability, it suﬃces to show that for any subsequence {nk} there exists a sub-subsequence {nkℓ} such that for j = i,

i C δθt i, N(µt, τ 2 t )

= 0, a.s. (77)

Fix a subsequence nk. In view of Lemmas 27 and 30, for any ﬁxed integer m,

lim k E (θt i)m = E [(µt + τt Zt)m] .

Combining Lemma 29 with Chebyshev s inequality,

θt i m p= E [(µt + τt Zt)m] , (78)

which further implies, by a well-known property of convergence in probability, that there exists a sub-subsequence such that (78) holds almost surely. Using a standard diagonal argument, one can construct a sub-subsequence {nkℓ} such that for all m 1,

lim ℓ 1 Kkℓ

θt i m = E [(µt + τt Zt)m] a.s.

Since a Gaussian distribution is determined by its moments, by the method of moments (see, for example, (Chung, 2001, Theorem 4.5.5)), applied for each outcome ω in the underlying probability space (excluding some subset of probability zero), it follows that the sequence of empirical distribution of θt i for i C weakly converges to N(µt, τ 2 t ), which, since Gaussian density is bounded, is equivalent to convergence in the KS distance,11 proving the desired (77).

Remark 31 We discuss the diﬀerence between the proof of Lemma 6 and that of (Deshpande and Montanari, 2015, Lemma 2.2). First, a larger K requires modiﬁcation of bounds from (Deshpande and Montanari, 2015) to calculate the moments of messages in Lemmas 27 - 29. In particular, (θt i j)m can be expanded as a sum of monomials in terms of Akℓ for k, ℓ [n]. A larger K implies that in the expansion, there are more monomials containing Akℓfor k, ℓ C . That eﬀect is oﬀset by µ being smaller. Our approach is to balances these two eﬀects by accounting separately the contributions of those Akℓ s which appear only once in a monomial. Such Akℓ s correspond to singly covered edges with both endpoints in C . See Lemma 24, as well as R1,α,k in Lemma 27, R3,α,k in Lemma 28, and Rα,k in Lemma 29. Finally, Lemma 6 is phrased in terms of KS distance while (Deshpande and Montanari, 2015, Lemma 2.2) is in terms of convergence of expected values of bounded Lipschitz functions.

11. This follows from the fact that when one of the distributions has bounded density the L evy distance, which metrizes weak convergence, is equivalent to the KS distance (see, e.g. (Petrov, 1995, 1.8.32)).

Submatrix localization via message passing

6.1 Justiﬁcation of state evolution equations for biclustering

In this subsection, we brieﬂy describe how to generalize the method of moments from symmetric case to asymmetric biclustering case. Recall f(x, t) = Pd r=0 qt rxr with |qt r| C for some constant C. Let {W t, t 1} be i.i.d. matrices distributed as W conditional on (C 1, C 2) and let W 0 = W. Similar to (61), we deﬁne a sequence of vectors {ξt, t 1} given by

(t even) ξt+1 i j = 1 n2

ℓ [n2]\{j} W t iℓf(ξt ℓ i, t), i [n1], j [n2] (79)

(t odd) ξt+1 j i = 1 n1

ℓ [n1]\{i} W t ℓjf(ξt ℓ j, t), j [n2], i [n1], (80)

with the initial condition ξ0 ℓ i = 0 for (ℓ, i) [n2] [n1]. Moreover, let

(t even) ξt+1 i = 1 n2

ℓ [n2] W t iℓf(ξt ℓ i, t), i [n1] (81)

(t odd) ξt+1 j = 1 n1

ℓ [n1] W t ℓjf(ξt ℓ j, t), j [n2]. (82)

The ﬁrst step is to represent (θt i j, θt i) and (ξt i j, ξt i) as sums over a family of ﬁnite rooted labeled trees. Abusing notation slightly, we treat elements from [n1] and [n2] as distinct elements. We deﬁne T t as Deﬁnition 21 with an additional constraint: a vertex u must have type from [n1] if t |u| is odd and from [n2] if t |u| is even. In particular, all leaves u with |u| = t must have type from [n2], and the root has type from [n1] if t is odd and from [n2] if t is even. The following lemma similar to Lemma 22 can be proved by induction on t similar to (Deshpande and Montanari, 2015, Lemma 3.3).

A(T)Γ(T, q, t),

A(T)Γ(T, q, t),

u v E(T) ℓ(u) [n1]

1 n1 Wℓ(u),ℓ(v) Y

u v E(T) ℓ(u) [n2]

1 n2 Wℓ(v),ℓ(u)

and Γ(T, q, t) is deﬁned as before:

Γ(T, q, t) = Y

u v E(T) qt |u| r(u) .

Hajek, Wu and Xu

A(T)Γ(T, q, t),

A(T)Γ(T, q, t),

u v E(T) ℓ(u) [n1]

1 n1 W t |u| ℓ(u),ℓ(v) Y

u v E(T) ℓ(u) [n2]

1 n2 W t |u| ℓ(v),ℓ(u) .

Similar to Deﬁnition 23, let G denote the undirected bipartite graph obtained by identifying the vertices of the same type in the tuple of trees T1, . . . , Tm and removing the edge directions. Note that the vertices of type from [n1] ([n2]) in trees constitute the left (right) part of G. Let E(G) denote the edge set of G. Let G1 denote the restriction of G to the vertices in (C 1, C 2) and G2 the restriction of G to the vertices in ([n1]\C 1, [n2]\C 2). Let E(G1) and E(G2) denote the edge set of G1 and G2, respectively. Let EJ denote the set of edges in G with one endpoint in G1 and the other endpoint in G2. Then Lemma 24 goes through as before, with n in the upper bound taken to be min{n1, n2}. Note that in the deﬁnition of A(T), either Wij is divided by n1 or Wji is divided by n2 depending on whether i [n1] or i [n2]. However, it always holds that 1 n1 1 n and 1 n2 1 n. For a given set of distinguished types D, we can deﬁne the equivalence classes on the family of m-tuples of trees in T t similar to Deﬁnition 25 except that we only allow either permutations of types in [n1] or permutations of types in [n2]. Then Lemma 26 goes throught as before, with the upper bound in Lemma 26 changes to

|S| (max{K1, K2})t1 (max{n1, n2})t2 ,

where we assume a representative m-tuple of trees (T1, . . . , Tm) has t1 and t2 distinct types in C 1 C 2\D and [n1] [n2]\(C 1 C 2 D), respectively. By assumptions that λ1, λ2 = Θ(1) and K1 K2, it follows that n1 n2 n. Furthermore, let K = min{K1, K2}. Then K1 K2 K. Hence, |S| c Kt1nt2 for an absolute constant c > 0. With Lemma 32, modiﬁed Lemma 24, and modiﬁed Lemma 26, Lemmas 27 - 29 go through as before. In particular, we still partition set {(T1, . . . , Tm) : Tℓ T t i } as a union of four disjoint sets Q R1 R2 R3, and introduce R1,α,k, R2,α,k, R3,α,k and Rα,k as before.

Acknowledgments

We would like to acknowledge support for this project from the National Science Foundation (NSF grants IIS-9988642, CCF 14-09106, CCF-1527105, CCF-1755960, OIS 18-23145, and a CAREER award CCF-1651588) and the Multidisciplinary Research Program of the Department of Defense (MURI N00014-00-1-0637).

Submatrix localization via message passing

A. Row-wise thresholding

We describe a simple thresholding procedure for recovering C . For simplicity and information theoretic comparisons, we consider the Gaussian case. Let Ri = P j Wi,j for i [n].

Then Ri N(Kµ, n) if i C and Ri N(0, n) if i / C . Let b C = n i [n] : Ri Kµ

Then E h | b C C | i = n Q Kµ 2 n

. Recall that λ = K2µ2

n . Hence, if

λ = ω log n

then we have E[| b C C |] = o(K) and hence achieved weak recovery. In the regime K n (n K), λ = ω(log n

K ) is equivalent to λ , which is also equivalent to Kµ2 and coincides with the necessary and suﬃcient condition for the information-theoretic possibility of weak recovery in this regime (Hajek et al., 2017, Corollary 2). (If instead n K = o(n), weak recovery is trivially provided by b C = [n].) Thus, row-wise thresholding provides weak recovery in the regime K n (n K) whenever information theoretically possible. Under the information-theoretic condition (15), an algorithm attaining exact recovery can be built using row-wise thresholding for weak recovery followed by voting, as in Algorithm 2 (see (Hajek et al., 2017, Theorem 3) and its proof). In the regime n K log n

K = o(log n), or equivalently K = ω(n log log n/ log n), condition (15) implies that λ = ω(log n

K ), and hence in this regime exact recovery can be attained in linear time O(n2) whenever information theoretically possible.

B. Proof of Lemma 12

Proof We remind the reader that in this paper we let A = W/ n so that var(Aij) = 1/n for i, j [n] and E [Aij] = µ/ n for i, j C . Fix a e C that satisﬁes (29) (30), i.e., | e C C | K(1 ϵ) and K(1 ϵ) | e C| nϵ. Let m = | e C| and abbreviate the restricted matrix A e C R| e C| | e C| by e A. Let 1 e C C R| e C|

denote the indicator vector of e C C . Then the mean of e A is the rank-one matrix E[ e A] = µ n1 e C C 1 e C C , whose largest eigenvalue is µ| e C C | n with the corresponding eigenvector

| e C C |1 e C C . Let Z = e A E[ e A], and let u denote the principal eigenvector of e A such

that u, v 0. Note that u v = p

2(1 u, v ) p

2(1 u, v 2) =

2 sin θ, where θ is the angle between u and v. Combining this observation with the sin theorem of (Davis and Kahan, 1970) yields

µ| e C C|/ n Z

µ| e C C|/ n 2

λ(1 ϵ) , (84)

where the last inequality follows from the assumption (29). Observe that Z is a symmetric matrix such that {Zij}i j are independent subgaussian random variables with zero mean

Hajek, Wu and Xu

and proxy variance γ/n. To bound Z , we use the following standard concentration inequality, see e.g.. (Deshpande and Montanari, 2015, Lemma A.3): For any t > 0,

P { Z t} 2 exp m t2n 16eγm log 5t2n

Note that if t p

64eγh(ϵ) + 160eγm/n,

m t2n 16eγm log 5t2n

32eγ 2nh(ϵ).

By assumption we have K(1 ϵ) m ϵn. Therefore, by setting β = p

64eγh(ϵ) + 160eγϵ, we have for any ﬁxed e C, P { Z β} 2e 2nh(ϵ). (85)

The number of possible choices of e C that fulﬁlls (30) so that | e C| ϵn is at most P k nϵ n k

which is further upper bounded by enh(ϵ) (see, e.g., (Ash, 1965, Lemma 4.7.2)). In view of (85), the union bound yields Z β with high probability as n . Throughout the reminder of this proof we assume A and e C are ﬁxed with Z β. Combining with (84), we have,

λ(1 ϵ) . (86)

Next, we argue that either bu or bu is close to u, and hence, close to v by the triangle inequality. By the choice of the initial vector u0, we can write u0 = z/ z for a standard normal vector z Rm. By the tail bounds for Chi-squared distributions, it follows that z 2 m with high probability. For any ﬁxed u, the random variable u, z N(0, 1) and thus with high probability, | u, z |2 1/ log n, and hence

| u, u0 | = | u, z |/ z (2 p

n log n) 1. (87)

Replacing u0 by u0 would result in replacing ut by ut for each t, and since b C returned by Algorithm 1 only depends on |ut i|, replacing u by u would have no eﬀect on the ouput b C of the algorithm. Thus, we can assume without loss of generality that u, u0 0, and (87) becomes u, u0 (2 p

n log n) 1. (88)

By Weyl s inequality, the maximal singular value of e A satisﬁes σ1( e A) µK(1 ϵ) n β

and the other singular values are at most β. Let r = σ2 σ1 ( e A). By the assumption that ϵ < ϵ0 and λ 1/e, we have

λ(1 ϵ) > 2β. As a consequence, r 2β

λ(1 ϵ) < 1. Since

ut = e Atu0/ e Atu0 , it follows that

ut = u, u0 u + y u, u0 u + y

Submatrix localization via message passing

for some y Rm, depending on t, such that y rt. Hence,

ut, u = u, u0 + y, u

u, u0 u + y

u, u0 + rt ,

or, equivalently,

ut u 2 = 2(1 ut, u ) 4rt

u, u0 + rt . (89)

Recall that bu = u s log n . Thus, choosing s = 2 log(

λ(1 ϵ)/(2β)) as in (33), we obtain

r s log n n 2 and consequently in view of (88) and (89),

bu u 2 4n 2

(2 n log n) 1 + n 2 1

for suﬃciently large n. Therefore, by the triangle inequality,

bu v bu u + u v n 1/2 + 2

λ(1 ϵ) βo, (90)

where (a) holds for suﬃciently large n. Let b Co be deﬁned by using a threshold test to estimate C based on bu: b Co = {i e C : |bui| τ},

where τ = 1 2 q

| e C C | . Note that vi = 2τ1{i e C C }. For any i b Co\( e C C ), we

have |bui| τ and vi = 0; for any i ( e C C )\ b Co, we have |bui| < τ and vi = 2τ. Therefore |bui vi| ||bui| |vi|| τ for all i b Co ( e C C ) and thus

bu v 2 | b Co ( e C C )|τ 2.

In view of (90), the number of indices in e C incorrectly classiﬁed by b Co satisﬁes

| b Co ( e C C )| 4β2 o| e C C | 4β2 o|C |.

Since |C \ e C| ϵK, we conclude that |C b Co| ϵK +4β2 o|C |. Thus, if the algorithm were to output b Co (instead of b C) the lemma would be proved. Rather than using a threshold test in the cleanup step, Algorithm 1 selects the K indices in e C with the largest values of |bui|. Consequently, with probability one, either b Co b C or b C b Co. Therefore, it follows that

|C b C| 2|C b Co| + |C | K .

Hajek, Wu and Xu

By assumption, |C |/K converges to one in probability, so that, in probability,

lim sup n |C b C|

K 2ϵ + 8β2 o η(ϵ, λ), (91)

where η is deﬁned in (34), completing the proof.

C. An adaptive variant of Algorithm 1

Recall that the last step of the spectral clean-up of Algorithm 1 involves choosing the K coordinates of the largest magnitude of bu. In order to be adaptive to the cluster size K, in this section we show that this step can be simply replaced by applying k-means clustering with k = 2 to {|bui|}i e C so that Theorem 1 continues to hold. Let w denote an optimal

solution, i.e., a minimizer of x |bu| e C over all x in R| e C| whose coordinates take at most two distinct values. Since |bui| is a scalar, w can be easily found by sorting {|bui|}i e C in descending order, and checking all vectors of the form (a, . . . , a, b, . . . , b), where a and b with a b 0 are given by the average of the respective set of |bui| s. Thus w can be found in time O(n log n). Deﬁne b C = {i e C : wi = a}. To show this b C fulﬁlls the same performance guarantee as in Theorem 1, it suﬃces to modify the proof of Lemma 12 to show that, for any ϵ suﬃciently small, if e C [n] satisﬁes (29) (30), then P n |C b C|

K η o 1 as n , where η is a function of ϵ and λ such that η 0 as ϵ 0 for λ ﬁxed. Without loss of generality we may also assume that | e C\C | = Ω(K). (92)

This extra condition is fulﬁlled by the output of the message-passing algorithm with high probability, because, in view of (32), | e C\C | = Θ(n) with probability tending to one. Recall that we have shown in (90) that min{ bu v , bu + v } βo. By the deﬁnition of w, since v 0 is binary-valued componentwise, we have

|bu| w |bu| v min{ bu v , bu + v } βo,

and thus w v w |bu| + |bu| v| 2βo.

Deﬁne S = {i e C : wi vi τ}, τ 1

| e C C | .

Then |S|τ 2 w v 2 4β2 0,

and consequently, |S| 16β2 0| e C C |. Since β0 can be made to be suﬃciently small by choosing ϵ to be small, we have |S| < | e C C |. Furthermore, by the assumption that |C |/K 1 in probability and (92), we have |S| < | e C\C |. Deﬁne T1 = ( e C C )\S and T0 = ( e C\C )\S, both of which are non-empty. For each i T1 and j T0, we have

wi wj vi vj |wi vi| |wj vj| > 2τ τ τ = 0,

Submatrix localization via message passing

that is, wi = a > b = wj. Hence, b C ( e C C ) S and thus

| b C ( e C C )| |S| 16β2 0| e C C | 16β2 0|C |.

Since |C \ e C| ϵK, we have that | b C C | ϵK + 16β2 0|C |. Therefore,

lim sup n |C b C|

K ϵ + 16β2 0.

Since β0 0 as ϵ 0, Theorem 1 holds for the adaptive variant of Algorithm 1.

D. Proofs of Theorems 13 and 14

In the proofs below we use the following notation. We write pe(π1, s2) to denote the minimal average error probability for testing N(µ1, σ2) versus N(µ0, σ2) with priors π1 and 1 π1, where µ1 µ0 and s2 = (µ0 µ1)2

σ2 . That is,

pe(π1, s2) min x {π1Q(s x) + (1 π1)Q(x)}.

Proof [Proof of Theorem 13] The proof of suﬃciency for weak recovery is closely based on the proof of suﬃciency for exact recovery by the MLE given in (Butucea et al., 2015); the reader is referred to (Butucea et al., 2015) for the notation used in this paragraph. The proof in (Butucea et al., 2015) is divided into two sections. In our terminology, (Butucea et al., 2015, Section 3.1) establishes the weak recovery of C 1 and C 2 by the MLE under the assumptions (37), (39), and (41). However, the assumption (39) (and similarly, (41)) is used in only one place in the proof, namely for bounding the terms T1,km deﬁned therein. We explain here why (37) alone is suﬃcient for the proof of weak recovery. Condition (37), in the notation12 of (Butucea et al., 2015), implies that there exists some suﬃciently small α > 0 such that a2m 2 log(N/n) 1 + α.

So (Butucea et al., 2015, (3.4)) can be replaced as: there exist some suﬃciently small δ1 > 0 and α1 > 0 such that

2 a2m (1 + α1) log(N/n) (1 + α1) log δ(N n)

where we use the assumption 0 k < (1 δ)n, or n k > δn. Thus, for large enough n,

T1,km exp δnα1

12. The notation of (Butucea et al., 2015) is mapped to ours as N n1, M n2, n K1, m K2, and a µ.

Hajek, Wu and Xu

from which the desired conclusion, P k:(n k)>δn T1,km = o(1), follows. This completes the proof of suﬃciency of (37) for weak recovery of both C 1 and C 2, and marks the end of our use of notation from (Butucea et al., 2015). The rate distortion argument used in the proof of (Hajek et al., 2017, Theorem 1) shows that (38) must hold if C 1 and C 2 are both weakly recoverable.

Proof [Proof of Theorem 14] We give the proof for exact recovery of C 1; the proof for exact recovery of C 2 is analogous. For the suﬃciency part, Recall that in Algorithm 3, the set [n1] is partitioned into sets, S1, . . . , S1/δ of size n1δ. There are 1/δ rounds of the algorithm, and indices in Sk are classiﬁed in the kth round. For the kth round, by assumption, given ϵ > 0, there exists an estimator b C2k based on observation of W with the rows indexed by Sk hidden such that | b C2k C 2| ϵK2 with high probability. Then the voting procedure estimates whether i C 1 for each i Sk by comparing P

j b C2k Wi,j to a threshold. This sum has approximately the N(K2µ, K2) distribution if i C 1 and N(0, K2) distribution otherwise ; the discrepancy can be made suﬃciently small by choosing ϵ to be small (See (Hajek et al., 2017, Theorem 3) for a proof). Thus, the mean number of classiﬁcation errors is well approximated by n1pe(K1/n1, K2µ2), which converges to zero under (39), completing the suﬃciency proof for exact recovery of C 1. The necessity part is proved in (Butucea et al., 2015, Section 4.2).

E. Proof of Lemma 18

Proof (Similar to proof of Lemma 12.) We prove the lemma for b C1; the proof for b C2 is identical. For the ﬁrst part of the proof we assume that for i = 1, 2, e Ci is ﬁxed, and later use a union bound over all possible choices of e Ci. Recall that W e C1 e C2, which we abbreviate

henceforth as f W, is the matrix W restricted to entries in e C1 e C2. Let Z = f W E[f W] and

E[f W] = µ q

| e C1 C 1|| e C2 C 2|v1v 2 (93)

is a rank-one matrix, where vi is the unit vector in R| e Ci| obtained by normalizing the indicator vector of e Ci C i . Thus, thanks to (54), the leading singular value of E[f W] is at least µ K1K2(1 ϵ) with left singular vector v1 and right singular vector v2. It is well-known (see, e.g., (Vershynin, 2010, Corollary 5.35)) that if M is an m1 m2 matrix with i.i.d. standard normal entries, then P M m1 + m2 + t 2e t2/2. Applying this result for mi = | e Ci|, which satisﬁes mi ϵni by (55), and t = 2 p

h(ϵ)(n1 + n2), we have for ﬁxed ( e C1, e C2),

P { Z ( n1 + n2)β} 2e 2(n1+n2)h(ϵ),

where β 3 p

ϵ + h(ϵ)). Similar to the proof of Lemma 12, the number of ( e C1, e C2) that satisﬁes (55) is at most e(n1+n2)h(ϵ). By union bound, if we drop the assumption that e Ci is ﬁxed for i = 1, 2, we still have that with high probability, Z ( n1 + n2)β.

Submatrix localization via message passing

Denote by u the leading left singular vector of W e C1 e C2 such that u, v1 0. Then, letting θ denote the angle between u and v1,

2 sin(θ) (a)

Z σ1 f W σ2 E h f W i

σ1(E[f W]) ,

where (a) follows from Wedin s sin-θ theorem for SVD (Wedin, 1972), and (b) follows from σ2 E h f W i = 0 and Weyl s inequality σ1(f W) σ1(E[f W]) Z . In view of (93), conditioning on the high-probability event that Z ( n1 + n2)β, we have

2β( n1 + n2) µ(1 ϵ) K1K2 2

2c0β 1 ϵ , (94)

where the last inequality follows from the standing assumption (53). Next, we argue that bu or bu is close to u, and hence, close to v1 by the triangle inequality. By (88), the initial value u0 R| e C1| satisﬁes | u, u0 | (2 n1 log n1) 1 with high probability, and without loss of generality we can assume as in the proof of Lemma 12 that u, u0 (2 n1 log n1) 1. By Weyl s inequality, the largest singular value of f W is at least µ K1K2(1 ϵ) ( n1+ n2)β, and the other singular values are at most ( n1+ n2)β. In view of (53), 1 ϵ

c0β 1 > 1 for all ϵ < ϵ0, where ϵ0 > 0 depends only on c0. Let λ1 and

λ2 denote the ﬁrst and second eigenvalue of f W f W in absolute value, respectively. Let r = λ2/λ1. Then r ( c0β 1 ϵ c0β)2. Since for even t, ut = (f W f W )t/2u0/ (f W f W )t/2u0 , the same analysis of power iteration that leads to (89) yields

ut u 2 = 2(1 ut, u ) 4rt/2

u, u0 + rt/2 .

Since bu = u2 s log n and s = (log 1 ϵ c0β

c0β ) 1, we have r s log n1 n 2 1 and thus | bu, u 2|

1 n 1 1 and consequently, uu bu(bu) 2 F = 2 2 u, bu 2 n 1 1 . Similar to (90), applying (94) and the triangle inequality, we obtain

bu v bu u + u v n 1/2 + 2

λ(1 ϵ) < 3coβ

λ(1 ϵ) βo, (95)

By the same argument that proves (91), we have lim supn |C 1 b C1|/K1 2ϵ+8β2 0 η(ϵ) with η deﬁned in (57), completing the proof.

E. Abbe and C. Sandon. Community detection in general stochastic block models: fundamental limits and eﬃcient recovery algorithms. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS), pages 670 688, 2015. ar Xiv 1503.00609.

Hajek, Wu and Xu

E. Abbe, A. S. Bandeira, and G. Hall. Exact recovery in the stochastic block model. IEEE Transactions on Information Theory, 62(1):471 487, 2016.

N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph. Random Structures and Algorithms, 13(3-4):457 466, 1998.

A. Amini, A. Chen, P. J. Bickel, and E. Levina. Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 41(4):2097 2122, 2013.

R. B. Ash. Information Theory. Dover Publications Inc., New York, NY, 1965.

M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2): 764 785, 2011.

C. Butucea and Y. I. Ingster. Detection of a sparse submatrix of a high-dimensional noisy matrix. Bernoulli, 19(5B):2652 2688, 11 2013. doi: 10.3150/12-BEJ470.

C. Butucea, Y. Ingster, and I. Suslina. Sharp variable selection of a sparse submatrix in a high-dimensional noisy matrix. ESAIM: Probability and Statistics, 19:115 134, June 2015.

T. T. Cai, T. Liang, A. Rakhlin, et al. Computational and statistical boundaries for submatrix localization in a large noisy matrix. The Annals of Statistics, 45(4):1403 1430, 2017.

Y. Chen and J. Xu. Statistical-computational tradeoﬀs in planted problems and submatrix localization with a growing number of clusters and submatrices. In Proceedings of ICML 2014 (Also ar Xiv:1402.1267), Feb 2014.

K. Chung. A course in probability theory. Academic press, 2nd edition, 2001.

A. Condon and R. M. Karp. Algorithms for graph partitioning on the planted partition model. Random Struct. Algorithms, 18(2):116 140, Mar. 2001.

C. Davis and W. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970.

Y. Deshpande and A. Montanari. Finding hidden cliques of size p

N/e in nearly linear time. Foundations of Computational Mathematics, 15(4):1069 1128, August 2015.

C.-G. Esseen. on the liapunoﬀlimit of error in the theory of probability . Arkiv f or matematik, astronomi och fysik, A28:1 19, 1942.

D. F eral and S. P ech e. The largest eigenvalue of rank one deformation of large Wigner matrices. Communications in mathematical physics, 272(1):185 228, 2007.

B. Hajek, Y. Wu, and J. Xu. Semideﬁnite programs for exact recovery of a hidden community. In Proceedings of Conference on Learning Theory (COLT), pages 1051 1095, New York, NY, Jun 2016. ar Xiv:1602.06410.

Submatrix localization via message passing

B. Hajek, Y. Wu, and J. Xu. Information limits for recovering a hidden community. IEEE Trans. on Information Theory, 63(8):4729 4745, 2017.

J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123 129, 1972.

W. Hoeﬀding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

A. Javanmard and A. Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling. Information and Inference: A Journal of the IMA, 2(2):115 144, 2013.

A. Knowles and J. Yin. The isotropic semicircle law and deformation of Wigner matrices. Communications on Pure and Applied Mathematics, 66(11):1663 1749, 2013.

M. Kolar, S. Balakrishnan, A. Rinaldo, and A. Singh. Minimax localization of structural information in large noisy matrices. In Advances in Neural Information Processing Systems, 2011.

Z. Ma and Y. Wu. Computational barriers in minimax submatrix detection. The Annals of Statistics, 43(3):1089 1116, 2015.

A. Montanari, D. Reichman, and O. Zeitouni. On the limitation of spectral methods: From the Gaussian hidden clique problem to rank one perturbations of Gaussian tensors. In Advances in Neural Information Processing Systems, pages 217 225, 2015. ar Xiv 1411.6149.

E. Mossel, J. Neeman, and S. Sly. Belief propagation, robust reconstruction, and optimal recovery of block models (extended abstract). In JMLR Workshop and Conference Proceedings (COLT proceedings), volume 35, pages 1 35, 2014.

E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for the planted bisection model. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 15, pages 69 75, New York, NY, USA, 2015. ACM.

V. V. Petrov. Limit theorems of probability theory: Sequences of independent random variables. Oxford Science Publications, Clarendon Press, Oxford, United Kingdom, 1995.

A. A. Shabalin, V. J. Weigman, C. M. Perou, and A. B. Nobel. Finding large average submatrices in high dimensional data. The Annals of Applied Statistics, 3(3):985 1012, 2009.

G. Szeg o. Orthogonal polynomials. American Mathematical Society, Providence, RI, 4th edition, 1975.

R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. Arxiv preprint arxiv:1011.3027, 2010.

P. Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99 111, 1972.

Hajek, Wu and Xu

S. Yun and A. Proutiere. Optimal cluster recovery in the labeled stochastic block model. ar Xiv 1510.05956, Oct. 2015.