# on_the_statistical_limits_of_convex_relaxations__62f40388.pdf On the Statistical Limits of Convex Relaxations: A Case Study Zhaoran Wang ZHAORAN@PRINCETON.EDU Princeton University, NJ 08544 USA Quanquan Gu QG5W@VIRGINIA.EDU University of Virginia, VA 22904 USA Han Liu HANLIU@PRINCETON.EDU Princeton University, NJ 08544 USA Many high dimensional sparse learning problems are formulated as nonconvex optimization. A popular approach to solve these nonconvex optimization problems is through convex relaxations such as linear and semidefinite programming. In this paper, we study the statistical limits of convex relaxations. Particularly, we consider two problems: Mean estimation for sparse principal submatrix and edge probability estimation for stochastic block model. We exploit the sum-of-squares relaxation hierarchy to sharply characterize the limits of a broad class of convex relaxations. Our result shows statistical optimality needs to be compromised for achieving computational tractability using convex relaxations. Compared with existing results on computational lower bounds for statistical problems, which consider general polynomialtime algorithms and rely on computational hardness hypotheses on problems like planted clique detection, our theory focuses on a broad class of convex relaxations and does not rely on unproven hypotheses. 1. Introduction A broad variety of high dimensional statistical problems are formulated as nonconvex optimization. For example, sparse estimation can be formulated as optimization under 0-norm constraints, where the 0-norm is a pseudo-norm defined as the number of nonzero elements in a vector. To solve these nonconvex optimization problems, a popular approach is to resort to convex relaxations. Particularly, for sparse estimation, significant progress has been made by using 1-norm as a convex relaxation for the nonconvex 0-norm Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). (see, e.g., (B uhlmann & van de Geer, 2011; Chandrasekaran et al., 2012) and the references therein). In this paper, we study the statistical limits of convex relaxations. In particular, we focus on the sum-of-squares (So S) hierarchy of convex relaxations (Lasserre, 2001; Parrilo, 2000; 2003), which is made up of a sequence of increasingly tighter convex relaxations based on semidefinite programming. We study the So S hierarchy because it attains tighter approximations than other hierarchies such as the hierarchies proposed by (Sherali & Adams, 1990) and (Lov asz & Schrijver, 1991), as well as their extensions (see (Laurent, 2003) for a comparison). Hence, the estimators in the So S hierarchy achieve superior statistical performance than the estimators within other weaker hierarchies, which suggests the statistical limits of the So S hierarchy are also the limits of weaker hierarchies. To demonstrate the statistical limits of convex relaxations, we focus on the examples of sparse principal submatrix estimation and stochastic block model estimation. In detail, for sparse principal submatrix estimation, we assume there is a s s submatrix with elevated mean β on the diagonal of a d d noisy symmetric matrix. For stochastic block model estimation, we assume there exists a dense subgraph with s nodes planted in an Erd os-R enyi graph with d nodes. We denote by β the edge probability of the subgraph. For both examples, our goal is to estimate β under a challenging regime where s = o (d/plog d)2/3 and log d = o(s ). We prove the following information-theoretic lower bound sup P2P(s ,d) ##bβ β ## C 1/s log(d/s ), (1.1) where bβ denotes any estimator, P(s , d) is the distribution family to be specified later and C is an absolute constant. We prove that a computational intractable estimator bβscan (to be specified later) attains the lower bound in (1.1). In order to achieve computational tractability, we consider convex relaxations of bβscan that fall within the So S and weaker hierarchies, which are denoted by H. Let C0 be a positive absolute constant. We prove that under certain Statistical Limits of Convex Relaxations conditions, sup P2P(s ,d) ##bβ β ## C0. (1.2) Together with (1.1), (1.2) illustrates the statistical limitations of a broad class of convex relaxations. Ignoring the logarithmic factor, (1.1) and (1.2) suggest there exists a gap of s between the limits for any estimator and the limits for estimators within the hierarchies of convex relaxations. Hence, this result shows statistical optimality must be sacrificed for gaining computational tractability with convex relaxations. For sparse principal submatrix estimation, we prove that a linear-time estimator within H attains the lower bound in (1.2) up to a logarithmic factor, and is therefore nearly optimal within a general family of convex relaxations. Our work is closely related to a recent line of research on computational barriers for statistical problems (Arias-Castro & Verzelen, 2014; Berthet & Rigollet, 2013a;b; Cai et al., 2015; Chen & Xu, 2014; Gao et al., 2014; Hajek et al., 2014; Krauthgamer et al., 2013; Ma & Wu, 2013; Wang et al., 2014; Zhang et al., 2014). Under various computational hardness hypotheses on problems like planted clique detection, these works quantify the gap between the informationtheoretic limits and the statistical accuracy achievable by polynomial-time algorithms. For this purpose, their proofs are based on polynomial-time reductions from hard computational problems to statistical problems. In contrast with these works, we focus on the statistical limits of a broad class of convex relaxations rather than all polynomial-time algorithms. Correspondingly, our theory does not hinge on unproven computational hardness hypotheses, and our proof is based on constructions rather than reductions. Also, based on another perspective, (Chandrasekaran & Jordan, 2013) study the tradeoffs between computational complexity and statistical performance for normal mean estimation via hierarchies of convex relaxations. Their results are based on hierarchies of convex constraints, which are obtained by successively weakening the cone representation of the original constraint set. In comparison, our results are based on hierarchies of convex relaxations of the optimization problem itself rather than the constraints, which are obtained by successively tightening a basic semidefinite relaxation using variable augmentation techniques. In addition, our work is connected to previous works on the So S and other convex relaxation hierarchies (see, e.g., (Barak & Moitra, 2015; Barak & Steurer, 2014; Chlamtac & Tulsiani, 2012; Ma & Wigderson, 2015; Meka et al., 2015) and the references therein). In particular, the key construction of feasible solutions in our proof is based on the dual certificates designed for the maximum clique problem, which is proposed by (Meka et al., 2015). The rest of this paper is organized as follows. In 2 we introduce the statistical models. In 3 we present the So S hierarchy of convex relaxations and apply it to estimate the models in 2. In 4 we establish the main results and lay out the proofs in 5. In 6 we conclude the paper. 2. Statistical Model In the sequel, we briefly introduce the statistical models considered in this paper. Then we present several common estimators for them. 2.1. Sparse Principal Submatrix Estimation Let X 2 Rd d be a random matrix from distribution P and E(X) = . We assume there exists an index set S [d] with |S | = s that satisfies i,j = β for i 6= j and (i, j) 2 S S , while i,j = 0 for i 6= j and (i, j) /2 S S . Here β 0 is the signal strength. For all i < j, we assume that Xi,j s are independently sub-Gaussian with E(Xi,j) = i,j and k Xi,j i,jk 2 1. In addition, we assume that Xi,i = 0 and Xi,j = Xj,i. We aim to estimate the signal strength β . For simplicity, hereafter we assume s is known. We denote by P(s , d) the family of distribution P s satisfying the above constraints. This estimation problem is closely related to the problems considered by (Butucea & Ingster, 2013; Butucea et al., 2013; Cai et al., 2015; Kolar et al., 2011; Ma & Wu, 2013; Shabalin et al., 2009; Sun & Nobel, 2013). These works consider the detection problem and the recovery of S , while we consider the estimation of signal strength. Besides, we focus on symmetric X for simplicity. We consider the following estimator for β proposed by (Butucea & Ingster, 2013), bβscan = 1 s (s 1) sup S [d] |S|=s Xi,j, (2.1) where |S| is the cardinality of set S. The intuition behind bβscan is to exhaustively search all principal submatrices of cardinality s and calculate the average of all entires within each principal submatrix. In 4 we will prove that bβscan attains the information-theoretic lower bound for estimating β within P(s , d) under a challenging regime where s = o (d/plog d)2/3 . Nevertheless, it is computationally intractable to obtain bβscan. In 3 we will introduce convex relaxations of bβscan. We also consider the following computational tractable estimators bβavg = 1 s (s 1) Xi,j, bβmax = max i,j2[d] Xi,j (2.2) for further discussion in 4. 2.2. Stochastic Block Model We consider the estimation of edge probability in a dense subgraph with s nodes planted within an Erd os-R enyi Statistical Limits of Convex Relaxations graph with d nodes. If a pair of nodes are within the subgraph, they are independently connected with edge probability β 2 [0, 1]. Otherwise, they are independently connected with edge probability eβ 2 [0, β ]. We denote P(s , d) to be the distribution family of graphs which satisfy the above constraints and by A 2 Rd d the adjacency matrix. We assume Ai,i = 0 for all i 2 [d] and s is known. Similar to principal submatrix estimation, we focus on the challenging regime with s = o (d/ log d)2/3 . Additionally, we assume log(d/s ) = o(1) so that s is not too small. This estimation problem is connected to the ones studied by (Arias-Castro & Verzelen, 2014; Bhaskara et al., 2010; Chen & Xu, 2014; Coja-Oghlan, 2010; Decelle et al., 2011; Fortunato, 2010; Hajek et al., 2014; Kuˇcera, 1995; Massouli e, 2014; Meka et al., 2015; Mossel et al., 2012; 2013; Verzelen & Arias-Castro, 2013). However, we mainly focus on estimating the edge probability of the dense subgraph rather than detection or recovery of subgraphs. Also, we assume that the dense subgraph and its size are fixed rather than random as in some of the existing works. To estimate β , we use bβscan and bβmax defined in (2.1) and (2.2) with Xi,j replaced by Ai,j. Though stochastic model estimation is closely related to sparse principal submatrix estimation, in 4 we will illustrate that the respective upper and lower bounds have subtle differences because of the different deviations of Bernoulli random variables and general sub-Gaussian random variables, which possibly have unbounded support. 3. Convex Relaxation Hierarchy In this section, we first introduce some specific notations which will greatly simplify our presentation. Then we introduce the So S hierarchy for bβscan defined in (2.1). Notation: We define a collection C to be an unordered array of elements, where each element can appear more than once. For instance, {1}, {1, 2} and {1, 1} are all collections. Let the summation between two collections be the combination of all elements in them, e.g., for C1 = {1, 2}, C2 = {1, 3} we have C1 + C2 = {1, 1, 2, 3}. Note that a collection is different from a set, because a set has distinct elements. Let the merge operation M( ) on a collection be the operation that eliminates the duplicate elements and outputs a set, e.g., for C = {1, 1, 2, 2, 3} we have M(C) = {1, 2, 3}, which is a set. We use |C| and |S| to denote the cardinality of a collection and a set. Also, we denote by C1 = C2 if they contain the same elements. For integer 0, we define d( ) = P i=0 di for notational simplicity. Note that bβscan in (2.1) can be reformulated as bβscan = max v2Vs v>Xv/[s (s 1)], (3.1) v : v 2 {0, 1}d, Because (3.1) involves maximizing a convex function subject to nonconvex constraints, it is computational intractable to solve. Note that v>Xv = tr in (3.1). We can reparameterize vv> to be a d d positive semidefinite matrix with rank one. For notational simplicity, we define 0 01 d 0d 1 X , v0 = 1, (3.2) 1 0,1 . . . 0,d 1,0 1,1 . . . 1,d ... ... ... ... d,0 d,1 . . . d,d Here Y, 2 R(d+1) (d+1) and 0d1 d2 denotes a d1 d2 matrix whose entries are all zero. Meanwhile, note that Vs defined in (3.1) can be reformulated as i vi =0, 8i2[d] According to the reparametrization in (3.2), it holds that i,j = vivj for all i, j 2 {0, . . . , d}. Hence, from (3.1) we obtain the following semidefinite program tr(Y )/[s (s 1)], (3.4) i,0 = s , 0,0 = 1, 0, i,j = j,i, i,i = i,0 for all i, j 2 {0, 1, . . . , d}, in which Pd i=1 i,0 = s corresponds to Pd i=1 vi = s , i,j = j,i corresponds to vivj = vjvi, while i,i = i,0 corresponds to v2 i vi = 0. Note that if rank( ) = 1, then from our reparametrization in (3.2), the maximum of (3.4) equals the maximum of (3.1). However, we drop this rank constraint since it is nonconvex, and hence (3.4) is a convex relaxation of (3.1). The So S hierarchy is obtained by increasingly tightening the basic semidefinite program in (3.4) using variable augmentation techniques. In particular, the reparametrization in (3.4) only involves the second order interaction between vi and vj. For integer 1, we consider a d( ) d( ) matrix ( ), where d( ) = P i=0 di in our notations. For notational simplicity, we index the entries of ( ) using collections C1 and C2 with |C1|, |C2| , whose elements are indices 1, . . . , d. Our reparametrization takes the form Statistical Limits of Convex Relaxations In particular, for C = ; we define Q i2C vi = 1. The -th level So S relaxation of (3.1) takes the form /[s (s 1)], subject to (3.6) {i}+C1,C2 = s ( ) C1,C2, for all |C1| 1, |C2| , {i,i}+C1,C2 = ( ) for all i2[d], |C1| 2, |C2| , C1,C2 = ( ) for all C1 + C2 = C0 2, |C1|, |C2|, |C0 ;,; = 1, ( ) 0, where Y( ) 2 Rd( ) d( ) is defined as 0 01 d . . . 01 d 0d 1 X . . . 0d d ... ... ... ... 0d 1 0d d . . . 0d d Note that the first constraint in (3.6) is corresponding to the reparametrization in (3.5) and vj, for all |C| 2 1, which is equivalent to Pd i=1 vi = s in (3.3). The second constraint corresponds to (3.5) and Y vj vi, for all |C| 2 2, which is equivalent to v2 i vi = 0 in (3.3). Also, the third constraint corresponds to (3.5) and Y for all C1 + C2 = C0 2, |C1|, |C2|, |C0 The last constraint that ( ) ;,; = 1 follows from (3.5) and our definition that Q i2C vi = 1 for C = ;. For = 1, (3.6) reduces to the basic semidefinite relaxation in (3.4). We denote by bβ( ) So S the maximum of (3.6). We have bβscan bβ( ) since we have more constraints in (3.6) for a larger . Thus, for a larger (3.6) gives a tighter convex relaxation of (3.1). Meanwhile, note that the semidefinite program in (3.6) can be solved in O operations. Hereafter we focus on the settings where does not increase with d. (Laurent, 2003) proves that other existing convex relaxation hierarchies, such as Sherali-Adams and Lov asz-Schrijver hierarchies as well as their extensions, are weaker than the So S hierarchy in the sense that bβscan bβ( ) other, where bβ( ) other denotes the -th level of other weaker hierarchies. Note that relaxing constraints and objectives in the convex relaxations also leads to looser approximations of bβscan. Hence, we denote by H( ) the class of estimator bβ s that fall in the -th level of the So S and weaker hierarchies, as well as their weakened versions obtained by relaxing constraints and objectives. By this definition, we have H(1) H(2) . For example, for > 1 we can drop constraints in (3.6) to obtain (3.4), which corresponds to = 1. In particular, from (3.1) we have bβscan = max v>Xv s (s 1) max u,v2Vs u>Xv s (s 1) u>Xv s (s 1) max tr(X ) s (s 1), (3.7) vi =s , vi 0, 8i2[d] i,j =(s )2, i,j 0, 8i, j 2[d] Here Vs is a linear programming relaxation of Vs . Note that the right-hand side of (3.7) is equal to s /(s 1) bβmax, where bβmax is defined in (2.2). Therefore, s /(s 1) bβmax can be viewed as a linear programming relaxation of bβscan, which falls in H(1) (see, e.g., 2 of (Chlamtac & Tulsiani, 2012) for details). It is also worth noting that the So S hierarchy has several equivalent formulations. See, e.g., Theorem 2.7 of (Barak & Steurer, 2014) for a proof of such equivalence. 4. Main Result As defined in 3, H( ) denotes the -th level of the convex relaxation hierarchy for bβscan defined in (2.1). For stochastic block model, we replace X in (2.1) with the adjacency matrix A respectively. 4.1. Sparse Principal Submatrix Estimation In the following, we present the main theoretical results for estimating the signal strength of sparse principal submatrix. In the sequel we establish the information-theoretic lower bound for estimating β within the distribution family P(s , d) defined in 2.1. Theorem 1. For all estimators bβ constructed using X P 2 P(s , d) and s = o (d/plog d)2/3 , there exists an absolute constant C > 0 such that sup P2P(s ,d) ##bβ β ## C 1/s log(d/s ). Proof. See A for a detailed proof. In Theorem 1 we consider a challenging regime. More specifically, a straightforward calculation shows that Statistical Limits of Convex Relaxations bβavg defined in (2.2) achieves the d/(s )2 rate of convergence. For s = o (d/plog d)2/3 , we have p 1/s log(d/s ) = o . Thus, there exists a gap between the rate attained by bβavg and the informationtheoretic lower bound. We will show that there is also such a gap for bβmax. The next proposition shows bβscan in (2.1) attains the information-theoretic lower bound in Theorem 1. Proposition 2. For bβscan defined in (2.1) with Xi,j being the (i, j)-th entry of X P 2 P(s , d), we have that ##bβscan β ## C 1/s log(d/s ) holds with probability at least 1 1/d for some absolute constant C > 0. Proof. See A for a detailed proof. Theorem 1 and Proposition 2 show bβscan is statistically optimal under the regime where s = o (d/plog d)2/3 . However, it is computationally intractable to obtain bβscan. Thus, we consider the family of convex relaxations of bβscan within the -th level So S and weaker hierarchies as well as their further relaxations, which is denoted by H( ). In the sequel, we establish a minimax lower bound for the statistical performance of all estimators within H( ). Recall that P(s , d) is the distribution family defined in 2.1. Theorem 3. We assume that s = o d/(log d)2 1/2 . There is an absolute constant C > 0 such that inf bβ2H( ) sup P2P(s ,d) ##bβ β ## C. Proof. See A for a detailed proof. Note that the regime considered in Theorem 3 is within the challenging regime considered in Theorem 1. Under this regime, Theorem 3 proves that any estimator within the convex relaxation hierarchy fails to attain a statistical rate that decreases when s is increasing. A comparison between Theorems 1 and 3 illustrates that there exists a gap of s (ignoring the log d factor) between the informationtheoretic lower bound and the statistical rate achievable by a broad class of convex relaxations. In other words, to achieve computational tractability via convex relaxations, we have to compromise statistical optimality. It is worth noting that this gap between computational tractability and statistical optimality is effective under the regime s = o d/(log d)2 1/2 , which shrinks as increases. However, cannot increase with d and s , because otherwise the computational complexity required to solve the convex relaxations increases exponentially, according to our discussion in 3. For being any constant, the regime in Theorem 3 is a nontrivial subset of the regime in Theorem 1. As will be shown in our proof, s = o d/(log d)2 1/2 is a sufficient condition to establish the feasibility of the constructed solution. In fact, for = 2, we can further relax this condition to s = o d1/3/ log d with the results of (Deshpande & Montanari, 2015). Under the regime in Theorem 3, the next proposition shows that bβmax defined in (2.2) is nearly optimal under computational tractability constraints. Proposition 4. For bβmax in (2.2), where Xi,j is the (i, j)- th entry of X P 2 P(s , d), we have ##bβmax β ## C holds with probability at least 1 1/d for some absolute constant C > 0. Proof. See A for a detailed proof. According to (3.7) and the discussion in 3, we have bβmax 2 H(1) H(2) . Thus bβmax attains the minimax lower bound with computational constraints in Theorem 3 for every up to a log d factor, which also suggests that the lower bound in Theorem 3 is tight. Meanwhile, note that the calculation of bβmax in (2.2) requires O(d2) operations, which is linear in the size of input. In contrast, tighter approximations in the -th level So S hierarchy require O operations. In practice, such a computational complexity is in general higher than the complexity for calculating bβmax. Theorem 3 indicates that this extra computational cost can only result in limited possible improvements on the statistical rate of convergence, i.e., a log d factor. It is worth noting the gap between the lower bounds in Theorems 1 and 3 vanishes when s is a constant that does not increase with d. In this case, bβmax achieves the informationtheoretic lower bound in Theorem 1. On the other hand, bβscan is computational tractable to obtain in this case. 4.2. Stochastic Block Model In this section, we present the main theory for edge probability estimation in stochastic block model. Recall that P(s , d) is the distribution family defined in 2.2. The following lemma establishes the information-theoretic lower bound for estimating β . Recall eβ denotes the edge probability of the large Erd os-R enyi graph with d nodes. Theorem 5. We assume that s = o (d/plog d)2/3 and log(d/s ) = o(1). There is an absolute constant C > 0 such that sup P2P(s ,d) ##bβ β ## C 1/s log(d/s ). Proof. See B for a detailed proof. Theorem 5 is similar to Theorem 1 but needs an extra condition that log(d/s ) = o(1), which ensures s is not too small. Recall each entry of the adjacency matrix A is Bernoulli. (Arias-Castro & Verzelen, 2014) shows that a larger s guarantees the moderate deviation of the Bernoulli Statistical Limits of Convex Relaxations distribution is in effect in the lower bound. Next, we prove bβscan achieves the information-theoretic lower bound in Theorem 5 and hence is optimal. Proposition 6. For bβscan defined in (2.1), we have that with probability at least 1 1/d, ##bβscan β ## C 1/s log(d/s ). Proof. See B for a detailed proof. The next theorem establishes the minimax lower bound on the statistical performance of convex relaxations within H( ) defined in 3. Theorem 7. For s and d sufficiently large and s = o d/(log d)2 1/2 inf bβ2H( ) sup P2P(s ,d) ##bβ β ## 1/4. Proof. See B for a detailed proof. Similar to Theorem 3, Theorem 7 shows the gap between statistical optimality and computational tractability. Note that β 2 [0, 1]. Meanwhile, it is easy to show P 1 eβ * d2 d 2 . Therefore, bβmax exactly attains such a minimax lower bound under computational constraints up to constants. From another point of view, for s = o d/(log d)2 1/2 , every estimators within H( ) is at most as accurate as the trivial estimator bβ = 1. Theorems 3 and 7 are similar. Note that for sparse principal submatrix estimation we consider sub-Gaussian entries, while in the adjacency matrix for stochastic block model each entry is Bernoulli. A direct way to establish Theorem 3 is to adapt the construction of P in the proof of Theorem 7, since Bernoulli is sub-Gaussian. However, as illustrated in A the information-theoretic lower bound in Theorem 1 is established using the construction of P with unbounded support. Correspondingly, we use a construction of P with unbounded support to establish the lower bound with computational constraints in Theorem 3. By matching the constructions of P 2 P(s , d) in the proofs of Theorems 1 and 3, we can sharply characterize the existence of the p s gap particularly for sub-Gaussian distributions with unbounded support. 5. Proof of Main Results In the sequel, we present the proofs of the main results in 4. Due to space limit, we lay out the proof of Theorem 3 for sparse principal submatrix estimation, which is one of the major results, and defer the rest proofs to the appendix. 5.1. Proof of Theorem 3 Proof of Theorem 3. In this proof, we focus on specific distributions in P(s , d) with β = 0. We consider Xi,j s (i < j) being sub-Gaussian random variables which satisfy the constraints in 2.1. In addition, we assume that |Xi,j| almost surely and P(Xi,j > 0) = P(Xi,j < 0) = 1/2 for all i < j and constant > 0. Under such a distribution, we construct a matrix ( ) 2 Rd( ) d( ), which is a feasible solution to the -th level So S program in (3.6) with high probability. We further prove that the objective value corresponding to ( ) is larger than , which indicates that the maximum of the corresponding So S program is at least with high probability. In the rest of this proof, we denote X + Id to be X. Hereafter, we denote by XS,S0 the submatrix of X whose row indices are in S and column indices are in S0. For notational simplicity, we define the expansivity (S, X) of some set S [d] to be the number of sets S0 [d] that satisfy |S0| = 2 , S S0 and sign = 12 ,2 . Here sign(X) is a matrix that satisfies [sign(X)]i,j = 1 if Xi,j > 0 and [sign(X)]i,j = 0 if Xi,j 0. Note that (S, X) is nonzero only if Xi,j > 0 for all i 2 S, j 2 S. Hence, (S, X) gives the number of X s submatrices that are extended from XS,S and have size 2 2 with all entries being positive. It is worth noting that by definition (S, X) is a random quantity, which depends on the random matrix X. Recall that each entry ( ) C1,C2 of ( ) are indexed by collections C1 and C2, and M(C1) and M(C1) are the respective sets, which have distinct elements. Based on the construction of dual certificates of (Meka et al., 2015), we construct ( ) as M(C1) [ M(C2), X s !/[s |M(C1) [ M(C2)|]! (2 )!/[2 |M(C1) [ M(C2)|]!. Now we verify ( ) defined in (5.1) satisfies all the constraints of the -th level So S program in (3.6). First, we have ( ) ;,; = 1 from (5.1). Also, ( ) satisfies ( ) C1,C2 = ( ) 2 for C1 + C2 = C0 M(C1) [ M(C2) = M(C1 + C2) by the definition of the merge operation M( ). Meanwhile, it holds that ( ) C1+{i,i},C2 = ( ) C1+{i},C2 for all C1 and C2 with |C1| 2 and |C2| , since in (5.1) we have M(C1 + {i, i}) [ M(C2) = M(C1 + {i}) [ M(C2). Now we prove that Pd C1+{i},C2 = s ( ) C1,C2 holds for all |C1| 1 and |C2| . Let C = C1 + C2, which Statistical Limits of Convex Relaxations satisfies |M(C)| |C| 2 1. By (5.1) we have C1+{i},C2 = M(C + {i}), X s !/[s |M(C + {i})|]! (2 )!/[2 |M(C + {i})|]!, where we use the fact that M(C1+{i})[M(C2)=M(C1+C2+{i})=M(C+{i}). Also, note that M(C + {i}) = M(C) for i 2 M(C). In addition, it holds that M(C + {i}) = M(C) [ {i} and |M(C + {i})| = |M(C)| + 1 for i /2 M(C). From (5.2) we have C1+{i},C2 (5.3) (i) z }| { X * s !/[s |M(C)|]! (2 )!/[2 |M(C)|]! M(C) [ {i}, X * s !/[s |M(C)| 1]! (2 )!/[2 |M(C)| 1]! | {z } (ii) In the following, we characterize the relationship between M(C) [ {i}, X with i /2 M(C). We define S1, S2, . . . , S [M(C),X] [d] to be the distinct sets that satisfy |Sj| = 2 |M(C)|, M(C) \ Sj = ;, as well as sign XSj[M(C),Sj[M(C) = 12 2 for every . Setting S] = [ [M(C),X] j=1 Sj, we have that X M(C) [ {i}, X M(C) [ {i}, X 1(i 2 Sj) = [2 |M(C)|]. Here the first equality is from M(C) [ {i}, X = 0 for i /2 S], since in this case XM(C)[{i},M(C)[{i} 6= 1|M(C)[{i}|,|M(C)[{i}|. The second equality holds because to calculate M(C) [ {i}, X , we only need to count the number of Sj s that include i. The last equality is from |Sj| = 2 |M(C)|. Therefore, for term (ii) in (5.3) we have M(C) [ {i}, X * s !/[s |M(C)| 1]! (2 )!/[2 |M(C)| 1]! * (2 |M(C)|) s !/[s |M(C)| 1]! (2 )!/[2 |M(C)| 1]! * s !/[s |M(C)| 1]! (2 )!/[2 |M(C)|]! . (5.4) Meanwhile, for term (i) in (5.3) we have * s !/[s |M(C)|]! (2 )!/[2 |M(C)|]! (5.5) * s !/[s |M(C)|]! (2 )!/[2 |M(C)|]! = (|M(C)| s ) * s !/[s |M(C)|]! (2 )!/[2 |M(C)|]! * s !/[s |M(C)|]! (2 )!/[2 |M(C)|]! * s !/[s |M(C)| 1]! (2 )!/[2 |M(C)|]! * s !/[s |M(C)|]! (2 )!/[2 |M(C)|]!. Plugging (5.4) and (5.5) into (5.3), we obtain C1+{i},C2 = s * s !/[s |M(C)|]! (2 )!/[2 |M(C)|]! Thus, we conclude that ( ) satisfies all the constraints of the -th level So S program in (3.6) except ( ) 0. We defer the verification of this constraint to the end of the proof. Next we calculate the value of objective function corresponding to ( ). Note that where the first equality holds because by the definition of = 0 for Xi,j 0, which implies ( ) {i},{j} = 0 correspondingly. Moreover, we have {j},; = s s ( ) ;,; = (s )2, Statistical Limits of Convex Relaxations where the third and second last equalities are from the constraint Pd C1+{i},C2 = s ( ) C1,C2, while the last is from ;,; = 1. Similarly, we have {i},; = s ( ) where the first equality follows from the constraints that ( ) C1+{i,i},C2 = ( ) C1+{i},C2 and ( ) C1,C2 = ( ) 2, and the second is from Pd C1+{i},C2 = C1,C2. Recall that |Xi,j| almost surely and the objective function is equivalent to = 1 s (s 1) {i},{j} s (s 1) = [(s )2 s ] Hence, the objective value corresponding to ( ) is . Because bβ 2 H( ) is the maximum of the -th level So S program or its relaxed versions, so far we obtain )bβ | ( ) 0 In the sequel, we verify that ( ) 0 holds with high probability. We invoke Theorem 2.5 of (Meka et al., 2015). They consider a matrix M( ) 2 R d j) indexed by sets S1, S2 [d], which satisfies M ( ) S1,S2 = ( ) C1,C2 for S1 = M(C1) and S2 = M(C2). Their result implies that under the distribution within P(s , d) specified at the beginning of our proof, M( ) 0 holds with probability at least 1/2 for sufficiently large s and d, and s = o d/(log d)2 1/2 . Note M( ) is a submatrix of ( ), i.e., {C:|C|=|M(C)|},{C:|C|=|M(C)|}. In other words, we can simultaneously permute the rows and columns of ( ), which are indexed by the collection C s that satisfy |C| = |M(C)|, to the upper-left corner of ( ). Then M( ) is identical to such a P upper-left submatrix of ( ). Meanwhile, note that by (5.1) for all |C1| = |M(C1)|, M(C1) = M(C2). ,C denote the row and column correspond- ing to collection C. Thus, for any vector u 2 Rd( ), we have u> ( )u (5.7) C1:|C1|=|M(C1)| C1:|C1|=|M(C1)| C2:|C2|=|M(C2)| C1:|C1|=|M(C1)| where u 2 R d j) is indexed by sets and u S = P C:M(C)=S u C. Therefore, using (5.7) and the fact that M( ) 0 with probability at least 1/2, we have ( ) 0 holds with the same probability. Moreover, according to (5.6) and our setting that β = 0, by Markov s inequality we have ##bβ β ## P )bβ | ( ) 0 for all bβ 2 H( ) and s = o d/(log d)2 1/2 . Recall is a positive constant and our construction of distributions are within P(s , d). Hence, we conclude the proof. 6. Conclusions In this paper, we investigate the statistical limits of convex relaxations for two statistical problems: mean estimation for sparse principal submatrix and edge probability estimation for stochastic block model. Different from existing works, which consider the statistical limits of general polynomial-time algorithms, we instead characterize the loss in statistical rates incurred by a broad family of convex relaxations. At the core of our main theoretical results is a construction-based proof, which does not hinge on any unproven hardness hypotheses. Our conclusion is that in order to attain computational tractability with convex relaxations, under particular regimes we have to compromise the statistical optimality. Statistical Limits of Convex Relaxations Arias-Castro, Ery and Verzelen, Nicolas. Community detec- tion in dense random networks. The Annals of Statistics, 42(3):940 969, 06 2014. Barak, Boaz and Moitra, Ankur. Tensor prediction, rademacher complexity and random 3-xor. ar Xiv preprint ar Xiv:1501.06521, 2015. Barak, Boaz and Steurer, David. Sum-of-squares proofs and the quest toward optimal algorithms. ar Xiv preprint ar Xiv:1404.5236, 2014. Berthet, Quentin and Rigollet, Philippe. Computational lower bounds for sparse PCA. ar Xiv preprint ar Xiv:1304.0828, 2013a. Berthet, Quentin and Rigollet, Philippe. Optimal detection of sparse principal components in high dimension. The Annals of Statistics, 41(4):1780 1815, 2013b. Bhaskara, Aditya, Charikar, Moses, Chlamtac, Eden, Feige, Uriel, and Vijayaraghavan, Aravindan. Detecting high log-densities: An o(n1/4) approximation for densest ksubgraph. In ACM Symposium on Theory of Computing, pp. 201 210, 2010. B uhlmann, Peter and van de Geer, Sara. Statistics for high- dimensional data: Methods, theory and applications. Springer, 2011. Butucea, Cristina and Ingster, Yuri I. Detection of a sparse submatrix of a high-dimensional noisy matrix. Bernoulli, 19(5B):2652 2688, 11 2013. Butucea, Cristina, Ingster, Yuri I, and Suslina, Irina. Sharp variable selection of a sparse submatrix in a high-dimensional noisy matrix. ar Xiv preprint ar Xiv:1303.5647, 2013. Cai, T Tony, Liang, Tengyuan, and Rakhlin, Alexander. Computational and statistical boundaries for submatrix localization in a large noisy matrix. ar Xiv preprint ar Xiv:1502.01988, 2015. Chandrasekaran, Venkat and Jordan, Michael I. Compu- tational and statistical tradeoffs via convex relaxation. Proceedings of the National Academy of Sciences, 110 (13):1181 1190, 2013. Chandrasekaran, Venkat, Recht, Benjamin, Parrilo, Pablo A, and Willsky, Alan S. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805 849, 2012. Chen, Yudong and Xu, Jiaming. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. ar Xiv preprint ar Xiv:1402.1267, 2014. Chlamtac, Eden and Tulsiani, Madhur. Convex relaxations and integrality gaps. In Handbook on semidefinite, conic and polynomial optimization, pp. 139 169. Springer, 2012. Coja-Oghlan, Amin. Graph partitioning via adaptive spec- tral techniques. Combinatorics, Probability and Computing, 19(02):227 284, 2010. Decelle, Aurelien, Krzakala, Florent, Moore, Cristopher, and Zdeborov a, Lenka. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 84(6):066106, 2011. Deshpande, Yash and Montanari, Andrea. Improved sum- of-squares lower bounds for hidden clique and hidden submatrix problems. ar Xiv preprint ar Xiv:1502.06590, 2015. Fortunato, Santo. Community detection in graphs. Physics Reports, 486(3):75 174, 2010. Gao, Chao, Ma, Zongming, and Zhou, Harrison H. Sparse CCA: Adaptive estimation and computational barriers. ar Xiv preprint ar Xiv:1409.8565, 2014. Hajek, Bruce, Wu, Yihong, and Xu, Jiaming. Computational lower bounds for community detection on random graphs. ar Xiv preprint ar Xiv:1406.6625, 2014. Kolar, Mladen, Balakrishnan, Sivaraman, Rinaldo, Alessan- dro, and Singh, Aarti. Minimax localization of structural information in large noisy matrices. In Advances in Neural Information Processing Systems, pp. 909 917, 2011. Krauthgamer, Robert, Nadler, Boaz, and Vilenchik, Dan. Do semidefinite relaxations really solve sparse PCA? ar Xiv preprint ar Xiv:1306.3690, 2013. Kuˇcera, Ludˇek. Expected complexity of graph partitioning problems. Discrete Applied Mathematics, 57(2):193 212, 1995. Lasserre, Jean B. Global optimization with polynomials and the problem of moments. SIAM Journal on Optimization, 11(3):796 817, 2001. Laurent, Monique. A comparison of the Sherali-Adams, Lov asz-Schrijver, and Lasserre relaxations for 0-1 programming. Mathematics of Operations Research, 28(3): 470 496, 2003. Lov asz, L aszl o and Schrijver, Alexander. Cones of matrices and set-functions and 0-1 optimization. SIAM Journal on Optimization, 1(2):166 190, 1991. Statistical Limits of Convex Relaxations Ma, Tengyu and Wigderson, Avi. Sum-of-squares lower bounds for sparse PCA. In Advances in Neural Information Processing Systems, pp. 1603 1611, 2015. Ma, Zongming and Wu, Yihong. Computational barriers in minimax submatrix detection. ar Xiv preprint ar Xiv:1309.5914, 2013. Massouli e, Laurent. Community detection thresholds and the weak Ramanujan property. In ACM Symposium on Theory of Computing, pp. 694 703, 2014. Meka, Raghu, Potechin, Aaron, and Wigderson, Avi. Sum- of-squares lower bounds for the planted clique problem. In ACM Symposium on Theory of Computing, 2015. Mossel, Elchanan, Neeman, Joe, and Sly, Allan. Stochas- tic block models and reconstruction. ar Xiv preprint ar Xiv:1202.1499, 2012. Mossel, Elchanan, Neeman, Joe, and Sly, Allan. A proof of the block model threshold conjecture. ar Xiv preprint ar Xiv:1311.4115, 2013. Parrilo, Pablo A. Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. Ph D thesis, California Institute of Technology, 2000. Parrilo, Pablo A. Semidefinite programming relaxations for semialgebraic problems. Mathematical Programming, 96 (2):293 320, 2003. Shabalin, Andrey A, Weigman, Victor J, Perou, Charles M, and Nobel, Andrew B. Finding large average submatrices in high dimensional data. The Annals of Applied Statistics, pp. 985 1012, 2009. Sherali, Hanif D and Adams, Warren P. A hierarchy of relaxations between the continuous and convex hull representations for zero-one programming problems. SIAM Journal on Discrete Mathematics, 3(3):411 430, 1990. Sun, Xing and Nobel, Andrew B. On the maximal size of large-average and ANOVA-fit submatrices in a Gaussian random matrix. Bernoulli, 19(1):275 294, 2013. Verzelen, Nicolas and Arias-Castro, Ery. Community detection in sparse random networks. ar Xiv preprint ar Xiv:1308.2955, 2013. Wang, Tengyao, Berthet, Quentin, and Samworth, Richard J. Statistical and computational trade-offs in estimation of sparse principal components. ar Xiv preprint ar Xiv:1408.5369, 2014. Zhang, Yuchen, Wainwright, Martin J, and Jordan, Michael I. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. ar Xiv preprint ar Xiv:1402.1918, 2014.