# random_subgraph_detection_using_queries__65c0c014.pdf

Journal of Machine Learning Research 25 (2024) 1-25 Submitted 4/22; Revised 3/24; Published 5/24

Random Subgraph Detection Using Queries

Wasim Huleihel wasimh@tauex.tau.ac.il Department of Electrical Engineering-Systems Tel Aviv university Tel Aviv 6997801, Israel

Arya Mazumdar arya@ucsd.edu Halıcıo glu Data Science Institute University of California San Diego La Jolla, CA 92093, USA

Soumyabrata Pal soumyabratapal13@gmail.com Adobe Research, India Bangalore, INDIA

Editor: Andrea Montanari

The planted densest subgraph detection problem refers to the task of testing whether in a given (random) graph there is a subgraph that is unusually dense. Speciﬁcally, we observe an undirected and unweighted graph on n vertices. Under the null hypothesis, the graph is a realization of an Erd os-R enyi graph with edge probability (or, density) q. Under the alternative, there is a subgraph on k vertices with edge probability p > q. The statistical as well as the computational barriers of this problem are well-understood for a wide range of the edge parameters p and q. In this paper, we consider a natural variant of the above problem, where one can only observe a relatively small part of the graph using adaptive edge queries. For this model, we determine the number of queries necessary and suﬃcient (accompanied with a quasi-polynomial optimal algorithm) for detecting the presence of the planted subgraph. We also propose a polynomial-time algorithm which is able to detect the planted subgraph, albeit with more queries compared to the above lower bound. We conjecture that in the leftover regime, no polynomial-time algorithms exist. Our results resolve two open questions posed in the past literature.

Keywords: Random graphs, statistical inference, planted dense subgraph, adaptive probing, queries

1. Introduction

In the planted densest subgraph (PDS) formulation of community detection, the task is to detect the presence of a small subgraph of size k planted in an Erd os-R enyi random graph. This problem has been studied extensively both from the algorithmic and the informationtheoretic perspectives Arias-Castro et al. (2014); Butucea and Ingster (2013); Verzelen et al. (2015); Chen and Xu (2016); Montanari (2015); Candogan and Chandrasekaran (2018); Hajek et al. (2016). Nonetheless, the best known algorithms exhibit a peculiar phenomenon: there appears to be a statistical-computational gap between the minimum k at which this task can be solved and the minimum k at which it can be solved in polynomial-time. Tight

c 2024 Wasim Huleihel, Arya Mazumdar, Soumyabrata Pal.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-0395.html.

Huleihel, Mazumdar and Pal

statistical-computational bounds for several parameter regimes of the PDS were recently established through average-case reductions from the planted clique conjecture Ma and Wu (2015); Hajek et al. (2015); Brennan et al. (2018). The regimes in which these problems are information-theoretically impossible, statistically possible but computational hard, and admit polynomial-time algorithms appear to have a common structure. Recently, models of clustering and community detection that allow active querying for pairwise similarities have become quite popular. This includes active learning, as well as data labeling by amateurs via crowdsourcing. Clever implementation of an interactive querying framework can improve the accuracy of clustering and help in inferring labels of large amount of data by issuing only a small number of queries. Queries can be easily implemented, e.g., via captcha. Non-expert workers in crowdsourcing platforms are often not able to label the items directly, however, it is reasonable to assume that they can compare items and judge whether they are similar or not. Understanding the query complexity to recover hidden structures is a fundamental theoretical question with several applications, from community detection to entity resolution Mazumdar and Saha (2017a,b). For example, analyzing query complexity and designing query-based algorithms is relevant to community recovery in social networks, where access to the connections (edges) between individuals may be limited due to privacy concerns; or the network can be very large so only part of the graph can be sampled. In this paper, we investigate a natural variant of the PDS problem above, where one can only observe a small part of the graph using non-adaptive edge queries. A precise description of our model as well as our main goals are described next.

1.1 Model Formulation and Goal

To present our model, we start by reminding the reader the basic mathematical formulation of the PDS detection problem. Speciﬁcally, let G(n, q) denote the Erd os-R enyi random graph with n vertices, where each pair of vertices is connected independently with probability q. Also, let G(n, k, p, q) with p > q denote the ensemble of random graphs generated as follows:

Pick k vertices uniformly at random from [n] {1, 2, . . . , n}, and denote the obtained set by K.

Any two vertices in K are connected with probability p; all other pairs of vertices are connected with probability q.

In summary, G(n, k, p, q) is the ensemble of graphs of size n, where a random subgraph G(k, p) is planted in an Erd os-R enyi random graph G(n, q); this ensemble is known as the PDS model. The vertices in K form a community with higher density than elsewhere. In this paper, we focus on the regime where both edge probabilities p and q are ﬁxed and independent of n. The PDS detection problem is deﬁned as follows.

Deﬁnition 1 (PDS detection problem) The PDS detection problem with parameters (n, k, p, q), henceforth denoted by PDS(n, k, p, q), refers to the problem of distinguishing hypotheses:

H0 : Gn G(n, q) vs. H1 : Gn G(n, k, p, q). (1)

Random Subgraph Detection Using Queries

The statistical and computational barriers of the problem in Deﬁnition 1 depend on the parameters (n, k, p, q). Roughly speaking, if the planted subgraph size k decreases, or if the distance between the densities p and q decrease, the distributions under the null and alternative hypotheses become less distinguishable. The statistical limits (i.e., necessary and suﬃcient conditions) for detecting planted dense subgraphs, without any constraints on the computational complexity, were established in Arias-Castro et al. (2014); Verzelen et al. (2015). Interestingly, in the same papers it was observed that state-of-the-art lowcomplexity algorithms are highly suboptimal. This raised the intriguing question of whether those gaps between the amount of data needed by all computationally eﬃcient algorithms and what is needed for statistically optimal algorithms is inherent. According, quite recently Hajek et al. (2015); Brennan et al. (2018, 2019), tight statistical-computational gaps for several parameter regimes of PDS were established through average-case reductions from the planted clique conjecture (see, Conjecture 4 below). In this work, we consider a variant of the PDS detection problem where one can only inspect a small part of the graph by non-adaptive edge queries, deﬁned as follows.

Deﬁnition 2 (Oracle/Edge queries) Consider a graph Gn = ([n], E) with n vertices, where E denotes the set of edges. An oracle O : [n] [n] {0, 1}, takes as input a pair of vertices i, j [n] [n], and if (i, j) E, namely, there exists an edges between the chosen vertices, then O(i, j) = 1, otherwise, O(i, j) = 0.

We consider query mechanisms that evolve dynamically over Q steps/queries in the following form: in step number ℓ [Q], the mechanism chooses a pair of vertices eℓ (iℓ, jℓ) and asks the oracle whether these vertices are connected by an edge or not. Generally speaking, either adaptive or non-adaptive query mechanisms can be considered. In the former, the chosen ℓth pair may depend on the previously chosen pairs {ei}i<ℓ, as well as on past responses {O(ei)}i<ℓ. In non-adaptive mechanisms, on the other hand, all queries must be made upfront. In this paper we focus mainly on non-adaptive mechanisms. The query-PDS detection problem is deﬁned as follows.

Problem 1 (Query-PDS detection problem) Consider the PDS detection problem in Deﬁnition 1. There is an oracle O as deﬁned in Deﬁnition 2. Find a set of queries Q [n] [n] such that Q = |Q|, and from the oracle answers it is possible to solve (as deﬁned below) the detection problem in (1). Henceforth, we denote this detection problem by QPDS(n, k, p, q, Q).

A detection algorithm An for the problem in Deﬁnition 1, makes up to Q non-adaptive edge queries, and based on the query responses is tasked with outputting a decision in {0, 1}. We deﬁne the risk of a detection algorithm An as the sum of its Type-I and Type-II errors probabilities, namely,

R(An) = PH0(An(Gn) = 1) + PH1(An(Gn) = 0), (2)

where PH0 and PH1 denote the probability distributions under the null and the alternative hypothesis, respectively. If R(An) 0 as n , then we say that An solves the detection problem. Our primary goals in this paper are:

Huleihel, Mazumdar and Pal

To characterize the statistical limits of QPDS(n, k, p, q, Q), namely, to derive necessary and suﬃcient conditions for when its statistically impossible and statistically possible to solve the detection problem, ignoring algorithmic computational constraints.

To devise eﬃcient polynomial-time algorithms for QPDS(n, k, p, q, Q).

1.2 Related Work and Main Contributions

The problem of ﬁnding cliques in an Erd os-R enyi random graph under the same edge query model was considered in Feige et al. (2020). It was shown that under certain limitations on the adaptivity of the considered class of algorithms, any algorithm that makes Q = O(nα) adaptive edge queries, with α < 2, in ℓrounds ﬁnds cliques of size at most (2 ϵ) log2 n where ϵ = ϵ(α, ℓ) > 0. This lower bound should be contrasted with the fact that current state-ofthe-art algorithms that make Q = O(nα) queries ﬁnd a clique of size approximately (1 + α/2) log2 n. This result was later improved in Alweiss et al. (2020), where the dependency of ϵ on ℓwas removed. Closing the gap between those bounds seems to be a challenging open problem. We also mention Ferber et al. (2015, 2017); Conlon et al. (2018), which study the problems of ﬁnding a Hamilton cycle, long paths, and a copy of a ﬁxed target graph, in sparse random graphs under the adaptive edge query model. Another recent line of active research is the analysis of the query complexity in certain clustering tasks, such as, the stochastic block model and community detection Mazumdar and Saha (2017a,b); Vinayak and Hassibi (2016); Hartmann et al. (2016); Anagnostopoulos et al. (2016). Most closely related to our paper are R acz and Schiﬀer (2020); Mardia et al. (2020), where the special case of the planted clique model, where p = 1 and q = 1/2, under the adaptive edge query model, was considered. For this model, assuming unbounded computational resources, upper and lower bounds on the query complexity for both detecting and recovery were established. Speciﬁcally, it was shown in R acz and Schiﬀer (2020) that no algorithm that makes at most Q = o(n2/k2) adaptive queries to the adjacency matrix of Gn is likely to solve the detection problem. On the other hand, when k (2 + ϵ) log2 n, for any ϵ > 0, it was shown in R acz and Schiﬀer (2020) that there exists an algorithm (not polynomial time) that solves the detection problem by making at least Q = (2 + ϵ)n2

k2 log2 2 n adaptive queries. For the recovery task, it was shown in R acz and Schiﬀer (2020) that no algorithm that makes at most Q = o(n2/k2 + n) adaptive queries exists, while recovery is possible using Q = o(n2/k2 log2 2 n + n log2 n) adaptive queries. Note that when the whole graph is shown to the algorithm, namely, Q = n 2 , then the above detection upper bound boils down to k > (2 + ϵ) log2 n, which is folklore and well-known to be tight. On the other hand, the above detection lower bound gives k = O(1), which is loose. Sub-linear time algorithms that ﬁnd the planted clique in the regime k = ω( n log log n) were proposed in Mardia et al. (2020). Speciﬁcally, among other things, it was shown that a simple and eﬃcient algorithm can detect the planted clique using Q = O(n3

k3 log3 n) non-adaptive queries; conversely using the planted clique conjecture, it was shown that a certain class of non-adaptive algorithms cannot detect the planted clique if Q = o(n3

k3 ), suggesting that in the regime where n2

k3 polynomial-time algorithms do not exist. In this paper, we generalize and strengthen the results of R acz and Schiﬀer (2020), resolving two open questions raised in the same paper. First, we consider the more general PDS model which allows for arbitrary edge probabilities. While this might seem as a rather

Random Subgraph Detection Using Queries

incremental contribution, it turns out that the lower bounding techniques used in R acz and Schiﬀer (2020) are quite weak and result in loose bounds on the query complexity for the PDS model. In a nutshell, the main observation in the proof of the lower bound in R acz and Schiﬀer (2020) is that if Q n2/k2, then with high probability all edge queries will fall outside the planted clique, no matter how strong/sophisticated the query mechanism is. Therefore, detection would be impossible. While the same bound holds for the PDS model as well, it does not capture the intrinsic dependency of the query complexity on the edge densities p and q. More importantly, as was mentioned above, even for the planted clique problem, the results of R acz and Schiﬀer (2020) exhibit a polylog gap between the upper and lower bounds on the query complexity. We close this gap by providing asymptotically tight bounds. Speciﬁcally, we show that an algorithm that must makes Q = Ω( n2 k2χ4(p||q) log2 n) non-adaptive queries to the adjacency matrix of the graph to be able to detect the planted subgraph, where χ2(p||q) is the chi-square distance. On the other hand, we devise a quasi-polynomial-time combinatorial algorithm that detects the planted subgraph with high probability by making Q = O( n2 k2χ4(p||q) log2 n) non-adaptive queries. For the lower bound, we derive ﬁrst high probability lower and upper bounds on the number of edge queries C that fall inside the planted subgraph, associated with the optimal query mechanism. Then, we develop a general information-theoretic lower bound on the risk of any algorithm that is given those Q queries, and essentially observes a subgraph of the PDS model, with a planted signal that is an arbitrary sub-structure of the original planted subgraph on C edges. We also propose a polynomial-time algorithm which is able to detect the planted subgraph using Q = Ω( n3 k3χ2(p||q) log3 n) queries. In the leftover regime, where

k3 and k = ω( n), we conjecture that no polynomial-time algorithms exist for detection. Finally, as we discuss later in the paper, the generality of our techniques allows for arbitrary planting and noise distribution P and Q, respectively, where the PDS model boils down to P = Bern(p) and Q = Bern(q).

1.3 Main Results

The following theorem determines (up to a constant factor) the number of queries necessary to solve the query-PDS detection problem. Recall that χ2(p||q) and d KL(p||q) denote the chi-square distance and the Kullback-Leibler (KL) divergence between two Bern(p) and Bern(q) random variables, respectively. Note that χ2(p||q) = (p q)2

Theorem 3 (Detecting a planted densest subgraph) Consider the QPDS(n, k, p, q) detection problem, and let ϵ > 0 be arbitrary. The following statements hold.

1. (Non-adaptive lower-bound) The risk of any algorithm An that makes at most Q nonadaptive edge queries, is R(An) 1 o(1), if

Q < (2 ϵ) n2

k2χ4(p||q) log2 n

2. (Statistical suﬃciency) Suppose that k (2 + ϵ0) log n d KL(p||q), for some ϵ0 > 0. It is possible to detect the presence of a planted densest subgraph, i.e, R(An) o(1), by

Huleihel, Mazumdar and Pal

Q (2 + ϵ) n2

k2d2 KL(p||q) log2 n

queries. Moreover, the queries can be non-adaptive.

3. (Computational suﬃciency) Suppose that k = Ω( p

n log n/χ2(p||q)). There exists a polynomial-time algorithm An that can detect the presence of a planted densest subgraph, i.e, R(An) o(1), by making

k3χ2(p||q) log3 n (5)

queries. Moreover, the queries can be non-adaptive.

The proof of Theorem 3 is given in Sections 2 and 3. Let us describe brieﬂy the algorithms achieving the query complexities in the second and third items of Theorem 3. The test achieving the query complexity in the second item of Theorem 3 is the scan test. In the ﬁrst step of this algorithm we subsample a set S [n] of M N elements, drawn uniformly at random, and then take all pairwise queries among those elements, resulting in Q = M 2

non-adaptive queries. Observe that at the end of this ﬁrst step we, in fact, observe the subgraph GS of Gn induced by S. Now, given the responses to those queries, in order to distinguish between hypotheses H0 and H1, we search for the densest subgraph (in the sense of the number of active edges) of a certain size in GS, and then compare the result to a carefully chosen threshold. The test achieving the query complexity in the third item of Theorem 3, on the other hand, is the degree test which, roughly speaking, counts the number of high degree vertices (i.e., vertices with degrees exceeding a certain threshold) in a randomly chosen induced subgraph of Gn, and then compare the result to a certain threshold. A very similar variant of this degree test was proposed and analyzed in Mardia et al. (2020), for the planted clique problem. It is clear that the combinatorial scan test is computationally expensive (super-polynomial), while the degree test has a polynomialtime complexity. Interestingly, adaptivity is not needed in order to achieve the statistical barrier. It should be emphasized that the condition k (2 + ϵ0) log n d KL(p||q) in the second part

of Theorem 3 is essential because, otherwise, if k < (2 ϵ0) log n d KL(p||q) detection is known to be statistically impossible even if we observe/query the whole graph. Note that the lower and upper bound for the non-adaptive case in (3) and (4) are tight up to a constant factor; the former depend on p and q through the chi-square distance, while the latter through the KL divergence. As we show in the proof, an alternative condition for the scan test to succeed in detection is

Q (2 + ϵ) Cn2

k2χ4(p||q) log2 n

for some constant C 8; the above condition meets the lower bound in (3) up to the constant factor C. We strongly believe that the source of this negligible gap is due to the lower bound, namely, the χ4(p||q) factor in (3) can be in fact replaced with d2 KL(p||q). It

Random Subgraph Detection Using Queries

should be emphasized, however, that in the special case of planted clique where p = 1 and q = 1/2, we have χ4(p||q) = d2 KL(p||q) = 1, and thus our bounds in (3) and (4) are tight and fully characterize the statistical limit of detection. This closes the gap in R acz and Schiﬀer (2020). As can be noticed from the second and third items of Theorem 3, there is a signiﬁcant gap between the query complexity of the optimal algorithm and that of the computationally eﬃcient one. This observation raises the following intriguing question: what is the sharp condition on (n, k, p, q, Q) under which the problem admits a computationally eﬃcient test with vanishing risk, and conversely, without which no algorithm can detect the planted dense subgraph reliably in polynomial-time? The gap observed in our problem is common to many contemporary problems in high-dimensional statistics studied over the last few years. Indeed, recently, there has been a success in developing a rigorous notion of what can and cannot be achieved by eﬃcient algorithms. Speciﬁcally, a line of work initiated in Berthet and Rigollet (2013) has aimed to explain these statistical-computational gaps by reducing from conjecturally hard average-case problems in computer science, most notably, the planted clique problem, conjectured to be computationally hard in the regime k = o( n). Accordingly, such reductions from planted clique were established to prove tight statistical-computational gaps for a wide verity of detection and recovery problems, e.g., Berthet and Rigollet (2013); Ma and Wu (2015); Cai et al. (2017); Hajek et al. (2015); Wang et al. (2016a,b); Gao et al. (2017); Brennan et al. (2018, 2019); Wu and Xu (2020); Brennan and Bresler (2020). As mentioned above, it is widely believed that the planted clique detection problem cannot be solved in randomized polynomial time when k = o( n), which we shall refer to as the planted clique conjecture, stated as follows. Below, we let (HPC 0 , HPC 1 ) denote the planted clique detection problem, and we recall that it is a special case of the PDS detection problem with edge probabilities p = 1 and q = 1/2; for simplicity of notation we designate PC(n, k) = PDS(n, k, 1, 1/2).

Conjecture 4 (Planted clique conjecture) Suppose that {An} is a sequence of randomized polynomial-time algorithms An and {kn} is a sequence of positive integers satisfying that lim supn log kn

log n < 1/2. Then, if Gn is an instance of PC(n, k), it holds that

h PHPC 0 (An(Gn) = 1) + PHPC 1 (An(Gn) = 0) i 1. (7)

Going back to our problem, it is clear that for k = o( n), and any 1 Q n 2 solving QPDS(n, k, Q, p, q) is computationally hard, namely, there exists no randomized polynomialtime adaptive algorithm that makes up to Q edge queries and is able to solve the detection problem. Accordingly, Theorem 3 and the planted clique conjecture give a partial phase diagram for when detection is statistically impossible, computational hard, and computational easy. The union of the last two regime is the statistically possible regime. Treating k and Q as polynomials in n, i.e., Q = Θ(nα) and k = Θ(nβ), for some α (0, 2) and β (0, 1), we obtain the phase diagram in Fig. 1. Speciﬁcally,

1. Computationally easy regime (blue region): there is a polynomial-time algorithm for the detection task when α > 3 3β and β > 1/2.

Huleihel, Mazumdar and Pal

Statistically

Computationally Easy

Figure 1: Phase diagram for detecting the presence of a planted dense subgraph, as a function of the dense subgraph size k = Θ(nβ) and the number of non-adaptive edge queries Q = Θ(nα).

2. Computationally hard regime (red region): there is an ineﬃcient algorithm for detection when β < 1/2 and 2 2β < α, but the problem is computationally hard (no polynomial-time algorithm exists) in the sense that it is at least as hard as solving the planted clique problem.

3. Conjecturally hard regime (black region): there is an ineﬃcient algorithm for detection when 2 2β < α < 3 3β, but we conjecture that there is no polynomial-time algorithm. This was also conjectured in Mardia et al. (2020).

4. Statistically impossible regime: the task is statistically/information-theoretically impossible when α < 2 2β.

It turns out that our techniques allows for a more general submatrix detection problem with arbitrary planting and noise distribution P and Q, respectively, deﬁned as follows.

Deﬁnition 5 (General submatrix detection) Given a pair of distributions (P, Q) over a measurable space (X, B), let SD(n, k, P, Q) denote the hypothesis testing problem with observation X X n n and hypotheses

H0 : X Q n n vs. H1 : X D(n, k, P, Q), (8)

where D(n, k, P, Q) is the distribution of symmetric matrices X with entries Xij P if i, j K and Xij Q otherwise that are conditionally independent given K, which is chosen uniformly at random over all k-subsets of [n].

Random Subgraph Detection Using Queries

Now, consider the following natural generalization of the edge query oracle in Deﬁnition 6 to the above submatrix problem.

Deﬁnition 6 (Oracle/Entries queries) Consider an n n symmetric matrix X Rn n. An oracle Osub : [n] [n] R, takes as input a pair of vertices i, j [n] [n], and outputs Osub(i, j) = Xij in response.

Then, we deﬁne the query-submatrix detection problem as follows.

Problem 2 (Query-submatrix detection problem) Consider the submatrix detection problem in Deﬁnition 5. There is an oracle Osub as deﬁned in Deﬁnition 6. Find a set of queries Q [n] [n] such that Q = |Q|, and from the oracle answers it is possible to solve the detection problem in (8). Henceforth, we denote this detection problem by QSD(n, k, P, Q, Q).

It is clear that the PDS model correspond to P = Bern(p) and Q = Bern(q), while X is the graph adjacency matrix. Then, it can be shown that under mild conditions on the tails of the distribution P and Q, Theorem 3 holds for the setting in Deﬁnition 5 with χ2(p||q) and d KL(p||q) replaced by χ2(P||Q) and d KL(P||Q), respectively. Speciﬁcally, the lower bound in Theorem 3 holds whenever 0 < χ2(P||Q) < , and the upper bounds hold if, for example, the log-likelihood ratio L log d P

d Q is sub-Gaussian, or, even slightly weaker; if there is a constant C 1 such that

ψP(λ) d KL(P Q) λ C d KL(P Q) λ2 λ [ 1, 0], (9a)

ψQ(λ) + d KL(Q P) λ C d KL(Q P) λ2 λ [ 1, 1], (9b)

where ψQ log EQ[exp(λL)] and ψP log EP[exp(λL)]. Note that the above assumption was considered also in Hajek et al. (2017); Brennan et al. (2019), and arise naturally from classical binary hypothesis testing, in order to control the tails of the error probabilities associated with the simple tests we propose.

2. Algorithms and Upper Bounds

In this section we prove items 2 and 3 in Theorem 3. To that end, we propose two algorithms whose performance match (4) (5). Below, we denote the adjacency matrix of the underlying graph Gn by A {0, 1}n n, with its (i, j) entry denoted by Aij, for any 1 i, j n.

2.1 Scan Test

In this subsection we analyzed the scan test in Algorithm 1. The parameters M, N0, and τscan in Algorithm 1 will be speciﬁed below. In the ﬁrst step of the scan test we subsample a set S [n] of M N elements, drawn uniformly at random, and take all pairwise queries among these elements. Therefore, the number of queries is Q = M 2 . Given these queries we, in fact, learn the induced subgraph, denoted by GS, on the set of vertices S. Then, using this subgraph, we wish to distinguish between H0 and H1. Recall that K denotes the set of vertices over which the densest subgraph was planted under H1, and let N denote the number of planted dense subgraph vertices in S, i.e., N |K S|. Since K and S are drawn

Huleihel, Mazumdar and Pal

Algorithm 1 Scan Test

Require: Gn, Q = M 2 , N0 = (1 ϵ) k M

n , and τscan = N0 2 γ, for ϵ (0, 1) and γ [q, p]. 1: Subsample a set S of M elements drawn uniformly at random from [n]. 2: Take all pairwise queries among the elements in S, and obtain Aij, for all i, j S. 3: Compute Sscan max L S:|L|=N0

4: If Sscan > τscan decide H1; otherwise, decide H0.

uniformly at random from all sets of size k and M from [n], respectively, we observe that N Hypergeometric(n, k, M), namely, N has a Hypergeometric distribution with parameters n, k, and M. Accordingly, we have that E(N) = k M

n , and var(N) k M

n (1 k/n). Therefore, Chebyshev s inequality implies that

P [N (1 ϵ)E(N)] 1 ϵ2E(N). (10)

Thus, provided that ϵ2E(N) , we see that with probability tending to unity N (1 ϵ)k M

n N0. The implication of this is the following: as mentioned above we are tasked with the following detection problem:

H 0 : GS G(M, q) vs. H 1 : GS G(M, N, p, q). (11)

Therefore, (11) represents a PDS(M, N, p, q) detection problem, but the size of the planted densest subgraph is random. Nonetheless, due to (10), it should be clear that by replacing N with N0 in (11), the detection problem becomes algorithmically harder; thus upper bounds on PDS(M, N0, p, q) imply corresponding upper bounds on PDS(M, N, p, q). Below, we prove this rigorously.

Recall that A is the adjacency matrix of Gn. The scan test is deﬁned as follows

Ascan(AS) 1

max L S:|L|=N0

i<j L Aij τscan

where τscan N0 2 γ, for some γ [q, p]. Under the null hypothesis, for any ﬁxed subset

L of size N0, it is clear that P i<j L Aij Binomial N0 2 , q . By the union bound and

Random Subgraph Detection Using Queries

Chernoﬀ s inequality,

PH 0 (Ascan(AS) = 1) = PH 0

max L S:|L|=N0

i<j L Aij τscan

L S:|L|=N0 PH 0

i<j L Aij τscan

P Binomial N0 2

N0 exp N0 2

d KL(γ||q) (16)

= exp N0 log e M

d KL(γ||q) . (17)

Under the alternative hypothesis, conditioned on N = N , for some N N0, the scan test statistics max L S:|L|=N0 P i<j L Aij stochastically dominates Binomial N0 2 , p . By the multiplicative Chernoﬀ s bound,

PH 1 (Ascan(AS) = 0) = PH 1

max L S:|L|=N0

i<j L Aij τscan

P (N N0) + P Binomial N0 2

1 ϵ2EN0 + exp

which clearly goes to 0 since N0 . Therefore, combining (17) and (20) we see that R(Ascan) δ, if

d KL(γ||q) > 2

2n Qk log n

By taking γ arbitrary close to p, we get the query complexity bound in (4), for arbitrary ϵ > 0. Also, note that the condition k (2 + ϵ0) log n d KL(p||q), for some ϵ0 > 0, in the second part of Theorem 3 follows from the constraint that M n. As we mentioned right after Theorem 3, a weaker bound exhibiting a dependency on the chi-square distance can be

Huleihel, Mazumdar and Pal

derived. Speciﬁcally, let τscan = N0 2 p+q

2 . By the union bound and Bernstein s inequality,

PH 0 (Ascan(AS) = 1) = PH 0

max L S:|L|=N0

i<j L Aij τscan

L S:|L|=N0 PH 0

i<j L Aij τscan

P Binomial N0 2

N0 2 2(p q)2/4

2 N0 2 q + N0 2 (p q)/3

= exp N0 log e M

N0 CN2 0 (p q)2

for some constant C > 8, and note that (p q)2

q(1 q) = χ2(p||q). Similarly to (20), under the alternative hypothesis, conditioned on N = N , for some N N0, the scan test statistics max L S:|L|=N0 P i<j L Aij stochastically dominates Binomial N0 2 , p . By the multiplicative Chernoﬀ s bound,

PH 1 (Ascan(AS) = 0) = PH 1

max L S:|L|=N0

i<j L Aij τscan

P (N N0) + P Binomial N0 2

1 ϵ2EN0 + exp

2 N0 2 N0 2 2 (p q)2

= 1 ϵ2EN0 + exp CN2 0q , (30)

which converges to 0. Therefore, combining (26) and (30) we see that R(Ascan) δ, if

χ2(p||q) Ω 1

= Ω n Qk log n

2.2 Degree Test

In this subsection we analyze the degree test in Algorithm 2. The parameters n , M, and N0 in Algorithm 2 will be speciﬁed below. Speciﬁcally, let S denote a set of n N vertices drawn uniformly at random from [n], and GS be the subgraph of G induced by this set. Then, we subsample M N elements from S, and denote those set of elements by M. Our

Random Subgraph Detection Using Queries

Algorithm 2 Degree Test

Require: Gn, n = Ω n2 log n k2χ2(p||q) , M = Ω n

k log2 n , and N0 = (1 ϵ) kn

n , for ϵ (0, 1).

1: Let S denote a set of n vertices drawn uniformly at random from [n], and GS be the subgraph of G induced by S. 2: Subsample M elements uniformly at random from S, and denote those set of elements by M. 3: Compute

j S Aij > n q + N0(p q)

4: If Cdeg > 2 log n decide H1; otherwise, decide H0.

test is deﬁned as follows,

j S Aij > τdeg

where τdeg n q + N0(p q)

2 , and N0 N. As in the previous subsection, under H1, let N denote the number of planted dense subgraph vertices in S, i.e., N |K S|, and note that N Hypergeometric(n, k, n ). We have shown that with probability tending to unity N (1 ϵ)kn

n N0. Let this event be denoted by C1, and so P(C1) 1. Next, we analyze the Type-I+II error probability associated with the above test. Let us start with the null hypothesis. For any i M, we let Xi 1 h P j S Aij > n q + N0(p q)

Bern P h P j S Aij > n q + N0(p q)

2 i . Under H0, it is clear that P j S Aij Binom (n , q), and thus by Bernstein s inequality

j S Aij > n q + N0(p q)

exp Ω N2 0 n χ2(p||q) . (33)

Accordingly, if n < N2 0χ2(p||q)

C log n , which implies that n > Ω n2 log n k2χ2(p||q) we will have

PH0 h P j S Aij > n q + N0(p q)

2 i n 2, for some C > 0. This implies that under H0,

each Xi is stochastically dominated by Bern(n 2). Consider now the alternative hypothesis. Recall that conditioned on C1, over the induced subgraph GS with n vertices there are at least N0 vertices from the planted set. Next, we show that by subsampling M vertices M from GS (as in the second step of Algorithm 2), with high probability among those M samples at least 3 log n vertices fall inside the planted set. To that end, let us divide the subsampled planted vertices K S into 3 log n disjoint sets {Si}3 log n i=1 of equal size N0 3 log n . Let E denote the event that there exist a set Si which do not intersect M, namely,

Huleihel, Mazumdar and Pal

E S3 log n i=1 {Si M = }. Then, note that

P [E] (3 log n ) P [S1 M = ] (34)

= (3 log n ) 1 N0 3n log n

(3 log n ) exp N0M 3n log n

namely, the probability that there exists a set Si which do not intersect M is at most (36). Thus, taking M 6n log2 n

N0 , i.e., M = Ω n

k log2 n we obtain that P [E] 3 log n /n 2, and accordingly, with probability tending to 1, we sample at least one elements from each Si. Therefore, among the M samples in M at least 3 log n vertices fall inside the planted set. Now, for any i M (K S), we note that P j S Aij stochastically dominates Binom (N0, p) + Binom (n N0, q), and thus conditioned on Ec and C1, by the multiplicative Chernoﬀ s bound,

j S Aij n q + N0(p q)

exp Ω N2 0 n χ2(p||q) . (37)

Accordingly, if n < N2 0χ2(p||q)

C log n , we will have PH1 h P j S Aij n q + N0(p q)

2 i n 2, for large enough C > 0. This implies that under H1, conditioned on Ec and C each Xi stochastically dominates Bern(1 n 2), for i M (K S). Establishing the above results, we are in a position to upper bound the Type-I and Type-II error probabilities. Using Markov s inequality, we have

PH0 (Adeg(AS) = 1) = PH0

i M Xi > 2 log n #

P Binom M, n 2 > 2 log n (39)

M 2n2 log n , (40)

which clearly goes to 0 as n . On the other hand,

PH1 (Adeg(AS) = 0) = PH1

i M Xi < 2 log n #

P Binom 3 log n , 1 n 2 < 2 log n + P (E) + P (Cc) (42)

e C log n + o(1) 0, (43)

as n . Finally, note that the number of queries made by Algorithm 2 is Q = M n > Ω n3 k3χ2(p||q) log3 n and the constraint that n < n implies that k Ω( p

n log n/χ2(p||q)). This concludes the proof.

Random Subgraph Detection Using Queries

3. Statistical Lower Bound

In this section, we prove the lower bounds in Theorem 3, for non-adaptive mechanisms. Our proof consists of two main steps. In the ﬁrst step, we bound the total number of planted edges any non-adaptive mechanism Qn can query. We denote this number of planted queries by C (Qn). Given Qn, in the second step of the proof, we analyze the resulting detection problem a test An is faced with, and derive a lower bound on its risk. Note that in order to prove that detection is statistically impossible whenever (3) holds, it is suﬃce to prove impossibility on the boundary, i.e., when Q = Q (2 ϵ) n2 k2χ4(p||q) log2 n, for non-adaptive mechanisms. Indeed, if detection is impossible when Q = Q , then with less queries Q < Q

detection remains impossible.

3.1 Upper Bound on the Query-Number of Planted Edges

In this subsection, we bound the total number of planted edges any mechanism can query. As mentioned above, this number is denoted by C (Qn) for a given query mechanism Qn. Letting Q denote the query trajectory (i.e., the set of Q queried edges), we note that C (Qn) = PQ ℓ=1 1 {Qℓ K K}. After making Q queries, a testing algorithm produces an decision. Speciﬁcally, given a trajectory Q (or, a mechanism Qn), we denote by (H 0, H 1) the new hypothesis testing problem faced by a testing procedure: under H 0 we observe a subgraph over Q with Bern(q) independent edges, while under H 1 we observe a subgraph over Q edges, where there exists a set of C edges that belong to the planted densest subgraph over K (i.e., Bern(p) random edges), and the remaining edges are independent Bern(q) random variables. We denote the respective null and alternative distributions by PH 0 and PH 1, respectively. We have the following result.

Lemma 7 (Total number of planted edge queries) Assume that Q satisﬁes the condition in the ﬁrst item of Theorem 3, and ﬁx δ (0, 1). Let Qn be any algorithm that makes at most Q non-adaptive edge queries. Then, with probability at least 1 δ,

Remark 8 Note that on the boundary Q = Q , we have n k Q = o(1), as n . Therefore,

(44) reduce to C (Qn) Q k2

n2 (1 + o(1)).

Proof [Proof of Lemma 7] We will start by proving Lemma 7 for deterministic query mechanisms, and then, we address the more general case of randomized algorithms. Since queries are made upfront, they are statistically independent of Gn. Accordingly, let Xe, for e Q, denote an indicator random variable such that Xe = 1 if e K K, and zero otherwise. Then,

e Q Xe. (45)

Huleihel, Mazumdar and Pal

It is clear that P(Qℓ K K) = (k 2) (n 2) = k(k 1)

n2 , and thus,

k 2 n 2 Q = k(k 1)

n(n 1)Q L, (46)

and furthermore,

E[C (Qn)]2 = E X

e,e Q Xe Xe (47)

e Q Xe + E X

e =e Q Xe Xe (48)

n(n 1)Q + k(k 1)

e =e Q P e K K|e K K , (49)

where the second equality follows the fact that X2 e = Xe, and in the last equality we note

that EXe = P(e K K) = (k 2) (n 2). Fix a pair e = (ie 1, je 1), and consider a pair e = (ie 1 , je 1 ). We

decompose the summation in (49) into two parts. In the ﬁrst part, the edges e and e are completely disjoint, namely, the do not share a common vertex. We denote this by e e .

In this case, P [e K K|e K K] = (k 2 2 ) (n 2 2 ) = (k 2)(k 3)

(n 2)(n 3). On the other hand, if e and e

share exactly one common vertex, then, P [e K K|e K K] = (k 1 2 ) (n 1 2 ) = (k 1)(k 2)

(n 1)(n 2). It

is not diﬃcult to show that (k 2)(k 3)

(n 2)(n 3) k(k 1)

n(n 1) and (k 1)(k 2)

(n 1)(n 2) k(k 1)

n(n 1), for k < n, and n = ω(1). Therefore,

e =e Q P e K K|e K K k2(k 1)2

n2(n 1)2 Q2, (50)

E[C (Qn)]2 k(k 1)

n(n 1)Q + k2(k 1)2

n2(n 1)2 Q2. (51)

Using the above we ﬁnally get that,

Var (C (Qn) k(k 1)

n(n 1)Q + k2(k 1)2

n2(n 1)2 Q2 k2(k 1)2

n2(n 1)2 Q2 (52)

n(n 1)Q = L. (53)

Chebyshev s inequality implies that, for any ϵ > 0,

P[C (Qn) (1 + ϵ) L] 1 ϵ2 L. (54)

Random Subgraph Detection Using Queries

Thus, with probability at least 1 δ,

Finally, for randomized query mechanisms, we can condition ﬁrst on the additional source of randomness R, and apply our arguments above for deterministic query mechanisms, to prove (44), independently of the realization of R. Then, by taking an expectation over R the proof is concluded.

3.2 Hypothesis Test Over the Queried Subgraph

Provided with the Q edge queries probed by an arbitrary non-adaptive query mechanism Qn, we now describe the hypothesis testing problem the detection algorithm An is faced with, and derive a statistical lower bound on its performance. Recall that the overall distinguishing algorithm is a decomposition An Qn. While we do not specify the distribution of C , we derived in the previous subsection a high probability upper bound on it. Below, we denote the data that is being supplied to the detection algorithm by a vector X {0, 1}Q

of length Q, with each entry representing a queried edge. While it makes more sense to represent this data using a matrix (which in turn is a sub-matrix of the original adjacency matrix), we ﬁnd it more convenient to work with the above vector notation. With some abuse of notation, let (H 0, H 1) denote the hypothesis testing problem faced by An, followed by the best edge query mechanism. Speciﬁcally, let P Bern(p) and Q Bern(q). Under H 0 it is clear that X PH 0 where PH 0 Q Q is the distribution of a product of Q Bernoulli random variables Bern(q). Under H 1, the situation is a bit more complicated; we let X PH 1, and the distribution PH 1 is deﬁned as follows. The query trajectory Q completely determines how the query mechanism operates for any realization of K. Accordingly, conditioned on K, we let WK K denote the set of edge queries that fall inside the planted set K K, associated with the best query mechanism. Then, the alternative distribution is constructed as follows:

1. We pick k vertices uniformly at random from [n], and denote the obtained set by K (this is the original set of vertices over which the subgraph was planted).

2. Any two vertices in K are connected with probability p.

3. Let WK K K K be the set of queried edges that fall inside K K, and denote its size by |WK K| = C .

4. The elements of X are the edges in WK K, and outside this set the edges/elements are drawn i.i.d. Bern(q).

Note that due to the result in the previous section, such a set WK K of random size C exists, and that the above hypothesis testing problem is simple. Below, we let Unifn,k be the uniform measure over sets of size k drawn u.a.r. from [n]. We would like to derive a lower bound on the above testing problem. Speciﬁcally, let R(An) denote the risk of An, i.e., the sum of the Type-I and Type-II error probabilities associated with the detection

Huleihel, Mazumdar and Pal

algorithm An. We start by using the following standard lower bound on the risk R(An) of any algorithm (Tsybakov, 2008, Theorem 2.2),

R(An) 1 TV(PH 0, PH 1). (56)

Next, recall that in the previous subsection we have proved that P(C > L ) δ, where L Q k2

n2 (1 + o(1)). Throughout the rest of the proof, we assume that we are in the regime where L 1 (as it is on the boundary Q = Q ), otherwise, detection is clearly impossible. Indeed, if L < 1, then this implies that the query mechanism do not probe any planted edge. Thus,

TV(PH 0, PH 1) = TV(PH 0, EC PH 1|C ) (57)

EC h TV(PH 0, PH 1|C ) i (58)

ℓ L TV(PH 0, PH 1|C =ℓ)P(C = ℓ), (59)

where the ﬁrst inequality follows from the convexity of (P, Q) 7 TV(P, Q). The above total variation distance can be upper bounded as follows (Tsybakov, 2008, Lemma 2.7)

2TV2(PH 0, PH 1|C =ℓ) χ2(PH 1|C =ℓ, PH 0), (60)

where χ2(PH 1|C =ℓ, PH 0) is the chi-square distance between PH 1|C =ℓand PH 0. It should be clear that this total variation is maximized at the boundary, namely, when ℓ= L . Accordingly, without loss of generality, below we focus on this case and ignore the conditioning on C , namely, we treat PH 1|C as PH 1 with C = L .

Let us evaluate the likelihood function PH 1 PH 0 . Since,

PH 1 = EK Unifn,k h PH 1|K,WK K i , (61)

it is clear that,

PH 1 PH 0 (X) = EK Unifn,k

PH 1|K,WK K

Q Q (X) (62)

= EK Unifn,k

i WK K f(Xi)

χ2(PH 1, PH 0) + 1 = EPH 0

" PH 1 PH 0 (X)

= EK1,K2 Unifn,k EPH 0

Random Subgraph Detection Using Queries

= EK1,K2 Unifn,k EPH 0

i WK1 K1 WK2 K2

i WK1 K1 WK2 K2

= EK1,K2 Unifn,k

i WK1 K1 WK2 K2

EPH 0f2(Xi) Y

i WK1 K1 WK2 K2

= EK1,K2 Unifn,k

i WK1 K1 WK2 K2

EPH 0f2(Xi)

= EK1,K2 Unifn,k h (1 + χ2(P, Q))|WK1 K1 WK2 K2|i , (69)

where the last equality holds since EPH 0f(Xi) = 1 and 1 + χ2(P, Q) = EPH 0f2(Xi). To conclude

χ2(PH 1, PH 0) + 1 = EK1,K2 Unifn,k h (1 + χ2(P, Q))|WK1 K1 WK2 K2|i . (70)

Now, we note that if K1 K2 = then, obviously, |WK1 K1 WK2 K2| = 0, and thus we may focus on sets K1 and K2 such that |K1 K2| > 0. Now, given K1 and K2, the random variable HK1 K2 |WK1 K1 WK2 K2| is clearly upper bounded by L |K1 K2| 2 . Therefore,

χ2(PH 1, PH 0) + 1 EK1,K2 Unifn,k h (1 + χ2(P, Q))L (|K1 K2| 2 )i . (71)

Using (60) we have

2 TV(PH 0, PH 1)2 χ2(PH 1, PH 0) (72)

EK1,K2 Unifn,k h (1 + χ2(P, Q))L (|K1 K2| 2 )i 1 (73)

EK1,K2 Unifn,k

exp L |K1 K2| 2

χ2(P, Q) 1. (74)

Next, notice that H |K1 K2| Hypergeometric(n, k, k), and as so, with this notation, we get

χ2(PH 1, PH 0) EK1,K2 Unifn,k

χ2(P, Q) 1. (75)

We analyze the two possible cases: L > H 2 and L H 2 . Furthermore, in the sequel we assume that k <

2n, and then discuss the complementary region. Starting with the case where L > H 2 , we have

E exp L H 2

χ2(P, Q) 1 L > H 2

Huleihel, Mazumdar and Pal

= E exp H 2

χ2(P, Q) 1 L > H 2

To analyze this, we use the following tail bound on Hypergeometric random variable (see, e.g., Arias-Castro et al. (2014); Wu and Xu (2023)),

P (H h) exp ( k d KL (h/k||ρ)) , (77)

for any h/k ρ, where ρ k/n. Using the deﬁnition of the KL divergence and the identity (1 x) log(1 x) x, we get

d KL (a||b) a log a

Using (78), we note that,

k d KL (h/k||ρ) h log h

Accordingly, we have

χ2(P, Q) 1 L > H 2

h=2 e h(h 1)

2 χ2(P,Q) k d KL(h/k||ρ) (81)

2 χ2(P,Q) log nh

k2 +1 . (82)

For a > 0 ﬁxed, the function x ax log x is decreasing on (0, 1/a) and increasing on (1/a, ). Therefore,

2 χ2(P, Q) log nh

2 χ2(P, Q), log n

Substituting L = k2Q/n2, while noticing that log n

2L k2 = log n

k + O(log log n

k ) = [1 o(1)] log n

k , the second term in the minimum tends to if

Q < (2 ϵ) n2

k2χ4(P, Q) log2 n

Random Subgraph Detection Using Queries

for any ϵ > 0. This is also the case of the ﬁrst term, since

2 χ2(P, Q) = log n

2 χ2(P, Q) +

2 χ2(P, Q) log

with the second diﬀerence bounded from below if χ2(P, Q) is ﬁnite. Hence the sum in (82) converges to zero, and we get

χ2(P, Q) 1 L > H 2

1 + o(1). (87)

Next, we turn to the case where L H 2 . We have

E exp L H 2

χ2(P, Q) 1 L H 2

= exp L χ2(P, Q) P L H 2

where the inequality follows from (77) and (78). Substituting L = k2Q/n2, and noticing that log n

2L k2 = log n

k + O(log log n

k ) = [1 o(1)] log n

k , it is clear that the right hand side of (90) converges to zero as long as

Q < (2 ϵ) n2

k2χ4(P, Q) log2 n

for any ϵ > 0. Accordingly, under this condition (90) converges to zero. Combining (75), (87), and (90), we get that for k <

χ2(PH 1, PH 0) 1 + o(1) 1 = o(1), (92)

provided that (85) and (91) hold (which coincide), as required (see, (3) in Theorem 3). Finally, we mention that the complementary case, where k >

2n follow from almost the same arguments as above, with the following small modiﬁcations; speciﬁcally, in this regime, we use the following tail bound on Hypergeometric random variable

P (H < h) exp ( k d KL (1 h/k||1 ρ)) , (93)

for any h/k < ρ, where ρ = k/n, and we lower bound the KL divergence using,

d KL (a||b) a + (1 a) log 1 a

Huleihel, Mazumdar and Pal

Then, repeating the same steps above it can be shown that now the lower tail term is asymptotically small, namely,

χ2(P, Q) 1 L > H 2

while the upper tail term is,

E exp L H 2

χ2(P, Q) 1 L H 2

1 + o(1), (96)

and thus χ2(PH 1, PH 0) 1 + o(1) 1 = o(1), provided that the same conditions as in (85) and (91) hold. This concludes the proof (see, (3) in Theorem 3).

4. Conclusion and Outlook

In this paper, we formulated and analyzed a variant of the classical PDS problem, where one can only observe a small part of the graph using non-adaptive edge queries. This problem is relevant, for example, when access to the edges (connections) between vertices (individuals) may be scarce due to privacy concerns. For this model, we derived the number of queries necessary and suﬃcient for detecting the presence of the planted subgraph, up to a constant factor. For the special case of planted cliques, our results are completely tight. This work also has left number of speciﬁc problems open, including the following:

It would be quite interesting to provide any concrete evidence for our conjectured statistical-computational gap either by means of an average-case reduction from the planted clique problem, or failure of classes of powerful algorithms (such as, sum-ofsquares hierarchy, low-degree polynomials, etc.), below the computational barrier.

Our bounds are almost tight in the sense that there is a multiplicative constant gap between our lower and upper bounds. As was mentioned in the paper, we believe that the source for this gap is our lower bound; the χ4(p||q) factor should be in fact d2 KL(p||q). We suspect that one way to prove this is by applying a truncation procedure on the likelihood analysis when deriving the lower bound on the risk.

In this paper, we analyzed the regime where the edge probabilities p and q are ﬁxed and independent of n. The regime where p and q depend on n, e.g., both decay polynomially in n, is quite important and challenging. It would be interesting to ﬁnd the phase diagram for this case. Note that here χ2(p||q) is not a constant anymore, but decays with n polynomially fast.

While in this paper we have focused on the detection problem, it is interesting to consider the recovery and partial recovery variants of our setting as well.

Acknowledgments

The authors would like to thank the editor, Andrea Montanari, and the anonymous referee for their suggestions which helped improving the content of this paper. The work of W.

Random Subgraph Detection Using Queries

Huleihel was supported by the ISRAEL SCIENCE FOUNDATION (grant No. 1734/21). The work of A. Mazumdar was supported by NSF under Award 2217058, and Award 2133484.

Ryan Alweiss, Chady Ben Hamida, Xiaoyu He, and Alexander Moreira. On the subgraph query problem. Combinatorics, Probability and Computing, page 1 16, Jul 2020.

A. Anagnostopoulos, Jakub Lacki, Silvio Lattanzi, S. Leonardi, and Mohammad Mahdian. Community detection on evolving graphs. In NIPS, 2016.

Ery Arias-Castro, Nicolas Verzelen, et al. Community detection in dense random networks. The Annals of Statistics, 42(3):940 969, 2014.

Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Proceedings of the 26th Annual Conference on Learning Theory, volume 30, pages 1046 1066, 12 14 Jun 2013.

Matthew Brennan and Guy Bresler. Reducibility and statistical-computational gaps from secret leakage. In Proceedings of Thirty Third Conference on Learning Theory, volume 125, pages 648 847, 09 12 Jul 2020.

Matthew Brennan, Guy Bresler, and Wasim Huleihel. Reducibility and computational lower bounds for problems with planted sparse structure. In COLT, pages 48 166, 2018.

Matthew Brennan, Guy Bresler, and Wasim Huleihel. Universality of computational lower bounds for submatrix detection. In COLT, 2019.

Cristina Butucea and Yuri I Ingster. Detection of a sparse submatrix of a high-dimensional noisy matrix. Bernoulli, 19(5B):2652 2688, 2013.

Tony Cai, Tengyuan Liang, and Alexander Rakhlin. Computational and statistical boundaries for submatrix localization in a large noisy matrix. Annals of Statistics, 45(4): 1403 1430, 08 2017.

Utkan Onur Candogan and Venkat Chandrasekaran. Finding planted subgraphs with few eigenvalues using the schur horn relaxation. SIAM Journal on Optimization, 28(1):735 759, 2018.

Yudong Chen and Jiaming Xu. Statistical-computational tradeoﬀs in planted problems and submatrix localization with a growing number of clusters and submatrices. Journal of Machine Learning Research, 17(27):1 57, 2016.

D. Conlon, J. Fox, A. Grinshpun, and X. He. Online ramsey numbers and the subgraph query problem. ar Xiv: Combinatorics, pages 159 194, 2018.

Uriel Feige, David Gamarnik, Joe Neeman, Mikl os Z. R acz, and Prasad Tetali. Finding cliques using few probes. Random Structures & Algorithms, 56(1):142 153, 2020.

Huleihel, Mazumdar and Pal

Asaf Ferber, Michael Krivelevich, Benny Sudakov, and Pedro Vieira. Finding hamilton cycles in random graphs with few queries. Random Struct. Algorithms, 49, 05 2015. doi: 10.1002/rsa.20679.

Asaf Ferber, M. Krivelevich, B. Sudakov, and Pedro Vieira. Finding paths in sparse random graphs requires many queries. Random Struct. Algorithms, 50:71 85, 2017.

Chao Gao, Zongming Ma, and Harrison H Zhou. Sparse CCA: Adaptive estimation and computational barriers. The Annals of Statistics, 45(5):2074 2101, 2017.

Bruce Hajek, Yihong Wu, and Jiaming Xu. Achieving exact cluster recovery threshold via semideﬁnite programming. IEEE Transactions on Information Theory, 62(5):2788 2797, 2016.

Bruce Hajek, Yihong Wu, and Jiaming Xu. Information limits for recovering a hidden community. IEEE Transactions on Information Theory, 63(8):4729 4745, 2017.

Bruce E Hajek, Yihong Wu, and Jiaming Xu. Computational lower bounds for community detection on random graphs. In COLT, pages 899 928, 2015.

Tanja Hartmann, A. Kappes, and D. Wagner. Clustering evolving networks. In Algorithm Engineering, 2016.

Zongming Ma and Yihong Wu. Computational barriers in minimax submatrix detection. The Annals of Statistics, 43(3):1089 1116, 2015.

Jay Mardia, Hilal Asi, and Kabir Aladin Chandrasekher. Finding planted cliques in sublinear time. Ar Xiv, abs/2004.12002, 2020.

A. Mazumdar and B. Saha. Clustering with noisy queries. In NIPS, 2017a.

A. Mazumdar and B. Saha. Query complexity of clustering with side information. In NIPS, 2017b.

Andrea Montanari. Finding one community in a sparse graph. Journal of Statistical Physics, 161(2):273 299, 2015.

Mikl os Z. R acz and Benjamin Schiﬀer. Finding a planted clique by adaptive probing. ALEA Latin American Journal of Probability and Mathematical Statistics, 17:775 790, 2020.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519.

Nicolas Verzelen, Ery Arias-Castro, et al. Community detection in sparse random networks. The Annals of Applied Probability, 25(6):3465 3510, 2015.

Ramya Korlakai Vinayak and B. Hassibi. Crowdsourced clustering: Querying edges vs triangles. In NIPS, 2016.

Tengyao Wang, Quentin Berthet, and Yaniv Plan. Average-case hardness of rip certiﬁcation. In Advances in Neural Information Processing Systems, pages 3819 3827, 2016a.

Random Subgraph Detection Using Queries

Tengyao Wang, Quentin Berthet, and Richard J Samworth. Statistical and computational trade-oﬀs in estimation of sparse principal components. The Annals of Statistics, 44(5): 1896 1930, 2016b.

Yihong Wu and Jiaming Xu. Statistical problems with planted structures: Informationtheoretical and computational limits. In Miguel R. D. Rodrigues and Yonina C. Eldar, editors, Information-Theoretic Methods in Data Science. Cambridge University Press, Cambridge, 2020.

Yihong Wu and Jiaming Xu. Statistical inference on graphs: Selected Topics. Lecture notes, 2023.