# community_recovery_in_graphs_with_locality__6a127168.pdf

Community Recovery in Graphs with Locality

Yuxin Chen YXCHEN@STANFORD.EDU Govinda M. Kamath GKAMATH@STANFORD.EDU Changho Suh CHSUH@KAIST.AC.KR David Tse + DNTSE@STANFORD.EDU Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA Department of Electrical Engineering, KAIST, Daejeon 305-701, Korea + Department of EECS, University of California, Berkeley CA 94720, USA

Motivated by applications in domains such as social networks and computational biology, we study the problem of community recovery in graphs with locality. In this problem, pairwise noisy measurements of whether two nodes are in the same community or different communities come mainly or exclusively from nearby nodes rather than uniformly sampled between all node pairs, as in most existing models. We present two algorithms that run nearly linearly in the number of measurements and which achieve the information limits for exact recovery.

1. Introduction

Clustering of data is a central problem that is prevalent across all of science and engineering. One formulation that has received signiﬁcant attention in recent years is community recovery (Girvan & Newman, 2002; Fortunato, 2010; Porter et al., 2009), also referred to as correlation clustering (Bansal et al., 2004) or graph clustering (Jalali et al., 2011). In this formulation, the objective is to cluster individuals into different communities based on pairwise measurements of their relationships, each of which gives some noisy information about whether two individuals belong to the same community or different communities. While this formulation applies naturally in social networks, it has a broad range of applications in other domains including protein complex detection (Chen & Yuan, 2006), image segmentation (Shi & Malik, 2000; Globerson et al., 2015), shape matching (Chen et al., 2014a), etc. See (Abbe & Wainwright, 2015) for an introduction.

In recent years, there has been a ﬂurry of works on designing community recovery algorithms based on idealised generative models of the measurement process. A partic-

Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

ular popular model is the Stochastic Block Model (SBM) (Holland et al., 1983; Condon & Karp, 2001), where the n individuals to be clustered are modeled as nodes on a random graph with statistically more edges between nodes within the same community than between nodes across two different communities. A closely related model is the Censored Block Model (CBM) (Abbe et al., 2015), where one obtains noisy parity measurements on the edges of an Erd os-R enyi graph (Durrett, 2007). Both the SBM and the CBM can be uniﬁed into one model with noisy measurements which are randomly sampled on the edges of a complete graph, with the two models differing only in the measurement noise model. Thus, a central assumption underlying both models is that it is equally likely to obtain measurements between any pair of nodes. This is a very unrealistic assumption in many applications: nodes often have locality and it is more likely to obtain data on relationships between nearby nodes than far away nodes. For example, in friendship graphs, individuals that live close by are more likely to interact than nodes that are far away.

This paper focuses on the community recovery problem when the measurements are randomly sampled from graphs with locality structure rather than complete graphs. Our theory covers a broad range of graphs including rings, lines, 2-D grids, and small-world graphs (Fig. 1). Each of these graphs is parametrized by a locality radius r such that nodes within r hops are connected by an edge. We characterize the information limits for community recovery on these networks, i.e. the minimum number of measurements needed to exactly recover the communities as the number of nodes n scales. We propose two algorithms whose complexities are nearly linear in the number of measurements and which can achieve the information limits of all these networks for a very wide range of the radius r. In the special case when the radius r is so large that measurements at all locations are possible, we recover the exact recovery limit identiﬁed by (Hajek et al., 2015a) when measurements are randomly sampled from complete graphs.

It is worth emphasizing that various computationally feasible algorithms (Coja-Oghlan, 2010; Chaudhuri et al., 2012; Chen et al., 2014b; Abbe & Sandon, 2015) have been pro-

Community Recovery in Graphs with Locality

posed for more general models beyond the SBM and the CBM, which accommodate multi-community models, the presence of outlier samples, the case where different edges are sampled at different rates, and so on. Most of these models, however, fall short of accounting for any sort of locality constraints. In fact, the results developed in prior literature often lead to unsatisfactory guarantees when applied to graphs with locality, as will be detailed in Section 3. Another recent work (Chen et al., 2015) has determined the order of the information limits in geometric graphs, with no tractable algorithms provided therein. In contrast, our ﬁndings uncover a curious phenomenon: the presence of locality does not lead to additional computational barriers: solutions that are information theoretically optimal can often be achieved computational efﬁciently and, perhaps more surprisingly, within nearly linear time.

2. Problem Formulation and An Application

2.1. Sampling Model

Measurement Graph. Consider a collection of n vertices V = {1, , n}, each represented by a binary-valued vertex variable Xi {0, 1}, 1 i n. Suppose it is only feasible to take pairwise samples over a restricted set of locations, as represented by a graph G = (V, E) that comprises an edge set E. Speciﬁcally, for each edge (i, j) E one acquires Ni,j samples1 Y (l) i,j (1 l Ni,j), where each sample measures the parity of Xi and Xj. We will use G to encode the locality constraint of the sampling scheme, and shall pay particular attention to the following families of measurement graphs.

Complete graph: G is called a complete graph if every pair of vertices is connected by an edge; see Fig. 1(a).

Line: G is said to be a line Lr if, for some locality radius r, (i, j) E iff |i j| r; see Fig. 1(b).

Ring: G = (V, E) is said to be a ring Rr if, for some locality radius r, (i, j) E iff i j [ r, r] (mod n); see Fig. 1(c).

Grid: G is called a grid if (1) all vertices reside within a n n square with integer coordinates, and (2) two vertices are connected by an edge if they are at distance not exceeding some radius r; see Fig. 1(d).

Small-world graphs: G is said to be a small-world graph if it is a superposition of a complete graph G0 = (V, E0) and another graph G1 = (V, E1) with locality. See Fig. 1(e).

Random Sampling. This paper focuses on a random sampling model, where the number of samples Ni,j taken over (i, j) E is independently drawn and obeys2 Ni,j Poisson (λ) for some average sampling rate λ. This gives

1We adopt the convention that Ni,j 0 for any (i, j) / E. 2All results presented in this paper hold under a related model where Ni,j Bernoulli (λ), as long as |E| n log n and λ 1 (which is the regime accommodated in all theorems).

rise to an average total sample size

(i,j) E E [Ni,j] = λ |E| . (1)

When m is large, the actual sample size sharply concentrates around m with high probability.

Measurement Noise Model. The acquired parity measurements are assumed to be independent given Ni,j; more precisely, conditional on Ni,j,

Y (l) i,j = Y (l) j,i ind. =

( Xi Xj, with probability 1 θ Xi Xj 1, else (2) for some ﬁxed error rate 0 < θ < 1, where denotes modulo-2 addition. This is the same as the noise model in CBM (Abbe et al., 2015). The SBM corresponds to an asymmetric erasure model for the measurement noise, and we expect our results extend to that model as well.

2.2. Goal: Optimal Algorithm for Exact Recovery

This paper centers on exact recovery, that is, to reconstruct all input variables X = [Xi]1 i n precisely up to global offset. This is all one can hope for since there is absolutely no basis to distinguish X from X 1 := [Xi 1]1 i n given only parity samples. More precisely, for any recovery procedure ψ the probability of error is deﬁned as

Pe (ψ) := max X {0,1}n P {ψ(Y ) = X and ψ(Y ) = X 1} ,

where Y := {Y (l) i,j }. The goal is to develop an algorithm whose sample complexity approaches the information limit m (as a function of (n, θ)), that is, the minimum sample size m under which infψ Pe (ψ) vanishes as n scales.

2.3. Haplotype Phasing: A Motivating Application

Humans have 23 pairs of homologous chromosomes, one maternal and one paternal. Each pair are identical sequences of nucleotides A,G,C,T s except on certain documented positions called single nucleotide polymorphisms (SNPs), or genetic variants. The problem of haplotype phasing is that of determining which variants are on the same chromosome in each pair, and has important applications such as in personalized medicine and human genetics. The advent of next generation sequencing technologies allows haplotype phasing by providing linking reads between multiple SNP locations (Browning & Browning, 2011; Donmez & Brudno, 2011; Das & Vikalo, 2015).

One can formulate the problem of haplotype phasing as recovery of two communities of SNP locations, those with the variant on the maternal chromosome and those with the variant on the paternal chromosome (Si et al., 2014; Kamath et al., 2015). Each pair of linking reads gives a noisy measurement of whether two SNPs have the variant on the same chromosome or different chromosomes. While there

Community Recovery in Graphs with Locality

Figure 1. Examples of (a) complete graph, (b) line, (c) ring, (d) grid, and (e) small-world graph.

are of the order of n = 105 SNPs on each chromosome, the linking reads are typically only several SNPs or at most 100 SNPs apart, depending on the speciﬁc sequencing technology. Thus, the measurements are sampled from a line graph like in Fig. 1(b) with locality radius r n.

2.4. Other Useful Metrics and Notation

One key metric that captures the distinguishability between two probability measures P0 and P1 is the Chernoff information (Cover & Thomas, 2006), deﬁned as

D (P0, P1) := inf 0 τ 1 log n X

y P τ 0 (y) P 1 τ 1 (y) o . (3)

For instance, when P0 Bernoulli (θ) and P1 Bernoulli (1 θ), D simpliﬁes to

D = KL (0.5 θ) = 0.5 log 0.5

θ + 0.5 log 0.5

where KL (0.5 θ) is the Kullback-Leibler (KL) divergence between Bernoulli(0.5) and Bernoulli(θ). Here and below, we shall use log ( ) to indicate the natural logarithm.

We denote by dv and davg the vertex degree of v and the average vertex degree of G, respectively.

3. Main Results

This section describes two nearly linear-time algorithms and presents our main results. The proofs of all theorems can be found in (Chen et al., 2016).

3.1. Algorithms

3.1.1. SPECTRAL-EXPANDING

The ﬁrst algorithm, called Spectral-Expanding, consists of three stages. For concreteness, we start by describing the procedure when the measurement graphs are lines / rings; see Algorithm 1 for a precise description of the algorithm and Fig. 2 for a graphical illustration.

Stage 1: spectral metrod on a core subgraph. Consider a subgraph Gc induced by Vc := {1, , r}, and it is selfevident that Gc is a complete subgraph. We run a spectral method (e.g. (Chin et al., 2015)) on Gc using samples taken over Gc, in the hope of obtaining approximate recovery of {Xi | i Vc}. Note that the spectral method can be replaced by other efﬁcient algorithms, including semideﬁnite programming (SDP) (Javanmard et al., 2015) and a variant of belief propagation (BP) (Mossel et al., 2013).

Stage 2: progressive estimation of remaining vertices. For each vertex i > |Vc|, compute an estimate of Xi by majority vote using backward samples those samples linking i and some j < i. The objective is to ensure that a large fraction of estimates obtained in this stage are accurate. As will be discussed later, the sample complexity required for approximate recovery is much lower than that required for exact recovery, and hence the task is feasible even though we do not use any forward samples to estimate Xi.

Stage 3: successive local reﬁnement. Finally, we clean up all estimates using both backward and forward samples in order to maximize recovery accuracy. This is achieved by running local majority voting from the neighbors of each vertex until convergence. In contrast to many prior work, we reuse all samples in all iterations. As we shall see, this stage is the bottleneck for exact information recovery. Remark 1. The proposed algorithm falls under the category of a general paradigm, which starts with an approximate estimate (often via spectral methods) followed by iterative reﬁnement. This paradigm has been successfully applied to a wide spectrum of applications ranging from matrix completion (Keshavan et al., 2010a; Jain et al., 2013) to phase retrieval (Chen & Candes, 2015) to community recovery (Chaudhuri et al., 2012; Abbe et al., 2016).

An important feature of this algorithm is its low computational complexity. First of all, the spectral method can be performed within O (mc log n) time by means of the power method, where mc indicates the number of samples falling on Gc. Stage 2 entails one round of majority voting, whereas the ﬁnal stage as we will demonstrate converges within at most O (log n) rounds of majority voting. Note that each round of majority voting can be completed in linear time, i.e. in time proportional to reading all samples. Taken collectively, we see that Spectral Expanding can be accomplished within O (m log n) ﬂops, which is nearly linear time.

Careful readers will recognize that Stages 2-3 bear similarities with BP, and might wonder whether Stage 1 can also be replaced with standard BP. Unfortunately, we are not aware of any approach to analyze the performance of vanilla BP without a decent initial guess. Note, however, that the spectral method is already nearly linear-time, and is hence at least as fast as any feasible procedure.

While the preceding paradigm is presented for lines / rings, it easily extends to a much broader family of graphs with locality (see (Chen et al., 2016)). The only places that need

Community Recovery in Graphs with Locality

Algorithm 1 : Spectral-Expanding

1. Run spectral method (see (Chen et al., 2016)) on a core subgraph induced by Vc, which yields estimates X(0) j , 1 j |Vc|.

2. Progressive estimation: for i = |Vc| + 1, , n,

X(0) i majority n Y (l) i,j X(0) j | j : j < i, (i, j) E, 1 l Ni,j o .

3. Successive local reﬁnement: for t = 0, , T 1, X(t+1) i majority n Y (l) i,j X(t) j | j : j = i, (i, j) E, 1 l Ni,j o , 1 i n.

4. Output X(T ) i , 1 i n.

Here, majority { } represents the majority voting rule: for any sequence s1, , sk {0, 1}, majority {s1, , sk} is equal to 1 if Pk i=1 si > k/2; and 0 otherwise.

Algorithm 2 : Spectral-Stitching

1. Split all vertices into several (non-disjoint) vertex subsets each of size W as follows

Vl := {i | (i 1) W/2 + 1 l (i 1) W/2 + W } , l = 1, 2, ,

and run spectral method on each subgraph induced by Vl, which yields estimates {XVl j | j Vl} for each l 1.

2. Stitching: set X(0) j XV1 j for all j V1; for l = 2, 3, ,

X(0) j XVl j ( j Vl) if X

j Vl Vl 1 XVl j XVl 1 j 0.5 |Vl Vl 1| ;

and X(0) j XVl j 1 ( j Vl) otherwise.

3. Successive local reﬁnement and output X(T ) i , 1 i n (see Steps 3-4 of Algorithm 1).

to be adjusted are:

(1) The core subgraph Vc. One would like to ensure that |Vc| davg and that the subgraph Gc induced by Vc forms a (nearly) complete subgraph, in order to guarantee decent recovery in Stage 1.

(2) The ordering of the vertices. Let Vc form the ﬁrst |Vc| vertices of V, and make sure that each i > |Vc| is connected to at least an order of davg vertices in {1, , i 1}. This is important because each vertex needs to be incident to sufﬁciently many backward samples in order for Stage 2 to be successful.

3.1.2. SPECTRAL-STITCHING

We now turn to the 2nd algorithm called Spectral-Stitching, which shares similar spirit as Spectral-Expanding and, in fact, differs from Spectral-Expanding only in Stages 1-2.

Stage 1: node splitting and spectral estimation. Split V into several overlapping subsets Vl (l 1) of size W, such that any two adjacent subsets share W/2 common ver-

tices. We choose the size W of each Vl to be r for rings / lines, and on the order of davg for other graphs. We then run spectral methods separately on each subgraph Gl induced by Vl, in the hope of achieving approximate estimates {XVl i | i Vl} up to global phase for each subgraph.

Stage 2: stiching the estimates. The aim of this stage is to stitch together the outputs of Stage 1 computed in isolation for the collection of overlapping subgraphs. If approximate recovery (up to some global phase) has been achieved in Stage 1 for each Vl, then the outputs for any two adjacent subsets are positively correlated only when they have matching global phases. This simple observation allows us to calibrate the global phases for all preceding estimates, thus yielding a vector {X(0) i }1 i n that is approximately faithful to the truth modulo some global phase.

The remaining steps of Spectral-Stitching follow the same local reﬁnement procedure as in Spectral-Expanding; see Algorithm 2. As can be seen, the ﬁrst 2 stages of Spectral Stitching which can also be completed in nearly lin-

Community Recovery in Graphs with Locality

core subgraph Vc Stage 1:

Stages 1-2:

(a) Spectral-Expanding (b) Spectral-Stitching

Figure 2. Illustration of the information ﬂow in Spectral-Expanding and Spectral-Stitching.

ear time are more symmetric than those of Spectral Expanding. More precisrely, Spectral-Expanding emphasizes a single core subgraph Gc and computes all other estimates based on Gc, while Spectral-Stitching treats each subgraph Gl almost equivalently. This symmetry nature might be practically beneﬁcial when the acquired data deviate from our assumed random sampling model.

3.2. Theoretical Guarantees: Rings

We start with the performance of our algorithms for rings. This class of graphs which is spatially invariant is arguably the simplest model exhibiting locality structure.

3.2.1. MINIMUM SAMPLE COMPLEXITY

Encouragingly, the proposed algorithms succeed in achieving the minimum sample complexity, as stated below. Theorem 1. Fix θ > 0 and any small ϵ > 0. Let G be a ring Rr with locality radius r, and suppose

m (1 + ϵ) m , (5)

where m = n log n 2 1 e KL(0.5 θ) . (6)

Then with probability approaching one3, Spectral Expanding (resp. Spectral-Stitching) converges to the ground truth within T = O (log n) iterations, provided that r log3 n (resp. r nδ for an arbitrary constant δ > 0).

Conversely, if m < (1 ϵ) m , then the probability of error Pe(ψ) is approaching one for any algorithm ψ. Remark 2. When r = n 1, a ring reduces to a complete graph (or an equivalent Erd os-R enyi model). For this case, computationally feasible algorithms have been extensively studied (Swamy, 2004; Jalali et al., 2011; Chen et al., 2014c;b;a), most of which focus only on the scaling results. Recent work (Hajek et al., 2015a; Jog & Loh, 2015) succeeded in characterizing the sharp threshold for this case, and it is immediate to check that the sample complexity we derive in (6) matches the one presented in (Hajek et al., 2015a; Jog & Loh, 2015).

3More precisely, the proposed algorithms succeed with probability exceeding 1 c1r 9 C2 exp{ c2 m

n (1 e D )} for some constants c1, c2, C2 > 0.

Remark 3. Theorem 1 requires r poly log(n) because each node needs to be connected to sufﬁciently many neighbors in order to preclude bursty errors. The condition r log3 n might be improved to a lower-order poly log (n) term using more reﬁned analyses. When r log n, one can compute the maximum likelihood (ML) estimate via dynamic programming (Kamath et al., 2015) within polynomial time.

Theorem 1 uncovers a surprising insensitivity phenomenon for rings: as long as the measurement graph is sufﬁciently connected, the locality constraint does not alter the sample complexity limit and the computational limit at all. This subsumes as special cases two regimes that exhibit dramatically different graph structures: (1) complete graphs, where the samples are taken in a global manner, and (2) rings with r = O(poly log (n)), where the samples are constrained within highly local neighborhood.

Notably, both (Abbe et al., 2015) and (Hajek et al., 2015b) have derived general sufﬁcient recovery conditions of SDP which, however, depend on the second-order graphical metrics of G (Durrett, 2007) (e.g. the spectral gap or Cheeger constant). When applied to rings (or other graphs with locality), the sufﬁcient sample complexity given therein is signiﬁcantly larger than the information limit4. This is in contrast to our ﬁnding, which reveals that for many graphs with locality, both the information and computation limits often depend only upon the vertex degrees independent of these second-order graphical metrics.

3.2.2. BOTTLENECKS FOR EXACT RECOVERY

Before explaining the rationale of the proposed algorithms, we provide here some heuristic argument as to why n log n samples are necessary for exact recovery and where the recovery bottleneck lies.

Without loss of generality, assume X = [0, , 0] . Suppose the genie tells us the correct labels of all nodes except v. Then all samples useful for recovering Xv reside on the edges connecting v and its neighbors, and there are Poisson(λdv) such samples. Thus, this comes down to

4For instance, the sufﬁcient sample complexity given in (Abbe et al., 2015) scales as n log n

h GD with h G denoting the Cheeger constant. Since h G = O(1/n) for rings / lines, this results in a sample size that is about n times larger than the information limit.

Community Recovery in Graphs with Locality

testing between two conditionally i.i.d. distributions with a Poisson sample size of mean λdv. From the large deviation theory, the ML rule fails in recovering Xv with probability

Pe,v exp n λdv(1 e D ) o , (7)

where D is the large deviation exponent. The above argument concerns a typical error event for recovering a single node v, and it remains to accommodate all vertices. Since the local neighborhoods of two vertices v and u are nearly non-overlapping, the resulting typical error events for recovering Xv and Xu become almost independent and disjoint. As a result, the probability of error of the ML rule ψml is approximately lower bounded by

v=1 Pe,v n exp n λdavg(1 e D ) o , (8) where one uses the fact that dv davg. Apparently, the right-hand side of (8) would vanish only if

λdavg(1 e D ) > log n. (9)

Since the total sample size is m = λ 1

2ndavg, this together with (9) conﬁrms the sample complexity lower bound

2λndavg > n log n 2 (1 e D ) = m .

As we shall see, the above error events in which only a single variable is uncertain dictate the hardness of exact recovery.

3.2.3. INTERPRETATION OF OUR ALGORITHMS

The preceding argument suggests that the recovery bottleneck of an optimal algorithm should also be determined by the aforementioned typical error events. This is the case for both Spectral-Expanding and Spectral-Stitching, as revealed by the intuitive arguments below. While the intuition is provided for rings, it contains all important ingredients that apply to many other graphs.

To begin with, we provide an heuristic argument for Spectral-Expanding.

(i) Stage 1 focuses on a core complete subgraph Gc. In the regime where m n log n, the total number of samples falling within Gc is on the order of |Vc|

n m |Vc| log n, which sufﬁces in guaranteeing partial recovery using spectral methods (Chin et al., 2015). In fact, the sample size we have available over Gc is way above the degrees of freedom of the variables in Gc (which is r).

(ii) With decent initial estimates for Gc in place, one can infer the remaining pool of vertices one by one using existing estimates together with backward samples. One important observation is that each vertex is incident to many i.e. about the order of log n backward samples. That said, we are effectively operating in a high signal-to-noise ratio (SNR) regime. While existing estimates are imperfect,

0.2 0.4 0.6 0.8 1

m = c n log n 2(1 e D )

error rate: p

Lm = c n log n

Figure 3. (Left) Minimum sample complexity m vs. locality radius r; (Right) Minimum number Lm of vertices being measured (including repetition) vs. single-vertex error rate p.

the errors occur only to a small fraction of vertices. Moreover, these errors are in some sense randomly distributed and hence fairly spread out, thus precluding the possibility of bursty errors. Consequently, one can obtain correct estimate for each of these vertices with high probability, leading to a vanishing fraction of errors in total.

(iii) Now that we have achieved approximate recovery, all remaining errors can be cleaned up via local reﬁnement using all backward and forward samples. For each vertex, since only a vanishingly small fraction of its neighbors contain errors, the performance of local reﬁnement is almost the same as in the case where all neighbors have been perfectly recovered.

The above intuition extends to Spectral-Stitching. Following the argument in (i), we see that the spectral method returns nearly accurate estimates for each of the subgraph Gl induced by Vl, except for the global phases. Since any two adjacent Gl and Gl+1 have sufﬁcient overlaps, this allows us to calibrate the global phases for {XVl i : i Vl} and {XVl+1 i : i Vl+1}. Once we obtain approximate recovery for all variables simultaneously, the remaining errors can then be cleaned up by Stage 3 as in Spectral-Expanding.

We emphasize that the ﬁrst two stages of both algorithms which aim at approximate recovery require only O (n) samples (as long as the pre-constant is sufﬁciently large). In contrast, the ﬁnal stage is the bottleneck: it succeeds as long as local reﬁnement for each vertex is successful. The error events for this stage are almost equivalent to the typical events singled out in Section 3.2.2, justifying the information-theoretic optimality of both algorithms.

3.3. Theoretical Guarantees: Inhomogeneous Graphs

The proposed algorithms are guaranteed to succeed for a much broader class of graphs with locality beyond rings, including those that exhibit inhomogeneous vertex degrees. The following theorem formalizes this claim for two of the most important instances: lines and grids.

Theorem 2. Theorem 1 continues to hold for the following families of measurement graphs:

(1) Lines with r = nβ for some constant 0 < β < 1, where

Community Recovery in Graphs with Locality

m = max {1/2, β} n log n

1 e KL(0.5 θ) ; (10)

(2) Grids with r = nβ for some constant 0 < β < 0.5, where

m = max {1/2, 4β} n log n

1 e KL(0.5 θ) . (11)

Remark 4. Careful readers will note that for lines (resp. grids), m does not converge to n log n 2(1 e KL(0.5 θ)) as β 1 (resp. β 0.5), which is the case of complete graphs. This arises because m experiences a more rapid drop in the regime where β = 1 (resp. β = 0.5). For instance, for a line with r = γn for some constant γ > 0, one has m = (1 γ/2)n log n

1 e KL(0.5 θ) . In addition, the result extends to small-world graphs. See (Chen et al., 2016) for details.

Theorem 2 characterizes the effect of locality radius upon the sample complexity limit; see Fig. 3 for a comparison of three classes of graphs. In contrast to rings, lines and grids are spatially varying models due to the presence of boundary vertices, and the degree of graph inhomogeneity increases in the locality radius r. To be more concrete, consider, for example, the ﬁrst davg/ log n vertices of a line, which have degrees around davg/2. In comparison, the set of vertices lying away from the boundary have degrees as large as davg. This tells us that the ﬁrst few vertices form a weakly connected component, thus presenting an additional bottleneck for exact recovery. This issue is negligible unless the size of the weakly connected component is exceedingly large. As asserted by Theorem 2, the minimum sample complexity for lines (resp. grids) is identical to that for rings unless r n (resp. r n1/8). Note that the curves for lines and grids (Fig. 3) have distinct hinge points primarily because the vertex degrees of the corresponding weakly connected components differ.

More precisely, the insights developed in Section 3.2.2 readily carry over here. Since the error probability of the ML rule is lower bounded by (8), everything boils down to determining the smallest λ (called λ ) satisfying

v=1 exp n λ dv 1 e D o 0,

which in turn yields m = 1

2λ davgn. The two cases accommodated by Theorem 2 can all be derived in this way.

3.4. Connection to Low-Rank Matrix Completion

One can aggregate all correct parities into a matrix Z = [Zi,j]1 i,j n such that Zi,j = 1 if Xi = Xj and Zi,j = 1 otherwise. It is straightforward to verify that rank (Z) = 1, with each Y (l) i,j being a noisy measurement of Zi,j. Thus, our problem falls under the category of low-rank matrix completion, a topic that has inspired a ﬂurry of research (e.g. (Candes & Recht, 2009; Keshavan et al., 2010b; Cand es et al., 2011; Chandrasekaran et al., 2011; Chen

et al., 2013)). Most prior works, however, concentrated on samples taken over an Erd os Rnyi model, without investigating sampling schemes with locality constraints. One exception is (Bhojanapalli & Jain, 2014), which explored the effectiveness of SDP under general sampling schemes. However, the sample complexity required therein increases signiﬁcantly as the spectral gap of the measurement graph drops, which does not deliver optimal guarantees. We believe that the approach developed herein will shed light on solving general matrix completion problems from samples with locality.

4. Extension: Beyond Pairwise Measurements

The proposed algorithms are applicable to numerous scenarios beyond the basic setup in Section 2.1. This section presents one important extension.

In some applications, each measurement may cover more than two nodes in the graph. In the haplotype phasing application, for example, a new sequencing technology called 10X (10x, 2016) generates barcodes to mark reads from the same chromosome (maternal or paternal), and more than two reads can have the same barcode. For concreteness, we suppose the locality constraint is captured by rings, and consider the type of multiple linked samples as follows.

Measurement (hyper)-graphs. Let G0 = (V, E0) be a ring Rr, and let G = (V, E) be a hypergraph such that (i) every hyperedge is incident to L vertices in V, and (ii) all these L vertices are mutually connected in G0.

Noise model. On each hyperedge e = (i1, , i L) G, we obtain Ne ind. Poisson (λ) multi-linked samples {Y (l) e | 1 l Ne}. Conditional on Ne, each sample Y (l) e is an independent copy of

( (Zi1, , Zi L) , with prob. 0.5, (Zi1 1, , Zi L 1) , else, (12)

where Zi is a noisy measurement of Xi such that Zi = Xi with probability 1 p and Zi = Xi 1 otherwise. Here, p represents the error rate for measuring a single vertex. For the pairwise samples considered before, one can think of the parity error rate θ as P {Zi Zj = Xi Xj} or, equivalently, θ = 2p(1 p).

We emphasize that a random global phase is incorporated into each sample (12). That being said, each sample reveals only the relative similarity information among these L vertices, without providing further information about the absolute cluster membership. Interestingly, the proposed algorithms with slight modiﬁcation (see (Chen et al., 2016)) are still information-theoretically optimal. Theorem 3. Fix L 2. Theorem 1 continues to hold under the above L-wise sampling model, with m replaced by

m := n log n L 1 e D(P0,P1) .

Community Recovery in Graphs with Locality

0 1 2 3 4 0

Number of measurements normalized by information theoretic limit

Success rate

r = 4 r = 18 r = 317 r = 5624

0 1 2 3 4 0

Number of measurements normalized by Information Theoretic limit

Success rate

θ = 0.01 θ = 0.10 θ = 0.15

0 10 20 30 0

Number of mate pair measurements per SNP

Success rate

information theoretic limit

0 10 20 30 0

Number of measurements per SNP

Success rate

information theoretic limit

(a) (b) (c) (d) (e)

Figure 4. Empirical success rate of Spectral-Expanding for: (a) Rings Rr; (b) Rings R18 with varied measurement error rate θ; (c) Insert size distribution (Illumina, 2012); (d) Performance on a simulation of haplotype; (e) Performance on a simulation of haplotype. For (a) and (b), the x-axis is the sample size m normalized by the information limit m .

r = n0.25 r = n0.5 r = n0.75

Time (seconds/run) 3.55 6.45 58.4

Table 1. The time taken to run Spectral-Expanding on a Mac Book Pro equipped with a 2.9 GHz Intel Core i5 and 8GB of memory over rings Rr, where n = 100, 000, θ = 10% and m = 1.5m .

P0 = (1 p)Binomial (L 1, p) + p Binomial (L 1, 1 p) ; P1 = p Binomial (L 1, p) + (1 p) Binomial (L 1, 1 p) .

Remark 5. A closed-form expression of D(P0, P1) can be found in (Chen et al., 2016).

With Theorem 3 in place, we can determine the beneﬁts of multi-linked sampling. To enable a fair comparison, we evaluate the sampling efﬁciency in terms of Lm rather than m , since Lm captures the total number of vertices (including repetition) one needs to measure. As illustrated in Fig. 3, the sampling efﬁciency improves as L increases, but there exists a fundamental lower barrier given by n log n 1 e KL(0.5 p) . This lower barrier, as plotted in the black curve of Fig. 3, corresponds to the case where L is approaching inﬁnity.

5. Numerical Experiments

To verify the practical applicability of the proposed algorithms, we have conducted simulations in various settings. All these experiments focused on graphs with n = 100, 000 vertices, and used an error rate of θ = 10% unless otherwise noted. For each point, the empirical success rates averaged over 10 Monte Carlo runs are reported.

(a) Regular rings. We ran Algorithm 1 on rings Rr for various values of locality radius r (Fig. 4(a)), with the runtime reported in Table 1;

(b) Rings with different error rates. We varied the error rate θ for rings with r = 18 = n0.25, and plotted the empirical success rate (Fig. 4(d)).

We have also simulated a model of the haplotype phasing problem by assuming that the genome has a SNP periodically every 1000 base pairs. The insert length distribution, i.e. the distribution of the genomic distance between

Figure 5. The switch error rates of Spectral-Stitching when run on the NA12878 data-set from 10x-genomics (10x Genomics, 2015).

the linking reads, is given in Fig. 4(c) for Illumina reads, and a draw from Poisson(3.5) truncated within the interval 1, , 9 is a reasonable approximation for the number of SNPs between two measured SNPs. We then ran the simulation on the rings R9, with non-uniform sampling weight. Using the nominal error rate of p = 1% for the short reads, the error rates of the measurements is 2p(1 p) 2%. The empirical performance is shown in Fig. 4(d).

Additionally, we have simulated reads generated by 10x Genomics (10x, 2016) , which corresponds to the model in Section 4. Each measurement consists of multiple linked reads, which is generated by ﬁrst randomly picking a segment of length 100 SNPs (called a fragment) on the line graph and then generating Poisson(9) number of linked reads uniformly located in this segment. The noise rate per read is p = 0.01. The empirical result is shown in Fig. 4(e). The information theoretic limit is calculated using Theorem 3, with L set to inﬁnity (since the number of vertices involved in a measurement is quite large here).

To evaluate the performance of our algorithm on real data, we ran Spectral-Stitching for Chromosomes 1-22 on the NA12878 data-set made available by 10x-Genomics (10x Genomics, 2015). The nominal error rate per read is p = 1%, and the average number of SNPs touched by each sample is L [6, 7]. The number of SNPs n ranges from 34240 to 191829, with the sample size m from 102633 to 574189. Here, we split all vertices into overlapping subsets of size W = 100. The performance is measured in terms of the switch error rate, that is, the fraction of positions where we need to switch the estimate to match the ground truth. The performance on Chromosomes 1-22 is reported in Fig. 5.

Community Recovery in Graphs with Locality

10x Genomics, 2016. URL http://www. 10xgenomics.com. [Online; accessed 5-February2016].

10x Genomics. NA12878 Loupe data-set, 2015. URL http://software.10xgenomics.com/ files/samples/genome/NA12878_WGS.

Abbe, E. and Sandon, C. Community detection in general stochastic block models: fundamental limits and efﬁcient recovery algorithms. ar Xiv preprint ar Xiv:1503.00609, 2015.

Abbe, E., Bandeira, A., Bracher, A., and Singer, A. Decoding binary node labels from censored measurements: Phase transition and efﬁcient recovery. IEEE Trans on Network Science and Engineering, 1(1):10 22, 2015.

Abbe, Emmanuel and Wainwright, Martin. ISIT 2015 Tutorial: Information Theory and Machine Learning. 2015.

Abbe, Emmanuel, Bandeira, Afonso S, and Hall, Georgina. Exact recovery in the stochastic block model. IEEE Trans. on Information Theory, 62(1):471 487, 2016.

Bansal, N., Blum, A., and Chawla, S. Correlation clustering. Machine Learning, 56(1-3):89 113, 2004.

Bhojanapalli, Srinadh and Jain, Prateek. Universal matrix completion. International Conference on Machine Learning (ICML), pp. 1881 1889, 2014.

Browning, S. and Browning, B. Haplotype phasing: existing methods and new developments. Nature Reviews Genetics, 12(10):703 714, 2011.

Candes, E. J. and Recht, B. Exact Matrix Completion via Convex Optimization. Foundations of Comp. Math., (6): 717 772, 2009.

Cand es, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Journal of ACM, 58(3):11:1 11:37, Jun 2011.

Chandrasekaran, Venkat, Sanghavi, Sujay, Parrilo, Pablo A, and Willsky, Alan S. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572 596, 2011.

Chaudhuri, K., Chung, F., and Tsiatas, A. Spectral clustering of graphs with general degrees in the extended planted partition model. Journal of Machine Learning Research, 2012:1 23, 2012.

Chen, Jingchun and Yuan, Bo. Detecting functional modules in the yeast protein protein interaction network. Bioinformatics, 22(18):2283 2290, 2006.

Chen, Y. and Candes, E. Solving random quadratic systems of equations is nearly as easy as solving linear systems. In Advances in Neural Information Processing Systems (NIPS), pp. 739 747, 2015.

Chen, Y., Guibas, L., and Huang, Q. Near-Optimal Joint Object Matching via Convex Relaxation. International Conference on Machine Learning (ICML), pp. 100 108, June 2014a.

Chen, Y., Suh, C., and Goldsmith, A. J. Information recovery from pairwise measurements: A Shannon-theoretic approach. IEEE International Symposium on Information Theory, ar Xiv preprint ar Xiv:1504.01369, 2015.

Chen, Yudong, Jalali, A., Sanghavi, S., and Caramanis, C. Low-Rank Matrix Recovery From Errors and Erasures. IEEE Trans on Info Theory, 59(7):4324 4337, 2013.

Chen, Yudong, Lim, Shiau H, and Xu, Huan. Weighted graph clustering with non-uniform uncertainties. In International Conference on Machine Learning (ICML), pp. 1566 1574, 2014b.

Chen, Yudong, Sanghavi, Sujay, and Xu, Huan. Improved graph clustering. IEEE Transactions on Information Theory, 60(10):6440 6455, 2014c.

Chen, Yuxin, Kamath, Govinda, Suh, Changho, and Tse, David. Community recovery in graphs with locality (full version). 2016. URL http://arxiv.org/abs/ 1602.03828.

Chin, Peter, Rao, Anup, and Vu, Van. Stochastic Block Model and Community Detection in the Sparse Graphs: A spectral algorithm with optimal rate of recovery. ar Xiv preprint ar Xiv:1501.05021, 2015.

Coja-Oghlan, Amin. Graph partitioning via adaptive spectral techniques. Combinatorics, Probability and Computing, 19(02):227 284, 2010.

Condon, Anne and Karp, Richard M. Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 18(2):116 140, 2001.

Cover, Thomas M and Thomas, Joy A. Elements of information theory. Wiley-interscience, 2006.

Das, Shreepriya and Vikalo, Haris. SDha P: haplotype assembly for diploids and polyploids via semi-deﬁnite programming. BMC genomics, 16(1):1, 2015.

Donmez, N. and Brudno, M. Hapsembler: an assembler for highly polymorphic genomes. In Research in Computational Molecular Biology, pp. 38 52. Springer, 2011.

Durrett, R. Random graph dynamics, volume 200. Cambridge university press Cambridge, 2007.

Fortunato, S. Community detection in graphs. Physics Reports, 486(3):75 174, 2010.

Community Recovery in Graphs with Locality

Girvan, M. and Newman, M. Community structure in social and biological networks. Proceedings of the national academy of sciences, 99(12):7821 7826, 2002.

Globerson, Amir, Roughgarden, Tim, Sontag, David, and Yildirim, Cafer. How Hard is Inference for Structured Prediction? In ICML, pp. 2181 2190, 2015.

Hajek, Bruce, Wu, Yihong, and Xu, Jiaming. Achieving Exact Cluster Recovery Threshold via Semidefinite Programming: Extensions. ar Xiv preprint ar Xiv:1502.07738, 2015a.

Hajek, Bruce, Wu, Yihong, and Xu, Jiaming. Exact recovery threshold in the binary censored block model. In Information Theory Workshop, pp. 99 103, 2015b.

Holland, Paul W, Laskey, Kathryn Blackmond, and Leinhardt, Samuel. Stochastic blockmodels: First steps. Social networks, 5(2):109 137, 1983.

Illumina. Data processing of Nextera mate pair reads on Illumina sequencing platform. Technical Note: Sequencing, 2012. URL http://www.illumina.com/documents/ products/technotes/technote_nextera_ matepair_data_processing.pdf.

Jain, Prateek, Netrapalli, Praneeth, and Sanghavi, Sujay. Low-rank matrix completion using alternating minimization. In Symposium on Theory of Computing (STOC), pp. 665 674, 2013.

Jalali, Ali, Chen, Yudong, Sanghavi, Sujay, and Xu, Huan. Clustering Partially Observed Graphs via Convex Optimization. In International Conference on Machine Learning (ICML), pp. 1001 1008, June 2011.

Javanmard, Adel, Montanari, Andrea, and Ricci-Tersenghi, Federico. Phase Transitions in Semideﬁnite Relaxations. ar Xiv preprint ar Xiv:1511.08769, 2015.

Jog, Varun and Loh, Po-Ling. Information-theoretic bounds for exact recovery in weighted stochastic block models using the Renyi divergence. ar Xiv preprint ar Xiv:1509.06418, 2015.

Kamath, G., Sasoglu, E., and Tse, D. Optimal Haplotype Assembly from High-Throughput Mate-Pair Reads. IEEE International Symposium on Information Theory, pp. 914 918, June 2015.

Keshavan, R. H., Montanari, A., and Oh, S. Matrix completion from noisy entries. The Journal of Machine Learning Research, 99:2057 2078, 2010a.

Keshavan, R. H., Montanari, A., and Oh, S. Matrix completion from a few entries. IEEE Trans on Info Theory, (6):2980 2998, 2010b.

Mossel, Elchanan, Neeman, Joe, and Sly, Allan. Belief propagation, robust reconstruction, and optimal recovery of block models. ar Xiv preprint ar Xiv:1309.1380, 2013.

Porter, Mason A, Onnela, Jukka-Pekka, and Mucha, Peter J. Communities in networks. Notices of the AMS, 56 (9):1082 1097, 2009.

Shi, Jianbo and Malik, Jitendra. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888 905, 2000.

Si, H., Vikalo, H., and Vishwanath, S. Haplotype assembly: An information theoretic view. In IEEE Information Theory Workshop, pp. 182 186, 2014.

Swamy, C. Correlation clustering: maximizing agreements via semideﬁnite programming. In Symposium on Discrete Algorithms (SODA), pp. 526 527, 2004.