# multiview_stochastic_block_models__f2fc9847.pdf Multi-View Stochastic Block Models Vincent Cohen-Addad * 1 Tommaso d Orsi * 1 2 Silvio Lattanzi * 1 Rajai Nasser * 1 Graph clustering is a central topic in unsupervised learning with a multitude of practical applications. In recent years, multi-view graph clustering has gained a lot of attention for its applicability to real-world instances where one has access to multiple data sources. In this paper we formalize a new family of models, called multiview stochastic block models that captures this setting. For this model, we first study efficient algorithms that naively work on the union of multiple graphs. Then, we introduce a new efficient algorithm that provably outperforms previous approaches by analyzing the structure of each graph separately. Furthermore, we complement our results with an information-theoretic lower bound studying the limits of what can be done in this model. Finally, we corroborate our results with experimental evaluations. 1. Introduction Clustering graphs is a fundamental topic in unsupervised learning. It is used in a variety of fields, including data mining, social sciences, statistics, and more. The goal of graph clustering is to partition the vertices of the graph into disjoint sets so that similar vertices are grouped together and dissimilar vertices lie in different clusters. In this context, several notions of similarity between vertices have been studied throughout the years resulting in different clustering objectives and clustering algorithms (Von Luxburg, 2007; Ng et al., 2001; Bansal et al., 2004; Goldberg, 1984; Dasgupta, 2016). Despite the rich literature, most of the algorithmic results in graph clustering only focus on the setting where a single graph is presented in input. This is in contrast with the increasing practical importance of multimodality and with *Equal contribution 1Google Research 2BIDSA, Bocconi. Correspondence to: Tommaso d Orsi . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). the growing attention in applied fields to multi-view or multilayer clustering (Paul & Chen, 2016; Corneli et al., 2016; Han et al., 2015; De Bacco et al., 2017; Khan & Maji, 2019; Zhong & Pun, 2021; Hu et al., 2019; Abavisani & Patel, 2018; Kim et al., 2016; Gujral et al., 2020; Ni et al., 2016; De Santiago et al., 2023; Papalexakis et al., 2013; Gujral & Papalexakis, 2018; Gorovits et al., 2018). In practice it is in fact observed that while a single data source only offers a specific characterization of the underlying objects, leading to a coarse partition of the data, a careful combination of multiple views often allows a semantically richer network structure to emerge (Fu et al., 2020; Fang et al., 2023). For a practical example, consider the task of clustering users of a social network platform like Facebook, Instagram or X. To solve such task one could simply cluster the friendship graph, or one could cluster such graph by looking together at the friendship graph, the co-like graph (a graph where two users are connected if they like the same picture/video), the co-comment graph (a graph where two users are connected if they comment on the same post), the co-repost graph (a graph where two users are connected if they repost the same post) and so on and so forth. In practice, one would expect the second approach to work better in many settings because it provides a more fine-grained description of the behaviors of the users. Despite the large number of basic applications, very little is known on the theoretical aspect of the problem. Several works (Paul & Chen, 2016; Corneli et al., 2016; Han et al., 2015; De Bacco et al., 2017) consider the multi-layer stochastic block models where the goal is to identify k communities given several instances (i.e., layer or view) of the stochastic block model, each with k communities. In this paper, we would like to work in a more general and more realistic setup, where there are k communities but each instance only provides partial information about these k communities. Very recently and concurrently to us, (De Santiago et al., 2023) introduced the multi-view stochastic block model. In this model, one is given in input multiple graphs, each coming from a stochastic block model, and the goal is to leverage the information contained in the graphs to recover the underlying clustering structure. More precisely, given a vector of labels1 z where the labels capture the clustering assignment and are in [k], and 1We write random variables in boldface. Multi-View Stochastic Block Models t graphs G1, . . . , Gt, where each graph Gℓis drawn independently from a stochastic block model with 2 labels and possibly distinct parameters, we are interested in designing an algorithm to weakly recover the underlying vector z. One important aspect of the model is that none of the input graphs G1, . . . , Gt may contain enough information to recover the full clustering structure (for example because 2 < k), nevertheless one can show that if enough graphs are observed it is possible to recover the clustering structure of the underlying instance. Armed with this new model we study different approaches to cluster the input graphs G1, . . . , Gt. First, we note that the natural approach (sometimes used in practice) of merging the graphs and then clustering the union of the graphs, called early fusion, leads to suboptimal results. Then we design a more careful late fusion clustering algorithm that first clusters all the graphs separately and then carefully merges their results. This shows the superiority of late over early fusion. Finally, we complement our results with an information-theoretic lower bound studying the limits of what can be done in this model. The bounds obtained are a drastic improvement over the ones obtained by (De Santiago et al., 2023). Model Before formally introducing our model, we recall the classic definition of the stochastic block model. The k community symmetric stochastic block model (see (Abbe, 2017) for a survey) denotes the following joint distribution (x, G) SBMn,k,d,ε over a vector of n labels in [k] and a graph on n vertices: draw x from [k]n uniformly at random; for each distinct i, j [n], independently create an edge ij in G with probability (1 + (1 1 n if xi = xj and probability (1 ε n otherwise. We denote the conditional distribution of G given x = x as SBMk,d,ε(x) . Given a graph G sampled according to this model, the goal is to recover the (unknown) underlying vector of labels as well as possible. Most of the statistical and computational phenomena at play can already be observed in the simplest settings with two communities, so we will often focus on those. For k = 2, we denote the distribution by SBMn,2,d,ε , i.e., we explicitly replace the subscript k. It will also be convenient to use {+1, 1} for the community labels instead of [2], so we will sometimes do this. The labeling convention should be clear from the context. One of the most widely studied natural objective in the context of stochastic block models is that of weak recovery asking to approximately recover the communities. Specifically, we say that an algorithm achieves weak recovery for {SBMn,k,d,ε}n N if the correlation of the algorithm s output ˆx(G) [k]n and the underlying vector x of labels is better than random as n grows,2 P R(ˆx(Gℓ), xℓ) 1 k + Ωd,ε/k(1) 1 o(1) , (1) where R(ˆx, x) is the agreement between ˆx and x, defined as3 R(ˆx, x) = max π Pk 1 n i [n] Jˆxi = π(xi)K , (2) and Pk is the permutation group of [k] . A sequence of works (Decelle et al., 2011; Massouli e, 2014; Mossel et al., 2014; 2015b; Abbe & Sandon, 2016a; Mossel et al., 2018; Montanari & Sen, 2016), have studied the statistical and computational landscapes of this objective, with great success. The emerging picture (Bordenave et al., 2015; Abbe & Sandon, 2016a; Montanari & Sen, 2016) shows that it is possible to achieve weak recovery in polynomial time whenever dε2/k2 > 1, this value is called the Kesten-Stigum threshold. Further evidence (Hopkins & Steurer, 2017) suggests that this threshold is optimal for polynomial time algorithms. In particular, for the special case of weak recovery with 2 communities (Mossel et al., 2015b) showed that the problem is solvable (also computationally efficiently) if and only if dε2/4 > 1. For larger values of k a gap between information-theoretic results and efficient algorithms exists (Abbe & Sandon, 2016b; Banks et al., 2016). In the context of multimodality, we define the following multi-view model. Model 1.1 (Multi-View stochastic block model). Let k 1 and let T be a sequence of t tuples (dℓ, εℓ) where dℓ 0 and εℓ (0, 1) . We refer to the following joint distribution (z, (f1, G1), . . . (ft, Gt)) (T ,k,t)-MV-SBMn as the (T , k, t)-multi-view stochastic block model: 1. for each ℓ [t], independently draw a mapping fℓ: [k] { 1} uniformly at random; 2. independently draw a vector z from [k]n uniformly at random; 3. for each ℓ [t], independently draw a graph Gℓ SBMn,2,dℓ,εℓ(fℓ(z)), where fℓ(z) is the n-dimensional vector with entries fℓ(z1), . . . , fℓ(zn) . Given G1, . . . , Gt, the goal is to approximately recover the unknown vector z of labels. 2We use o(1) to denote a function f such that limn f(n)/1 = 0 . 3We use Iverson s brackets to denote the indicator function. Multi-View Stochastic Block Models When T = {(dℓ, εℓ)}ℓ [t] is such that (dℓ, εℓ) = (d, ε) for some d, ε, we denote the model simply by (d,ε,k,t)-MV-SBMn. Although Model 1.1 captures the algorithmic phenomena of multi-view models used in practice, more general versions of Model 1.1 could be defined, we discuss them in Section 6. Similarly to the vanilla stochastic block model, weak recovery can also be defined for Model 1.1. We say that an algorithm achieves weak recovery for (T ,k,t)-MV-SBMn with t observations, if it outputs a vector ˆz(G1, . . . Gt) satisfying: P R(ˆz(G1, . . . Gt), z) 1 k + Ω(1) 1 ot(1) . Differently from the vanilla stochastic block model, the complexity of Model 1.1 is governed both by the SBM parameters in T and by the number of observations t. A good algorithm should then achieve weak recovery with the best possible multiway tradeoff between the edge-densities of the graphs, the biases and the number of observations at hand. That is, extract as much information as possible so to require as few observations as possible. This novel interplay of parameters immediately raises two natural questions, which are the main focus of this work. How many observations are needed? The problem gets easier the larger the number of observations one has access to (see Appendix A for a formal proof). It is also easy to see that for t = o(log k), it is information theoretically impossible to approximately recover the communities (since log2 k bits are needed to encode k labels). Furthermore, as we will see, stronger lower bounds can also be obtained. How many observations suffice? To understand how many observations suffice to recover the communities, it is instructive to consider the union graph G = S ℓ [t] Gℓof an instance from (d,ε,k,t)-MV-SBMn, which turns out to follow a k-communities stochastic block model with parameters d = Θ(dt) , ε = Θ(ε) (see Appendix A). Building on the aforementioned results, this implies that at least t Ω(k2/dε2) observations are needed for efficient weak recovery of the communities from G ! However, as we show later, exponentially better algorithms can bridge this gap. 1.1. Results Weak recovery Our main algorithmic result shows that weak recovery for (T ,k,t)-MV-SBMn can be achieved in polynomial time with only O(log k) many observations. Theorem 1.2 (Weak recovery for multi-view models). Let n, k > 0 . Let (z, (f1, G1), . . . (ft, Gt)) (T ,k,t)-MV-SBMn for a sequence of tuples T = {(dℓ, εℓ)}t ℓ=1 , each satisfying dℓ ε2 ℓ/4 > 1. Then there exists a constant CT > 0 depending only on T , such that if t Ω log k , weak recovery of z in the sense of (3) is possible. Moreover, the underlying algorithm runs in polynomial time. Theorem 1.2 implies that whenever the algorithm has access to Θ(log k) observations, each above the relative Kesten Stigum threshold, the guarantees of the underlying algorithm match the aforementioned trivial lower bound, up to constant factors. Moreover, as we will see in Section 4, the underlying algorithm turns out to be surprisingly simple and efficient. The algorithm in Theorem 1.2 applies a specialized weakrecovery algorithm on each view Gℓto obtain a matrix ˆXℓ estimating fℓ(z)fℓ(z)T and achieving the correlation Cℓ E h ˆX(G)ij fℓ(z)i = fℓ(z)j i E h ˆX(G)ij fℓ(z) = fℓ(z)j i , (4) where Cℓdepends only on dℓ, εℓ. The algorithm then proceeds into processing the outputs ˆXℓin a blackbox fashion to produce an estimate ˆz of z. The constant CT in Theorem 1.2 is the average of the correlations (Cℓ)ℓ [t]. It is natural to wonder whether the dependency of the number of observations on CT is needed. Moreover, as for canonical stochastic block models, it is natural to ask what the exact phase transition of Model 1.1 is. While we leave this latter fascinating question open, our next result shows that if we want an algorithm that processes estimates in a blackbox fashion, then some dependency on CT is indeed needed. Theorem 1.3 (Lower bound for multi-view models - Informal). Let n, k > 0 . Let (z, (f1, G1), . . . (ft, Gt)) (T ,k,t)-MV-SBMn for a sequence of tuples T = {(dℓ, εℓ)}t ℓ=1 , each satisfying dℓ ε2 ℓ/4 > 1. Assume that for every ℓ [t] we have an estimate4 ˆXℓof fℓ(z)fℓ(z)T satisfying a pair-wise correlation (as in (4)) of at least Cℓ> 0 , and let CT be the average correlation. If t = o log k , then by only using the estimates ˆX1, . . . , ˆXt, it is information-theoretically impossible to return a vector ˆz satisfying P R(ˆz, z) 1 k + Ω(1) 0.99 . (5) Exact recovery Another widely studied objective for stochastic block models is that of exact recovery, where the goal is to correctly classify all vertices in the graph 4Such estimates might be obtained by applying a blackbox weak-recovery algorithm for SBMn,2,d,ε on each of G1, . . . , Gt, and which has the mentioned correlation guarantee. Multi-View Stochastic Block Models (see (Abbe et al., 2015; Mossel et al., 2015a; Abbe, 2017) and references therein). In the context of Model 1.1 this objective becomes P (R(ˆz, z) = 1) 1 o(1) . (6) As a corollary we show that, when given access to more views, the algorithm behind Theorem 1.2 can achieve exact recovery. Corollary 1.4 (Exact recovery for multi-view models). Consider the settings of Theorem 1.2, if t Ω log n then exact recovery of z in the sense of Equation (6) is possible. Moreover, the underlying algorithm runs in polynomial time. Experiments Theorem 1.2 show hows, for Model 1.1, late fusion algorithms can provide better guarantees by requiring an exponentially smaller number of observations to achieve the same error in a large parameters regime than early fusion algorithms. We further corroborated these findings with experiments on synthetic data in Section 5. Organization The rest of the paper is organized as follows. We introduce the main ideas in Section 2. In Section 3 we introduce our specialized weak recovery algorithm for the standard stochastic block model. This is then used in the design of the algorithm behind Theorem 1.2 in Section 4. Experiments are presented in Section 5. Future directions and conclusions are discussed in Section 6. In Appendix A we show the limits of algorithms using the union graph. Appendix B contains a proof of (the formal version of) Theorem 1.3. Deferred proofs are presented in Appendix C. We denote random variables in boldface. We hide constant factors using the notation O( ), Ω( ), Θ( ) . We write Oδ( ) , Ωδ( ) to specify that the hidden constant may depend on the parameter δ. Similarly, we sometimes write Cδ to denote a constant depending only on δ. We further denote the indicator function with Iverson s brackets J K . Given functions f, g : R R, we say f on(g) if limn f(n)/g(n) = 0. Similarly we write f ωn(g) if g on(f) . With a slight abuse of notation we often write on(g) to denote a function in on(g). When the context is clear we drop the subscript. In particular, we often write o(1) to denote functions that tends to zero as n grows. We say that an event happens with high probability if this probability is at least 1 o(1) . For a set S [n], we write i u.a.r. S to denote an element drawn uniformly at random. For a given probability distribution and a measurable event E , we denote the probability that the event occurs by P(E) . We denote the complement event by Ec . For a vector v Rn, we write v for its Euclidean norm. For a matrix M Rn n, we denote by M its spectral norm and by M F its Frobenius norm. We also let M 1 := P ij |Mij| . We denote the i-th row of M by Mi. For a graph G with n vertices, we denote by A(G) its adjacency matrix. When the context is clear we simply write A. If the graph is directed, row Ai contains the outgoing edges of vertex i . For a given vector of labels z [k]n, we denote by c1(z) . . . , ck(z) the n-dimensional indicator vectors of the k communities defined by z. In the interest of simplicity, we often denote instances (z, (f1, G1), . . . (ft, Gt)) drawn from (T ,k,t)-MV-SBMn simply by I . For z [k]n, we also denote by (T ,k,t)-MV-SBMn(z) the distribution of (z, (f1, G1), . . . (ft, Gt)) (T ,k,t)-MV-SBMn conditioned on the event z = z . We often call z [k]n a community vector . We say that an algorithm runs in time T, if in the worst case it performs at most T elementary operations. 2. Techniques We outline here the main ideas behind Theorem 1.2 and Theorem 1.3. In the interest of clarity, we limit our discussion to (d,ε,k,t)-MV-SBMn. Behavior of the union graph The algorithm behind Theorem 1.2 is remarkably simple and leverages known algorithms for weak recovery of stochastic block models (particularly related to the robust algorithms of (Ding et al., 2022; 2023)). As a first step, to gain intuition, it is instructive to understand why the union graph instead requires t Ω(k2) observations (see Theorem A.1). An instance I of (d,ε,k,t)-MV-SBMn is given by a vector z [k]n and a collection of t independent pairs (f1, G1) , . . . , (ft, Gt), where each Gℓis sampled from SBM2,d,ε(fℓ(z)) . Since, by definition, each edge {i, j} appears in Gℓwith probability 1 + ε 2 fℓ(z)i fℓ(z)j d n , in the union graph S ℓ [t] Gℓ the same edge appears roughly with probability 2 fℓ(z)i fℓ(z)j . The intuition here is that if zi = zj the second term in the sum would be close to 0, while for zi = zj it would be εt 2 . Hence, we will see this edge in the union graph roughly with probability n (1 + O(ε) Jzi = zj K) . That is, the union graph behaves similarly to a vanilla stochastic block model with k communities, bias O(ε) and Multi-View Stochastic Block Models expected degree dt . As stated in the introduction, existing efficient algorithms can achieve weak recovery for that distribution whenever (dt)ε2/k2 > Ω(1) , implying t Ω(k2) in the regime 4 < dε2 O(1) where weak recovery for each observation is possible. Amplifying the signal-to-noise ratio via black-box estimators The above approach of taking the union graph and then running community detection on it yields sub-optimal guarantees because the graph does not keep all information regarding the instance I . Our strategy to overcome this issue is to proceed in the reverse order: first extract as much information as possible from each graph, and then combine the data. Concretely, in the context of (d,ε,k,t)-MV-SBMn, our plan is to accurately estimate the matrix fℓ(z)fℓ(z)T for each graph Gℓ. Indeed, the polynomial P ℓ [t] fℓ(z)ifℓ(z)j strongly correlates with Jzi = zj K in the sense that ℓ [t] fℓ(z)ifℓ(z)j ℓ [t] fℓ(z)ifℓ(z)j In other words, we can accurately estimate whether zi = zj or not by accurately estimating the products P ℓ [t] fℓ(z)ifℓ(z)j . Now, we do not have access to the functions fℓ(z) but we can hope (see the subsequent paragraphs) that existing weak recovery algorithms can provide a close enough estimate in the sense ℓ [t] ˆx(Gℓ)iˆx(Gℓ)j ℓ [t] ˆx(Gℓ)iˆx(Gℓ)j If so, we may simply decide whether i, j should be clustered together based on how large P ℓ [t] ˆx(Gℓ)iˆx(Gℓ)j is. By independence of the observations, standard concentration of measure results tell us that Ω(log(n)/C2 d,ε) observations5 would suffice to exactly predict all the n2 pairs (and hence achieve exact recovery with this number of observations). Improvements via neighborhoods intersection Continuing with the above line of thinking, one can further improve the dependency on t to t = Θ(log k) as promised in Theorem 1.2. For a label p [k] , let cp(z) {0, 1}n be the indicator vector of the corresponding community. The improvement comes from observing that for ℓ = ℓ and for any typical labelling z (i.e. a labelling that is approximately 5This is better than Ω(k2) as long as k Ω( log n). balanced), we have large separation between the community indicator vectors cp(z) cp (z) 2 Ω(n/k) . The crucial consequence is that for a fixed index i [n], we do not need to guess correctly Jzi = zj K for all j and we may misclassify some pairs. Indeed if Ai {0, 1}n is a vector with entries (Ai)j that accurately predicts Jzi = zj K up to a ρ < n/k misclassification error, then we can deduce whether i, j come from the same community by verifying if Ai and Aj agree on the majority of their entries. Concretely, by the reverse triangle inequality it holds that | Ai Aj cp(z) cp (z) | cp(z) Ai + cp (z) Aj O(n/k) . That is, we are still able to exactly deduce whether i, j sit in the same community! The improvement over t then comes as O(log(k)/C2 δ ) observations suffice to bound the misclassification error by n/k times a tiny constant. Finally, we remark that we are bound to misclassify some vertices as for some vertices i the estimator vector Ai will not accurately represent its community when t O(log(k)/C2 δ ) (this is due to the well-known gap between weak recovery and exact recovery in the vanilla stochastic block model (Abbe, 2017)). The pair-wise weak recovery estimator So far, we glossed over the fact that we do not have an algorithm returning an accurate estimate ˆx(Gℓ)ˆx(Gℓ)T of fℓ(z)fℓ(z)T given the graph Gℓ. Notice that for a pair (x, G) SBMn,2,d,ε, an algorithm achieving weak recovery returns a vector ˆx(G) { 1}n such that Ω(n2) E ˆx(G), x 2 E ˆx(G)ˆx(G)T, xx T which is enough to obtain the separation required in Equation (7). Indeed, the above implies that on average E [ˆx(G)iˆx(G)j | xi = xj] E [ˆx(G)iˆx(G)j | xi = xj] which is enough to carry out the strategy outlined in the previous paragraphs. Notice that, a priori, it is not clear whether these estimators should work for (d,ε,k,t)-MV-SBMn as the model introduces some subtle difficulties compared to SBMn,2,d,ε . Most importantly, the labels in fℓ(z) and hence the edges in Gℓ are not pair-wise independent. We bypass this obstacle carrying out the analysis after conditioning on the choice of fℓ, so that, even though the communities may be unbalanced, the edges are again independent. Now, for highly unbalanced communities the expected degree of each vertex is an accurate predictor for its community, hence weak recovery is easy to achieve. On the other hand, one can treat slightly unbalanced communities as perturbed balanced communities and apply the node-robust algorithm of (Ding et al., 2023). Multi-View Stochastic Block Models Remark 2.1 (Connection with (Liu et al., 2022)). Liu, Moitra and Raghavendra studied a joint distribution model over hypergraphs with independence edges. In the special case of graphs, their model can be seen as a simpler version of Model 1.1 in which every fℓ( ) is known, and each dℓis a constant. For this model, (Liu et al., 2022) beats random guessing: They produce a unit vector that correlates with the community vector better than a random vector guess would. However, they provide no rounding strategy. The algorithmic techniques in (Liu et al., 2022) are very different from ours and do not imply a result of the form of Theorem 1.2 for Model 1.1. Information theoretic lower bounds for black-box algorithms The proof of Theorem 1.3 consists of two main steps. In the first step, we bound how much information an estimate ˆXℓwith pair-wise correlation Cℓcan reveal about z. We show that this can be bounded (in terms of mutual information) as I(z; ˆXℓ) O(Cℓ n). In the second step we use an adapted version of Fano s inequality to show that if ˆz is an estimate of z achieving (5), then ˆz must have at least Ω(n log k) bits of information about z, i.e., I(z; ˆz) Ω(n log k). Now if ˆz is obtained by only processing ( ˆXℓ)ℓ [t] , then by using the data processing inequality, we can show that Ω(n log k) X t [t] I(z; ˆXℓ) t [t] O(Cℓ n) O(CT t n) , where CT is the average of the correlations (Cℓ)ℓ [t] . From this we deduce that we must have t Ω log k 3. Specialized weak recovery for vanilla SBMs General weak recovery results (Abbe, 2017) for stochastic block models turn out to be too weak for our objective. We rely instead on the following stronger statement, which we obtain by exploiting the robust algorithms of (Ding et al., 2022), (Ding et al., 2023), and which also works down to the Kesten-Stigum threshold. This specialized estimator will be used in the main algorithm behind Theorem 1.2. Theorem 3.1 (Pair-wise weak recovery for unbalanced 2 communities stochastic block model). Let n, d, ε > 0 be satisfying dε2/4 1 > 0 . Let x = (xi)i [n] { 1}n be a sequence of i.i.d. binary random variables with P(xi = +1) = p . There exists a polynomial time algorithm such that, on input G SBMn,2,d,ε(x), returns with probability 1 o(1) a matrix ˆX(G) [ 1 , +1]n n satisfying i, j [n] Cd,ε E h ˆX(G)ij xi = xj i E h ˆX(G)ij xi = xj i for some constant Cd,ε > 0 . Notice that the algorithm only receives G, d, ε and hence it does not know p . Note also that in the above, if one of the events xi = xj or xi = xj has probability 0 (i.e., if p = 0 or p = 1), then we adopt the convention that the corresponding conditional expectation is 0. We defer the proof of Theorem 3.1 to Appendix C. 4. The algorithm In this section we prove Theorem 1.2. We start by describing the underlying algorithm. Throughout the section, for a given t , we assume T = {(dℓ, εℓ)}t ℓ=1 to be a sequence of t tuples (dℓ, εℓ) each satisfying dℓ ε2 ℓ/4 1 > 0 . For each ℓ [t], let Cℓbe the weak-recovery constant that is achievable for SBMn,dℓ,εℓin the sense of Theorem 3.1 (recall this constant depends only on dℓ, εℓ) and let ℓ [t] Cℓ> 0 . We also assume k n1 Ω(1) . Algorithm 1 Community detection for multi-view stochastic block models Input: k, G1, . . . , Gt . Output: Community vector ˆz in [k]n . For each Gℓwith ℓ [t], run the community detection algorithm of Theorem 3.1. Construct the directed graph F on the vertex set [n] as follows. for i = 1 to n do Add an outgoing edge to the n/k vertices j [n] with largest P ℓ [t] ˆX(Gℓ)ij . end for Run Algorithm 2 on the adjacency matrix A of F and return the resulting vector. Remark 4.1 (Running time). By Theorem 3.1, the first step of the algorithm takes time O t n O(1) . Computing the values of P ℓ [t] ˆX(Gℓ)ij for all pairs takes time O(t n2) . Drawing the edges takes time O(n2 log n). Hence the algorithm runs in time O (tn)O(1) + T where T is the running time of Algorithm 2. Graph structure on balanced instances Algorithm 1 will work on typical instances from (T ,k,t)-MV-SBMn. In particular, it will work on instances that are approximately balanced, as described below. Definition 4.2 (Balanced vector). Let z [k]n . We say that z is balanced if for all p [k] , it holds 1 n Ω(1) n k cp(z) 2 1 + n Ω(1) n Multi-View Stochastic Block Models If z is balanced we also say that I (T ,k,t)-MV-SBMn(z) is balanced. It is immediate to see that a random instance (z, (f1, G1), . . . (ft, Gt)) (T ,k,t)-MV-SBMn is balanced with high probability. Fact 4.3 (Probability of balanced instance). Let I := (z, (f1, G1), . . . (ft, Gt)) (T ,k,t)-MV-SBMn, . Then, with probability 1 n 10 , I is balanced. The proof of Fact 4.3 is straightforward and we defer it to Appendix C. On balanced instances with sufficiently many observations, the adjacency matrix A of F becomes a good approximation of the true community matrix P i [k] ci(z)ci(z)T. Lemma 4.4 (Graphs structure from good instances). Let n, k, t > 0. Let I := (z, (f1, G1), . . . (ft, Gt)). Then Algorithm 1 constructs an adjacency matrix A such that p [k] , i [n] E h Ai cp(z) 2 zi = p i O(n) n Ω(1) + e Ω( C2 t) . To prove Lemma 4.4 we require an intermediate step. Fact 4.5. Consider the setting of Lemma 4.4. There exists a constant C [ t, t] such that, for i, j [n], ℓ [t] ˆX(Gℓ)ij < C C t e Ω(C 2 t) , ℓ [t] ˆX(Gℓ)ij C C t e Ω(C 2 t) . We defer the proof of Fact 4.5 to Appendix C. We are now ready to prove Lemma 4.4. Proof of Lemma 4.4. Fix i [n] . We can limit our analysis after conditioning on the event E(z) that z is balanced. Indeed, we have E h Ai cp(z) 2 E(z)c , zi = p i P (E(z)) O(n) P (E(z)) n Ω(1) . Let C be the constant of Fact 4.5. Define ℓ [t] ˆX(Gℓ)ij , and let A i = (A ij)j [n] be the binary vector defined as A ij = s Bij C C t Notice that A i 1 is the number of indices j with Bij C C t 3 . Now from the definition of the binary vector Ai , we know that Aij = 1 if and only if Bij is among the top n/k values in {Bij : j [n]} . From this it is not hard to see that that Ai A i 1 can be bounded from above by how much A i 1 deviates from n/k , that is Ai A i 1 A i 1 n Now assuming that E(z) holds, we have cp(z) 1 = cp(z) 2 = n k n1 Ω(1) , Ai A i 1 A i 1 cp(z) 1 + n1 Ω(1) A i cp(z) 1 + n1 Ω(1) . Since cp(z) and Ai are binary vectors, we have Ai cp(z) 2 = Ai cp(z) 1 Ai A i 1 + A i cp(z) 1 , and hence, given E(z) , we have Ai cp(z) 2 2 A i cp(z) 1 + n1 Ω(1) . (8) Now notice that |(A i)j cp(z)j| = J(A i)j = 1KJcp(z)j = 0K + J(A i)j = 0KJcp(z)j = 1K = J(A i)j = 1KJzj = p K + J(A i)j = 0KJzj = p K . Now using Fact 4.5 and leveraging the fact that P (E(z)c | zi = p) O n 10 we get E [|(A i)j cp(z)j| | E(z), zi = p] e Ω(C 2 t) + O n 10 . Combining this with (8) we conclude that E h Ai cp(z) 2 E(z), zi = p i O(n) n Ω(1) + e Ω( C2 t) . Rounding Thanks to Lemma 4.4, an application of the following rounding scheme sometimes called second moment rounding, as one may see A as an estimate of P p cp(z)cp(z)T suffices to compute the true communities. Remark 4.6 (Running time). Each step in the outer loop takes time O(n2). Overall Algorithm 2 runs in O(k n2) . Multi-View Stochastic Block Models Algorithm 2 Second moment rounding Input: A matrix A {0, 1}n n . Output: Community vector ˆz in [n]k . Let S = . for p = 1 to k (stop earlier if S = [n]) do Pick uniformly at random i [n] \ S . Set Sp = {i} . for j [n] \ S , j = i do Set j Sp if Ai Aj 2 n k . end for Add Sp to S . end for Assign each i [n] \ S to a set Sp, where p is chosen uniformly at random. Return: the vector ˆz with ˆzi = p if and only if i Sp . As first step of the proof, we introduce a new definition. Definition 4.7 (Representative row). Let z [k]n, let c1(z), . . . , ck(z) be the indicator vectors of the labels in z and let A (z) = P p cp(z)cp(z)T . For a matrix A {0, 1}n n , we say Ai is q-representative if Ai A (z)i 2 n e q C 2 t . (9) We denote by Rq the set of q representatives of A . The use of representative rows is convenient because, for sufficiently many observations, it is easy to see that rows which are representative of the same community must be close together, while rows that are representative of different communities must be far from each other. Lemma 4.8. Let n, k, t > 0 , and let q > 0 be the hidden constant in Lemma 4.4. Let t C log k C 2 for a large enough universal constant C > 0. Let z [k]n be balanced. Suppose that at each iteration of the outer loop, Algorithm 2 picks some Ai that is a q-representative. Then maxπ Pk P i Rq(z)J ˆzi = π(zi)K = |Rq| , where Pk is the permutation group over [k] . We defer the proof of Lemma 4.8 to Appendix C. Now, to prove Theorem 1.2, it remains to argue that, with high probability, first there are few non-representatives in A , and second Algorithm 2 picks q-representatives at each iteration of the outer loop. Lemma 4.9. Let n, k, t > 0 , and let q > 0 be the hidden constant in Lemma 4.4. Let t = C log k C2 for a large enough universal constant C > 0. Let I := (z, (f1, G1), . . . (ft, Gt)) (T ,k,t)-MV-SBMn . Let A be the matrix constructed by Algorithm 1. On input A, with probability at least 1 k Ω(1), the following holds: there are at least n (1 k Ω(1)) rows in A which are Ω(q)-representatives, step 1.(a) of Algorithm 2 only picks Ω(q)- representative vectors. Proof. By Fact 4.3 we may assume z is balanced. By Lemma 4.4 and Markov s inequality, with probability 1 e Ω(q C 2 t) there are at most O(n) e Ω(q C 2 t) indices i [n] such that Ai is not a Ω(q)-representative. Note that if C is large enough, then e Ω(q C 2 t) = k Ω(1). At every iteration of the loop, if we pick a representative vector we remove at most (1 + o(1)) n k indices, since z is balanced. Hence at each iteration there are at least n k (1 o(1)) indices that are Ω(q)-representative and which have not yet been picked. It follows that the probability we never pick an index that is not Ω(q)-representative is at least 1 n k Ω(1) n/k k 1 k Ω(1) as desired (by choosing C to be large enough). Theorem 1.2 now immediately follows combining Fact 4.3, Lemma 4.4, Lemma 4.8 and Lemma 4.9. Exact recovery Lemma 4.9 also implicitly yield exact recovery for sufficiently many observations. Corollary 4.10. Let n, k, t > 0 , and let q > 0 be the hidden constant in Lemma 4.4. Let t C log n C2 for a large enough universal constant C > 0. Let I := (z, (f1, G1), . . . (ft, Gt)) (T ,k,t)-MV-SBMn . Let A be the matrix constructed by Algorithm 1. On input A, with probability at least 1 n Ω(1), all rows in A are Ω(q)- representatives. Proof. By Lemma 4.4 and Markov s inequality, with probability 1 e Ω(q C 2 t) there are at most O(n) e Ω(q C 2 t) indices i [n] such that Ai is not a Ω(q)-representative. Therefore, for t C C2 log n no such index exists. Since by Lemma 4.8 only indices corresponding to rows that are not Ω(q)-representative can be misclassified, by Fact 4.3, Lemma 4.4, Lemma 4.9 and Corollary 4.10 we obtain Corollary 1.4. 5. Experiments We show here experiments on synthetic data sampled from (d,ε,k,t)-MV-SBMn . The estimator in Theorem 3.1 is complex and relies on a high order sum-of-squares program, making it hard to implement in practice. Nevertheless, it is reasonable to believe that: (i) the guarantees of Theorem 3.1 are only sufficient, but not necessary, to obtain Theorem 1.2; (ii) other estimators provide the guarantees of Theorem 3.1. For this reason, it makes sense to test Algorithm 1 with other community detection algorithms. Multi-View Stochastic Block Models The next figures compares the results on (z, (f1, G1), . . . , (ft, Gt)) (d,ε,k,t)-MV-SBMn (for a wide range of parameters) of the following algorithms: A.1 Louvain s algorithm (Blondel et al., 2008) on the union graph S A.2 Algorithm 1 with Louvain s algorithm applied in place of the estimator of Theorem 3.1 . The y-axis measures agreement as defined in Equation (2). Results are averaged over 20 simulations. Figure 1. Fixing t = 10, n = 1000, k = 10, d = 50 and varying ε in [0.5, 1.5]. Figure 2. Fixing t = 10, n = 1000, k = 10, ε = 0.5 and varying d in [50, 150]. 6. Conclusions and future directions The introduction of Model 1.1 raises several natural questions, for which we only provide initial answers. On the phase transition threshold One of the most interesting question concerns the phase transition of the model. Concretely, one may expect a rich interplay between the signal-to-noise ratio of each of the observed graphs (possibly below the relative KS threshold) and the number of observations required to weakly recover the hidden vector z. We leave the characterization of this trade-off beyond Theorem 1.2 and Theorem 1.3 as a fascinating open question. From 2 communities to k communities in the multi-view model Another natural question concern the generalization to a model in which each view may have 2 kℓ k communities. The ideas outlined above translate in principle to these settings but the correctness appears difficult to prove. Concretely, any estimator achieving guarantees comparable to Theorem 3.1 but for more than 2 communities can immediately be plugged-in Algorithm 1 to achieve weak recovery in these more general settings.6 However, to the best of our knowledge existing weak-recovery algorithms for SBMn,k,d,ε do not lead to estimators of this form. A first obstacle appears to be the fact that for each distribution the labels fℓ(z)1 . . . , fℓ(z)n are not independent. Existing estimators for stochastic block models instead crucially relies on the independence of the labels. Notice that independence holds when conditioning on fℓbut due to the high variance of random [k] [k ] mappings, an argument along these lines would require an analysis for communities whose distribution is not known a priori. This makes each observation Gℓmore akin to a stochastic block model with unknown parameters. These models have been studied in (Abbe & Sandon, 2015) in the context of partial recovery and exact recovery but are not known to achieve the notion of weak recovery we require. A second obstacle arising from the current proof structure is that existing robust algorithm for weak recovery which are used in Theorem 3.1 are only known to work for k = 2 . But a generalization of these algorithms to k > 2 appears to be difficult to analyze (the current proofs is already more than 200 pages long! (Ding et al., 2022)). Hence, it remains open how to provide guarantees for more general models. Acknowledgments We thank David Steurer for insightful discussions in the early stages of this work. Tommaso d Orsi is partially supported by the project MUR FARE2020 PARe Co Di. Impact Statement This paper presents work whose goal is to provide a theoretical foundation to machine learning heuristics often used in practice. We do not feel that any particular societal consequences of our work should be highlighted here. 6Without changing the proof of correctness of the algorithm! Multi-View Stochastic Block Models Abavisani, M. and Patel, V. M. Deep multimodal subspace clustering networks. IEEE Journal of Selected Topics in Signal Processing, 12(6):1601 1614, 2018. Abbe, E. Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(1):6446 6531, 2017. Abbe, E. and Sandon, C. Recovering communities in the general stochastic block model without knowing the parameters. Advances in neural information processing systems, 28, 2015. Abbe, E. and Sandon, C. Achieving the ks threshold in the general stochastic block model with linearized acyclic belief propagation. Advances in Neural Information Processing Systems, 29, 2016a. Abbe, E. and Sandon, C. Crossing the ks threshold in the stochastic block model with information theory. In 2016 IEEE International Symposium on Information Theory (ISIT), pp. 840 844. IEEE, 2016b. Abbe, E., Bandeira, A. S., and Hall, G. Exact recovery in the stochastic block model. IEEE Transactions on information theory, 62(1):471 487, 2015. Banks, J., Moore, C., Neeman, J., and Netrapalli, P. Information-theoretic thresholds for community detection in sparse networks. In Conference on Learning Theory, pp. 383 416. PMLR, 2016. Bansal, N., Blum, A., and Chawla, S. Correlation clustering. Machine learning, 56:89 113, 2004. Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008. Bordenave, C., Lelarge, M., and Massouli e, L. Nonbacktracking spectrum of random graphs: community detection and non-regular ramanujan graphs. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 1347 1357. IEEE, 2015. Corneli, M., Latouche, P., and Rossi, F. Exact icl maximization in a non-stationary temporal extension of the stochastic block model for dynamic networks. Neurocomputing, 192:81 91, 2016. Dasgupta, S. A cost function for similarity-based hierarchical clustering. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 118 127, 2016. De Bacco, C., Power, E. A., Larremore, D. B., and Moore, C. Community detection, link prediction, and layer interdependence in multilayer networks. Phys. Rev. E, 95:042317, Apr 2017. doi: 10.1103/Phys Rev E.95. 042317. URL https://link.aps.org/doi/10. 1103/Phys Rev E.95.042317. De Santiago, K., Szafranski, M., and Ambroise, C. Mixture of stochastic block models for multiview clustering. In ESANN 2023-European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 151 156, 2023. Decelle, A., Krzakala, F., Moore, C., and Zdeborov a, L. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 84(6):066106, 2011. Ding, J., d Orsi, T., Nasser, R., and Steurer, D. Robust recovery for stochastic block models. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 387 394. IEEE, 2022. Ding, J., d Orsi, T., Hua, Y., and Steurer, D. Reaching kesten-stigum threshold in the stochastic block model under node corruptions. In The Thirty Sixth Annual Conference on Learning Theory, pp. 4044 4071. PMLR, 2023. Fang, U., Li, M., Li, J., Gao, L., Jia, T., and Zhang, Y. A comprehensive survey on multi-view clustering. IEEE Transactions on Knowledge and Data Engineering, 2023. Fu, L., Lin, P., Vasilakos, A. V., and Wang, S. An overview of recent multi-view clustering. Neurocomputing, 402: 148 161, 2020. Goldberg, A. V. Finding a maximum density subgraph. 1984. Gorovits, A., Gujral, E., Papalexakis, E. E., and Bogdanov, P. Larc: Learning activity-regularized overlapping communities across time. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1465 1474, 2018. Gujral, E. and Papalexakis, E. E. Smacd: Semi-supervised multi-aspect community detection. In Proceedings of the 2018 SIAM International Conference on Data Mining, pp. 702 710. SIAM, 2018. Gujral, E., Pasricha, R., and Papalexakis, E. Beyond rank1: Discovering rich community structure in multi-aspect graphs. In Proceedings of The Web Conference 2020, pp. 452 462, 2020. Han, Q., Xu, K., and Airoldi, E. Consistent estimation of dynamic and multi-layer block models. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Multi-View Stochastic Block Models Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1511 1520, Lille, France, 07 09 Jul 2015. PMLR. URL https:// proceedings.mlr.press/v37/hanb15.html. Hopkins, S. B. and Steurer, D. Efficient bayesian estimation from few samples: community detection and related problems. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 379 390. IEEE, 2017. Hu, D., Nie, F., and Li, X. Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9248 9257, 2019. Khan, A. and Maji, P. Approximate graph laplacians for multimodal data clustering. IEEE transactions on pattern analysis and machine intelligence, 43(3):798 813, 2019. Kim, M., Han, D. K., and Ko, H. Joint patch clusteringbased dictionary learning for multimodal image fusion. Information Fusion, 27:198 214, 2016. Liu, S., Mohanty, S., and Raghavendra, P. On statistical inference when fixed points of belief propagation are unstable. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 395 405. IEEE, 2022. Massouli e, L. Community detection thresholds and the weak ramanujan property. In Proceedings of the fortysixth annual ACM symposium on Theory of computing, pp. 694 703, 2014. Montanari, A. and Sen, S. Semidefinite programs on sparse random graphs and their application to community detection. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 814 827, 2016. Mossel, E., Neeman, J., and Sly, A. Belief propagation, robust reconstruction and optimal recovery of block models. In Conference on Learning Theory, pp. 356 370. PMLR, 2014. Mossel, E., Neeman, J., and Sly, A. Consistency thresholds for the planted bisection model. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 69 75, 2015a. Mossel, E., Neeman, J., and Sly, A. Reconstruction and estimation in the planted partition model. Probability Theory and Related Fields, 162:431 461, 2015b. Mossel, E., Neeman, J., and Sly, A. A proof of the block model threshold conjecture. Combinatorica, 38(3):665 708, 2018. Ng, A., Jordan, M., and Weiss, Y. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14, 2001. Ni, J., Cheng, W., Fan, W., and Zhang, X. Self-grouping multi-network clustering. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1119 1124. IEEE, 2016. Papalexakis, E. E., Akoglu, L., and Ience, D. Do more views of a graph help? community detection and clustering in multi-graphs. In Proceedings of the 16th International Conference on Information Fusion, pp. 899 905. IEEE, 2013. Paul, S. and Chen, Y. Consistent community detection in multi-relational data through restricted multi-layer stochastic blockmodel. 2016. Von Luxburg, U. A tutorial on spectral clustering. Statistics and computing, 17:395 416, 2007. Zhong, G. and Pun, C.-M. Latent low-rank graph learning for multimodal clustering. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 492 503. IEEE, 2021. Multi-View Stochastic Block Models A. Failure of community detection on the union graph In this section we provide rigorous evidence that efficient algorithm cannot achieve comparable guarantees to Theorem 1.2 by only considering the union graph S i [t] Gi for (z, (f1, G1), . . . (ft, Gt)) (d,ε,k,t)-MV-SBMn. Concretely, we prove the following theorem. Theorem A.1 (Limits of weak recovery from the union graph). Let n, k, d, ε > 0, assume that t 100 (log k)2, and let I := (z, (f1, G1), . . . (ft, Gt)) (d,ε,k,t)-MV-SBMn. With probability at least 1 k Ω(1) over the draws of f1, . . . , ft, the conditional distribution over (z , G1, . . . , Gt) satisfies: z is drawn uniformly at random from [k]n; each edge ij appears (independently) in S i [t] Gi with probability at most d k)ε ) if zi = zj and probability k) otherwise, for some d , ε such that k ε d dt 1 + ε 1 ε /k 1 + 1 1 In words, Theorem A.1 shows that the union graph S i [t] Gi is essentially a k-community stochastic block model with parameters d = Θ(dt) , ε = Θ(ε) . As discussed in Section 1, it is conjecturally hard to achieve weak recovery in polynomial time for d (ε /k)2 1 . In the context of Theorem A.1, this implies that the parameters of the distribution of S i [t] Gi are above the Kesten-Stigum threshold only for d (ε /k)2 Ω(1). That is, at least t Ω(k2) observations are required! Next we prove the theorem. Proof of Theorem A.1. Let q, q [k] be distinct. By Chernoff s bound and choice of t we have7 2(1 o(1)) X ℓ [t] Jfℓ(q) = fℓ(q )K t 2(1 + o(1)) 1 k 5 . (10) Hence we may take a union over all such pairs q, q [k] as the corresponding event E will hold with probability at least 1 k O(1) . So let f1 . . . , ft : [k] 1 be fixed functions verifying the event E of (10). We condition the rest of the analysis on f1 = f1, . . . , ft = ft . In these settings each edge appears in G := S i [t] Gi independently of others. Moreover, by union bound P (ij G | zi = zj , f1 = f1, . . . , ft = ft) 1 + ε P (ij G | zi = zj , f1 = f1, . . . , ft = ft) 2 (1 o(1)) 1 1 ε 2n(1 o(1)) 1 1 ε 2n(1 + o(1)) where we used the inequality (1 + s)r 1 + rs , for r > 1 , s > 1 . It remains to compute d and ε so that 7We can take o(1) to be t 1/4 and by Hoeffding s inequality we can bound the probability of the event E in (10) not happening by 2 exp( t) 2 exp( 10 log k) k 5. Multi-View Stochastic Block Models Rearranging the inequalities, k o(1) = dt 1 + ε k k ε 1 + 1 1 as desired. Remark A.2 (On the weighted union graph). A natural question to ask is whether the weighted union graph the graph over [n], in which edge ij has weight P ℓ [t]Jij GℓK could provide better guarantees. In the sparse settings d no(1) , t no(1) only a no(1) 1 fraction of the edges have weight larger than 1 and thus one may expect that this additional information does not simplify the problem. B. Information theoretic lower bound for blackbox algorithms In this section we would like to study, in the context of (T ,k,t)-MV-SBMn, the information-theoretic limitations for having an algorithm that (i) runs the procedure in Theorem 3.1 on each observation and (ii) uses the resulting matrices ( ˆXℓ)ℓ [t] to reconstruct the original k communities. Essentially, we would like to prove a formal version of Theorem 1.3. In order to do this, we first need to introduce some useful notation and terminology. Throughout the section let z [k]n be the vector of communities, and let (fℓ)ℓ [t] be t independent and uniformly distributed random mappings [k] {+1, 1}. For every ℓ [t] let xℓ= fℓ(z) and let Gℓ SBMn,2,dℓ,εℓ(fℓ(z)). We introduce a quantitative version of weak-recovery. Definition B.1 (αℓ-weak-recovery algorithm). We say that an algorithm8 ˆXℓtaking Gℓas input and producing an estimate ˆXℓ(Gℓ) [+1, 1]n n of xℓxℓT is an αℓ-weak-recovery algorithm if we have E h ˆXℓ(G)ij (xℓ)i = (xℓ)j i E h ˆX(G)ij (xℓ)i = (xℓ)j i αℓ, i, j [n] . (11) Clearly, the algorithm mentioned in Theorem 3.1 is a Cd,ε-weak-recovery algorithm. We are interested in determining the information-theoretic limits for estimating z based only on the outputs of an αℓ-weakrecovery algorithm when applied on the observations (Gℓ)ℓ [t]. To this end, let us introduce blackbox estimators: Definition B.2 (Blackbox estimator). A blackbox estimator for z is a mapping ˆz : ([+1, 1]n n)t [k]n . The blackbox estimator is applied as follows: For every ℓ [t] , we first compute ˆXℓ= ˆXℓ(Gℓ) , for some α-weak-recovery algorithm ˆXℓfor xℓfor which we do not know anything about except that it is an α-weak-recovery algorithm, and then compute ˆz = ˆz( ˆX1, . . . , ˆXt). We would like to guarantee that the blackbox estimator yields a successful weak-recovery of z using only the fact that ˆXℓ= ˆXℓ(Gℓ) satisfies (11) for every ℓ [t]. In order to formalize this, we will use the notion of α-estimates: Definition B.3 (α-estimates). Let α = (αℓ)ℓ [t] (0, 2]t be a sequence of t positive numbers. Let ˆX1, . . . , ˆXt [+1, 1]n n be t random matrices. We say that ( ˆXℓ)ℓ [t] are α-estimates of (xℓxℓT)ℓ [t] if they satisfy the following three conditions: (a) Given (xℓ)ℓ [t], the random matrices ( ˆXℓ)ℓ [t] are conditionally independent from z. (b) For every ℓ [t], given xℓ, the random matrix ˆXℓ is conditionally independent from (x1, . . . , xℓ 1, xℓ+1, . . . , xt, ˆX1, . . . , ˆXℓ 1, ˆXℓ+1, . . . , ˆXt). 8Note that ˆ Xℓmay be a randomized algorithm. Multi-View Stochastic Block Models (c) For every ℓ [t] and every i, j [n] with i = j we have E h ( ˆXℓ)ij (xℓ)i = (xℓ)j i E h ( ˆXℓ)ij (xℓ)i = (xℓ)j i αℓ, i, j [n] . (12) It is not hard to see that if ˆXℓis an αℓ-weak-recovery algorithms for every ℓ [t], then ( ˆX(Gℓ))ℓ [t] are α-estimates of (xℓxℓT)ℓ [t]. Now we are ready to formally define what we mean by guaranteeing that the blackbox estimator yields a successful weak-recovery of z using only the fact that ˆXℓ= ˆX(Gℓ) satisfies (11) for every ℓ [t] : Definition B.4 ((ρ, τ, α, t)-weak-recovery blackbox estimator). Let α = (αℓ)ℓ [t] (0, 2]t be a sequence of t positive numbers, and let ρ > 0 and τ > 0 . ˆz : ([ 1, +1]n n)t [k]n is said to be a (ρ, τ, α, t)-weak-recovery blackbox estimator for z from α-estimates of (xℓxℓT)ℓ [t] if for every t random matrices ( ˆXℓ)ℓ [t] which are α-estimates of (xℓxℓT)ℓ [t], if z = ˆz( ˆX1, . . . , ˆXt) , then with probability at least τ we have 1 n Jzi = π(ˆzi)K 1 k + ρ , (13) where Pk is the set of permutations [k] [k] . We are now ready to state the main theorem of the section, which implies Theorem 1.3. Theorem B.5 (Formal statement of Theorem 1.3). Let α = (αℓ)ℓ [t] (0, 2]t be a sequence of t positive numbers and denote Let ρ > 0 and τ > 0 . Let z [k]n be the uniformly random vector of communities, and let (fℓ)ℓ [t] be t independent and uniformly distributed random mappings [k] {+1, 1}. For every ℓ [t] let xℓ= fℓ(z). If there exists a (ρ, τ, α, t)-weak-recovery blackbox estimator for z from α-estimates of (xℓxℓT)ℓ [t] , and if n is large enoug, then we must have We prove Theorem B.5 in Appendix B.1, Appendix B.2, and Appendix B.3. We conclude this section showing how Algorithm 1 is indeed a blackbox estimator. Proof that Algorithm 1 is a blackbox estimator. We can split Algorithm 1 into two steps: (1) Computing ˆX1 = ˆX1(G1), . . . , ˆXt = ˆXt(Gt) by applying the algorithm in Theorem 3.1 to G1, . . . , Gt , respectively. (2) Computing an estimate ˆz of z based only on ( ˆXℓ)ℓ [t] . In step (1), since each of ˆXℓis applied to Gℓindependently of all other graphs and since Gℓdepends on z only through xℓ, it is not hard to see that ( ˆXℓ)ℓ [t] satisfy conditions (a) and (b) of Definition B.3. Now if αℓ= Cℓis the correlation guaranteed by Theorem 3.1 for ˆ Xℓ, it follows that ( ˆXℓ)ℓ [t] are α-estimates of (xℓxℓT)ℓ [t] , where α = (αℓ)ℓ [t] . Now since step (2) of Algorithm 1 only processes ( ˆXℓ)ℓ [t] , and since the guarantee on the agreement of ˆz with z is proved based only on the fact that ( ˆXℓ)ℓ [t] are α-estimates, we can see that, assuming that t is a large enough multiple of (log k)/α2 , step (2) is a (1 k Ω(1), 1 k Ω(1), α, t)-weak-recovery blackbox estimator. 9Note that ˆz may be a randomized function. Multi-View Stochastic Block Models B.1. Upper bound on the information revealed by α-estimates The first step in our proof is to determine how much information about z the α-estimates can reveal. Let ( ˆXℓ)ℓ [t] be α-estimates of (xℓxℓT)ℓ [t]. The mutual information (measured in bits) between z and ˆX1, . . . , ˆXt can be upper bounded as follows: I(z; ˆX1, . . . , ˆXt) ( ) I(x1, . . . , xt; ˆX1, . . . , ˆXt) = H( ˆX1, . . . , ˆXt) H( ˆX1, . . . , ˆXt|x1, . . . , xt) , (14) where ( ) follows from the data-processing inequality10. The entropy H( ˆX1, . . . , ˆXt) can be upper bounded as follows: H( ˆX1, . . . , ˆXt) X ℓ [t] H( ˆXℓ) . (15) Now using the chain rule, the conditional entropy H( ˆX1, . . . , ˆXt|x1, . . . , xt) can be rewritten as follows: H( ˆX1, . . . , ˆXt|x1, . . . , xt) = X ℓ [t] H( ˆXℓ|x1, . . . , xt, ˆX1, . . . , ˆXℓ 1) ℓ [t] H( ˆXℓ|xℓ) , (16) where the last equality follows from Property (b) of α-estimates. Combining (14), (15) and (16) we get I(z; ˆX1, . . . , ˆXt) X H( ˆXℓ) H( ˆXℓ|xℓ) l [t] I(xℓ; ˆXℓ) (17) Now for each ℓ [t], we will derive an upper bound on I(xℓ; ˆXℓ) (which would then induce an upper bound on I(z; ˆX1, . . . , ˆXt)). Note that we cannot obtain a non-trivial upper bound on I(xℓ; ˆXℓ) for arbitrary α-estimates because setting ˆXℓ= xℓxℓT would satisfy the definition of α-estimates, and for ˆXℓ= xℓxℓT we have I(xℓ; ˆXℓ) = n , which is too large for our purposes. What we will do instead is to show that there exist α-estimates for which we can get the desired upper bound on I(xℓ; ˆXℓ). The α-estimates that we will consider are of the form ˆXℓ= ˆxℓˆxℓT where ˆxℓ {+1, 1}n is defined as follows: P (ˆxℓ= ˆxℓ|xℓ= xℓ) = Y i [n] P ((ˆxℓ)i = (ˆxℓ)i|(xℓ)i = (xℓ)i) , P ((ˆxℓ)i = (ˆxℓ)i | (xℓ)i = (xℓ)i) = ( 1 2 + p αℓ 8 if (ˆxℓ)i = (xℓ)i , 1 2 p αℓ 8 if (ˆxℓ)i = (xℓ)i . In other words, we obtain ˆxℓby sending the entries of xℓthrough a binary symmetric channel with flipping probability 1 2 p αℓ 10Notice that due to Property (a) of α-estimates, z (xℓ)ℓ [t] (ˆxℓ)ℓ [t] is a Markov chain. Multi-View Stochastic Block Models Now notice that E [(ˆxℓ)i | (xℓ)i] = (xℓ)i P ((ˆxℓ)i = (xℓ)i | (xℓ)i) (xℓ)i P ((ˆxℓ)i = (xℓ)i | (xℓ)i) 8 (xℓ)i = rαℓ Hence, for i = j E h ( ˆXℓ)i,j (xℓ)i, (xℓ)j i = E [(ˆxℓ)i (ˆxℓ)j | (xℓ)i, (xℓ)j] = E [(ˆxℓ)i | (xℓ)i] E [(ˆxℓ)j | (xℓ)j] 2 (xℓ)i rαℓ 2 (xℓ)j = αℓ 2 (xℓ)i (xℓ)j , from which it is not hard to see that E h ( ˆXℓ)i,j (xℓ)i = (xℓ)j i E h ( ˆXℓ)i,j (xℓ)i = (xℓ)j i = αℓ This proves that our choice of ( ˆXℓ)ℓ [t] indeed yields α-estimates. In the remainder of this subsection we will show that this particular choice of α-estimates is noisy enough to yield a useful upper bound on the mutual information I(xℓ; ˆXℓ) . For every ℓ [t] we have I(xℓ; ˆXℓ) = I(xℓ; ˆxℓ) = H(ˆxℓ) H(ˆxℓ|xℓ) n H(ˆxℓ|xℓ) , (18) where the first equality follows from the fact that there is a one-to-one mapping between ˆxℓand ˆXℓ= ˆxℓˆxℓT , and the last inequality follows from the fact that ˆxℓ {+1, 1}n is a binary vector of length n . Now notice that the conditional distribution of ˆxℓgiven xℓcan be seen as a sequence of n independent Bernoulli random variables with parameter 1 8 . Therefore11, H(ˆxℓ|xℓ) = n h2 where h2(p) = p log2 p (1 p) log2(1 p) is the binary entropy function. Combining (18) and (19), we get I(xℓ; ˆXℓ) n 1 h2 Now note that the function p 7 h2(p) is a strictly concave function achieving its maximum at p = 1 2 for which we have h2(p) = 1. Therefore, for small αℓ, we have We conclude that there exists an absolute constant C > 0 such that I(xℓ; ˆXℓ) C αℓ n . Combining this with (17), we conclude that for some α-estimates ( ˆXℓ)ℓ [t] of (xx T ℓ)ℓ [t], we have I(z; ˆX1, . . . , ˆXt) X ℓ [t] C αℓ n = C α t n . (20) 11Notice that h2(p) = h2(1 p) and hence h2 1 Multi-View Stochastic Block Models B.2. Weakly recovering z reduces its entropy Now let ˆz [k]n be an estimate of z which satisfies (13) with probability at least τ. We will apply a modified version of the standard Fano inequality in order to upper bound H(z|ˆz). Define the random variable ( 1 if z and ˆz satisfy (13) , 0 otherwise . H(z|ˆz) H(A, z|ˆz) = H(A|ˆz) + H(z|ˆz, A) H(A) + H(z|ˆz, A = 0)P[A = 0] + H(z|ˆz, A = 1)P[A = 1] 1 + (n log2 k) P[A = 0] + H(z|ˆz, A = 1) P[A = 1] , where the last inequality follows from the fact that A is a binary random variable (and hence its entropy is at most one bit), and the fact that z [k]n, which implies that H(z|ˆz, A = 0) log2 (kn) = n log2 k. We conclude that H(z|ˆz) 1 + n log2 k (n log2 k H(z|ˆz, A = 1)) P[A = 1] 1 + n log2 k τ (n log2 k H(z|ˆz, A = 1)) , where the last inequality follows from the fact that P[A = 1] τ and the fact that n log2 k H(z|ˆz, A = 1) 0 (because z [k]n). Now since z is a uniform random variable in [k]n, we have H(z) = log2 (kn) = n log2 k , and hence I(z; ˆz) = H(z) H(z|ˆz) τ (n log2 k H(z|ˆz, A = 1)) 1 . (21) Now we will focus on upper bounding H(z|ˆz, A = 1). For every ˆz [k]n, we have H(z|ˆz = ˆz, A = 1) log2 |Z(ˆz, ρ)| , (22) z [k]n : max π Pk 1 n Jzi = π(ˆzi)K 1 π Pk Z(ˆz, ρ, π) , Z(ˆz, ρ, π) = 1 n Jzi = π(ˆzi)K 1 Hence, |Z(ˆz, ρ)| X π Pk |Z(ˆz, ρ, π)| . We will further divide Z(ˆz, ρ, π) as follows: Z(ˆz, ρ, π) = [ βn m n Z(ˆz, ρ, π, m) , where β = 1 Z(ˆz, ρ, π, m) = i Jzi = π(ˆzi)K = m Multi-View Stochastic Block Models It is not hard to see that |Z(ˆz, ρ, π, m)| = n m By defining βm = m n and using Stirling s formula12, we get: log2 |Z(ˆz, ρ, π, m)| = n log2 n n log2 e m log2 m + m log2 e (n m) log2(n m) + (n m) log2 e O(log n) + (n m) log2(k 1) = n log2 n βmn log2(βmn) (1 βm)n log2((1 βm)n) + (1 βm)n log2(k 1) O(log n) = (h2(βm) + (1 βm) log2(k 1)) n O(log n) . By taking derivatives and analyzing the function g(βm) = h2(βm) + (1 βm) log2(k 1), we can show that g is decreasing after βm 1 k, and hence for βm β = 1 k, we have g(βm) g(β). In particular, for every m satisfying βn m n, we have log2 |Z(ˆz, ρ, π, m)| (h2(β) + (1 β) log2(k 1)) n + O(log n) . |Z(ˆz, ρ)| X βn m n |Z(ˆz, ρ, π, m)| βn m n 2(h2(β)+(1 β) log2(k 1)) n+O(log n) = k! n 2(h2(β)+(1 β) log2(k 1)) n+O(log n) , log2 |Z(ˆz, ρ)| (h2(β) + (1 β) log2(k 1)) n + O(k log k + log n) . Since this is true for every ˆz [k]n, we get from (22) that H(z|ˆz, A = 1) (h2(β) + (1 β) log2(k 1)) n + O(k log k + log n) . Combining this with (21), we get I(z; ˆz) τ (n log2 k (h2(β) + (1 β) log2(k 1)) n O(log n + k log k)) 1 2 (log2 k h2(β) (1 β) log2(k 1)) , (23) where the last inequality assumes13 that n is large enough (and in particular n k log k). B.3. Putting everything together Proof of Theorem B.5. Assume that there is a (ρ, τ, α, t)-weak-recovery blackbox estimator ˆz for z and assume that n is large enough. Let ( ˆXℓ)ℓ [t] be α-estimates of (xℓ)ℓ [t] satisfying (20), i.e., I(z; ˆX1, . . . , ˆXt) C α t n . Let ˆz = ˆz( ˆX1, . . . , ˆXℓ). From the data-processing inequality, we have I(z; ˆz) I(z; ˆX1, . . . , ˆXt) C α t n . 12For large n, we have log2(n!) = n log2 n n log2 e O(log2 n). 13It is worth noting that if β > 1/k , then log2 k h2(β) (1 β) log2(k 1) > 0 , as we will show in the next subsection. Multi-View Stochastic Block Models On the other hand, since ˆz is a (ρ, τ, α, t)-weak-recovery blackbox estimator and since (ˆxℓ)ℓ [t] are α-estimates of (xℓ)ℓ [t], it follows that ˆz satisfies (13) with probability 1 τ. It follows from (23) that for n large enough, we have I(z; ˆz) τ n 2 (log2 k h2(β) (1 β) log2(k 1)) . We conclude that τ n 2 (log2 k h2(β) (1 β) log2(k 1)) n C α t n . Therefore, we must have t τ log2 k h2(β) (1 β) log2(k 1) l(β) = log2 k h2(β) (1 β) log2(k 1) = log2 k + β log2 β + (1 β) log2(1 β) (1 β) log2(k 1) , 2C α = τ l(1/k + ρ) Let us analyze the function l(β): A quick calculation shows that l(1/k) = 0 . The derivative of l is l (β) = log2 β + 1 ln 2 log2(1 β) 1 ln 2 + log2(k 1) = log2 β log2(1 β) + log2(k 1) , and hence l (1/k) = 0 . The second derivative of l is l (β) = 1 β ln 2 + 1 (1 β) ln 2 > 0 , β (0, 1) . Hence, l (β) > 0 for β (1/k, 1) , and since l(1/k) = 0 we can see that l(1/k + ρ) > 0 whenever ρ > 0 . Furthermore, a quick calculation reveals that for fixed ρ we have14 lim k l(1/k + ρ) log2(k) = ρ > 0 , which means that min k 2 l(1/k + ρ) log2(k) > 0 , and hence l(1/k + ρ) = Ωρ(log2(k)) . We conclude that 14Note that β log2 β + (1 β) log2(1 β) = h2(β) and since 0 h2(β) 1 , we can see that limk h2(β) log2(k) = 0 . Hence limk l(1/k+ρ) log2(k) can be simplified as limk 1 (1 ρ 1 k) log2(k 1) log2(k) = ρ . Multi-View Stochastic Block Models C. Deferred proofs We present here proofs deferred in the main body of the paper. Deferred proofs of Section 3 . To obtain Theorem 3.1 we need to introduce results about robust weak recovery. Definition C.1 (µ-node corrupted, balanced 2 communities SBM). Let µ [0, 1] . Let x { 1}n be a vector satisfying P i xi = 0 and let G0 SBM2,d,ε(x) . An adversary may choose up to µ n vertices in G0 and arbitrarily modify edges (and non-edges) incident to at least one of them to produce the corrupted graph G. We write G µ G0 to denote that G is a µ-node corrupted version of G0. In the context of node corrupted graphs, the definition of weak recovery is still with respect to the original vector x as defined in Equation (1). It is known that node robust weak recovery is achievable. Theorem C.2 (Implicit in (Ding et al., 2023)). Let n, d, ε > 0 be satisfying d ε2 1 =: δ > 0 . There exist: constants 0 < µδ < 1 and 0 < Cδ < 1 , and a (randomized) polynomial time algorithm15 ˆXr taking a graph G of n vertices as input and producing a matrix ˆXr(G) [ 1, +1]n n as output, such that ˆXr is a successful weak-recovery recovery algorithm robust against any µ-node corruption for all µ µδ. More formally, for every x { 1}n satisfying P i xi = 0 , and every µ µδ , we have E G0 SBM2,d,ε(x) min G:G µ G0 ˆXr(G), xx T We can use Theorem C.2 to obtain Theorem 3.1. Let γ = p 1 2 be the unbalancedness in the vector of labels of x, where p = P(xi = +1). The main idea behind the algorithm in Theorem 3.1 is to first distinguish whether the unbalancedness γ is sufficiently small or not. If it is sufficiently small, then we apply the robust algorithm of Theorem C.2. Otherwise, we can achieve weak-recovery by relying on the degree of a vertex to estimate its community label. In the following two lemmas, we treat the case where the unbalancedness γ is sufficiently small: Lemma C.3. Let n, d, ε, p, x = (xi)i [n] and G SBMn,2,d,ε(x) be as in Theorem 3.1 and let δ = ε2d 4 1 > 0. Let µδ, Cδ and ˆXr be16 as in Theorem C.2 and define µ δ = 1 100 min{µδ, Cδ} . If the unbalancedness γ = 1 2 p of x satisfies γ µ δ , then for n large enough, we have E h ˆXr(G), xx T i 3 Proof. For the sake of simplicity, we will only treat the case where n is even. Since ˆXr is robust against node corruptions, it is not hard to see that the proofs can be adapted to the case where n is odd. 15The subscript r in ˆXr stands for robust . 16It is worth noting that since Theorem C.2 assumes that P i xi = 0 , then n must be even in Theorem C.2. However, in Lemma C.3 we would like n to be general. Hence, if n is odd, we apply the following procedure: (1) we add a fictitious vertex n + 1 which is not incident to any vertex in [n] and we call the resulting graph (having [n + 1] as its set of vertices) as G , (2) we apply ˆXr on G , and (3) we take the submatrix of ˆXr( G) induced by the vertices in [n]. We still denote the overall algorithm as ˆXr. Multi-View Stochastic Block Models For every x { }n, let n+(x) = |{i [n] : xi = +1}| and n (x) = |{i [n] : xi = 1}|. Since P(xi = +1) = p, then by the law of large numbers we know that n+(x) and n (x) concentrate around pn and (1 p)n, respectively. Furthermore, since γ = 1 2 p µ δ, we can use standard concentration inequalities to show that with probability at least 1 O(n 10), the random vector x satisfies the event E = x { }n : n+(x) 2µ δ and n (x) Now since P(x E) = 1 O(n 10) and since ˆXr(G), xx T n2, it is not hard to see that E h ˆXr(G), xx T i = E h ˆXr(G), xx T x E i o(1) , (24) so we can focus on studying E h ˆXr(G), xx T x E i . Now fix x E and condition on the event that x = x. From the definition of E it is not hard to see that there is x { }n satisfying P i [n] x i = 0 and |Dx,x | 4µ δn , where Dx,x = |i [n] : xi = x i| . Now construct a random graph G as follows: If i, j [n] \ Dx,x , i.e., if xi = x i and xj = x j , then we let {i, j} G if and only if {i, j} G. If either i Dx,x or j Dx,x then we put the edge {i, j} in G with probability d The events ({i, j} G )i,j [n] are mutually independent. It is not hard to see that: G SBMn,2,d,ε(x ) , and G 4µ δ G , i.e., G can be obtained from G by adding or removing edges incident to at most 4µ δn vertices. Since 4µ δ 4 100µδ µδ, it follows from Theorem C.2 that E h ˆXr(G), x x T x = x i Cδ n2 . (25) Now notice that ˆXr(G), xx T = ˆXr(G), x x T + ˆXr(G), xx T x x T , and ˆXr(G), xx T x x T ˆXr(G) xx T x x T 1 1 xx T x x T 1 2 x x 1 n = 4|Dx,x | n 16µ δ n2 . Combining this with (25), we get E h ˆXr(G), xx T x = x i (Cδ 16µ δ) n2 84 Multi-View Stochastic Block Models Now since this is true for all x E , we conclude that E h ˆXr(G), xx T x E i (Cδ 16µ δ) n2 84 where the last inequality is true because µ δ = 1 100 min{µδ, Cδ}. Combining the above with (24), we can deduce that for n large enough, we have E h ˆXr(G), xx T i 3 The following lemma takes the algorithm of Lemma C.3 and applies a symmetrization argument in order to get a positive correlation at the edge level . Lemma C.4 (Pair-wise weak recovery for sufficiently balanced 2 communities stochastic block mode). Let n, d, ε, p, x = (xi)i [n] and G SBMn,2,d,ε(x) be as in Theorem 3.1. Let δ = ε2d 4 1 > 0 and let γ = 1 2 p be the unbalancedness of x. There exist constants µ δ > 0 and C δ > 0 and a randomized polynomial-time algorithm17 ˆXsb taking G as input and producing a matrix ˆXsb(G) [ 1, +1]n n such that if then for every i, j [n] with i = j, we have C δ E h ˆXsb(G)ij xi = xj i E h ˆXsb(G)ij xi = xj i . Proof. Let µδ, Cδ and ˆXr be as in Theorem C.2, and let µ δ = C δ = 1 100 min{µδ, Cδ}. Lemma C.3 shows that E h ˆXr(G), xx T i 3 The algorithm ˆXsb(G) can be obtained by symmetrizing the algorithm ˆXr as follows: Let σ be a (uniformly) random permutation [n] [n] and let Gσ be the graph obtained from G by σ-permuting its vertices, i.e., we let the edge18 {σi, σj} belong to Gσ if and only if {i, j} G. We define the matrix ˆXsb(G) [ 1, +1]n as follows: ˆXsb(G)ij = ˆXr(Gσ)σiσj . In other words, we apply the random permutation σ to graph G , we apply the algorithm ˆXr , and then we apply the inverse of the permutation on the resulting matrix. Due to the symmetry of the SBM distribution, it is not hard to see that for every i, j [n] with i = j, we have E h ˆXsb(G)ij xixj i = X i ,j [n]: i =j E h ˆXsb(G)ij xixj σi = i , σj = j i P (σi = i , σj = j ) i ,j [n]: i =j E h ˆXr(Gσ)σiσj xixj σi = i , σj = j i 1 n(n 1) ( ) = 1 n(n 1) i ,j [n]: i =j E h ˆXr(G)σiσj xσixσj σi = i , σj = j i 17The subscript sb in ˆXsb stands for sufficiently balanced . 18For simplicity, We denote σ(i) as σi. Multi-View Stochastic Block Models i ,j [n]: i =j E h ˆXr(G)i j xi xj i = 1 n(n 1) E h ˆXr(G), xx T i 1 n(n 1) i [n] E h ˆXr(G)ii x2 i i 4Cδ n2 1 n 1 where ( ) follows from the symmetry of the SBM distribution under the simultaneous permutation of vertices and labels. Now notice that E h ˆXsb(G)ij xixj i = E h ˆXsb(G)ij xi = xj i P (xi = xj) E h ˆXsb(G)ij xi = xj i P (xi = xj) = E h ˆXsb(G)ij xi = xj i p2 + (1 p)2 E h ˆXsb(G)ij xi = xj i (2p(1 p)) = E h ˆXsb(G)ij xi = xj i 1 2 + 2γ2 E h ˆXsb(G)ij xi = xj i 1 2 E h ˆXsb(G)ij xi = xj i + 2γ2 1 2 E h ˆXsb(G)ij xi = xj i + 2γ2 , where in the last inequality we used the fact that ˆXsb(G)ij 1. We can deduce that for n large enough, we have E h ˆXsb(G)ij xi = xj i E h ˆXsb(G)ij xi = xj i 2 E h ˆXsb(G)ij xixj i 8γ2 3 2Cδ o(1) 8γ2 Cδ , where the last inequality follows from the fact that 8γ2 8µ 2 δ 8 1 By picking C δ = Cδ , the lemma follows. Now we turn to show that if x is sufficiently unbalanced, then there exists an efficient algorithm that achieves pair-wise weak recovery. Lemma C.5 (Pair-wise weak recovery for sufficiently unbalanced 2 communities stochastic block mode). Let n, d, ε, p, x = (xi)i [n] and G SBMn,2,d,ε(x) be as in Theorem 3.1. Further assume that p [0, 1]. Let δ = ε2d 4 1 > 0 and let γ = 1 2 p be the unbalancedness of x, and let µ δ be as in Lemma C.4. There exists a constant C δ > 0 and a randomized polynomial-time algorithm 19 ˆXsu taking G as input and producing a matrix ˆXsu(G) [ 1, +1]n n such that if then for every i, j [n] with i = j, we have C δ E h ˆXsb(G)ij xi = xj i E h ˆXsb(G)ij xi = xj i . Proof. For the sake of simplicity we may assume without loss of generality that p > 1 2 , i.e., p = 1 2 + γ and hence +1 is the larger community in expectation. 19The subscript su in ˆXsu stands for sufficiently unbalanced . Multi-View Stochastic Block Models For every i, j [n], let deg =j(i) = |{v [n] \ {i, j} : {i, v} G}| be the number of vertices in [n] \ {i, j} which are adjacent to i in G. Let C > 0 be a large enough constant (to be chosen later) and let ˆx(j) i (G) = 1 C deg =j(i) d(1 2/n) q deg =j(i) d(1 2/n) C y [ 1, 1] , and define the matrix ˆXsu(G) [ 1, +1]n n as: ˆXsu(G)ij = ˆx(j) i (G) ˆx(i) j (G) . It is not hard to see that given (xi, xj) , the random variables deg =j(i) and deg =i(j) are conditionally independent. Therefore, ˆx(j) i (G) and ˆx(i) j (G) are conditionally independent given (xi, xj) , hence E h ˆXsu(G)ij xi, xj i = E h ˆx(j) i (G) xi, xj i E h ˆx(i) j (G) xi, xj i = E h ˆx(j) i (G) xi i E h ˆx(i) j (G) xj i . Now, for every v [v] \ {i, j} , we have P ({i, v} G | xi) = E [P ({i, v} G | xi, xv) | xi] = E d 2ε xi E[xv] = d 2ε (p (1 p)) n (1 + εγxi) . Therefore, the conditional distribution of deg =j(i) given xi is Binomial n 2, d n (1 + εγxi) , hence E deg =j(i) xi = (n 2) d n (1 + εγxi) , C deg =j(i) d(1 2/n) xi = d(1 2/n) εγxi On the other hand, by the Cauchy-Schwarz inequality, we have E ˆx(j) i (G) 1 C deg =j(i) d(1 2/n) deg =j(i) d(1 2/n) q deg =j(i) d(1 2/n) > C y xi C E h deg =j(i) d(1 2/n) 2 xi i1/2 E hq deg =j(i) d(1 2/n) > C y2 xi i1/2 C E h deg =j(i) d(1 2/n) 2 xi i1/2 P deg =j(i) d(1 2/n) > C xi 1/2 , and by Chebychev s inequality, we have P deg =j(i) d(1 2/n) > C xi E h deg =j(i) d(1 2/n) 2 xi i Multi-View Stochastic Block Models E ˆx(j) i (G) 1 C deg =j(i) d(1 2/n) E h deg =j(i) d(1 2/n) 2 xi i Now notice that E h deg =j(i) d(1 2/n) 2 xi i = E h deg =j(i) d(1 2/n) (1 + εγxi) + (εγxi) d(1 2/n) 2 xi i ( ) E h 2 deg =j(i) d(1 2/n) (1 + εγxi) 2 + 2 ((εγxi) d(1 2/n))2 xi i 2 E h deg =j(i) d(1 2/n) (1 + εγxi) 2 xi i + 2d2ε2γ2 , where ( ) is true because (a + b)2 2a2 + 2b2 for all a, b R. Now since the conditional distribution of deg =j(i) given xi is Binomial n 2, d n (1 + εγxi) , its conditional expectation is equal to d(1 2/n) (1 + εγxi) and its conditional variance is equal to E h deg =j(i) d(1 2/n) (1 + εγxi) 2 xi i = (n 2) d n (1 + εγxi) 1 d n (1 + εγxi) d(1 + εγ) 2d . E h deg =j(i) d(1 2/n) 2 xi i 4d + 2d2ε2γ2 , and hence from (27) we get E ˆx(j) i (G) 1 C deg =j(i) d(1 2/n) 4d + 2d2ε2γ2 Combining this with (26) we get E h ˆx(j) i (G) xi i = d(1 2/n) εγxi C O d + d2ε2γ2 εγxi O 1 + dε2γ2 C = C(1 2/n) dε + 1 εµ δ for some large enough constant C 1 to be chosen later. We have 1 C(1 2/n)/(εµ δ) where the last inequality follows from γ µ δ 2 . Furthermore, C(1 2/n) dε We conclude that E h ˆx(j) i (G) xi i = d(1 2/n) = d(1 2/n)εγ Multi-View Stochastic Block Models = dεγ C dε + 1 εµ δ = C δ γ xi O 1 C dε2 + 1 µ δ = 4(1 + δ) C 4(1 + δ) + 1 µ δ = 4(1 + δ)µ δ C (4(1 + δ)µ δ + 1) . Notice how C δ depends only on δ . E h ˆXsu(G)ij xi, xj i = E h ˆx(j) i (G) xi i E h ˆx(i) j (G) xj i = C δ γ xi O 1 C δ γ xj O 1 = (C δ γ)2 xixj O 1 E h ˆXsu(G)ij xi = xj i E h ˆXsu(G)ij xi = xj i = (C δ γ)2 2 O 1 By choosing C to be an absolute constant which is large enough, we get E h ˆXsu(G)ij xi = xj i E h ˆXsu(G)ij xi = xj i (C δ γ)2 (C δ µ δ)2 where the last inequality is true because we assume that γ µ δ 2 . Now we can leverage Lemma C.4 and Lemma C.5 in order to prove Theorem 3.1. Proof of Theorem 3.1. Let γ, µ δ, C δ, C δ be as in Lemma C.4 and Lemma C.5. We will first distinguish between the sufficiently balanced and sufficiently unbalanced cases by counting the number of edges which are incident to a random sublinear (but sufficiently high) set of vertices. Let m = n3/4 and let I be a random subset of [n] of size m. Let deg(I) be the number of edges in G from I to [n] \ I. We have E [deg(I) | I] = X i I,j [n]\I E [J{i, j} GK] = X i I,j [n]\I E [P ({i, j} G | xi, xj)] i I,j [n]\I E d i I,j [n]\I 2ε E [xixj] i I,j [n]\I 2ε (P (xi = xj) P (xi = xj)) i I,j [n]\I 2ε p2 + (1 p)2 2p(1 p) n 1 + 2εγ2 = (1 o(1))dn3/4 1 + 2εγ2 . So the algorithm ˆX is defined as follows: Multi-View Stochastic Block Models If deg(I) dn3/4 1 + 2ε 3µ δ 4 2 we apply the algorithm ˆXsu on the subgraph G([n] \ I) of G induced on n \ I ( ˆXsu(G([n] \ I))ij if i, j [n] \ I , 0 otherwise. If deg(I) < dn3/4 1 + 2ε 3µ δ 4 2 we apply the algorithm ˆXsb on the subgraph G([n] \ I) of G induced on n \ I ( ˆXsb(G([n] \ I))ij if i, j [n] \ I , 0 otherwise. By standard concentration inequalities, we can show that: If γ < µ δ 2 , then with probability 1 o(1) we have deg(I) < dn3/4 1 + 2ε 3µ δ 4 2 and so we apply the algorithm ˆXsb which will succeed in achieving pair-wise weak recovery according to Lemma C.4, assuming i, j [I]. If γ µ δ 2 , then with probability 1 o(1) we have deg(I) dn3/4 1 + 2ε 3µ δ 4 2 and so we apply the algorithm ˆXub which will succeed in achieving pair-wise weak recovery according to Lemma C.5, assuming i, j [I]. If µ δ 2 < γ < µ δ , then it follows from Lemma C.4 and Lemma C.5 that it does not matter which algorithm we apply because both of them achieve pair-wise weak recovery, assuming i, j [I]. Now for any i, j [n] , the probability of picking any of them in I is vanishingly small. We conclude that the algorithm ˆX satisfies the guarantees sought in Theorem 3.1. Deferred proof of Section 4 First we prove Fact 4.3. Proof of Fact 4.3. Let p [k] be fixed. By Chernoff s bound, with probability at least 1 n20, 1 q cp(z) 2 1 + q n k , hence the property follows by a union bound. Next we prove Fact 4.5. Proof. For each ℓ [t] , define E h ˆX(Gℓ)ij fℓ(z)i = fℓ(z)j i =: C ℓ and let C = X By independence of the observations, the inequality of Fact 4.5 for the case case zi = zj follows with an application of Hoeffding s inequality. The case zi = zj needs a bit more work. By Theorem 3.1, for each ℓ [t], we have E h ˆX(Gℓ)ij fℓ(z)i = fℓ(z)j i = C ℓ, Multi-View Stochastic Block Models and E h ˆX(Gℓ)ij fℓ(z)i = fℓ(z)j i C ℓ Cℓ. E h ˆX(Gℓ)ij zi = zj i = E h ˆX(Gℓ)ij fℓ(z)i = fℓ(z)j i P (fℓ(z)i = fℓ(z)j | zi = zj) + E h ˆX(Gℓ)ij fℓ(z)i = fℓ(z)j i P (fℓ(z)i = fℓ(z)j | zi = zj) ℓ [t] ˆX(Gℓ)ij The inequality of Fact 4.5 for case zi = zj follows with another application of Hoeffding s inequality. Now we prove Lemma 4.8. Proof of Lemma 4.8. Let Ai be a (p, q)-representative and Aj be a (p , q)-representative, for p , p [k] . It suffices to show that if p = p then Ai Aj 2 n/k and otherwise Ai Aj 2 > n/k . Then no q-representative index remains unassigned at the end of step 1. By the reverse triangle inequality | Ai Aj cp(z) cp (z) | Ai Aj cp(z) + cp (z) Ai cp(z) + Aj cp (z) 2n e q C 2 t . For p = p we have cp(z) cp (z) = 0 and by choice of t , the first inequality follows. For p = p we have cp(z) cp (z) n k (2 o(1)) since z is balanced and the second inequality follows again by choice of t.