# optimal_bipartite_network_clustering__69e868ac.pdf Journal of Machine Learning Research 21 (2020) 1-68 Submitted 4/19; Revised 12/19; Published 1/20 Optimal Bipartite Network Clustering Zhixin Zhou zhixin@ucla.edu Department of Statistics University of California Los Angeles, CA 90095-1554, USA Arash A. Amini aaamini@ucla.edu Department of Statistics University of California Los Angeles, CA 90095-1554, USA Editor: Matthias Hein We study bipartite community detection in networks, or more generally the network biclustering problem. We present a fast two-stage procedure based on spectral initialization followed by the application of a pseudo-likelihood classifier twice. Under mild regularity conditions, we establish the weak consistency of the procedure (i.e., the convergence of the misclassification rate to zero) under a general bipartite stochastic block model. We show that the procedure is optimal in the sense that it achieves the optimal convergence rate that is achievable by a biclustering oracle, adaptively over the whole class, up to constants. This is further formalized by deriving a minimax lower bound over a class of biclustering problems. The optimal rate we obtain sharpens some of the existing results and generalizes others to a wide regime of average degree growth, from sparse networks with average degrees growing arbitrarily slowly to fairly dense networks with average degrees of order n. As a special case, we recover the known exact recovery threshold in the log n regime of sparsity. To obtain the consistency result, as part of the provable version of the algorithm, we introduce a sub-block partitioning scheme that is also computationally attractive, allowing for distributed implementation of the algorithm without sacrificing optimality. The provable algorithm is derived from a general class of pseudo-likelihood biclustering algorithms that employ simple EM type updates. We show the effectiveness of this general class by numerical simulations. Keywords: Bipartite networks; stochastic block model; community detection; biclustering; network analysis; pseudo-likelihood; spectral clustering. 1. Introduction Network analysis has become an active area of research over the past few years, with applications and contributions from many disciplines including statistics, computer science, physics, biology and social sciences. A fundamental problem in network analysis is detecting and identifying communities, also known as clusters, to help better understand the underlying structure of the network. The problem has seen rapid advances in recent years with numerous breakthroughs in modeling, theoretical understanding, and practical applications (Fortunato and Hric, 2016). In particular, there has been much excitement and progress in understanding and analyzing the stochastic block model (SBM) and its variants. We refer to Abbe (2017) for a recent survey of the field. Much of this effort, especially on the theoretical side has been fo- c 2020 Zhixin Zhou and Arash A. Amini. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-299.html. Zhou and Amini cused on the unipartite (or symmetric) case, while the bipartite counterpart, despite numerous practical applications, has received comparatively less attention. Of course, there has been lots of activity in terms of modeling and algorithm development for bipartite clustering both in the context of networks (Zhou et al., 2007; Larremore et al., 2014; Wyse et al., 2017; Rohe et al., 2012; Razaee et al., 2019) as well as other domains such as topic modeling and text mining (Dhillon, 2001, 2003) and biological applications (Cheng and Church, 2000; Madeira et al., 2010). But much of this work either lacks theoretical investigations or has not considered the issue of statistical optimality. In this paper, we consider the community detection, or clustering, in the bipartite setting with a focus on deriving fundamental theoretical limits of the problem. The main goal is to propose computationally feasible algorithms for bipartite network clustering that exhibit provable statistical optimality. We will focus on the bipartite version of the SBM which is a natural model for bipartite networks with clusters. SBM is a stochastic network model where the probability of edge formation depends on the latent (unobserved) community assignment of the nodes, often referred to as node labels. The goal of the community detection problem is to recover these labels given an instance of the network. This is a non-trivial task since, for example, maximum likelihood estimation involves a search over exponentially many labels. Community detection in bipartite SBM is closely related to the biclustering problem, for which many algorithms have been developed over the years (Hartigan, 1972; Cheng and Church, 2000; Tanay et al., 2002; Gao et al., 2016). On the other hand, in recent years, various algorithms have been proposed for clustering in unipartite SBMs, including global approaches such as spectral clustering (Rohe et al., 2011; Krzakala et al., 2013; Lei et al., 2015; Fishkind et al., 2013; Vu, 2018; Massouli e, 2014; Yun and Proutiere, 2014; Chin et al., 2015; Bordenave et al., 2015; Gulikers et al., 2017; Pensky et al., 2019; Zhou and Amini, 2019) and convex relaxations via semidefinite programs (SDPs) (Amini et al., 2018; Hajek et al., 2016a; Bandeira, 2015; Gu edon and Vershynin, 2016; Montanari and Sen, 2016; Ricci-Tersenghi et al., 2016; Agarwal et al., 2017; Perry and Wein, 2017), as well as local methods such as belief propagation (Decelle et al., 2011), Bayesian MCMC (Nowicki and Snijders, 2001) and variational Bayes (Celisse et al., 2012; Bickel et al., 2013), greedy profile likelihood (Bickel and Chen, 2009; Zhao et al., 2012) and pseudo-likelihood maximization (Amini et al., 2013), among others. A limitation of spectral clustering approaches is that they are often not optimal on their own, and the SDPs have the drawback of not being able to fit the full generality of SBMs. Various algorithms can further improve the clustering accuracy, and adapt to the generality of SBM. Profile likelihood maximization was proposed and analyzed in Bickel and Chen (2009), but the underlying optimization problem is computationally infeasible and the approach only applicable to networks of limited size. Pseudo-likelihood ideas were used in Amini et al. (2013) to derive EM type updates to maximize a surrogate to the likelihood of the SBM. We extend the ideas of Amini et al. (2013) to the bipartite setting and greatly improve their analysis by showing that these pseudo-likelihood approaches can achieve minimax optimal rates in a wide variety of settings. In the unipartite setting, there has been interesting recent advancements in understanding optimal recovery in the semi-sparse regime where the (expected) average network degree, dav, is allowed to grow slowly to infinity as the number of nodes, n, grows. In a series of papers (Mossel et al., 2015; Abbe et al., 2016; Hajek et al., 2016a,b), the optimal threshold for exact recovery, also known as strong consistency, was established for simple planted partition models. In Abbe and Sandon (2015), the problem of strong consistency was considered for a general SBM and Optimal Bipartite Network Clustering the optimal threshold for strong consistency was established. In subsequent work (Zhang et al., 2016; Gao et al., 2017, 2018), the results were extended to include weak consistency, i.e., requiring the fraction of misclassified nodes to go to zero, rather than drop to exactly zero (as in strong consistency), and rates of optimal convergence were established, up to a slack in the exponent. To achieve the more relaxed consistency results, Gao et al. (2017) limited the model to what we refer to as strongly assortative SBM; see Amini et al. (2018) for a definition. Our work is inspired by the insightful analysis of Abbe and Sandon (2015) and Gao et al. (2017). We extend these ideas by presenting results that are strictly sharper and more general that what has been obtained so far. In short, we only assume that the clusters are distinguishable (in the sense of Chernoffdivergence) and the network is not very dense, i.e. dav = O( n) where dav denotes the expected average degree, and n is the number of nodes. Our results establish minimax optimal rates below this n regime and above the sparse regime dav = O(1). In particular, we obtain precise rates of weak consistency when dav grows arbitrarily slowly. We require dav = O( n) to allow Poisson approximations on the degrees of nodes restricted to large subsets. In the regimes denser than n, existing work easily establishes exact recovery under suitable distinguishability conditions. We make more detailed comparisons with existing work in Section 3. Contributions. Establishing these results require a fair amount of technical and algorithmic novelty. Here, we highlight some of the features of our approach: 1. Existing minimax rates of convergence for the misclassification error are known for what we refer to as the nearly assortative model where the probability of connection is a/n within clusters and b/n outside clusters. The existing results establish an error rate that belongs to an interval: Error e (1+o(1))I, e (1 o(1))I , as I , for some o(1) terms that are positive. Here, I is related to the Bhattacharyya distance (also known as the Hellinger affinity) of Bernoulli variables with probabilities a/n and b/n. This type of result originally appeared in Zhang et al. (2016) and propagated to many subsequent works (Gao et al., 2018, 2017; Zhang and Zhou, 2017; Xu et al., 2017; Chien et al., 2018). This rate, however, is not sharp since the slack term eo(1)I could be unbounded (because o(1) > 0 and I ). Another shortcoming of these results are their limitations to the simple nearly assortative setting. Our result sharpens and generalizes this known minimax rate to Error = e I R, for some 0 < R log I for the general class of all SBMs under a mild distinguishably assumption on the rows and columns of the edge probability matrix. Furthermore, the I in our result takes the form of a Chernoffexponent among Poisson vectors, which is the form necessary for the general SBM. 2. In order to achieve these sharp rates, we introduce an efficient sub-block (or sub-graph) partitioning scheme that generalizes the partitioning idea of Chin et al. (2015). Our scheme allows one to break down the costly spectral initialization, by applying it to smaller subblocks, without losing optimality. If done in parallel, spectral clustering on subblocks is computationally cheaper than performing a spectral decomposition of the entire matrix. The resulting algorithm is naturally parallelizable, hence can be deployed in a distributed fashion and scaled to very large networks. Zhou and Amini 3. Our algorithms are modifications of a natural EM algorithm on mixtures of Poisson vectors, hence very familiar from a statistical perspective. Although other (optimal) algorithms in the literature are preforming more or less similar operations, the link to EM algorithms and mixture modeling is quite clear in our work. In Section 4.2, we provide the general blueprint of the algorithms based on the pseudo-likelihood and block compression ideas (Algorithm 1). We then construct a provable version by combining with the sub-block partitioning idea (Section 5). 4. To get the sharper rate, analyzing a single step of an EM algorithm is not enough; we thus analyze the second step too. We will show that the first step gets us from a good (but crude) initial rate γ1 to the fast rate exp( I/Q) where Q is the number of subblocks. This rate is in the vicinity of the optimal rate and repeating the iteration once more, with the more accurate labels gets us to the minimax error rate exp( I R). 5. Among the technical contributions are the uniform consistency of the likelihood ratio classifier (LRC) over a subset of the parameters close to the truth (Lemma 2), sharp approximations for the Poisson-binomial distributions (Appendix F.4), and the extension (and elucidation) of a novel technique of Abbe and Sandon (2015) for deriving error exponents (Appendix F.3). The uniform consistency result for LRCs allows us to tolerate some dependence among the estimates from iteration to iteration (Sections 5 and Appendix B.1). 6. In contrast to the unipartite case, the bipartite setting allows us to introduce an oracle version of the problem that reveals the nature of the optimal rates in community detection and how they relate to classical hypothesis testing and mixture modeling. In particular, we answer the curious question of why the Chernoffexponent of a (binary) hypothesis testing problem controls the misclassification rate in community detection and network clustering. The oracle also provides an upper bound on the performance (i.e., a lower bound on the misclassification rate) of any algorithm. See Section 1 and Proposition 1 for details. The rest of the paper is organized as follows. We introduce the model and the biclustering oracle in Section 2. We present our main results in Section 3, including an upper bound on the error rate of the algorithm and a matching minimax lower bound. The general pseudo-likelihood algorithms are presented in Section 4 and a provable version in Section 5. In Section 6, we demonstrate the numerical performance of the methods. The proofs appear in Appendices B through F. Extra comments on the results and proof techniques can be found in Appendix A. 2. Network biclustering We start by introducing the network biclustering problem based on stochastic block modeling, and set up some notation. We then discuss how a biclustering oracle with side information can optimally recover the labels. These ideas will be the basis for our algorithms. 2.1. Bipartite block model We work with a bipartite network that can be represented by a biadjacency matrix A {0, 1}n m, where for simplicity, the nodes on the two sides are indexed by the sets [n] and [m]. We assume that there are K and L communities for the two sides respectively, and the membership of the nodes to these communities are given by two vectors y = (yi) [K]n and z = (zj) [L]m. Thus, yi = k if node i on side 1 belongs to community k [K]. We call yi and zj the labels of nodes i and j, respectively. We often treat these labels as Optimal Bipartite Network Clustering binary vectors as well, using the identification [K] {0, 1}K via the one-hot encoding, that is yi = k yik = 1, yik = 0, k = k. Given the labels y and z, and a connectivity matrix P [0, 1]K L (also known as the edge probability matrix), the general bipartite stochastic block model (bi SBM) assumes that: Aij are Bernoulli variables, independent over (i, j) [n] [m] with mean parameters, E[Aij] = y T i Pzj = Pkℓ, if yi = k, zj = ℓ. (1) We denote this model compactly as A SBM(y, z, P). It is helpful to consider the Poisson version of the model too, which is denoted as A p SBM(y, z, P). This is the same model as the Bernoulli SBM, with the exception that each entry Aij is drawn (independently) from a Poisson variate with the mean given in (1). These two models behave very closely when the entries of P are small enough. Throughout, we treat y, z and P as unknown deterministic parameters. The goal of network biclustering is to recover these three parameters given an instance of A. In fact, as we will see, the parameters P themselves are not that important. What matters is the set of (Poisson) mean parameters which are derived from P and the sizes of the communities. In order to define these parameters, let n(z) = (n1(z), . . . , n L(z)) NL, be the number of nodes in each of the communities of side 2. That is, nℓ(z) = Pm j=1 1{zj = ℓ} = Pm j=1 zjℓ. We also let πℓ(z) = nℓ(z)/m be the proportion of nodes in the ℓth community of side 2. Similar notations, namely n(y) NK and π(y) [0, 1]K, denote the community sizes and proportions of side 1. The row mean parameters are defined as Λ = (λkℓ) = (Pkℓnℓ(z)) = P diag(n(z)) RK L (2) where diag(v) for a vector v = (vk) is a diagonal matrix with diagonal entries vk. The column mean parameters can be defined similarly, ΓT = nk(y)Pkℓ = diag(n(y))P RK L. (3) Note the transpose in the above definition, i.e., Γ RL K. This convention allows us to define information measures based on rows of matrices Λ and Γ in a similar fashion, as will be discussed in Section 3. Although the rates we derive are controlled by the Poison parameters defined above, we always assume that the true distribution is the Bernoulli SBM and any Poisson approximation will be carefully derived. 2.2. Biclustering oracle with side information The key idea behind the algorithms, as well as our consistency arguments is the following observation: Assume that we have prior knowledge of P and the column labels z, but not the row labels y. For each row, we can sum the columns of A according to their column memberships, i.e., we can perform the (ideal) block compression b iℓ:= P j Aijzjℓ. The vector b i = (b i1, . . . , b i L) contains the same information for recovering the community of i, as the original matrix A i.e., it is a sufficient statistic. Assume that we are under the p SBM model (i.e., the Poisson SBM). Then, b i has the distribution of a vector of independent Poisson variables. More precisely, ℓ=1 Poi(λkℓ), if, yi = k, (4) Zhou and Amini where λkℓare the row mean parameters defined in (2). Note that the distributions Qk, k = 1, . . . , K are known under our simplifying assumptions. The problem of determining the row labels, thus, reduces to deciding which of these K known distributions generated b i . Whether node i belongs to a particular community k can be decided optimally by performing a likelihood ratio (LR) test of Qk against each of Qr, r = k. The above LR test is at the heart of the algorithms discussed in Sections 4 and 5. The difficulty of the biclustering problem (relative to a simple mixture modeling) is that in practice, we do not know in advance either y or Λ hence neither the exact test statistics (b i ) nor the distributions {Qk} are known. We thus proceed by a natural iterative procedure: Based on the initial estimates of y and z, we obtain estimates of (b i ) and {Qk}, perform the approximate LR test to obtain better estimates of z, and then repeat the procedure over the columns to obtain better estimates of y. These new label estimates lead to better estimates of (b i ) and {Qk}, and we can repeat the process. We refer to the algorithm that has access to the true column labels z and parameters Λ, and performs the optimal LR tests, as the oracle classifier. The misclassification rate of this oracle gives a lower bound on the misclassification rate of any biclustering algorithm. The performance of the oracle, in turn, is controlled by the error exponent of the simple hypothesis testing problems Qk versus Qr, r = k, as detailed in Proposition 1. This line of reasoning reveals the origin of the information quantities Ikr and Icol ℓr defined in (8) and (9) that control the optimal rate of the biclustering problem. Note that the bipartite setup has the advantage of disentangling the row and column labels, so that a non-trivial oracle exists. It does not make much sense to assume known column labels in the unipartite SBM, since by symmetry we then know the row labels too, hence, nothing more is left to estimate. On the other hand, due to the close relation between the bipartite and unipartite problems, the above argument also sheds light on why the error exponent of a hypothesis test is the key factor controlling optimal misclassification rates of community detection in unipartite SBMs. 2.3. Notation on misclassification rates Let Πn be the set of permutations on [n]. The (average) misclassification rate between two sets of (column) labels by and y is given by Mis(by, y) := min σ Πn 1 n i=1 1 σ(byi) = yi . (5) Letting σ be a minimizer in (5), the misclassification rate over cluster k is Misk(by, y) := 1 nk(y) i:yi=k 1 σ (byi) = yi = |i : σ (byi) = k, yi = k| nk(y) , (6) using the cardinality notation to be discussed shortly. Note that (6) is not symmetric in its arguments. We will also use the notation σ (by y) to denote an optimal permutation in (5). When Mis(by, y) is sufficiently small, this optimal permutation will be unique. It is also useful to define the direct misclassification rate between by and y, denoted as d Mis(by, y), which is obtained by setting the permutation in (5) to the identity. In other words, d Mis(by, y) is the normalized Hamming distance between by and y. With σ = σ (by y), we have Mis(by, y) = d Mis(σ (by), y). Optimal Bipartite Network Clustering We note that Mis(by, y) = X k [K] πk(y) Misk(by, y) max k [K] Misk(by, y), (7) as well as maxk K Misk(by, y) Mis(by, y)/ mink πk (y). We can similarly define the misclassification rate of an estimate bz relative to z. Our goal is to derive efficient algorithms for obtaining by and bz that have minimal misclassification rates asymptotically, as the number of nodes grow. Other notation. We write w.h.p. as an abbreviation for with high probability , meaning that the event holds with probability 1 o(1). To avoid ambiguity, we assume that all parameters, including m, are functions of n. All limits and little o notations are under n . For example, f(n) = o(g(n)) denotes limn f(n)/g(n) = 0. We write ZQ = Z/QZ to denote a cyclic group of order Q. Our convention regarding solutions of optimization problems, whenever more than one exists, is to choose one uniformly at random. We use the shorthand notation |i : yi = k| := |{i : yi = k}| for cardinality of sets, where i [n] is implicit, assuming that y is a vector of length n. For example, if by, y [K]n, we have the identity |i : byi = yi| = P k [K] |i : yi = k, byi = k|. It is worth noting that we use community and cluster interchangeably in this paper, although some authors prefer to use community for the assortative clusters, and use cluster to refer to any general group of nodes. We will not follow this convention and no assortativity will be implicitly assumed. 3. Main results Let us start with some assumptions on the mean parameters. Recall the row and column mean parameter matrices Λ and Γ defined in (2) and (3). Let Λmin and Λ be the minimum and maximum value of the entries of Λ, respectively, and similarly for Γ. We assume Γmin ω, (A1) for some ω > 0. That is, ω measures the deviation of the entries of the mean matrices from uniform. We assume that the sizes of the clusters are bounded as 1 βK πk(y) β K and 1 βL πℓ(z) β for all k [K] and ℓ [L]. We will assume (A1) and (A2) throughout the paper. The following key quantity controls the misclassification rate: Ikr := Ikr(Λ) := sup s (0,1) ℓ=1 (1 s)λkℓ+ sλrℓ λ1 s kℓλs rℓ, (8) for k, r [K]. We can think of I(Λ) := (Ikr(Λ)) RK K + , as an operator acting on pairs of rows of a matrix Λ RK L + , say λk and λr , producing a K K pairwise information matrix. We often refer to the function of s being maximized in (8) as s 7 Is, with some abuse of notation, dropping the dependence on k and r and assuming that they are fixed. This function is strictly concave over R whenever λk = λr , and we have I0 = I1 = 0. Recalling the product Poisson distributions {Qk}, Ikr given in (8) is the Chernoffexponent in testing the two hypotheses Qk and Qr (Chernoff, 1952). The difference with the classical Zhou and Amini setting in which the Chernoffexponent appears is the regime we work in, where we are effectively testing based on a sample of size of 1 and instead, let Ikr . The column information matrix is defined similarly Icol ℓℓ := Iℓℓ (Γ) = sup s (0,1) k=1 (1 s)Γℓk + sΓℓ k Γ1 s ℓk Γs ℓ k, (9) for all ℓ, ℓ [L]. We let Imin := mink =r Ikr and Icol min := minℓ =ℓ Icol ℓℓ . Another set of key quantities in our analysis are: εkr := max ℓ [L] 1, εk := min r [K] εkr, and ε := min k [K] εk. (10) The relation with hypothesis testing is formalized in the following proposition: Proposition 1. Consider the likelihood ratio (LR) testing of the null hypothesis Qk against Qr, based on a sample of size 1. Let Λ = [λk ; λr ] R2 L + . Assume that as Λmin , (a) lim inf εkr > 0, and (b) ω = O(1). Then, there exist constants C and C such that P(Type I error) + P(Type II error) C exp Ikr 1 2 log Λmin , 2 (log Λmin + C ) . (11) See Corollary 6 and Appendix G.6 for the proof. Any hypothesis testing procedure can be turned into a classifier, and a bound on the error of the hypothesis test (for a sample of size 1) translates into a bound on the misclassification rate for the associated classifier. This might not be immediately obvious, and we provide a formal statement in Lemma 3. Proposition 1 thus provides a precise bound on the misclassification rate of the LR classifier for deciding between Qk and Qr. The dependence on L in the lower bound is later removed in Zhou and Li (2018), subsequent to our current work. The significance of the Chernoffexponent of the hypothesis test in controlling the rates is thus natural, given the full information about {Qk} and the test statistics. What is somewhat surprising is that almost the same bound holds when no such information is available a priori. Our main result below is a formalization of this claim. In our assumptions, we include a parameter Q N that controls the number of sub-blocks when partitioning, the details of which are discussed in Section 5. Under the following two assumptions: (Q2 log Q)β2ω3KL(K L) log(K L)( Λ Γ )2 = O(n m), and (A3) (Q log Q)2β3ω2(K L)3(α α 1)( Λ Γ ) = o (Imin Icol min)2 , (A4) where α := m/n, there is an algorithm that achieves almost the same rate as the oracle: Theorem 1 (Main result). Consider a bipartite SBM (Section 2.1) satisfying (A1) (A4). Then, as Imin Icol min and Λmin , the row labels by outputted by Algorithm 3 in Section 5 satisfy for some ζ = o(1), Misk by, y = O ω X 2 ζ log Λmin (12) for every k [K], with high probability. Similar bounds holds for bz w.r.t. z. Optimal Bipartite Network Clustering By modifying the sequence ζ = o(1), one can replace the big O with the small o in (12) to obtain an equivalent statement (e.g., factor out a term e log Λmin = o(1) and change ζ to ζ := ζ + (log Λmin) 1/2 = o(1)). Let us discuss the assumptions of Theorem 1. The only real assumptions are (A3) and (A4). The other two, namely (A1) and (A2), are more or less the definitions of ω and β. The main constraints on ω and β are encoded in (A3) and (A4) in tandem with the other parameters. Remark 1. In the first reading, one can take β, ω, Q = O(1), n m and Λ Γ . In this setting, (A3) is a very mild sparsity condition, implying that the degrees should not grow faster than n. Condition (A4) guarantees that the information quantities grow fast enough so that the clusters are distinguishable. We only need (A4) for Algorithm 3 which uses a spectral initialization. In Section 5.2.1, we present Theorem 3 for the likelihood-updating portion of the algorithm, assuming that a good initialization is provided irrespective of the algorithm used. Theorem 3 only requires a weakened version of assumption (A4); see (A4 ) in Section 5.2.1. Depending on the behavior of εkr, the rate obtained in Theorem 1 can exhibit different regimes which are summarized in Corollary 1 below. Consider the additional assumption: max k,r [K] ω 1 + 1 = O(1). (A5) Corollary 1. Under the same assumptions as Theorem 1, w.h.p., for all k [K], Misk by, y = o X r =k exp( Ikr) . (13) If in addition we assume (A5), then for some ζ = o(1), w.h.p., for all k [K], Misk by, y = O X r =k exp Ikr 1 2 ζ log Λmin . (14) Remark 2. Consider the oracle version of the biclustering problem where the connectivity matrix P and the true column labels z are given. Then, the optimal row clustering reduces to the likelihood ratio tests in Proposition 1. That is, given the row sums within blocks as sufficient statistics, we compare the likelihoods at two different mean parameters. By Proposition 1, the optimal misclassification rate for the oracle problem is E Misk by, y = O X r =k exp Ikr 1 2 log Λmin , (15) where the sum over r is due to the need to compare against all other clusters. The gap between 1/2 and 1/2 ζ is not avoidable when stating high probability results, due to the Markov inequality; see Lemma 3 for the details. This error rate coincides with (14), which merely loses a constant due to the unknown mean parameters and column labels. The rate is sharp up to a factor of exp( 1 2(L 1) log Λmin) according to the lower bound in Proposition 1. The argument in Remark 2 can be formalized as the following minimax lower bound: Zhou and Amini Theorem 2 (Minimax lower bound). Consider the parameter space S := S(n, m, K, L, I ) := (y, z, P) | y {0, 1}n K, z {0, 1}m L, P [0, 1]K L, Λ is defined based on z and P according to (2), Imin(Λ) I , ε ε where ε is defined in (10), y, z and Λ satisfy (A1) to (A4). , and assume that there exists (y, z, P) S such that Imin(Λ) = I . Further assume that β, ω > 1, ε > 0 are constants and K exp(c L) for some constant c > 0. Then, for large n, the minimax risk over S satisfies inf ˆy sup (y,z,P) S E[ Mis(ˆy, y) ] exp I L(log I + C) . (16) Theorem 2 is a non-asymptotic result, i.e., we fix n (and hence m). In this case, assumption (A4) in the definition of the parameter space S should be interpreted by fixing a vanishing sequence in advance. Note that in defining S, only Λ and not Γ, is required to satisfy (A3) (A4). In order to better understand the rates in Corollary 1, let us consider some examples that clarify our results relative to the previous literature. Example 1. Consider a simple planted partition model where n = m, K = L, Pkk = a Then, λkk [ a K ] and λkℓ [ b K ] when k = ℓ. Applying (8) with s = 1/2, Assume that (A3) and (A4) hold, that is (using Λ βa/K), β4ω3(K log K)a2 = O(n) and β6ω2K4a = o ( a Further assume that βω2K3 = o(a b). Then w.h.p., we have Misk by, y = o exp ( a For the details of (17), see Appendix D.3. In particular, if lim inf n ( a βK log n 1, (18) we obtain Misk by, y = o(1/n) w.h.p., that is, we have exact recovery of the labels by Algorithm 3. (Whenever misclassification rate drops below 1/n, it should be exactly zero.) This result holds without any assumption of assortativity, i.e., it holds whether a > b or b > a. Optimal Bipartite Network Clustering Example 2. Suppose that P := e P(log n)/n where e P is a symmetric constant matrix, n = m, K = L, and y = z. Then, K, ω and εkr are constants, and λkℓ= eλkℓlog n, where eλkℓ:= e Pkℓπk(y), and Ikr = Ikr log n where Ikr is defined based on eλkℓand eλrℓas in (8). Assuming in addition that π(y) is constant, both eλkr and Ikr are constants. In this regime, our key assumptions (A3) and (A4) are satisfied. By Corollary 1, w.h.p., we have Misk by, y = o exp min r =k Ikr log n = o n minr =k Ikr . (19) As a consequence if mink =r Ikr 1, then Misk by, y = o(1/n) w.h.p., that is we have exact recovery by Algorithm 3. In addition to the above more or less familiar setups (cf. Section 3.1), our results determine the optimal rate for a much wider range of parameter settings. As an example, consider the following setting of very slow degree growth: Example 3. Consider the setup of Example 2 but with log n replaced with log log n in the definition of P. In this case, the expected average-degree grows very slowly as log log n, and it is known that exact recovery is not possible in this regime. However, our results establish the following minimax optimal rate: Mis(by, y) 1 (log n)α 1 log log n where α = mink =r Ikr and this rate is achieved by Algorithm 3. 3.1. Comparison with existing results Let us now compare with the work of Gao et al. (2017) and Abbe and Sandon (2015) whose results are closest to ours. Both papers consider the symmetric (unipartite) SBM, but the results can be argued to hold in the bipartite setting as well. The setup of Example 1 is more or less what is considered in Gao et al. (2017). They have shown that, when a > b, there is an algorithm with misclassification error bounded by exp (1 o(1))( a We have sharpened this rate to (17) under assumption (A3) (i.e., assuming the average degree grows slower than O( n)). Bound (20) implies that when lim inf n ( a βK log n > 1, one has exact recovery. Our bound on the other hand, imposes the relaxed condition (18). We note that the results in Gao et al. (2017) are derived for a more general class of (assortative) models than that of Example 1, namely, the class with connectivity matrix satisfying Pkk a/n and Pkℓ b/n for k = ℓ. The rate obtained in Gao et al. (2017) uniformly over this Zhou and Amini class is dominated by that of the hardest within this class which is the model of Example 1. For other members of this class, neither their rate (20), nor the one we gave in (17) is optimal. The optimal rate in those cases is given by the general form of Theorem 1 and is controlled by the general form of Ikr in (8). In other words, Algorithm 3 that we present is rate adaptive over the class considered in Gao et al. (2017), achieving the optimal rate simultaneously for each member of the class. A key in our approach is to apply the likelihood-type algorithm twice, in contrast to the single application in Gao et al. (2017). After the second stage we obtain much better estimates of the labels and parameters relative to the initial values, allowing us to establish the sharper form of the bound. Another key difference is the result in Lemma 2(b) which provides a better error rate than the classical Chernoffbound, using a very innovative technique introduced in Abbe and Sandon (2015). Moreover, we keep track of the balance parameter β in (A2), allowing it to go to infinity slowly. Last but not least, assortativity is a key assumption in Gao et al. (2017), while our result does not rely on it. Besides consistency, our provable algorithm is computationally more efficient. To obtain initial labels, we apply the spectral clustering on very few subgraphs (8 to be exact). However, the provable version of the algorithm in Gao et al. (2017) applies the spectral clustering for each single node on the rest of the graph excluding that node. If the cost of running the spectral clustering on a network of n nodes is Cn, then our approach costs 8Cn/8 while that of Gao et al. (2017) costs roughly n Cn 1. Our algorithm thus has a significant advantage in computational complexity when n . To be fair, the algorithm in Gao et al. (2017) was for the symmetric SBM, which has the extra complication of dependency in A due to symmetry. Our comparison here is mostly with Corollary 3.1 in Gao et al. (2017). In addition, Gao et al. (2017) have a result (their Theorem 5) for when ω grows arbitrarily fast which is not covered by our result. See Appendix A.2 for comments on the symmetric case and dependence on ω. The problem of exact recovery for a general SBM has been considered in Abbe and Sandon (2015), again for the case of a symmetric SBM, though the results are applicable to the bipartite setting (with y = z). The model and scaling considered in Abbe and Sandon (2015) is the same as that of Example 2, and they show that exact recovery of all labels is possible if (and only if) mink,r:k =r Ikr 1 which is the same result we obtain in Example 2 for Algorithm 3. Thus, our result contains that of Abbe and Sandon (2015) as a special case, namely in the log n regime of degree growth, with other parameters (such as K and the normalized connectivity matrix) kept constant. The results and algorithms of Abbe and Sandon (2015) do not apply to the general model in our paper. First, they only consider the regime P log n/n, i.e., the degree grows as fast as log n, while we allow the degree to grow in the range from arbitrarily slowly up to as fast as O( n) . Second, as discussed in Appendix A.1, their edge splitting idea cannot be used to derive the results in this paper, and we introduce the block partitioning to supply the independent copies necessary for the analysis. Finally, we note that Example 3 with a general nonassortative matrix e P has no counterpart in the literature. Existing results are not capable of providing any guarantees for such setups. 4. Pseudo-likelihood approach In this section, after introducing the local and global mean parameters which will be used throughout the paper, we present our general pseudo-likelihood approach to biclustering. Optimal Bipartite Network Clustering 4.1. Local and global mean parameters Let us define the following operator that takes an adjacency matrix A, and row and column labels y and z, and outputs the corresponding (unbiased) estimate of its mean parameters: [L (A, y, z)]kℓ= 1 nk( y) j=1 Aij1{ yi = k, zj = ℓ}, k [K], ℓ [L]. (21) Note that L (A, y, z) is a K L matrix with nonnegative entries. In general, we let ˆΛ = (ˆλkℓ) := L (A, y, z), (22) Λ( y, z) = (λkℓ( y, z)) := L (E[A], y, z), (23) for any row and column labels y and z. Here, ˆΛ is the estimate of the true row mean matrix. Its expectation is E[ˆΛ] = Λ( y, z) due to the linearity of L . We call Λ( y, z), the (global) row mean parameters associated with labels y and z. (We do not explicitly show the dependence of ˆΛ on the labels, in contrast to the mean parameters.) We have the following key identity Λ( y, z) | y=y, z=z = Λ (24) where Λ is the true (global) row mean parameter matrix defined in (2). In words, (24) states that the global row mean parameters associated with the true labels y and z, are the true such parameters. We will also use parameters such as Λ(y, z) which are obtained based on the true row labels y and generic column labels z. We also need local versions of all these definitions which are obtained based on submatrices of A. More precisely, let A(q ,q) be a submatrix of A, and let y(q ) and z(q) be the corresponding subvectors of z and y (i.e., corresponding to the same row and column index sets used to extract the submatrix). Here, q and q range over [Q] = {1, . . . , Q}, creating a partition of A into Q2 subblocks. We call Λ(q ,q)( y, z) := (λ(q ,q) kℓ ( y, z)) := L (E[A(q ,q)], y(q ), z(q)), (25) the local row mean parameters associated with submatrix A(q ,q) and sublabels y(q ) and z(q). The corresponding estimates are defined similarly, by replacing E[A(q ,q)] with A(q ,q). We will mostly work with submatrices obtained from a partition A(q ,q), q , q [Q] of A into (nearly) equal-sized blocks the details of which are described in Section 5. In such cases, Λ(q ,q)( y, z) 1 QΛ( y, z), q , q [Q] assuming that each subblock in the partition has nearly similar cluster proportions: n(z(q)) n(z). This is the case, for example, for a random partition as we show in Appendix B.2. Of special interest is when we replace both y and z with true labels y and z. In such cases, Λ(q ,q) does not depend on q . More precisely, we have for any q [Q], λ(q ,q) kℓ (y, z) = Pkℓnℓ(z(q)), q [Q], (26) where nℓ(z(q)) is the number of labels in class ℓin z(q), consistent with our notation for the full label vectors. We often write Λ(q) = (λ(q) kℓ) as a shorthand for Λ(q ,q)(y, z) which is justified by the above discussion. These will be called the true local row mean parameters (associated with column q subblock in the partition). Zhou and Amini 4.2. General pseudo-likelihood algorithm Let us now describe our main algorithm based on the pseudo-likelihood (PL) idea, which is a generalization of the approach in Amini et al. (2013) to the bipartite setup. The pseudolikelihood algorithm (PLA) is effectively an EM algorithm applied to the approximate mixture of Poissons obtained from the block compression of the adjacency matrix. It relies on some initial estimates of the row and column labels to perform the first block compressions (for both rows and columns). The initialization is often done by spectral clustering and will be discussed in Section 5.2.2. Subsequent block compressions are performed based on the label updates at previous steps of PLA. Let us assume that we have obtained labels y and z as estimates of the true labels y and z. We focus on the steps of PLA for recovering the row labels. Let us define an operator B(A; z) that takes approximate columns labels and produces the corresponding column compression of A: B(A; z) := b( z) := biℓ( z) Zn L + , biℓ( z) := j=1 Aij1{ zj = ℓ}. (27) The distribution of biℓ( z) is determined by the row class of i. It is not hard to see that E[biℓ( z)] = λkℓ(y, z) = λkℓ( y, z)| y=y, if yi = k, (28) where λkℓ( y, z) are the (global) row mean parameters defined in (23). Now consider an operator L(b; y) that, given the column compression b and the initial estimate of the row labels y, produces estimates of the (row) mean parameters λkℓ(y, z): L(b; y) := ˆΛ := [ˆλkℓ] RK L + , ˆλkℓ:= 1 nk( y) i=1 biℓ1{ yi = k}. (29) Note that if y = y, we have E[ˆλkℓ] = λkℓ(y, z). The definition of the estimates in (29) are consistent with those of (22) due to the following identity: L(B(A; z); y) = L (A, y, z) which holds for any row labels y and column labels z. Let us write π( y) := (πk( y)), πk( y) := 1 i=1 1{ yi = k} (30) for the estimate of (row) class priors based on y. We note that operations B and L remain valid even if y and z are soft labels with a minor modification. By a soft row label zj [0, 1]L we mean a probability vector on [L]: zjℓ 0 and PL ℓ=1 zjℓ= 1, which denotes a soft assignment to each row cluster. To extend (27) to soft row labels, it is enough to replace 1{ zj = ℓ} with zjℓ. Extending (29) to soft column labels y is done similarly. Given any block compression b = (biℓ), any estimate ˆΛ of the (row) mean parameters and any estimate eπ [0, 1]K of the (row) class prior, consider the operator that outputs the (row) class posterior assuming that the rows of bi approximately follow P k eπk Q ℓPoi(ˆλkℓ): F(b, ˆΛ, eπ) := (bπik) [0, 1]n K, bπik := eπk QL ℓ=1 ϕ(biℓ, ˆλkℓ) PK k =1 eπk QL ℓ=1 ϕ(biℓ, ˆλk ℓ) , (31) Optimal Bipartite Network Clustering Algorithm 1 Pseudo-likelihood biclustering, meta algorithm 1: Initialize row and column labels y and z. 2: while y and z have not converged do 3: b B(A; z) 4: while ˆΛ and y not converged (optional) do 5: ˆΛ L(b; y) 6: Option 1: eπ 1, or option 2: eπ π( y) 7: y F(b, ˆΛ, eπ) 8: (Optional) Convert y to hard labels. 9: end while 10: Repeat lines 3 9 with appropriate modifications to update z and columns parameters (by changing A to AT and swapping z and y.) 11: (Optional) Convert y and z to hard labels if they are not. 12: end while Algorithm 2 Simplified pseudo-likelihood clustering 1: Input: Initial column labels z, and Λ that estimates Λ. 2: Output: Estimate of row labels by. 3: b B(A; z) 4: ˆΛ L(b; y) 5: by F(b, Λ, 1) 6: Convert by to hard labels, by computing MAP estimates. where ϕ(x, λ) = exp(x log λ λ) is the Poisson likelihood (up to constants). In practice, we only use π( y) or a flat prior 1 as the estimated prior eπ in this step; similarly, we only use a block compression that is based on the estimates of row labels, i.e., biℓ= biℓ( z) for some z [n]L. Note that F outputs soft-labels as new estimates of y. We can convert (bπik) to hard labels if needed. Algorithm 1 summarizes the general blueprint of PLA which proceeds by iterating the three operators (27), (29) and (31). An optional conversion from soft to hard labels is performed by MAP assignment per row. With option 2 in step 6, the inner loop on lines 4 8 is the EM algorithm for a mixture of Poisson vectors. We can also remove the inner loop and perform iterations 5 8 only once. In total, Algorithm 1 has (at least) 6 possible versions, depending on whether we include each of the steps 8 or 11 (for the soft to hard label conversion) and whether to implement the inner loop till convergence or only for one step. We provide empirical results for two of these versions in Section 6. In practice, we recommend keeping the labels soft throughout, and only run the inner loop for a few iterations; maybe even once, if the computational cost is of significance. 4.3. Likelihood ratio classifier A basic simplified building block of the PLA is given in Algorithm 2. This operation which will play a key role in the development of the provable version of the algorithm in Section 5 can be equivalently described as a likelihood ratio classifier (LRC). Let us write the joint Poisson Zhou and Amini likelihood (up to a constant) as: ℓ=1 ϕ(xℓ, λℓ) = ℓ=1 exp(xℓlog λℓ λℓ), x RL, λ RL +, (32) and the corresponding likelihood ratio as: Ψ(x; λ | λ ) = log Φ(x, λ) ℓ=1 xℓlog λℓ λ ℓ + λ ℓ λℓ, x RL, λ, λ RL +. (33) Recalling the column compression (27), the likelihood ratio classifier, based on initial row labels z and an estimate Λ of the row mean parameter matrix, is [LR(A, Λ, z)]i argmax r [K] log Φ(bi ( z), λr ), i [n]. (34) This operation gives us a refined estimate of the row labels (i.e., y). It is not hard to see that the output of Algorithm 2 is by = LR(A, Λ, z). 5. Provable version When analyzing Algorithm 2, we need the initial labels to be independent of the adjacency matrix. Hence, we cannot apply the initialization method (e.g., the spectral clustering) and the likelihood ratio classifier (Algorithm 2) on the same adjacency matrix. In this section, we introduce Algorithm 3 which partitions A into sub-blocks and operates iteratively on collections of these blocks to maintain the desired independence. Algorithm 3 is the version of the pseudolikelihood algorithm for which our main result (Theorem 1) holds. Let us assume that n and m are divisible by 2Q = 8. This assumption is not necessary but helps simplify the notation. Let us write by = row SC(A), bz = col SC(A) to denote labels obtained by applying the spectral clustering on rows and columns of the adjacency matrix A, respectively; see Section 5.2.2 for details. We have col SC(A) = row SC(AT ). We also recall the LR classifier defined in (34). For matrices (or vectors) A and B, we use [A; B] to denote column concatenation and [A B] to denote row concatenation. The general idea behind the partitioning scheme used in Algorithm 3, which is done by sequential sampling without replacement, is to ensure that in each step where the LR classifier is applied, the initial labels used are independent of the sub-block of the adjacency matrix under consideration. We do not require, however, that the initial labels be independent of the estimates of the mean parameters ˆΛ, since as will be seen in Appendix B.1 we have uniform consistency of the LR classifier over all ˆΛ close to the truth. For example, in step 7, that is, in the assignment y(q) LR(A(q,q+2), ˆΛ(q+2), z(q+2)), the claim is that z(q+2) at that stage in the algorithm is independent of A(q,q+2) but not necessarily of ˆΛ(q+2). This will become clear in the following discussion where we keep track of the dependence of various estimates through the algorithm. Note that in the description of Algorithm 3, we are using the computer coding convention for in-place assignments, e.g., z(q) gets updated in place and refers to different objects at different points in the algorithm. Optimal Bipartite Network Clustering Algorithm 3 Provable (parallelizable) version 1: Randomly partition the rows into 2 groups of equal size (n/2), so that A = [Atop ; Abottom] 2: Randomly partition the rows and columns of Abottom into 4 groups of equal size, so that we have 16 sub-adjacency matrix with dimension (n/8) (m/4), i.e. A(1,1) A(1,2) A(1,3) A(1,4) A(2,1) A(2,2) A(2,3) A(2,4) A(3,1) A(3,2) A(3,3) A(3,4) A(4,1) A(4,2) A(4,3) A(4,4) In each of the following steps, perform the stated operation for every q Z4: 3: Obtain initial row labels: [ y(q 1) ; y (q)] row SC [A(q 1,q) ; A(q,q)] , q. 4: Obtain initial column labels: [ z(q) z (q+1)] col SC([A(q,q) A(q,q+1)]), q. 5: Get consistent (global) labels: y Match( y, y ) and z Match( z, z ). 6: Update (local) row mean parameters: ˆΛ(q+2) L (A(q,q+2), y(q), z(q+2)), q. 7: Update row labels: y(q) LR(A(q,q+2), ˆΛ(q+2), z(q+2)), q. 8: Similarly update column labels z as in steps 6 and 7. 9: Update (local) row mean parameters: ˆΛ(q+3) L (A(q,q+3), y(q), z(q+3)), q. 10: Obtain (global) row mean parameters: ˆΛ P q ˆΛ(q). 11: bytop LR(Atop, ˆΛ, z). 12: Swap Atop and Abottom, then repeat steps 2 11 to obtain bybottom, except for step 5 where y is matched to bytop instead, i.e., y Match( y, bytop). 13: by [bytop ; bybottom]. 14: Apply step 1 to 10 on AT to obtain bz. Figure 1 illustrates the partitions used in steps 2 9 of the algorithm. The collection of the submatrices in the partition is given a name in each case. For example, Gcol 1 consists of the four submatrices in Figure 1(a). Note that {Gcol 1 , G2, G3} form a complete partition of the matrix Abottom into disjoint blocks. Also, Gcol 1 and Grow 1 involve the same elements of the matrix, i.e. they cover the same portion of Abottom. Thus, {Grow 1 , G2, G3} is also a complete cover of Abottom with disjoint blocks. Let us write G1 for the common portion of Abottom covered by Gcol 1 and Grow 1 . Steps 3 and 4 operate on blocks in Gcol 1 and Grow 1 respectively, producing initial row and column labels. For example, in step 3, we apply row SC on each submatrix specified in Figure 1(a) Zhou and Amini A11 A12 A13 A14 A21 A22 A23 A24 A31 A32 A33 A34 A41 A42 A43 A44 A11 A12 A13 A14 A21 A22 A23 A24 A31 A32 A33 A34 A41 A42 A43 A44 A11 A12 A13 A14 A21 A22 A23 A24 A31 A32 A33 A34 A41 A42 A43 A44 A11 A12 A13 A14 A21 A22 A23 A24 A31 A32 A33 A34 A41 A42 A43 A44 (a) Gcol 1 (Step 3) (b) Grow 1 (Step 4) (c) G2 (Steps 6, 7) (d) G3 (Step 9) Figure 1: The four stages of partitioning in Algorithm 3. In each case, the collection of submatrices in the partition is given a name which is used in the text. We have used to shorthand Aqq = A(q,q ) for simplicity. Block used in obtaining initial labels (a b), in obtaining the first local parameter estimates (c), and in the first application of LR classifier (d). and obtain the label vectors (from the leftmost submatrix to the rightmost one): [ y (1) ; y(4)], [ y(1) ; y (2)], [ y(2) ; y (3)], [ y(3) ; y (4)]. (35) As a result of these steps, we obtain two sets of row labels y = ( y(q) : q Z4) and y = ( y (q) : q Z4), and similarly for the columns labels. Neither of y or y is necessarily a consistent set of labels for the whole matrix, since the cluster labels for individual pieces y(q) and y (q) need not match (e.g., cluster 1 in one piece could be labeled cluster 2 in another piece.). However, if the sub-block labels (35) are sufficiently close to the truth, we can use the overlap among them to find a global set of labels that are consistent with each block of y and y . This is what the Match operator in step 5 does, as will be detailed in Section 5.1. The resulting updated global row and column labels only depend on G1 portion of Abottom. Steps 6 13 go through the following phases: First local parameter estimates (step 6): Having obtained good initial (global) row and column labels, in Step 6, we obtain estimates of the local mean parameters ˆΛ(q+2) for the submatrices in G2 as in Figure 1(c). Note for example, that ˆΛ(q+2) computed in this step depends on blocks A(q,q+2) and on G1 through z(q+2). Collectively, the estimates {ˆΛ(q+2) : q Z4} in Step 6 depend on G1 G2 portion of Abottom. First LR classifier (steps 7 8): Using the estimates of the (local) row mean parameters, in Step 7, we apply the LR classifier, y(q) LR(A(q,q+2), ˆΛ(q+2), z(q+2)) to each of the submatrices in G2 (in Figure 1(c)). Here, ˆΛ(q+2) depends on the same block A(q,q+2) on which we apply LR classifier, but the dependence is not problematic due to the uniform consistency of LR classifier in parameters (Lemma 2). However, we note that z(q+2) is a function of G1 blocks of Abottom, hence independent of A(q,q+2) which is key in our arguments. We will similarly apply the LR classifier on the columns of G2, and obtain z(q). By the end of step 8, the updated labels y and z will depend on blocks in G1 G2; these labels will be much more accurate (Mis exp( I/Q)) than the initial labels obtained by spectral clustering. Second parameter estimates (steps 9 10): Using the more accurate labels of step 8, we obtain the local mean parameters ˆΛ(q+3) in step 9 for the submatrices in G3 (Figure 1(d)). This step is similar to step 6, but because of the much more accurate labels, the parameter estimates are much more accurate too. Since the global mean parameter is the sum of local mean parameters, i.e. Λ = P q [Q] Λ(q), we use ˆΛ := P q ˆΛ(q) to estimate Λ in step 10. We recall that the true local mean parameters do not depend on the block row index; see (26). Optimal Bipartite Network Clustering Second LR classifier (step 11): Using the more accurate estimates of (global) row mean parameters ˆΛ from step 10 and the more accurate labels z in step 8, in step 11 we apply the LR classifier bytop LR(Atop, ˆΛ, z) on Atop. We note that Atop in this step is independent of z (as well as ˆΛ). This second LRC application is what brings us from very accurate labels (Mis exp( I/Q)) to almost optimal (Mis exp( I)); see Appendix C. Bottom half (steps 12 13): The same process is repeated in step 12, after swapping the top and bottom halves of A, to get the bottom portion of the row labels. Matching the labels y from the spectral clustering in step 5 with bytop guarantees that bytop and bybottom have consistent labels. Thus, no extra matching is required when concatenating the two in step 13. 5.1. Matching step Let us describe the details of the matching step in Algorithm 3. Although, the idea is intuitively clear, formally describing the procedure is fairly technical. In order to understand the idea, consider the two-block labels y(q 1,q) := [ y(q 1) ; y (q)], for q = 2, 3, that is, y(1,2) := [ y(1) ; y (2)], y(2,3) := [ y(2) ; y (3)]. We will detail how these two sets of labels can be fused together to generate a set of consistent labels for the three-block true label vector y(1,2,3) := [y(1); y(2); y(3)]. The two (overlapping) two-blocks of the true label vector are denoted as y(1,2) := [y(1) ; y(2)], y(2,3) := [y(2) ; y(3)]. More generally, we let y(q 1,q) = [y(q 1) ; y(q)], similar to the notation for estimated blocks. Recall our notation σ ( ) for (an) optimal permutation between two sets of labels (cf. Section 2.3). Finding σ is a linear assignment problem, with computational complexity O(K L)3); see Burkard and Cela (1999). Let us define σq 1,q := σ y(q 1,q) y(q 1,q) , σq := σ ( y(q) y(q)), σ q := σ ( y (q) y(q)). (36) Thus, for example we have σ1,2 = σ ( y(1,2) y(1,2)), σ2 = σ ( y(2) y(2)), σ 3 = σ ( y (3) y(3)), and so on, as depicted in Figure 2(a). In other words, each of these permutations is the optimal permutation from the corresponding block of the underlying estimated label to that of the truth. Let us write y(1,2) y(1,2) to mean that the two sets of labels are sufficiently close (to be made precise later). The first claim is that y(1,2) y(1,2) implies that the underlying sub-blocks have the same optimal permutation to the truth as the original two-block label, i.e., y(1,2) y(1,2) = σ1 = σ 2 = σ1,2 and similarly y(2,3) y(2,3) = σ2 = σ 3 = σ2,3. The second claim is that each sub-block has almost the same misclassification error as the bigger two-block. To see this, recall the direct misclassification rate introduced in Section 2.3, i.e., misclassification rate without applying any permutation (or equivalently with the identity permutation). We have d Mis σ2,3( y(2,3)), y(2,3) = Mis y(2,3), y(2,3) ε. (37) Zhou and Amini y (2) y (1,2) = y (3) y (2,3) = y (2) σ ( y (2) y(2)) Figure 2: Pictorial depiction of the matching step. (a) Two-block and sub-block optimal permutations to the truth. When y(1,2) y(1,2), we have σ1 = σ 2 = σ1,2 and similarly y(2,3) y(2,3) implies σ2 = σ 3 = σ2,3. (b) Commutative diagram depicting how the missing permutation σ 1 2 σ 2 can be obtained by matching observed labels y (2) and y(2). See Section 5.1 for details. where the inequality is by assumption (ε being the rate achieved by the spectral clustering algorithm). A similar expression holds with (2, 3) replaced with (1, 2). Now (37) implies d Mis σ2( y(2)), y(2) = d Mis σ2,3( y(2)), y(2) 2ε = ε (38) where the equality uses σ2 = σ2,3. To see the inequality, let n2,3, n2 and n3 be the lengths of y(2,3), y(2) and y(3). Then, d Mis σ2,3( y(2,3)), y(2,3) = n2 n2,3 d Mis σ2,3( y(2)), y(2) + n3 n2,3 d Mis σ2,3( y (3)), y(3) and the result follows since we have n2 = n3 = n2,3/2 by construction. Note that d Mis has the property of being easily distributed over sub-blocks as opposed to Mis. Similarly, we obtain d Mis σ 3( y(3)), y(3) ε considering the second components of y(2,3) and y(2,3). Applying the same argument to indices (1, 2), we conclude similarly that d Mis σ1( y(1)), y(1) ε and d Mis σ 2( y (2)), y(2) ε . Now consider the following three-block vector, undergoing the transformation σ1( y(1)) σ2( y(2)) σ 3( y(3)) σ 1 2 σ1( y(1)) σ 1 2 σ2( y(2)) σ 1 2 σ 3( y(3)) σ 1 2 σ1( y(1)) y(2) σ 1 2 σ 2( y(1)) y(2) The leftmost vector has d Mis of at most ε relative to y(1,2,3) by the previous arguments, and since Mis d Mis, we have the same bound on Mis rate for the leftmost vector. The first transformation keeps the same Mis rate since we are applying a single permutation σ 1 2 to all elements. The second transformation is in fact an equality, using σ 3 = σ2 established earlier. The third transformation/equality follows similarly by σ1 = σ 2. Thus, if we can recover σ 1 2 σ 2 from the data, we can construct a consistent three-block label having Mis ε . The third and final claim is that this is possible, and in fact we have σ 1 2 σ 2 = σ ( y (2) y(2)) (39) Optimal Bipartite Network Clustering Algorithm 4 SC-RRE 1: Apply degree-reduction regularization A to obtain Are, as discussed in Zhou and Amini (2019). 2: Obtain A(K L) re = b Z1 ˆΣ b ZT 2 , the (K L)-truncated SVD of Are. 3: Output K ( b Z1 ˆΣ) where K is an isometry-invariant κ-approximate kmeans algorithm. that is, σ 1 2 σ 2 can be obtained (assuming ε is sufficiently small) by optimally matching y (2) to y(2), both of which we observe in practice. See the commutative diagram in Figure 2(b). In order to make the above argument precise, we need to justify the first and third claims. We will discuss the details in Appendix B.4. The above matching process can be repeated over all the two-blocks y(q 1,q) to get a consistent set of global labels whose overall misclassification rate is no more than twice that of the original two-blocks (cf. ε versus ε). 5.2. Results for Algorithm 3 5.2.1. General initialization Before studying the spectral initialization, let us give a general bound on the misclassification rate of Algorithm 3, assuming sufficiently good initial labels. Assume that the initial labels obtained in steps 3 and 4 of the algorithm are γ1-good in the sense Mis( y, y) γ1 βK , Mis( z, z) γ1 with γ1 satisfying γ1 1 384β2ω Imin 8L Λ Icol min 16K Γ Any other initialization algorithm besides spectral clustering can be used, as long as the above guarantee on its output holds. We also need the following weaker version of (A4): βω( Λ Γ ) = o h Imin Icol min Q log Q(K L) ia , for some a > 0. (A4 ) Theorem 3. Assume that the model parameters satisfy Imin Icol min , Λmin , (A1), (A2), (A3) and (A4 ), and the initial labels satisfy (40). Then, for some ζ = o(1), by outputted by Algorithm 3 satisfies Misk by, y = O ω X 2 ζ log Λmin (41) for every k [K] with probability 1 o(1). We refer to Section 3 for the definition of the parameters involved in the rate given in (41). Zhou and Amini 5.2.2. Spectral initialization Theorem 3 requires initial labels that satisfy (40). A spectral algorithm, namely SC-RRE given in Algorithm 4 can provide such initialization. (The acronym stands for reduced-rank efficient spectral clustering.) The algorithm is presented and analyzed in Zhou and Amini (2019). One performs a truncated SVD of rank r := K L on a regularized version of the adjacency matrix, Are, to obtain b Z1 ˆΣ b ZT 2 , where ˆΣ is the diagonal matrix retaining the top r largest singular values of Are, and b Z1 Rn r and b Z2 Rn r are the matrices of the corresponding singular vectors. One then runs a k-means type algorithm on b Z1 ˆΣ, rather than b Z1 which is the more common approach in spectral clustering. This allows one to state consistency results without a reference to the gap in the spectrum of Are, while still retaining the attractive feature of the latter approach, namely, the computational efficiency of running k-means on a matrix of reduced dimension. The isometry-invariant qualification used in Algorithm 4 means that the k-means algorithm should only be sensitive to the pairwise distances of the data points. We refer to Zhou and Amini (2019) for a detailed discussion. In particular, one has the following bound on the misclassification rate of SC-RRE: Theorem 4 (Zhou and Amini (2019)). Let α = m/n and Λ2 := mint =s Λs Λt 2. Consider the spectral algorithm given in Algorithm 4, assume (A2) and that for a sufficiently small C1 > 0, β2KL(K L) α Λ Λ2 C1(1 + κ) 2. (42) Then the algorithm outputs estimated row labels y satisfying Mis( y, y) C 1 1 (1 + κ)2βL(K L) α Λ Here, Λs refers to the sth row of the mean parameter matrix Λ (cf. Section 2.1). Combining Theorems 3 and 4, one obtains Theorem 1. Some work is required to translate the bound of the Theorem 4 to be applicable to sub-blocks. See Section D.1 for details. 6. Simulations We provide some simulation results to corroborate the theory. We generate from the SBM of Section 2.1 with the following connectivity matrix log(mn) α mn B, B = 1 2 3 4 5 6 2 3 4 5 6 1 3 4 5 6 1 2 4 5 6 1 2 3 Note that B does not have any clear assortative or dissortative structure. We let n = Kn0 and m = Ln0, and we vary n0. All clusters (both row and column) will have the same number of nodes n0. By changing α, we can study different regimes of sparsity. In particular, when α (0, 1), we are in the regime where weak recovery is possible but not exact (or strong) recovery. We consider both the misclassification rate, and the normalized mutual information (NMI) as measures of performance. NMI is a measure of accuracy which is between 0 and Optimal Bipartite Network Clustering 200 400 600 800 1000 1200 1400 1600 1800 2000 # of nodes per cluster NMI (Overall) C = 1.00, = 0.75, =1.00, K=4, L=6 # of nodes per cluster log miss. (overall) C = 1.00, = 0.75, =1.00, K=4, L=6 Figure 3: Plots of (a) the (overall) NMI and (b) the corresponding log. misclassification rate, for the SBM model with connectivity matrix (43). The four algorithms considered are the Spectral clustering of Algorithm 4, Soft and Hard versions of Algorithm 1 and the Oracle algorithm of Section 2.2. 1 (=perfect match). The NMI is quite sensitive to mismatch and tends to reveal discrepancies between methods more clearly. Figure 3(a) shows the overall NMI versus n0. Figure 3(b) illustrates the corresponding log. misclassification rates. We have considered four algorithms: (1) Spectral: the spectral clustering of Algorithm 4. (2) Soft: Algorithm 1 with flat prior, no inner loop and no conversion to hard labels. (3) Hard: Algorithm 1 with flat prior, no inner loop and conversion to hard labels after each label computation. (4) Oracle: The oracle classifier discussed in Section 2.2 and Remark 2: Assuming the knowledge of z and Λ, we obtain by by the likelihood ratio classifier, and similarly obtain bz, assuming the knowledge of y and Γ. Figure 3 shows the results for α = .75 (a regime where no exact recovery is possible) and C = 1. Both the soft and hard versions of Algorithm 1 are initialized with the spectral clustering and both significantly improve over it. The soft version of Algorithm 1 also outperforms the hard version as one would expect: soft labels carry more information between iterations. It is interesting to note that, as predicted by the theory, the slope for the log-misclassification rate of Algorithm 1 approaches that of the oracle, which is especially clear for the soft version in Figure 3(b). Simulation results for various other settings can be found in Appendix H, showing qualitatively similar behavior. Acknowledgement We thank Yunfeng Zhang for helpful discussions. Appendix A. Additional comments A.1. Comments on edge splitting One needs independent versions of the adjacency matrix in different stages of the algorithm. To achieve this goal, edge splitting was introduced in Abbe and Sandon (2015). The idea Zhou and Amini is that one can regard the two (or more) graphs obtained from edge splitting to be nearly independent. To be specific, let P1 be the joint probability measure corresponding to a pair of graphs G1 and G2 generated independently with connectivity matrices q P and (1 q)P. Let P2 be the joint probability measure on G1 and G2 obtained by edge splitting from a single SBM with connectivity matrix P, assigning every edge independently to either G1 or G2 with probabilities q and 1 q. Then, P1 and P2 have the same marginal distributions. Having a vanishing total variation between P1 and P2 is necessary for further analysis which, as was pointed out by (Abbe and Sandon, 2015, pp. 46-47), is equivalent to showing that under P1, G1 and G2 do no share any edge, with high probability. Letting e Pmin = minkℓe Pkℓ, P1(G1 and G2 do not share edges) 1 (1 q)q e P 2 min(log n)2 which is strictly bounded away from 1 unless (1 q)q e P 2 min(log n)2 = o(1), that is, the connectivity matrix of either G1 or G2 should vanish faster than 1/n. Our consistency result will not hold in this regime. Thus, edge splitting cannot be used to derive the results in this paper, and we introduce the block partitioning idea to supply us with the independent copies necessary for analysis. Another technical issue about edge splitting is discussed in Remark 4. A.2. Discussion Our results do not directly apply to the symmetric case, due to the dependence between the upper and lower triangular parts of the adjacency matrix A. However, a more sophisticated two-stage block partitioning scheme can be used to derive similar bounds under mild extra assumptions. One starts with an asymmetric partition into blocks of sizes {qn, (1 q)n} {qn, (1 q)n}, for q = 1/Q 0 very slowly. In the first stage, one applies a similar procedure as described in Algorithm 3 on the upper triangular portion of the large subblock (1 q)n (1 q)n, followed by the application of the LR classifier on the fat block qn (1 q)n to obtain very accurate row labels of the small block qn qn.. One then repeats the process using the leaveone-out of Gao et al. (2017), but applied to small blocks qn qn rather than individual nodes. We leave the details for a future work. It was also shown by (Gao et al., 2017, Theorem 5) that their equivalent of condition (A1) can be removed by modifying the algorithm. In their setting, without assuming a b, a misclassification rate of exp( (1 ε)I) is achievable, where ε (0, 1) is a variable in the new version of their algorithm. If those arguments can be extended to the general block model, it will be possible to relax the requirements on ω in (A3) and (A4). When K, L = O(1), one can completely remove sparsity condition (A3) using a much sharper Poisson-binomial approximation than what we have used in this paper. Finally, we suspect that our result could be generalized beyond SBMs to biclustering arrays where the row and column sums over clusters follow Poissonian central limit theorems. We will explore these ideas in the future. A.3. PL naming We have borrowed the name pseudo-likelihood (PL) from Amini et al. (2013) based on which the algorithms in this paper are derived. In Amini et al. (2013), the setup is that of the symmetric SBM, and in order to treat the full likelihood as the product of independent (over nodes i = 1, . . . , n) of the mixture of Poisson vectors, one has to ignore the dependence among the upper and lower triangular parts of the adjacency matrix, making the PL naming more Optimal Bipartite Network Clustering inline with the traditional use of the term. In our bipartite setup, there is no such dependence to ignore, but we have kept the name PL for consistency with Amini et al. (2013) and ease of use. We interpret the pseudo nature of the likelihood as the approximation used in the block compression stage (with imperfect labels) and in replacing Poisson-binomial distribution with the Poisson. Appendix B. Preliminary analysis We start by analyzing the properties of the operators introduced in Sections 4.1 and 4.2, for some fixed (deterministic) initial labels y and z. We assume that these labels satisfy: Mis( y, y) γ βK , Mis( z, z) γ We call such labels γ-good as in Section 5.2.1. Throughout, Λ will be used to denote a generic deterministic approximation of the true row mean parameter Λ. The relative ℓ ball of radius δ centered at Λ, that is, BΛ(δ) := { Λ : Λ Λ δ Λ }, (44) will play a key role in our arguments. For sufficiently small δ and true Λ, BΛ(δ) will be the set of δ-good row mean parameters. B.1. Fixed label analysis We first present the analysis assuming that all the operations are performed on the entire adjacency matrix A. In Appendix B.2, these results are extended to be applicable to subblocks of A. Recall the definitions of the mean parameters an their estimates from Section 2.1. In particular, we recall that λk (y, z) is the mean of bi ( z) for any node i with yi = k. These mean parameters form the kth row of Λ(y, z). Our first main lemma illustrate that whenever the initial labels z and y are γ-good, then the parameters Λ(y, z) as well as the corresponding estimates ˆΛ defined in (22) are close to the truth, that is Λ. Lemma 1 (Parameter consistency). Let Cγ = Cγ,β = β2γ/(1 γ), assume that 6Cγω 1, and let hc(τ) := 3 4cτ log 1 + 2c 3 τ . Then under assumptions (A1), (A2) and (B3), we have (a) Λ(y, z) Λ Cγ Λ , Λ(y, z) 2 Λ . (b) Λ( y, z) Λ(y, z) 2γ Λ , Λ( y, z) 4 Λ . (c) ˆΛ Λ( y, z) 4τ Λ with probability at least 1 2p1 where p1 = p1(τ; n, Λmin, β) := KL exp nΛmin h1(τ) , τ > 0, (45) (d) Λ(y, z) Λ Cγ Λ , and ˆΛ is as defined in (22). In particular, all the estimates Λ(y, z), Λ( y, z) and ˆΛ are within relative ℓ distance of at most 4(Cγ + τ) from Λ. Zhou and Amini The lemma is proved in Appendix F.1. Note that the lemma implies that ˆΛ BΛ(4(Cγ+τ)) with the stated probability. Our second key lemma shows that the LR classifiers in (34) are uniformly dominated, over Λ BΛ(δ), by a single (perturbed) classifier. To state this result, recall the block compression b( z) := B(A; z) given in (27), and define the following: Yikr(bi , Λ) := Ψ(bi ; λr | λk ) = ℓ=1 biℓlog λrℓ λkℓ + λkℓ λrℓ, (46) Zik(bi , Λ) := 1{Yikr( Λ) 0, for some r = k}. (47) Sk(b, Λ) := 1 nk(y) i:yi=k Zik(bi , Λ), (48) where Ψ is the Poisson log-likelihood ratio defined in (33). Thus, Yikr is the (pseudo) loglikelihood ratio, for k, r [K], measuring the relative likelihood of row i having label k. We note that Yikr( Λ) < 0, r = k implies byi := (LR(A, Λ, z))i = k. Thus, Sk(bi , Λ) is the misclassification rate for the LR classifier over the kth row-class, i.e., Misk(by, y). Let Jkr = L Λ /Ikr. (49) Recalling definitions of εkr, ω and β from Section 3, set η := η (δ; Λ) = 8ωδL Λ = 8ωδJkr Ikr, (50) ηkr := ηkr(δ; ω, β, m, Λ) = 21δωL Λ + 5βL2 Λ 2 m + log h 11ω 1 εkr 2ω(1 + εkr)δ + 1 i 1 2 log Λmin. (51) We have the following key lemma: Lemma 2 (Uniformity of LRC in mean parameters). Fix any row label z and let b = b( z) be the corresponding column compression. Let Λ = Λ(y, z) be the row mean parameter associated with b. Assume (A1), (A2), and Λ BΛ(δ) with 3 ωδ < 1. Then, for all k, r [K], k = r, and all i : yi = k, we have the following bounds: (a) With η defined as in (50), P Λ BΛ(δ), Yikr(bi , Λ) 0 exp( Ikr + η ). (52) (b) If in addition εkr 2ωδ > 0, then with ηkr defined as in (51), P Λ BΛ(δ), Yikr(bi , Λ) 0 exp Ikr + ηkr . (53) The proof of Lemma 2(b) appears in Appendix F.5, and that of part (a) in Appendix G.5. Remark 3 (Typical setting). In the error exponent in Lemma 2(b), i.e. Ikr + ηkr, the first three terms in (51) are positive and constitute the undesirable part of the bound. Our goal is to keep these terms dominated at the final stage of the algorithm, i.e., make them o(log Λmin), by making δ sufficiently small. For now, let us introduce a simple typical setting to give some Optimal Bipartite Network Clustering idea of the order of ηkr. In the first reading, one can consider the case where β, ω = O(1), Ikr I for all k, r and some I, and assume that L Λ /I = O(1) and (A5) holds. In this setting, Jkr = O(1) and we have ηkr = C(δ + m 1I)I 1 2 log Λmin for some constant C. Keeping these typical orders in mind will be helpful in understanding the statements of the subsequent results. It is also worth noting that we always have Jkr 1 2. which follows from the general bound Ikr 2L Λ . Another important quantity is Cγ in Lemma 1, which in the typical setting behaves as Cγ γ when γ 0. Combining Lemma 2 with the Markov inequality, we can get uniform control on the misclassification rate of the LR classifier in its parameter argument (i.e., ˆΛ): Lemma 3. Fix k [K] and z [L]m. Let ˆΛ RK L + be any random matrix and set by( z) := LR(A, ˆΛ, z). Assume that (52) holds. Then, for any u R, we have Misk by( z), y X r =k exp Ikr + η + u , with probability at least 1 e u P ˆΛ / BΛ(δ) . The result is also true if we replace η by ηrk when (53) holds. Remark 4. Edge splitting (ES) was proposed in Abbe and Sandon (2015) to generate nearly independent copies from a single network. One might ask whether combining the edge splitting idea with Lemma 3 is enough to give us a result similar to Theorem 1. In ES, edges are randomly assigned to two graphs G1 and G2, with probabilities q and 1 q. The new graphs G1 and G2 will follow a SBM with a reduced connectivity matrix (by a factor of q and 1 q respectively). Hence, the corresponding parameters Λ and I are reduced by the same factor; for example I will be scaled to q I for G1. Let us consider the typical setting where β, K, L, ω, εkr = O(1) and Ikr I for all k, r and some I; assume the connectivity matrix is symmetric, i.e., Λ = Γ and I = Icol. Let z and y be the labels obtained by performing biclustering on G1. Lemma 3 in the best case scenario, with the most favorable version of ηkr i.e., ignoring the first three positive terms in (51) gives a misclassification rate max{Mis y, y , Mis z, z } γ2 := X r =k exp q Ikr 1 2 log(qΛmin) + v for some v , w.h.p.. In the second stage, given the labels z and y, we obtain an estimate of the (row) mean parameters based on G2, using the natural estimator ˆΛ2 = L (G2, y, z). We then obtain the second stage labels y( z) := LR(G2, ˆΛ, z). Let Λ2 = (1 q)Λ be the row mean parameter of G2. By Lemma 1, ˆΛ2 BΛ2(δ) w.h.p for some δ γ2. By Lemma 3, and the perturbation of information (Lemma 7) we have Mis by( z), y γ3 := X r =k exp (1 q)Ikr + C(1 q)δ Λ 1 2 log Λmin + u for some u w.h.p.. To obtain result (14) in Corollary 1, we at least hope to have q Ikr + C(1 q)γ2 Λ = o(log Λmin). Zhou and Amini So we need q Ikr = o(log Λmin) and (1 q)γ2 Λ = o(log Λmin). Assume that we have q Ikr = o(log Λmin). Then, r =k exp q Ikr 1 2 log(qΛmin) + v = O(Λ 1/2 o(1) min / q). However, this is not sufficient to show (1 q)γ2 Λ = o(log Λmin). Therefore, applying edge splitting and Lemma 3 does not lead to the main result of this paper. B.2. Analysis on subblocks We now extend the analysis of Appendix B.1 to be applicable to the sub-blocks obtained by random partitioning. Some care needs to be taken since the true (row and column) mean parameters of the sub-blocks are changed by partitioning, due to the change in the distributions of the labels within each sub-block among the K L classes. The deviations of the sub-block class proportions from the global version will be controlled by a slack parameter ξ which will be set at the final stage of the proof (see Appendix C.2). Throughout this section, assumptions (A1) and (A2) will be implicit in all the stated lemmas. We will also state the result for a general 2Q Q partitioning scheme, although Q = 4 is enough for the analysis of Algorithm 3. Recall that the class priors πℓ(z) for the full labels are defined in (30). We will use the same notation for sublabels z(q), that is, πℓ(z(q)) is the proportion of labels in z(q) that lie in class ℓ. Note that we have πℓ(z) = nℓ(z) m , πℓ(z(q)) = nℓ(z(q)) m/Q , hence, πℓ(z(q)) πℓ(z) = Qnℓ(z(q)) nℓ(z) , (54) since z(q) has length m/Q. We similarly we have πk(y(q)) = nk(y(q))/(n/(2Q)). We will work under the assumption that the partitioning scheme satisfies: max k,q |πk(y(q)) πk(y)| ξ and max ℓ,q |πℓ(z(q)) πℓ(z)| ξ, (B4a) ξ min 1 2βK , 1 2βL When these conditions hold, we call the scheme a good partition. We note that these conditions combined with (A2) give, πℓ(z) 1 ξLβ 1 2 1 βL πℓ(z(q)) 3 and similarly for y(q). It follows that both z(q) and y(q) satisfy (A1) with β replaced with 2β. Each count nk(y(q)) follows a hypergeometric distribution with parameters (n, nk(y), n/(2Q)), that is, the number of nodes labeled k, in a sample of size n/(2Q), from a population of size n, with a total of nk(y) nodes labeled k. The concentration of the hypergoemtric distribution gives the following: Lemma 4. (B4a) holds for random partitioning, with probability at least 1 p2, where p2 = 2Q(K + L) exp min(n, m)ξ2/Q . (56) Optimal Bipartite Network Clustering The proof of this lemma and others in this section appear in Appendix G.1. Lemma 5. Under (B4a) and (B4b), the true local mean parameters Λ(q) = (λ(q) kℓ) satisfy: λ(q) kℓ λkℓ Q , q, k, ℓ. (57) In particular, Λ(q) min 1 2QΛmin, Λ(q) 3 2Q Λ and Λ(q) BΛ/Q(ξLβ) for all q [Q]. Our main lemma for the sub-blocks establishes the consistency of the local mean parameter estimates ˆΛ(q ,q) for a good partitioning scheme. This lemma is an extension of Lemma 1. We recall the operator L from (21): Lemma 6 (Local parameter consistency). Let Cγ = β2γ/(1 γ) and hc(τ) := 3 4cτ log 1 + 2c as in Lemma 1 and assume that 72 Cγω 1. Fix the underlying partition and fix q, q [Q], and labels z and y. Let ˆΛ(q ,q) = L (A(q ,q), y(q ), z(q)). Assume that the partition satisfies (B4a) and (B4b), and the pairs ( z(q), z(q)) and ( y(q), y(q)) satisfy the misclassification rate in (B3). Then, ˆΛ(q ,q) Λ(q) 24 Cγ + 6τ Λ/Q , and ˆΛ(q ,q) Λ/Q 24 Cγ + 6τ + ξLβ Λ/Q with probability at least 1 2p3, where p3 = p3(τ; n, K, Λmin, Q) := KL exp nΛmin h1(τ) We also have (a) Λ(q ,q)(y, z) Λ(q) 4Cγ Λ(q) . (b) Λ(q ,q)( y, z) Λ(q ,q)(y, z) 2γ Λ(q) . (c) ˆΛ(q ,q) Λ(q ,q)( y, z) 4τ Λ(q) , with probability at least 1 2p3. Remark 5. Similar results to those obtained above hold for the column parameters. Recall that the dual to the row mean parameters Λ are the column mean parameters Γ. The result of Lemma 5 can be translated to the column version by making the following substitutions Λ Γ, Q 2Q and L K. For Lemma 6, in addition we need to make n 4m. (The reason for this is that in (117), in the proof, we need to replace n/2Q with m/Q, and Λmin/2Q with Γmin/(4Q), and the combination of the aforementioned substitutions achieves this. We also note for future reference that the corresponding ω inflation by a factor of 3 remains true for column parameters.) After these substitutions, we obtain the same constant in (58), that is, p3 has to be replaced with p 3 := p3(τ; 4m, L, Γmin, 2Q) = p3(τ; m, L, Γmin, Q). (59) Zhou and Amini B.3. Perturbation of information Recall the definition of Chernoffinformation from (8), and let us write Ikr = Ikr(Λ) to explicitly show its dependence on the mean parameter matrix Λ. The following lemma, proved in Appendix G.1, bounds the perturbations of Ikr(Λ) in Λ: Lemma 7. Under (A1), for any Λ BΛ(δ), we have |Ikr( Λ) Ikr(Λ)| 2ωδL Λ . B.4. Analysis of the matching step In this section, we fill in the details of the argument sketched in Section 5.1. Specifically, we need to give sufficient conditions so that the first and the third claims of Section 5.1 hold. We will use the following two lemmas. Recall the notation σ ( y y) introduced in Section 2.3 to denote the optimal permutation from the set of labels y to another set y. Lemma 8. Let y, y [K]n, and assume that d Mis( y, y) < 1 2 mink πk(y). Then, (a) σ ( y y) = id, the identity permutation, and this optimal permutation is unique, and (b) πk( y) > 1 2πk(y) for all k. Note that Lemma 8 implies that if d Mis(σ( y), y) < 1 2 mink πk(y) for some permutation σ, then σ ( y y) = σ. Lemma 9. Consider three sets of labels y, y, y [K]n, and assume that max{ Mis( y, y), Mis( y , y) } < 1 4 min k πk( y). Let σ = σ ( y y) and σ = σ ( y y). Then, σ 1 σ = σ ( y y). The first claim of Section 5.1 follows from Lemma 8, under the further assumption: Mis( y(q 1,q), y(q 1,q)) < 1 32βK , q [Q]. (60) Using the permutation notations (36) of Section 5.1, we have: Corollary 2. Under assumptions (A2), (B4a), (B4b) and (60), σq 1,q = σq 1 for all q [Q]. The third and final claim of Section 5.1 follows from Lemmas 8 and 9, by applying them to the sub-block labels y(2), y(2), y (2): Corollary 3. Under assumptions (A2), (B4a), (B4b) and (60), σ 1 q σ q = σ ( y (q) y(q)) for all q [Q]. The proofs of the results of this section are deferred to Appendix G.1. Appendix C. Proof of Theorem 3 We start with the high-level analysis of Algorithm 3 in Appendix C.1. This analysis is parametrized by many parameters such as ξ , τ1, τ col 1 , τ2, etc. This allows us to give the high-level idea of the mechanics of the proof without making the arguments obscured by the expressions ultimately chosen for these parameters. In Appendix C.2, we make specific choices about these parameters and finish the proof of Theorem 3. Optimal Bipartite Network Clustering C.1. Parametrized analysis of Algorithm 3 We now have all the pieces for analyzing Algorithm 3. Let ystep 5 and zstep 5 be the labels from step 5 of of Algorithm 3. As before, in all the lemmas stated, (A1) and (A2) will be implicitly assumed. Consider the following event: Aγ := n y(q) step 5 and z(q) step 5 satisfy (B3) with parameter γ, for all q [Q] o . We implicitly assume that clusters in zstep 5 and ystep 5 are relabeled according to optimal permutation relative to the truth. In other words, zstep 5 and ystep 5 in the above event are not the raw output of the algorithm, but the relabeled versions (which we do not have access to in practice, but are well-defined and can be used in the proof.) When γ is sufficiently small, this implies that community k in zstep 5 is the same as community k in z, for all k [Q]. Let Π be the random partition used in Algorithm 3, and let P be the event that Π satisfies condition (B4a). By Lemma 4, we have P(P) 1 p2 where p2 is given in (56). For the most part, we will work on events of the form Aγ1 P. Let us also establish some terminology. By the probability on an event P , we mean the probability under the restricted measure PP := P( P). For example, if D = {property X holds}, we will say that property X fails on P with probability at most q if P(Dc P) q. In this case, if P holds with high probability, say 1 p2, and q is small, then D holds with high probability as well: P(D) 1 q p2. Let ˆΛ(q) step 6 = L (A(q 2,q), y(q 2), z(q)), q Z/QZ, be the first local parameter estimates obtained in step 6 of Algorithm 3 (it is easier to work with the shifted index), and let δ1 := 24 Cγ1 + 6τ1 + ξLβ. (61) A better name for δ1, and τ1 would be δrow 1 , and similarly τ row 1 contrasting with δcol 1 and τ col 1 defined later in (64). However, for simplicity, we drop the row qualifier here. Recall that ξ is a parameter controlling the tail probability related to the random partition, while τ1 will be controlling the tail probability p3(τ1) related to the local parameter estimates in Lemma 6. These parameters will be optimized at the end of the argument (see Appendix C.2). Lemma 10 (First local parameters). Assume (B4b) and 72 Cγ1ω 1, and let δ1 be as defined in (61). Then, on event Aγ1 P, ˆΛ(q) step 6 BΛ(q)(δ1), q ZQ, fails with probability at most 2Q p3, where p3 = p3(τ1) as given in (58). Proof. Conditioning on blocks G1 (cf. Section 5) of the (bottom) adjacency matrix Abottom denoted as A(G1) bottom the distribution of blocks A(q 2,q), q ZQ used in defining ˆΛ(q) step 6 is not changed. Under this conditioning, both initial labels ystep 5 and zstep 5 are deterministic, hence the results of Appendix B.2 apply. We will apply Lemma 6 to ˆΛ(q) step 6. Let us verify the conditions of the lemma. On Aγ1, for all q [Q], the sublabel pairs ( z(q) step 5, z(q)) and ( y(q) step 5, y(q)) satisfy (B3). On P, condition (B4a) holds for the random partition and (B4b) holds by assumption. Recall that the random partition is independent of all else, hence conditioning on it does not change the distribution of blocks A(q 2,q), q ZQ either. We may then apply Zhou and Amini Lemma 6 to conclude that for every q ZQ, conditioned on the partition Π and A(G1) bottom, the event {ˆΛ(q) step 6 / BΛ(q)(δ1)} Aγ1 P holds with probability 2p3. Let us write D = ˆΛ(q) step 6 BΛ(q)(δ1), q ZQ which is the desired event in this lemma. Using the union bound, and removing the conditioning, we have P(Dc Aγ1 P) 2Qp3, unconditionally. The proof is complete. Next we consider the first LR classifier application. Let ystep 7 be the row label estimates in step 7. That is, we have y(q 2) step 7 = LR A(q 2,q), ˆΛ(q) step 6, z(q) step 5 for which we have the following bound on misclassification rate: Lemma 11 (First LR classifier). Under the assumptions of Lemma 10, further assume that 9ωδ1 < 1. Let ηstep 7 := 2η (δ1; Λ/Q) where η ( ) is defined in (51). Then, on event Aγ1 P, Misk y(q) step 7, y(q) X r =k exp Ikr Q + ηstep 7 + u =: γrow 2k , q ZQ, (62) fails with probability at most Q(e u + 2Q p3) where p3 = p3(τ1) as given in (58). Proof. Fix q ZQ and consider y(q 2). As in the proof Lemma 10, we condition on blocks in G1 so that z(q) step 5 can be assumed deterministic. We will apply Lemma 2(a) to the subblock A(q 2,q). As discussed earlier, the corresponding ω is inflated to 3ω, hence we need 3(3ω)δ1 < 1 which we have assumed. We also note that Λ(q 2,q)(y, z) and Λ(q) play the role of Λ(y, z) and Λ in Lemma 2(b), and we have the needed condition Λ(q 2,q)(y, z) BΛ(q)(δ1) from Lemma 6. Let b(q 2,q) i be the row block compression of A(q 2,q) based on z(q) step 5. Then, Lemma 2(a) gives P n Λ BΛ(q)(δ1), Yikr b(q 2,q) i , Λ 0 o Aγ1 P A(G1) bottom, Π exp I(q) kr + η(q) for all rows i (in row block q 2) with yi = k. Here Λ(q) min is the minimum element of Λ(q), and I(q) kr := Ikr(Λ(q)) Ikr Q 2ωδ1L Λ/Q η(q) := η (δ1; Λ(q)) (1 + δ1) η (δ1; Λ/Q) where η (δ1; Λ(q)) = 8ωδ1L Λ(q) as defined in (51). The first inequality uses Lemma 7 and the second is obtained using the definition of η ( ) combined with Λ(q) BΛ/Q(δ1) (Lemma 5) which implies Λ(q) (1 + δ1) Λ/Q . By taking expectation in (63), the same bound holds unconditionally. By Lemma 10, on event Aγ1 P, we have ˆΛ(q) step 6 / BΛ(q)(δ1) with probability at most 2Q p3. Then, applying Lemma 3, we conclude that Misk y(q 2) step 7, y(q 2) X r =k exp I(q) kr + η(q) + u Optimal Bipartite Network Clustering fails on Aγ1 P with probability e u + 2Q p3, for each q [Q]. Note that I(q) kr + η(q) Ikr Q + (1 + δ1 + 9 1) η (δ1; Λ/Q). Since 9ωδ1 < 1 implies δ1 < 9 1 (recall ω 1), we have 1 + δ1 + 9 1 < 2. Combining with the previous bound and applying the union bound over q gives the result. Note that we have called the rate in (62) γrow 2 for the (column) misclassification rate based on the row information. This rate is faster than initial rate γ1. Repeating the procedure in steps 6 and 7 for the column labels as prescribed in step 8 in Algorithm 3 we obtain a similar rate for the misclassification rate of z(q) step 8 relative to z(q) which we call γcol 2 . In deriving γcol 2 , we have to make the substitutions in Remark 5, and particular, Λ Γ where Γ is the column mean parameters defined in Section 2.1. (A minor exception is when counting the number of blocks which will still be Q rather than 2Q.) Recall the definition of the column information matrix (Icol ℓr ) from (9). Letting δcol 1 := 24 Cγ1 + 6τ col 1 + ξKβ, (64) we obtain the following counterpart of Lemma 11: Corollary 4 (First LR classifier, column version). Under the assumptions of Lemma 10, further assume that 9ωδcol 1 < 1. Let ηstep 8 := 2η (δcol 1 ; Γ/(2Q)) where η ( ) is defined in (51). Then, on event Aγ1 P, Misℓ z(q) step 8, z(q) X r =ℓ exp Icol ℓr 2Q + ηstep 8 + ucol =: γcol 2ℓ, q ZQ, (65) fails with probability at most Q(e ucol + 2Q p 3), where p 3 = p 3(τ col 1 ) as given in (59). Let γcol 2 := maxk [K] γcol 2k , γrow 2 := maxℓ [L] γrow 2ℓ and γ2 := max{βKγrow 2 , βLγcol 2 }. (66) By (7), we have that (62) and (65) imply Mis y(q) step 7, y(q) γcol 2 γ2 βK , Mis z(q) step 8, z(q) γrow 2 γ2 Thus, if we consider the following event: Bγ := n y(q) step 7 and z(q) step 8 satisfy (B3) with parameter γ, for all q [Q] o , after Step 8, we can work on Bγ2 P which holds with high probability: Combining Lemma 11 and Corollary 4, by union bound, P(Bc γ2 Aγ1 P) Q(2e u + 2Q(p3 + p 3)), hence P(Bγ2 P) P Bγ2 Aγ1 P = P Aγ1 P P Bc γ2 Aγ1 P 1 P(Ac γ1) P(Pc) Q(2e u + Q(p3 + p 3)). Let ˆΛ(q) step 9 = L (A(q 3,q), y(q 3), z(q)), q ZQ, be the second local parameter estimates obtained in step 9 of Algorithm 3. Let δ2 := 24Cγ2 + 6τ2. (69) Zhou and Amini Lemma 12 (Second local parameters). Assume (B4b) and 72 Cγ2ω 1, and let δ2 be as defined in (69). Then, on event Bγ2 P, ˆΛ(q) step 9 BΛ(q)(δ2), q ZQ fails with probability at most 2Q p3, where p3 is given in (58). Proof. Conditioning on blocks G1 G2 (cf. Section 5) of the adjacency matrix A, the distribution of blocks A(q 3,q) used in defining ˆΛ(q) step 9 is not changed. Under this conditioning, both initial labels ystep 7 and zstep 8 are deterministic, hence the results of Appendix B.2 apply. On Bγ2, for all q ZQ, the sublabel pairs ( z(q) step 8, z(q)) and ( y(q) step 7, y(q)) satisfy (B3). The rest of the proof follows that of Lemma 10. The key is that δ2 is much smaller than δ1, due to γ2 γ1 (typically), i.e., the second parameter estimates are much more accurate. Let ˆΛstep 10 = P q ˆΛ(q) step 9 be the estimate of the global mean parameters obtained in step 10 of Algorithm 3. According to Lemma 12, on Bγ2 P, ˆΛ(q) step 9 Λ/Q δ2 Λ/Q , q hence, ˆΛstep 10 Λ δ2 Λ (70) fails with probability 2Q p3, where we have used triangle inequality. That is, on Bγ2 P, we have ˆΛstep 10 BΛ(δ2) with high probability. Remark 6. Note that we could have used QˆΛ(q) step 9 (for any q ZQ) as our estimate ˆΛstep 10, leading to the same bound as in (70). The results would be the same, though in practice, we expect the version given in the Algorithm 3 to perform better. We also note that on Bγ2, the sublabels ( z(q) step 8, q ZQ) automatically define a consistent global label vector zstep 8, and similarly for row labels ystep 7. Lemma 13 (Second LR classifier). Under the assumptions of Lemma 12, further assume that δ2 defined in (69) satisfies 3ωδ2 < 1 and 6Cγ2ω 1. Let ηstep 11 kr := ηkr δ2; ω, β, m, Λ . Then, on event Bγ2 P, Misk bytop, ytop X r =k exp Ikr + ηstep 11 rk + v =: γ3, (71) fails with probability at most e v + 2Q p3 where p3 = p3(τ2) as given in (58). The same result holds for ηstep 11 kr = η (δ2; Λ). Proof. As in the proof Lemma 12, we condition on blocks in G1 G2 so that zstep 8 can be assumed deterministic. We will apply Lemma 2(b) to Atop. Let Top [n] denote the row indices of Atop. Since all the columns are present in Atop, we can directly apply Lemma 2(b) (in contrast to the argument in Lemma 11), that is, the relevant row mean parameters are Λ(y, z) and Λ the same as those for the whole matrix A. The needed condition Λ(y, z) BΛ(δ2) is supplied by Lemma 1. Let bstep 11 i = bi ( zstep 8) be the block compression in step 11 of the algorithm. Then, Lemma 2(b) gives (after conditioning on A(G1 G2) bottom and then removing the conditioning as in (63)) P n Λ BΛ(δ1), Yikr bstep 11 i , Λ 0 o Bγ2 exp Ikr + ηstep 11 kr . (72) Optimal Bipartite Network Clustering for any i Top with yi = k. By (70), on Bγ2 P, we have ˆΛstep 10 / BΛ(δ2) with probability at most 2Q p3. Then, applying Lemma 3, we conclude (71) as desired. The last statement of the theorem follows if we apply Lemma 2(a) in place of Lemma 2(b) throughout. The same exact bound holds for bybottom in step 12, with the same probability. Hence, by union bound, the same bound on misclassification rate holds for the final row labels by in step 13, with probability inflated by a factor of 2; that is, Misk(by, y) γ3 fails on Bγ2 P, with probability at most 2(e v + 2Q p3). To summarize, under the conditions of the lemmas, we have P Misk(by, y) > γ3 P Misk(by, y) > γ3 Bγ2 P + P (Bγ2 P)c 2 e v + 2Q p3(τ2) + P (Bγ2 P)c 2 e v + 2Q p3(τ2) + P(Ac γ1) + P(Pc) + Q 2e u ucol + Q(p3(τ1) + p 3(τ col 1 ) where γ3 is the rate given in (71) and the second inequality uses (68). C.2. Choosing the parameters It remains to choose the parameters, τ1, τ2, ξ, etc. to simultaneously achieve the desired rate for γ3 and ensure that the probability in (73) is o(1). Proof of Theorem 3. First row LR classifier. Let us write τ row 1 = τ1 for clarity. Under our assumptions, we will have γ2 γ1 1/2 so that Cγi 2β2γi for i = 1, 2, recalling the definition of Cγ = β2γ/(1 γ). In Lemma 11, we defined (recall (61)) ηstep 7 = 8δ1ωL Λ/Q 384β2γ1 + 48τ row 1 + 8βLξ ωL Λ/Q . (74) By (40), 384β2γ1ωL Λ/Q Imin/(8Q). Take τ row 1 = Imin 384ωL Λ , ξ = Imin Icol min 64βω(K L)2( Λ Γ ), u = Imin where u is the parameter in (62). Then from (74) we have ηstep 7 Imin 8Q , ηstep 7 + u Imin Hence Lemma 11 implies that on event P, Misk y(q) step 7, y(q) γrow 2k := X r =k exp Ikr , q ZQ (76) fails with probability at most Q(e u + 2Q p3(τ row 1 )). By (A4 ), Q log Q = o(Imin), hence Qe u = o(1). By (A3), Q2p3(τ row 1 ) = Q2KL exp nΛmin h1(τ row 1 ) 32Q2βK Q2KL exp n I2 min 256(3842)Q2βKL2ω3 Λ = o(1) (77) Zhou and Amini where we have used the definition (58) of p3, h1(τ) τ 2/8 for τ 1 and Λ /Λmin ω. Moreover, (A4 ) implies (Imin Icol min)/(K L) , hence eventually (Imin Icol min)/(K L) 1 which gives P(Pc) = p2(ξ) = 2Q(K + L) exp (n m)ξ2/Q 2Q(K + L) exp n m 642Qβ2ω2(K L)2( Λ Γ )2 = o(1) (78) where the last implication follows from (A3). First column LR classifier. We can apply a similar argument to zstep 8. Let τ col 1 = Icol min 768ωK Γ , ucol = Icol min 16Q, with ξ defined as in (75). By (40), and a similar argument, we obtain ηstep 8 +ucol Icol min/(4Q). By Corollary 4, on event P, Misℓ z(q) step 8, z(q) γcol 2ℓ X r =ℓ exp Icol ℓr 2Q + Icol min 4Q L exp Icol min 4Q , q ZQ, (79) fails with probability at most Q(e ucol + 2Q p 3), where Q2p 3 = Q2p 3(τ col 1 ) = o(1) by (A3) and Qe ucol = o(1) by (A4 ), similar to how we argued for the row labels. Second row LR classifier. Recalling γ2 from (66) and combining with (76) and (79), γ2 max βK2 exp Imin , βL2 exp Icol min 4Q = o β(K L)2 (Imin Icol min)b for any b > 0, as Imin Icol min . By Lemma 13 and (51), ηstep 11 kr := ηkr δ2; ω, β, m, Λ = 21δ2 ωL Λ + 5βL2 Λ 2 m + log 11ω 1 εkr 2ω(1 + εkr) δ2 + 1 1 =: T1 + T2 + T3 + T4, (81) where we have called the four summands above T1, . . . , T4 in the order they appear. We have δ2 = 24Cγ2 + 6τ2 48β2γ2 + 6τ2 by (69) and the assumption γ2 1 T1 21(48)β2ωL Λ γ2 + 21(6)ωL Λ τ2 =: T11 + T12. For any b > 0, by (80) T11 = O β2ωL Λ γ2 = o β3ω(K L)3 Λ [(Imin Icol min)/Q]b Recall that we have βω(K L) Λ = o([(Imin Icol min)/Q]a) for some a > 0 by (A4 ). Taking b = 3a in (82), we obtain T11 = o(1). Letting τ2 = (ωL Λ ) 1, we have T12 = O(1), hence, T1 = O(1). Recalling the probability bound in Lemma 13, we have by (A3) Qp3(τ2) = QKL exp nΛmin h1(τ2) QKL exp n 256Q2βKL2ω3 Λ = o(1) (83) Optimal Bipartite Network Clustering where we have used h1(τ) τ 2/8 for τ 1 and Λ /Λmin ω. Using (A3) again, T2 = 5βL2 Λ 2 /m = O(1). Now let us consider the third piece T3 in (81). Recall that Jkr = L Λ /Ikr. By Lemma 20 in Appendix F.3.1, εkr 2 J 1 kr 1 . In bounding T1, we have shown δ2ωL Λ = O(1), hence 2ωδ2 = O((L Λ ) 1). Since Ikr and Jkr 1/2 (see Remark 3), (L Λ ) 1 = o J 1 kr 2 . Therefore, 2ωδ2 = o(εkr 1). As a result, 2ω(1 + εkr)δ2 = o(εkr), hence e T3 := 11ω 1 + 1 εkr 2ω(1 + εkr)δ2 = O ω 1 + 1 Finally, we let v = log Λmin. Since Λmin , e v = o(1). Applying Lemma 13, combined with T1 + T2 = O(1), and (84), then for ζ = 1/ log Λmin = o(1), Misk bytop, ytop = O ω X 2 log Λmin + p 2 ζ log Λmin fails w.p. 2(e v + 2Q p3(τ2)) + P(Pc) + Q(2e u ucol + Q(p3(τ row 1 ) + p 3(τ col 1 ))) = o(1). When we swap Atop and Abottom and repeat the algorithm, the same misclassification rate holds. The proof of Theorem 3 is complete. Appendix D. Proof of Other Main Results D.1. Proof of Theorem 1 We proceed by stating a few lemmas. The proofs are deferred to Appendix G.2. Lemma 14. P ℓ [L](λrℓ λkℓ)2 2Λmin Ikr. As a consequence, Λ2 2Λmin Imin. Combining Lemma 14 with Theorem 4, and noting that Λ /Λ2 ω/(2Imin) as a consequence of the lemma, we obtain the following guarantee for spectral clustering in terms of the information matrix (Ikr): Corollary 5. Consider the spectral algorithm given in Algorithm 4, assume that for a sufficiently small C1 > 0, β2ωKL(K L) α 2Imin C1(1 + κ) 2. (85) Then the algorithm outputs estimated row labels y satisfying w.h.p. Mis( y, y) (1 + κ)2ωβL(K L)α We next modify Corollary 5 to be applicable on sub-blocks: Lemma 15 (Spectral clustering on subblocks). Suppose (A3) holds, and we assume for a sufficiently small C1 > 0, 6Qβ2ω2KL(K L) α Imin C1(1 + κ) 2. (86) Zhou and Amini Using Algorithm 4 in Step 3 of Algorithm 3, w.h.p., the misclassification rate of y(q) satisfies Mis( y(q), y(q)) 3Q(1 + κ)2ω2βL(K L)α C1Imin q [Q]. A similar result holds for misclassification rate of the spectral clustering for column labels, with appropriate modifications. of Theorem 1. Assumption (A4) implies (86), eventually as Imin . Letting γrow 1 and γcol 1 be bounds on the misclassification rates of the spectral clustering algorithms in steps 3 and 4, we can take, by Lemma 15 (and its column counterpart), w.h.p. γrow 1 = O QωβL(K L)α , γcol 1 = O QωβK(K L)α 1 That is, by the end of step 4, w.h.p., Mis( y(q), y(q)) γrow 1 and Mis( z(q), z(q)) γcol 1 for all q [Q]. Since the matching step increases the misclassification rate by at most a factor of 2, the same bounds hold for the overall initial labels at step 6. Taking γ1 = γrow 1 γcol 1 , we observe that in order to satisfy condition (40) of Theorem 3, it is enough to have Qωβ(K L)2(α α 1) Imin Icol min = o 1 β2ω Imin L Λ Icol min K Γ which holds if we require the stronger condition Qωβ(K L)2(α α 1) Imin Icol min = o 1 β2ω Imin Icol min (K L)( Λ Γ ) But this latter condition is satisfied by assumption (A4). Thus, the assumptions of Theorem 3 hold with high probability, and so is its result. The proof is complete. D.2. Proof of Corollary 1 Proof of Corollary 1. From the proof of Theorem 1, we have that Misk by, y = O X r =k ω 1 + 1 2 log Λmin + v (87) fails with probability at most 2e v + o(1). First, we show that χr := ω 1 + 1 Λ 1/2 min = o(1), uniformly in r. (88) By Lemma 20 in Appendix F.3.1, εkr 2 J 1 kr 1 . Hence, 2(Jkr 1) 3Jkr using 2Jkr 1 (see Remark 3). Thus, to show (88), it is enough to show ωJkr/ Λmin = o(1). Using ω 1 Λ Λmin, we have ωJkr Λ 1/2 min = Λ Ikr L ωΛ 1/2 min Lω3/2 Λ 1/2 Ikr Lω3/2 Λ 1/2 Imin = o(1) Optimal Bipartite Network Clustering where the last equality is by ω3L2 Λ = o(I2 min) which is implied by (A4). Thus, we have (88), i.e., χ := maxr χr = o(1), as desired. Now, let 2v = log χ. It follows that e v = χ = o(1), and we have Misk by, y = O χ X r =k exp Ikr + v = O χ X r =k exp Ikr = o X r =k exp Ikr completing the proof. D.3. Proof of Example 1 Without loss of generality assume a > b so that εkr = a/b 1. Also, Λmin b/(βK). By (87), which holds in the general case, we have that Misk by, y = O X βωK3/2b 1/2 exp ( a fails with probability at most 2e v + o(1). Assumption βω2K3 = o(b) implies βωK3/2b 1/2 = o(1). Letting 2v = log χ, the rest of the proof follows similar to that of Corollary 1. Appendix E. Proof of Theorem 2 The distribution of A depends on (y, z, P), so that the expectation in (16) is, in fact, E = E(y,z,P). Throughout the proof, we restrict the parameter space to a fixed z and P such that the corresponding row mean matrix, Λ , satisfies Imin(Λ ) = I . From now on, instead of writing P(y,z ,P ) we simply write Py for the distribution on A, and similarly for the expectations. Let us write the optimal sum of errors in the test of Poi(Λ k ) against Poi(Λ r ) as follows: P Err,+ := inf P(Type I error) + P(Type II error) 2 (log Λ min + C ) , where the inequality is by combining Proposition 1 and Lemma 26 (Appendix G.7), and noting that the assumption of that lemma is satisfied due to (A3). Note that log Λ min is not determined in S; however, we can always replace it by 2 log I by (A4). The reduced parameter space is the set of all (y, z , P ) such that y belongs to T := y {0, 1}n K : y satisfy (A2). Let α := 8βK β β 1 and choose S [n] and y T such that i Sc : yi = k = n0 := n Zhou and Amini We then define the further restricted parameter space T := y = (y S, y Sc) : y S {k, r}|S| . β, each label in T has at least n/βK labels in each community, hence T T . The direct misclassification rate, d Mis (cf. Section 2.3), between any two labels in T is at most using x x x + 1, and K/n 1/(2α), which holds for large n, for the lower bound. In particular, the d Mis between any two elements in T is at most 1/(8βK) (i.e., at most n/(8βK) labels are different). It follows from Lemma 8 that the optimal matching permutation between any two label vectors in T is identity, hence their misclassification rate is the same as their d Mis. We next argue that by can be restricted to T as well: First suppose that under the optimal permutation π, by has at most εn different labels on indices Sc compared to y, that is, |{i Sc : π(byi) = yi}| εn. It follows that the misclassification rate between by and any y T is at most 2ε 1/(4βK). Thus, by Lemma 8, π is the optimal permutation for matching by to any y T . Redefining byi := π 1( yi) for i Sc then gives a uniformly better strategy over T . The new by equals y on Sc up to a permutation, so we can restrict by to T . On the other hand, if |{i Sc : π(byi) = yi}| > εn, then setting by to be any fixed vector from T (or randomly choosing from T ) gives a better strategy. The minimax risk is lower-bounded by inf by sup y T Ey[ Mis(by, y) ] = inf by T sup y T Ey[ d Mis(by, y) ] = inf by T sup y T 1 n i S Py(byi = yi) ε inf by T avg y T 1 |S| i S Py byi = yi = ε inf by T avg i S avg y T Py byi = yi where we have used n = |S|/ε and max avg. Let us now focus on a single term in the sum over S, say i = 1. For simplicity, let S \ 1 = S \ {1}. Let T u = {(u, y S\1, y Sc) : y S\1 {k, r}|S| 1}. Then, T is the disjoint union of T k and T r and we have avg y T Py(by1 = y1) = 1 |T | y T Py(by1 = y1) h P(k, y S\1, y Sc) by1 = k + P(r, y S\1, y Sc) by1 = r i y S\1 P Err,+ = 1 where the second equality is by decomposing the sum as P y S\1 P y1, and the last equality by noting that the sum over y S\1 is over {k, r}|S| 1 whose cardinality is |T |/2. The same lower Optimal Bipartite Network Clustering bound holds for all other i = 1 in (89) by symmetry. Hence, we conclude inf by sup y T Ey[Mis(by, y)] ε Recalling the definition of α and using the assumptions that β > 1 is constant and K exp(c L) gives the desired result. Appendix F. Proofs of the main lemmas In this section, we give the proof of the three main lemmas of Appendix B.1. We we first give the proofs of Lemma 1 and 3 in Appendix F.1 and F.2. The proof of Lemma 2(b) is more technical and occupies the remainder of this section, including auxiliary results on the error exponents and Poisson-binomial approximations, in Appendix F.3 and F.4. Throughout, we will use the following concentration inequality (Gin e and Nickl, 2015, p. 118): Proposition 2 (Prokhorov). Let S = P i Xi for independent centered variables {Xi}, each bounded by c < in absolute value a.s. and let v P i EX2 i , then P S vt exp[ vhc(t)], t 0, where hc(t) := 3 4ct log 1 + 2c Same bound holds for P(S < vt). Note that hc(t) t2 as t 0 and hc(t) t log t as t . F.1. Proof of Lemma 1 Let us define the confusion matrix as R( z, z) [0, 1]L L with entries Rkℓ( z, z) = 1 j=1 1{ zj = k, zj = ℓ} = |j : zj = k, zj = ℓ| We can similarly define Rkℓ(z, z). It is easy verify that R( z, z) = R(z, z)T . By definition (23) of the (global) row mean parameters, λkℓ (y, z) = ℓ=1 Pkℓ1{zj = ℓ, zj = ℓ } = m Pk R ℓ (z, z). (92) To see (92), note that since we are using true labels y in the first argument of λkℓ (y, z), the averaging 1 nk(y) P i 1{yi = k} over i, in the definition, is vacuous. That is, for any i with yi = k, we have λkℓ (y, z) = P j E[Aij]1{ zj = ℓ }. We then further break this sum according to column labels zj = ℓto get (92). Recall that n(z) is the vector of sizes of clusters in z and π(z) = n(z)/m is the corresponding proportions. To simplify, let N(z) := diag(n(z)), Π(z) := diag(π(z)). Zhou and Amini We have m IL = N(z)Π(z) 1 where IL is the L L identity matrix, hence m Pk R ℓ (z, z) = Pk N(z) Π(z) 1R ℓ (z, z) = λk (y, z) Π(z) 1R ℓ (z, z) using (2). Let use define U(z, z) := Π(z) 1R(z, z). Since π(z) contains the row sums of R ℓ (z, z), U(z, z) is the row-normalized confusion matrix, i.e. U = (Rkℓ/Rk+). We have λkℓ (y, z) = λk (y, z) U ℓ (z, z), (93) and its matrix version Λ(y, z) = Λ(y, z) U(z, z). We can similarly define U( y, y) = Π( y) 1R( y, y). Recalling definition (23), and some algebra gives λk ℓ ( y, z) = 1 nk ( y) k [K] λkℓ (y, z)1{yi = k, yi = k } = 1 nk ( y) k [K] λkℓ (y, z) |i : yi = k, yi = k |, where to get the first equality one further breaks the sums over P k 1{yi = k} and use the expression for λkℓ (y, z) in the comments after (92). Using the definition of the confusion matrix in (91), adapted to row labels, and the definition of U, we have λk ℓ ( y, z) = 1 πk ( y)Rk ( y, y) λ ℓ (y, z) = Uk ( y, y) λ ℓ (y, z), (94) or compactly Λ( y, z) = U( y, y)Λ(y, z). We also define a column-normalized confusion matrix, V (z, z) := R(z, z)Π( z) 1. Lemma 16. (A2) and (B3) imply max k [1 Ukk( y, y)] γ, (B3.1) max ℓ [1 Uℓℓ(z, z)] γ, and (B3.2) max ℓ [1 Vℓℓ(z, z)] γ. (B3.3) Proof. Without loss of generality, assume that the optimal permutation matching y to y is identity, and similarly for z to z. By definition, 1 Ukk( y, y) is the misclassification rate withing the kth community of y, hence 1 Ukk( y, y) = |i : yi = k, yi = k|/n |i : yi = k|/n = |i : yi = k, yi = k|/n |i : yi = k, yi = k|/n + |i : yi = yi = k|/n. Recall that we can write (see Section 2.3) Mis( y, y) = 1 n|i : yi = yi| = 1 k |i : yi = k, yi = k| = 1 k |i : yi = k, yi = k|. (95) Optimal Bipartite Network Clustering Then, (B3) and the second equality in (95) implies |i : yi = k, yi = k|/n γ/(βK), while the third equality in (95) gives |i : yi = yi = k|/n πk(y) γ/(βK). Letting f(x) = x/(x + 1), 1 Ukk( y, y) = f |i : yi = k, yi = k|/n |i : yi = yi = k|/n γ/(βK) γ/(βK) + πk(y) γ/(βK) = γ πk(y)βK γ where the first inequality is by monotonicity of f, and the last by (A2). This proves (B3.1). Similarly, 1 Uℓℓ(z, z) is the misclassification rate within the ℓth community of z, i.e., Misℓ( z, z), hence 1 Uℓℓ(z, z) = |j : zj = ℓ, zj = ℓ|/m πℓ(z) = γ πℓ(z)βL γ, proving (B3.2). The same bound holds for 1 Uℓℓ( z, z) by an argument similar to that used for Ukk( y, y). To prove (B3.3), we observe U( z, z) = Π( z) 1R( z, z) = Π( z) 1R(z, z)T = [R(z, z)Π( z) 1]T = V (z, z)T hence 1 Vℓℓ(z, z) = 1 Uℓℓ( z, z) γ. All statements are true for any k [K] and ℓ [L]. F.1.1. Proof of Lemma 1(a) For the lower bound, by (93) and (B3.2), λkℓ (y, z) = λk (y, z)U ℓ (z, z) λkℓ (y, z)Uℓ ℓ (z, z) (1 γ)λkℓ (y, z) λkℓ (y, z) Cγ Λ where the last inequality is by γ Cγ and λkℓ (y, z) Λ . For the upper bound, we write λk (y, z)U ℓ (z, z) = λkℓ (y, z)Uℓ ℓ (z, z) + X ℓ =ℓ λkℓ(y, z)Uℓℓ (z, z). The first term obviously satisfies λkℓ (y, z) Uℓ ℓ (z, z) λkℓ (y, z), hence λkℓ (y, z) λkℓ (y, z) X ℓ =ℓ λkℓ(y, z)Uℓℓ (z, z). (96) By (B3.3), for every ℓ [L], m|j : zj = zj = ℓ | = πℓ ( z)Vℓ ℓ (z, z) (1 γ)πℓ ( z). (97) By (A2), for every ℓ and ℓ, we have πℓ (z) β2πℓ(z), hence Uℓℓ (z, z) = 1 πℓ(z)Rℓℓ (z, z) = πℓ ( z) πℓ(z) Vℓℓ (z, z) β2 1 γ Vℓℓ (z, z). Combining with (96) λkℓ (y, z) λkℓ (y, z) β2 ℓ =ℓ λkℓ(y, z)Vℓℓ (z, z) β2γ 1 γ λk (y, z) (98) where the last inequality is by (B3.3) and that V is column normalized. This proves the upper bound, and completes the proof of Λ(y, z) Λ Cγ Λ . Since we assume Cγ 1, it follows that Λ(y, z) 2 Λ . Zhou and Amini F.1.2. Proof of Lemma 1(b) Recalling (94), we have λk ℓ ( y, z) = Uk ( y, y)λ ℓ (y, z) = Uk k ( y, y)λk ℓ (y, z) + X k =k Uk k( y, y)λkℓ (y, z). By (B3.1), the first term is bounded as (1 γ)λk ℓ (y, z) Uk k ( y, y) λk ℓ (y, z) λk ℓ (y, z) and the second term as k =k Uk k( y, y)λkℓ (y, z) γ λ ℓ (y, z) recalling that U is row normalized hence P k =k Uk k = 1 Uk k γ, by (B3.1). Combining the two bounds, we have λk ℓ ( y, z) λk ℓ (y, z) γλk ℓ (y, z), 0 + 0, γ λ ℓ (y, z) λ ℓ (y, z) γ, γ showing that Λ( y, z) Λ(y, z) γ Λ(y, z) . Combining with Λ(y, z) 2 Λ from part (a) of the lemma, we have the first assertion of part (b). The second assertion follows from γ 1/2 and part (a) by triangle inequality. (Note that assumption 6Cγω 1 in fact implies γ 1/6 since β, ω 1 and γ Cγ.) F.1.3. Proof of Lemma 1(c) Recalling definitions of ˆλkℓand λkℓ( y, z) from (22) and (23), we have nk( y) ˆλkℓ λkℓ( y, z) = Aij E[Aij] 1{ yi = k, zj = ℓ} which is of the form S = P ij Xij with independent centered terms Xij = Aij E[Aij] with |Xi| 1 and P ij EX2 ij = P ij var(Aij) P ij EAij = nk( y)λkℓ( y, z). Note that the sums in these expressions run over {(i, j) : yi = k, zj = ℓ}. Applying the two-sided version of Proposition 2, with v = nk( y)λkℓ( y, z), t = τ and c = 1, we have P ˆλkℓ λkℓ( y, z) > λkℓ( y, z) τ = P nk( y) ˆλkℓ λkℓ( y, z) > nk( y)λkℓ( y, z) τ 2 exp nk( y)λkℓ( y, z)h1(τ) . Applying union bound over (k, ℓ) [K] [L], and using part (b) of this lemma, we have ˆΛ Λ( y, z) τ Λ( y, z) 4τ Λ with probability at least 1 2KL min k nk( y) min k,ℓλkℓ( y, z) h1(τ) . We have nk( y) nπk(y)(1 γ) n(βK) 1/2 using (B3.1), (A2) and γ 1/2; see (97). Similarly, since Λ( y, z) Λ 3Cγ Λ , we have min k,ℓλkℓ( y, z) Λmin 3Cγ Λ Λmin(1 3Cγω) Λmin/2. Optimal Bipartite Network Clustering F.2. Proof of Lemma 3 Let bi = bi ( z). Recall (46), (47) and (48), and let b Sk = Sk(b, ˆΛ), b Zik = Zik(bi , ˆΛ), b Yikr = Zik(bi , ˆΛ). For any event A and random variable X, let us write E[X; A] := E[X1A]. Consider the following event: A := {ˆΛ BΛ(δ)}. Pick some i [N] with yi = k. Then, E b Sk; A = E b Zik; A = P [ r =k P Yikr(bi ( z), ˆΛ) 0, ˆΛ BΛ(δ) r =k P Λ BΛ(δ), Yikr(bi ( z), Λ) 0 r =k exp Ikr + η =: pk where the last inequality follows from Lemma 2 with ηkr defined there. Using Markov inequality P b Sk tpk P {b Sk tpk} A + P(Ac) tpk + P(Ac) 1 for any t > 0. The version of Markov inequality used follows from (pointwise) inequality: 1{X u}1A (X1A)/u. Taking t = eu complete the proof. F.3. Error exponents We start by obtaining a bound on the error exponent (i.e., the negative logarithm of the probability of error) for binary hypothesis testing in an exponential family. This result is a generalization of the result that appears in Abbe and Sandon (2015), and is proved by the same technique. The result (and the technique inspired by Abbe and Sandon (2015)) is interesting since it provides a bound different than the classical Chernoffbound on the error exponent (Chernoff, 1952); see also Verd u (1986) and Theorem 11.9.1 of Cover and Thomas (2006). This leads for example to a sharper control for the case of Poisson hypothesis testing. We start with the result for a general exponential family and then in Appendix F.3.1 specialize to the case of interest in this paper, the Poisson family. General exponential family. Let π(t; γ) denote the density of a 1-dimensional standard exponential family w.r.t. to some measure ν on R: π(t; γ) = h(t) exp γt A(γ) . (99) We consider distributions on RL that are products of these distributions, having density: ℓ=1 π(xℓ, θℓ), x = (xℓ) RL, θ = (θℓ) RL (100) with respect to µ = ν L (L-fold product measure whose coordinate measures are all ν). Zhou and Amini Proposition 3. Let pr(x) := p(x; θr), r = 0, 1 be two exponential family densities on RL (relative to µ = ν L) as defined in (99) and (100). Assume that ν is either the Lebesgue measure on R or the counting measure on Z, and that θ0 = θ1. For s (0, 1), let θsℓ= (1 s)θ0ℓ+ sθ1ℓ, and Isℓ= (1 s)A(θ0ℓ) + s A(θ1ℓ) A(θsℓ), (101) as well as Is := PL ℓ=1 Isℓ, T := {ℓ: θ0ℓ = θ1ℓ} and C(α) := Z e α|t|dν(t) = ( 2 α ν is Lebesgue, 1+e α 1 e α 2 1 e α ν is counting. (102) Consider testing p0 against p1 using the likelihood ratio test based on a single observation. Let pr be the probability of error under pr for r = 0, 1. Then, the sum of the error probabilities is bounded as Pe,0 + Pe,1 inf ℓ T inf s (0,1) h e Is π( ; θsℓ) C min(s, 1 s)|θ0ℓ θ1ℓ| i . (103) Remark 7. The proof goes through for any translation invariant measure ν (e.g., a Haar measure) with an appropriate constant C(α). It also goes through if we replace t in (99) with a general sufficient statistic φ(t), as long as (1) φ is surjective from the support of the exponential family to R and (2) C(α) = R e α|φ(t)|dν(t) < for all α > 0 and (3) φ has a measurable inverse. Remark 8. Let s be the maximizer of s 7 Is. Then, noting that α 7 C(α) is decreasing, Proposition 3 implies Pe,0 + Pe,1 exp Is + log π( ; θs ℓ) + log C(α ) , where, (104) α = min(s , 1 s ) max ℓ [L] |θ0ℓ θ1ℓ|. (105) The bound is an improvement over the Chernoffbound if log π( ; θs ℓ) is negative and log C(α ) is controlled. This is the case for the Poisson distribution as we show in the sequel. F.3.1. Poisson case The Poisson case corresponds to (99) with γ = log λ, h(t) = (1/t!)1{t 0}, ν = the counting measure and A(log λ) = λ. Letting θsℓ= log λsℓfor all s [0, 1], we have from (101) λsℓ= λ1 s 0ℓ λs 1ℓ, Isℓ= (1 s)λ0ℓ+ sλ1ℓ λsℓ. We also note that |θ0ℓ θ1ℓ| = | log(λ0ℓ/λ1ℓ)|. Let us define s = argmax s (0,1) Is, and, I = max s (0,1) Is, where Is = We will assume λ0ℓ/λ1ℓ [1/ω, ω], ℓ [L], for some ω > 1. (106) The following lemma shows that s stays away from the boundary: Optimal Bipartite Network Clustering Lemma 17. Assuming (106), we have s [ 1 2ω, 1 1 2ω]. Proof of this lemma and subsequent results appear in Appendix G.4. From (105), we have α = min(s , 1 s ) maxℓ| log(λ0ℓ/λ1ℓ)| in the Poisson case. Defining ε01 := ε01(Λ) := max ℓ [L] 1, α01 := 1 2ω log(1 + ε01) (107) we note that α = min(s , 1 s ) log(1 + ε01), hence Lemma 17 implies α α01, that is, C(α ) C(α01) in (104), where C( ) has the form given in (102) for the counting measure, i.e., C(α01) = 1 + e α01 1 e α01 2 1 e α01 ε01 + 3 ω (108) where the last inequality is by the following lemma: Lemma 18. Inequality (a) in (108) holds. Next we bound the maximum of the density: Lemma 19 (Hodges and Le Cam (1960)). Let π(t; log λ) = e t(λt/t!)1{t 0} be the desnity of the Poisson family. Then, for all λ > 0, π( ; log λ) 1 + 1 12λ In particular π( ; log λ) exp 1 2 log λ for λ 0.056. Combining Lemmas 17, 18 and 19, we have the following corollary which gives the following overall bound on the error exponent: Corollary 6. Consider testing two Poisson vector models with mean vectors given by the rows of Λ = [λ0 ; λ1 ] R2 L + , satisfying (106). Let Λmin = minrℓλrℓ. Then, the sum of the error probabilities for the likelihood ratio test is bounded as Pe,0 + Pe,1 ω 4 ε01 + 3 exp I 1 2 log Λmin . (109) We also have the following general lower bound on ε01 in terms of the information I : Lemma 20. Let Λ = [λ0 ; λ1 ] R2 L + . There exists ℓ [L] such that 2 log 1 + 8I which implies ε01 min 2I L Λ , 2 . Although the bound in Lemma 20 holds without any further assumption, it is not always tight. The difference in our two sets of results, namely (13) and (14) is due to using the sharper bound (109) versus replacing ε01 with its universal lower bound. Lemma 21 (Perturbation of ε01). Suppose Λ BΛ(δ), and let ε 01 = ε01(Λ ) and ε01 = ε01(Λ) as in (107), and assume (106). Then ε 01 ε01 2ω(1 + ε01)δ. Zhou and Amini F.4. Approximation results for Lemma 2(b) Let us collect some approximation lemmas that will be used in the proof of Lemma 2(b). The proofs can be found in Appendix G.4. We write pmf for the probability mass functions. We recall that a Poisson-binomial variable with parameter (p1, . . . , pn) is one that can be written as Pn i=1 Xi where Xi Ber(pi), independent over i = 1, . . . , n. We write pmf for the probability mass function. Lemma 22 (Poisson-binomial approximation). Let ϕ(x; λ) be the pmf of a Poisson variable with mean λ, and let eϕ(x, p) be the pmf of a Poisson-binomial variable with parameters p = (p1, . . . , pn) where Pn j=1 pj = λ. Let p := maxj [n] pj. Then, eϕ(x; p) ϕ(x; λ) exp , x Z+. This result immediately extends to the comparison between vector versions of the two distributions: Corollary 7 (Poisson-binomial approximation). Let p(ℓ) = (p(ℓ) 1 , . . . , p(ℓ) nℓ) [0, 1]nℓbe a vector of probabilities for each ℓ [L] and let λ(ℓ) = Pnℓ i=1 p(ℓ) i R+. Let eΦ(x, (p(1), . . . , p(L))) := ℓ=1 eϕ(xℓ; p(ℓ)), for each x = (x1, . . . , x L) ZL + (110) be the pmf of a vector Poisson-binomial variable, and Φ(x, (λ(1), . . . , λ(L))) = QL ℓ=1 ϕ(xℓ; λ(ℓ)) be the corresponding vector Poisson pmf. Then, we have eΦ(x, (p(1), . . . , p(L))) Φ(x, (λ(1), . . . , λ(L))) exp p L X ℓ=1 xℓ , x ZL +, where p = max{p(ℓ) i : i [nℓ], ℓ [L]}. Lemma 23 (Poisson likelihood approximation). Suppose max(|λ1 λ|, |λ2 λ|) ρ 1 3λ, then for any x Z+, we have φ(x; λ1) φ(x; λ2) exp 3ρx Lemma 24 (Degree Truncation). Let bi+ = P ℓ [L] biℓ= Pm j=1 Aij be the degree of (row) node i. Then, P(bi+ > 5L Λ ) exp( 3L Λ ). of Lemma 24. Let row node i belong to row cluster k, and let bi+ = P ℓ [L] biℓ= Pm j=1 Aij be its degree, with expectation λk+ := P ℓ [L] λkℓ. By definition, we have λk+ L Λ . We would like to find an upper bound on the probability P bi+ > 5L Λ P bi+ λk+ > 4L Λ We let v = λk+, vt = 4L Λ , so t 4. By Proposition 2, we have P bi+ λk+ > 4L Λ exp h 3 4vt log 1 + 2t 4vt exp( 3L Λ ). Optimal Bipartite Network Clustering F.5. Proof of Lemma 2(b) Fix i [n] such that yi = k, and z [K]n and let bi = bi ( z). Throughout, let Λ = (λ kℓ) := Λ(y, z) which belongs to BΛ(δ) by assumption. Denoting the kth row of Λ as λ k , we have E[bi ] = λk . For r = k [K], i such that yi = k and Λ BΛ(δ), Yikr(bi , Λ) = ℓ=1 biℓlog λrℓ λkℓ + λkℓ λrℓ h biℓlog λrℓ+ ρ λkℓ ρ + λkℓ λrℓ+ 2ρ i := Y where ρ := δ Λ is the radius of BΛ(δ). Hence, P( Λ BΛ, Yikr( Λ) 0) P(Y 0) = P(bi F), where we have defined (recalling the definition of Ψ from (33)): F := n x ZL + : Ψ(x; λr + ρ | λk ρ) 2Lρ o . Degree truncation. Let bi+ = P ℓ [L] biℓ= Pm j=1 Aij be the degree of (row) node i, and E = n x ZL + : ℓ=1 xℓ 5L Λ o . (111) Using Lemma 24, we have P(bi / E) exp( 3L Λ ), which is faster than the rate we want to establish. Hence, for the rest of the proof it is enough to work on {bi E}. We have the following two approximations on this event: Poisson-binomial approximation. Recall that P is the connectivity matrix and we have, P Λ mini ni(z) βL Λ where the first inequality follows from definition of Λ in (2), and the second from assumption (A2). We note that biℓ= biℓ( z) = Pm j=1 Aij1{ zj = ℓ} as defined in (27), follows a Poisson-binomial distribution. In order the describe the parameters of this distribution, let us introduce the following notation labℓ( z) := (zj : j [m] such that zj = ℓ}, that is, the vector of true labels associated with nodes in the ℓth cluster of z. Then, Pk,labℓ( z) = (Pk,zj : j [m] s.t. zj = ℓ) Rnℓ( z) is the probability vector associated with the Poissonbinomial distribution of biℓ. Also, let lab( z) := (lab1( z), . . . , lab L( z)), and Pk,lab( z) := (Pk,lab1( z), . . . , Pk,lab L( z)). Then, we can say that bi = bi ( z) is a product Poisson-binomial distribution with parameter Pk,lab( z). In particular, bi has pmf eΦ(x; Pk,lab( z)) as defined in (110). We also note that E[bi ( z)] = λk (y, z) =: λ k . It follows from Corollary 7, noting that Pk,lab( z) P combined with (112), eΦ(x; Pk,lab( z)) Φ(x; λ k ) exp Pk,lab( z) ℓ=1 xℓ exp 5βL2 Λ 2 m =: ζ1, x E. Zhou and Amini Poisson likelihood approximation. Recall that ρ = δ Λ . Since by assumption, ωδ 1 3, we have ρ Λ 3Λmin. Recall that by assumption Λ = (λ kℓ) BΛ(δ). By Lemma 23, Φ(x; λ k ) Φ(x; λk ρ) Y ℓ [L] exp 3ρxℓ λkℓ + 2ρ exp 2Lρ + 15ρ exp 17ωLρ =: ζ2, x E. With some abuse of notation, we treat Φ and eΦ are measures as well, thus, for example, Φ(E) = P x E Φ(x). Then, we have P(bi E F) = eΦ E F; Pk ζ1 Φ E F; λ k ζ1ζ2 Φ E F; λk ρ . (113) Thus, it is enough to bound Φ(F; λk ρ) which gives a further upper bound. This quantity is closely related to testing Poisson vector distributions with mean λk ρ and λr ρ against each other. Let us write p0(x) := Φ(x; λk ρ) and p1(x) := Φ(x; λr + ρ) and note that Ψ( ; λr + ρ | λk ρ) = log(p1/p0). We have x F Φ(x; λk ρ) = X p0(x) 1 n log p1(x) p0(x) 2Lρ o p0(x) 1 ne2Lρp1(x) min e2Lρp1(x), p0(x) e2Lρ X min p1(x), p0(x) . (114) Let us define Is(λ0 | λ1) = sλ0ℓ+ (1 s)λ1ℓ λs 0ℓλ1 s 1ℓ, λ0, λ1 RL +. (115) We can now apply Corollary 6. Since Λ +ρ Λmin ρ ωΛmin+ 1 3 Λmin 2 3 Λmin 2ω, we need to substitute ω in Corollary 6 by 2ω. It follows that min p1(x), p0(x) ζ3 exp Is(λr + ρ | λk ρ) 1 2 log(Λmin ρ) ζ3 exp Is(λk | λr ) + 2ωLρ 1 log Λmin + log 2 3 2ζ3 exp Is(λk | λr ) + 2ωLρ 1 2 log Λmin (116) where ζ3 = ω/(εkr 2ω(1 + εkr)δ) + ω from Lemma 21, and the second line follows from the following elementary inequality: (a ρ)1 s(b + ρ)s a1 s bs + sρ a1 sbs + ρω Optimal Bipartite Network Clustering assuming a/b ω. Note that Ikr = sups (0,1) Is(λr | λk ). Putting the pieces (113), (114) and (116) together (and taking supremum over s) we have P(bi E F) 8 3 2ζ2ζ3 e2Lρ+2ωLρ exp Ikr 1 2 log Λmin . We note that log(ζ1ζ2 e2Lρ+2ωLρ) 17ωLρ + 5βL2 Λ 2 m + 4ωLρ 21ωLρ + 5βL2 Λ 2 m =: log ζ4 It follows that P(bi E F) ζ3ζ4 exp Ikr 1 2 log Λmin . Finally we have P(bi F) P(bi E F) + P(bi Ec) 3 2ζ3ζ4 exp Ikr 1 2 log Λmin + exp 3L Λ 11ζ3ζ4 exp Ikr 1 2 log Λmin , assuming that Ikr and Λmin are sufficiently large. Noting that by the definition of ηkr in the statement of the theorem, ηkr = log(2ζ3ζ4), the proof is complete. Appendix G. Remaining proofs G.1. Proofs of Appendix B.2, B.3 and B.4 Proof of Lemma 4. We have nk(y(q)) Hypergeometric(n/Q, nk(y), n). For any fixed k [K] and q [Q], the concentration of hypergeometric distribution (Chv atal, 1979) gives |πk(y(q )) πk(y)| ξ with probability at least 1 2 exp( nξ2/Q). The same probability bound holds for |πℓ(z(q)) πℓ(z)| ξ, for any fixed ℓ [L] and q [Q]. Taking the union bound over k, ℓ, q, q gives the desired result. Proof of Lemma 5. Recall the definition of the true local mean parameters in (26), and the corresponding global parameters in Section 4.1. We have λ(q) kℓ λkℓ Q = Pkℓ nℓ(z(q)) nℓ(z) πℓ(z) 1 = λkℓ Since |πℓ(z(q)) πℓ(z)| ξ and πℓ(z) 1/(βL) by assumptions (B4a) and (A2), the first inequality in (57) follows, from which we have the second inequality by (B4a). Proof of Lemma 6. From Lemma 5, we have Λ(q) Λ/Q (ξLβ) Λ/Q and Λ(q) 3 2 Λ/Q . Note that Λ(q) is the true (local) mean parameter matrix associated with subblock A(q ,q), and this subblock has n/(2Q) rows. We will apply Lemma 1 to the submatrix A(q ,q) and Zhou and Amini sublabels z(q ) and y(q). In order to do so, we have to verify conditions (A1), (A2)) and (B3) for the subblock. (Condition (B3) is satisfied by assumption.) By Lemma 5, we have Λ(q) min 3 Λ By (55), the condition (A2) holds with β replaced with 2β. We also need to replace Λmin in Lemma 5 with Λ(q) min Λmin/(2Q), and Cγ with 4Cγ (more precisely, we are replacing Cγ,β with Cγ,2β). Thus, assuming 6(4Cγ)(3ω) 1, we obtain P ˆΛ(q ,q) Λ(q) 4(4Cγ + τ) Λ(q) 1 2p1 τ; n 2Q , 2β (117) where p1( ) is as in (45). Since Λ(q) 3 2 Λ/Q , 4(4Cγ+τ) Λ(q) (24Cγ+6τ) Λ/Q . Thus, on the event in (117), we have by triangle inequality ˆΛ(q ,q) Λ/Q 4(4Cγ + τ) Λ(q) + (ξLβ) Λ/Q h 4(4Cγ + τ)3 2 + ξLβ i Λ/Q , which is the desired result. Proof of Lemma 7. For the proof, it is enough to consider Λ = [λ0 ; λ1] R2 L + , where λ0, λ1 RL + are the two rows of Λ. Similarly, let Λ = [ λ0 ; λ1] R2 L + BΛ(δ). Let us define Is(λ0 | λ1) = (1 s)λ0ℓ+ sλ1ℓ λ1 s 0ℓλs 1ℓ, λ0, λ1 RL + (118) and αℓ= max{|λ0ℓ λ0ℓ|, |λ1ℓ λ1ℓ|}. We have Is(λ0 | λ1) Is( λ0 | λ1) h αℓ+ λ1 s 0ℓλs 1ℓ λ1 s 0ℓ λs 1ℓ i . Consider the function f(a, b) = a1 sbs for a, b > 0. Assuming max{a/b, b/a} ω, we have f(a, b) 1 (1 s)(b/a)s + s(a/b)1 s (1 s)ωs + sω1 s ω using ω 1. It follows that |a1 sbs u1 svs| ω max{|a u|, |b v|} for a, b, u, v > 0. Thus, Is(λ0 | λ1) Is( λ0 | λ1) (1+ω) P ℓαℓ 2ωLδ Λ since maxℓαℓ δ Λ by assumption. Taking the supremum over s gives part (a). Proof of Lemma 8. Assume d Mis( y, y) α and let nk = |i : yi = k| and Nkk = |i : yi = k, yi = k |. Then, n d Mis( y, y) = X k =k Nkk αn α πk nk =: εnk. It follows that P k =k Nkk εnk for every k. We also obtain Nkk εnk for all k and k such that k = k . Since P k Nkk = nk, we have Nkk (1 ε)nk. Thus, as long as ε < 1/2, we have Optimal Bipartite Network Clustering Nkk > Nkk for all k and k such that k = k . That is, the diagonal of the confusion matrix is bigger than every element in the corresponding row. Now take σ = id. Then, there exists k such that k := σ 1(k) = k. Nσ kk := |i : yi = k, σ( yi) = k| = |i : yi = k, yi = k | = Nkk < Nkk. Then we have n d Mis(σ( y), y) = X k (nk Nσ kk) > X k (nk Nkk) = n d Mis( y, y). showing that id is the unique optimal permutation and proving part (a). For part (b), we note that |{i : yi = k}| Nkk (1 ε)nk > (1/2)nk whenever ε < 1/2. Proof of Lemma 9. Assume that Mis( y, y) α and Mis( y , y) α where α < 1 4 mink πk( y). By definition of the optimal permutation, d Mis(σ( y), y) α and d Mis(σ ( y ), y) α. Since d Mis is a metric (being the sum of discrete metrics over the coordinates), we have d Mis(σ 1 σ ( y ), y) = d Mis(σ ( y ), σ( y)) 2α < 1 2 min k πk( y) where the first inequality is by the triangle inequality for d Mis and the second by assumption. Applying Lemma 8 gives the desired result. Proof of Corollary 2. Take q = 2 for simplicity. Assume that (60) holds with constant 8 in place of 32, which is all we need for this lemma. We have (n/2) d Mis(σ12( y(1)), y(1)) n d Mis(σ12( y(1,2)), y(1,2)) by the definition of the d Mis. It then follows that d Mis(σ12( y(1)), y(1)) < 2 1 8βK 1 2 min k πk(y(1)) where the second inequality holds by the counterpart of (55) for row labels. Applying Lemma (8) we conclude that σ12 = σ ( y(1) y(1)) =: σ1 . Proof of Corollary 3. Take q = 2 for simplicity. Let ε = 1/(32βK). By assumption, we have Mis( y(1,2), y(1,2)) < ε and Mis( y(2,3), y(2,3)) < ε. By Corollary 2, σ12 = σ2 and σ23 = σ 2. Then, the argument leading to (38) implies Mis( y(2), y(2)) < 2ε and Mis( y (2), y(2)) < 2ε. By assumption, Mis( y (2), y(2)) < 2ε = 1 16βK 1 8 min k πk(y(2)) 1 4 min k πk( y(2)) where the second inequality holds by the counterpart of (55) for row labels, and the third inequality follows from the second inequality and Lemma 8(b). It thus follows from Lemma 9 that σ 1 2 σ 2 = σ ( y (2) y(2)) which is the desired result. Zhou and Amini G.2. Proofs of Appendix C.1 Proof of Lemma 14. Let us define I := Ikr and ℓ=1 (1 s)λkℓ+ sλrℓ λ1 s kℓλs rℓ. in this proof. For s [0, 1], s 7 Is is a concave function and I0 = I1 = 0. We have defined I := Is = sups [0,1] Is. Suppose s 1 2, since 0 1 1 2s + s 2 and 1 2s 1 2, by concavity, Similarly, suppose s 1 2, I1/2 I/2 still holds, from which it follows that ℓ [L] (λrℓ λkℓ)2 = X λkℓ)2 (I/2)(4Λmin) = 2Λmin I. Taking the minimum over k and r completes the proof. Proof of Lemma 15. Recall our choice of ξ in (75) which will also be assumed in this proof giving P(Pc) = o(1) as shown (78). By Lemma 5, we have Λ(q) BΛ/Q(ξLβ) for all q [Q], which combined with Lemma 7 (applied with δ = ξLβ) gives |Ikr(Λ(q)) Ikr(Λ/Q)| 2ω(ξLβ)L Λ/Q 2ω(Lβ)L Λ/Q βω(K L)2( Λ Γ ) 2 using (75). Thus Imin(Λ(q)) Imin(Λ/Q) 2 as Imin . We are now ready to apply Corollary 5 to the Algorithm 4 operating on subblocks in Gcol 1 . It remains to verify that assumption (86) translates to condition (85) for the subblocks. Indeed, we have to replace Imin with Imin(Λ(q)), ω with 3ω (by Lemma 5), β with 2β (by (55)), and α with m/4 2 since the subblocks in Gcol 1 are of size n 4 . Therefore, by assumption (86), (2β)2(3ω)KL(K L) (α/2) 2Imin(Λ(q)) 6Qβ2ω2KL(K L) α Imin C1(1 + κ) 2, (119) verifying condition (85) on the subblocks. Applying Corollary 5, we have the misclassification rate of y(q) satisfies Mis( y(q), y(q)) (1 + κ)2(3ω)(2β)L(K L)(α/2) 2C1(Imin/2Q) which is the desired result. Optimal Bipartite Network Clustering G.3. Proofs of Appendix F.3 Proof of Proposition 3. Step 1: Interpolation. Assume without loss of generality that θ01 = θ11 and fix some s (0, 1). It is enough to establish the bound for ℓ= 1 and this particular s. Let Pe,+ := Pe,0 + Pe,1 the be sum of the error probabilities under the two hypothesis. Then, Pe,+ = Z p01{p0 p1}dµ + Z p11{p1 < p0}dµ (120) = Z min(p0, p1)dµ = Z p1 s 0 ps 1 min(ls, ls 1)dµ. (121) where l = p0/p1 is the likelihood ratio. Let prℓ:= π( ; θrℓ) so that pr(x) = Q ℓprℓ(xℓ). Similarly, let ps := p1 s 0 ps 1 R p1 s 0 ps 1dµ, and psℓ:= p1 s 0ℓ ps 1ℓ R p1 s 0ℓ ps 1ℓdν (122) It is easy to see that ps(x) = QL ℓ=1 psℓ(xℓ) and each psℓis a probability density (w.r.t. ν). One can also verify that Z p1 s 0ℓ ps 1ℓdν = e Isℓ, and psℓ= π( ; θsℓ), hence ps = p( ; θs) using definition (100). That is, ps defined in (122) belongs to the same exponential family, with parameter θs interpolating θ0 and θ1. We also note that p1 s 0ℓ ps 1ℓ= e Isℓpsℓ, hence p1 s 0 ps 1 = e Isps. Substituting into (120), we obtain Pe,+ = e Is Z ps min(ls, ls 1)dµ. (123) Step 2: Reduction to the single component case (L = 1). Using prℓ(t) = π(t; θrℓ), we have prℓ(t)/prℓ(t ) = exp(θrℓ(t t )), hence p0ℓ(t) p0ℓ(t ) p1ℓ(t ) p1ℓ(t) = exp (θ0ℓ θ1ℓ)(t t ) Using pr(x) = Q ℓprℓ(xℓ), the likelihood ratio can be written as l(x) = p0(x) ℓ lℓ(xℓ), where lℓ(xℓ) = p0ℓ(xℓ) p1ℓ(xℓ) = exp (θ0ℓ θ1ℓ)x A(θ0ℓ) + A(θ1ℓ) . As long as θ0ℓ = θ1ℓ, lℓis well defined on R and maps onto R++. For any (x2, . . . , x L), let x 1 = x 1(x2, . . . , x L) be the solution of the following equation: ℓ=2 lℓ(xℓ) = 1 which always exists in R (and not necessarily on the support of the exponential family). Then, we have, setting δ = θ01 θ11, l(x) = l1(x1) l1(x 1) = p01(x1) p01(x 1) p11(x 1) p11(x1) = exp (θ01 θ11)(x1 x 1) = exp[δ(x1 x 1)]. Zhou and Amini It follows that min(l(x)s, l(x)s 1) e min(s,1 s)|δ(x1 x 1)| = e α|x1 x 1| where we have defined α := |δ| min(s, 1 s). Recall that ps(x) = QL ℓ=1 psℓ(xℓ) which we write compactly as ps = QL ℓ=1 psℓ. Let us write µ = µ1 µ2:L as the product of underlying coordinate measures. By Fubini theorem, we first integrate over the first coordinate in (123): e Is Pe,+ = Z L Y ℓ=2 psℓ h Z ps1 min(ls, ls 1)dµ1i dµ2:L (125) Let J = J(x2, . . . , x L) denote the inner integral in (125) (in brackets). We have the bound J Z ps1(x1)e α|x1 x 1|dµ1(x1) ps1 Z e α|x1 x 1|dµ1(x1). Note that x 1 is the only place where dependence on (x2, . . . , x L) appears in the bound. Since µ1 is either the Lebesgue or the counting measure, and both these measures are translation invariant, the bound is in fact independent of x 1. That is, we have J(x2, . . . , x L) C(α) ps1 for all (x2, . . . , x L). It follows that the same bound holds for Pe,+ by (125), that is, e Is Pe,+ C(α) ps1 ℓ=2 psℓ dµ2:L = C(α) ps1 since QL ℓ=2 psℓis a probability density w.r.t µ2:L. Since the choice of the coordinate ℓ= 1 and s was arbitrary, the proof is complete. G.4. Proofs of Appendix F.3.1 Proof of Lemma 17. Let f(s) = PL ℓ=1(s 1)λkℓ sλrℓ+λ1 s kℓλs rℓ, then f(s) is a concave function of s on R+. Since f(0) = f(1) = 0, s (0, 1). First, we show the statement is true when L = 1. In this case, s satisfies λk1 λr1 + λ1 s k1 λs r1 log λr1 Let x = λr1 λk1 . Now (126) is equivalent to 1 x + xs log x = 0. s (x) = log((x 1)/ log x) We extend the domain of s (x) to 1 by defining s (1) = 1 2, then s (x) is an continuous increasing function on (0, ). Since λr1 λk1 [1/ω, ω], we have s [s (1/ω), s (ω)] (0, 1). One can observe that s (x) = 1 s (1/x), we have s [s (1/ω), 1 s (1/ω)]. One can also observe that s (x) x 2 for x [0, 1], so s [ 1 2ω, 1 1 2ω]. Now suppose L > 1, let s ℓbe the optimizer of fℓ(s) = (s 1)λkℓ sλrℓ+ λ1 s kℓλs rℓ, we still have s [ 1 2ω]. The optimizer s of f(s) = PL ℓ=1 fℓ(s) satisfies s [ 1 2ω] because fℓ(s) is concave for every ℓ [L]. Optimal Bipartite Network Clustering Proof of Lemma 18. We first note the following Laurent series: 1 1 (1 + x) r = 1 12r x O(x2), as x 0 from which we get the inequality 1 1 (1 + x) r r 1 x + 1 + r 1 2 , for x > 0, r < 1. Let ε = ε01 and α = α01. Applying this inequality with r = 1/(2ω) and x = ε, and recalling α = 1 2ω log(1 + ε), we have C(α) 2 1 e α = 2 1 (1 + ε) 1/(2ω) = 22ω ε + (1 + 2ω). Using 1 ω completes the proof. Proof of Lemma 19. We have eλ π( ; log λ) = sup t Z+ t! sup t R+ 2πλ(t/e)t = eλ where the first inequality is by Stirling s approximation and the last equality is by plugging in the maximizer t = λ. Proof of Lemma 20. There exists ℓ L such that (1 s )λkℓ+ s λrℓ λ1 s kℓ λs rℓ Ikr L Without loss of generality, we assume λkℓ< λrℓ. Let s be the optimizer of Ikr. Dividing λk1 both side, we have 1 s + s λrℓ Lλkℓ Ikr L Λ Let us definef(x) := 1 s + s x xs , then for x > 1, 2(1 s )s (x 1)2 1 Thus f(x) Ikr L Λ implies x q 1 + 8Ikr L Λ , or equivalently, log x 1 2 log 1 + 8Ikr L Λ Proof of Lemma 21. Without loss of generality, we assume λ01/λ11 = 1 + ε01. Letting ρ = δ Λ , we have Λ Λ ρ by definition. Let f(x) = (λ01 x)/(λ11 + x). Then f(x) is convex on (0, ) with derivative f (x) = (λ01 + λ11)/(λ11 + x)2, hence λ 01 λ 11 λ01 ρ λ11 + ρ = f(ρ) f(0) + ρf (0) λ11 λ01 + λ11 λ2 11 ρ = 1 + ε01 λ01 + λ11 Combined with ρ λ11 = δ Λ λ11 ωδ and λ01 + λ11 λ11 = 2 + ε01 2(1 + ε01) (127) we have λ 01/λ 11 1 + ε01 2ω(1 + ε01)δ which gives the desired result. Zhou and Amini Proof of Lemma 22. Let Xj Poi(pj) independent over j = 1, . . . , n, so that Pn j=1 Xj Poi(λ). Fix x Z+ and let S(x) = {S [n] : |S| = x}. For any subset S of [n] and vectors α, β Rn +, let ψ(α, β, S) = Q j S αj Q j / S βj. We have ϕ(x; λ) = P X j Xj = x, Xj {0, 1}, j [n] j S P(Xj = 1) Y j / S P(Xj = 0) i = X S S(x) ψ (pje pj), (e pj), S . On the other hand eϕ(x; p) = P S S(x) ψ((pj), (1 pj), S). Thus, eϕ(x; p) ϕ(x; λ) P S S(x) ψ (pj), (1 pj), S P S S(x) ψ (pje pj), (e pj), S max S S(x) ψ (pj), (1 pj), S ψ (pje pj), (e pj), S = max S S(x) ψ (epj), ((1 pj)epj), S using (P ai)/(P i bi) max(ai/bi) which holds assuming the sums have equal number of terms, all of which positive. Using (1 x)ex 1, It follows that eϕ(x; p) ϕ(x; λ) max S S(x) ψ (epj), (1), S = max S S(x) j S epj exp . Proof of Lemma 23. We have φ(x; λ1) φ(x; λ2) = λ1 x eλ2 λ1 λ + ρ x e2ρ = 1 + 2ρ λ ρ x e2ρ exp 2ρx Since λ ρ 2 3λ by assumption, the result follows. Proof of Lemma 24. Let row node i belong to row cluster k, and let bi+ = P ℓ [L] biℓ= Pm j=1 Aij be its degree, with expectation λk+ := P ℓ [L] λkℓ. By definition, we have λk+ L Λ . We would like to find an upper bound on the probability P bi+ > 5L Λ P bi+ λk+ > 4L Λ We let v = λk+, vt = 4L Λ , so t 4. By Proposition 2, we have P bi+ λk+ > 4L Λ exp h 3 4vt log 1 + 2t 4vt exp( 3L Λ ). Optimal Bipartite Network Clustering G.5. Proof of Lemma 2(a) Proof of Lemma 2(a). Fix z and let bi = bi ( z). For r = k [K] and i such that yi = k, and Λ BΛ(δ), Yikr(bi , Λ) = h biℓlog λrℓ λkℓ + λkℓ λrℓ i h biℓlog λrℓ+ ρ λkℓ ρ + λkℓ λrℓ+ 2ρ i := Y where ρ := δ Λ is the radius of BΛ(δ). Hence P( Λ BΛ, Yikr 0) P(Y 0). By Markov inequality, we have P(Y 0) E[es Y ] for any s 0. To simplify the notation, let us write vℓ= s log[(λrℓ+ ρ)/(λkℓ ρ)] and wℓ= s(λkℓ λrℓ+ 2ρ), so that s Y = PL ℓ=1 biℓvℓ+ wℓ. By independence, we have log E[es Y ] = log E h L Y ℓ=1 ebiℓvℓ+wℓ i = ℓ=1 log E[ebiℓvℓ+wℓ] = ℓ=1 log ewℓEebiℓvℓ . Since the mgf of a Poisson-binomial variable is bounded above by that of a Poisson variable with the same mean, log Ees Yikr ℓ=1 wℓ+ ψ vℓ, λkℓ(y, z) where ψ(t, µ) = µ(et 1) is the log-mgf of a Poi(µ) random variable. Recalling the assumption Λ(y, z) BΛ, we have ℓ=1 wℓ+ ψ vℓ, λkℓ(y, z) = h λkℓ(y, z) λrℓ+ ρ s λkℓ(y, z) + s(λkℓ λrℓ+ 2ρ) i . Since Λ(y, z) BΛ(δ), λkℓ(y, z) λkℓ+ δ Λ = λkℓ+ ρ. Since λkℓ ρ = λkℓ δ Λ λkℓ Λ λkℓ+ (1 + λrℓ s 1 (1 + λrℓ Moreover, λrℓ+ ρ = λrℓ+ δ Λ = λrℓ+ ωδΛmin λrℓ+ 1 3λrℓ. Thus, we have 3λrℓ 2 3λkℓ Zhou and Amini Taking infimum over s > 0, by Lemma 17, the maximizer s of Ikr is always bounded between 0 and 1, hence we have P Λ BΛ, Yikr(bi , Λ) 0 P(Y 0) exp Ikr + 8Lωδ Λ = exp (1 η )Ikr . G.6. Proofs of Section 3 Proof of Proposition 1. The upper bound has been provided by Corollary 6. Here we will show the lower bound, using the notation established in the proof of Proposition 3 and Appendix F.3.1. We rename λk and λr , and work with λ0 and λ1 instead, and we assume throughout that, λ0ℓ, λ1ℓ 1 for all ℓ [L]. We recall from (123) that Pe,+ = Z min(p0, p1)dµ = e Is Z ps min(ls, ls 1)dµ, (128) where ps is defined in (122) and l in (124). Since µ is the counting measure, we have Pe,+ max x ZL + min(p0(x), p1(x)) = max x ZL + e Isps(x) min(ls(x), ls 1(x)). (129) Finding the maximizer x over Z+ gives the lower bound. First, let us extend the Poisson density as φ(t; λ) = λte λ/Γ(t + 1) to any t R+, so that l is well-defined on RL +, given by l(x) = exp X ℓ [L] x log λ0ℓ λ1ℓ λ0ℓ+ λ1ℓ , x RL +. Recall that λsℓ= λ1 s 0ℓλs 1ℓ, λs = (λsℓ) and Is = P ℓ[(1 s)λ0ℓ+sλ1ℓ λsℓ] (cf. Appendix F.3.1). We note that ℓ [L] λ0ℓ+ λ1ℓ+ λsℓlog λ0ℓ . = log(l(λs)) The function s 7 Is is concave, smooth, nonconstant (by assumption) and we have I0 = I1 = 0. Hence, the unique maximizer s of s 7 Is belongs to (0, 1) and satisfies d Is/ds s = 0, that is, log(l(λs )) = 0, or equivalently p0(λs )/p1(λs ) = l(λs ) = 1. By the definition of ps, we have e Is ps (λs ) = p1 s 0 (λs )ps 1(λs ) = p0(λs ) = p1(λs ). We recall that ps is the product of Poisson densities with parameters λsℓ. By a version of the Stirling s inequality for the Gamma functions (Jameson, 2015): Γ(x + 1) = xΓ(x) (2π)1/2xx+1/2e xe1/(12x), x > 0 hence Γ(x + 1) C0 xx+1/2e x for all x 1, where C0 = (2π)1/2e1/12. Then, φ(λ; λ) = λλe λ Γ(λ + 1) C 1 0 λ 1/2, Optimal Bipartite Network Clustering from which it follows that ps (λs ) = Y ℓ L φ(λs ℓ; λs ℓ) C L 0 Y ℓ [L] λ 1/2 s ℓ. Thus, e Is C L 0 Q ℓλ 1/2 s is a lower bound on Pe,+ whenever λs ZL +. In general, λs does not have integer coordinates. Instead, pick any x ZL + satisfying x λs ℓ 1. Since t 7 φ(t; λ) is a quasi-concave function (i.e., upper-level sets are convex), we have φ(t; λ) min{φ(a; λ), φ(b; λ)} for every t [a, b], hence, for every t [a 1, a + 1], we obtain using Γ(x + 1) = xΓ(x), φ(t; λ) e λ min nλa 1 Γ(a) , λa+1 Γ(a + 1) min na φ(t; λ) φ(a; λ) min na o , t [a 1, a + 1]. Since |xℓ λs ℓ| 1, p0ℓ(xℓ) p0ℓ(λs ℓ) min nλs ℓ λ0ℓ , λ0ℓ λs ℓ+ 1 o (2ω) 1p0ℓ(λs ℓ) where we have used, for any s [0, 1], o = min nλ1ℓ and λs ℓ/(λs ℓ+ 1) 1/2 since λs ℓ 1. Similarly p1ℓ(xℓ) p1ℓ(λs ℓ)/(2ω). Hence, min{p0(x), p1(x)} min{p0(λs ), p1(λs )} (2ω)L = e Is ps (λs ) ℓ [L] λ 1/2 s where we have used min{p0(x), p1(x)} = e Isps(x) min{l(x)s, l(x)s 1} and l(λs ) = 1. Thus, Pe,+ exp Is L log(2C0ω) L exp Is L log(2C0ω3/2) L 2 log Λmin (130) using the assumption Λ ωΛmin. The proof is complete. G.7. Auxiliary lemmas for Theorem 2 The following lemmas are used in the proof of the minimax Theorem 2. Lemma 25. For a discrete probability distribution L on Z+, let us write pmf(x; L), x Z+ for the probability mass function of L. There is a universal constant c > 0, such that for ω > 1, pmf x; Bin n, λ n c pmf x; Poi(λ) , x 2ωλ n/3. (131) Zhou and Amini Proof. Let ep be the pmf of the Binomial distribution in (131). By the Stirling s approximation, 2πn(n/e)n p 2π(n x)[(n x)/e]n x e1/(12n+1) e1/(12(n x)) r n n xnn(n x) (n x)e x where we have used x 2n/3 to bound the second factor by c1 from below. Hence, ep(x) = n! x!(n x)! x! n! (n x)!n n(n λ)n x n/(n x) 1. Using the inequality 1 + t (1 t2)et for |t| 1, n x = 1 λ x n x h 1 λ x 2in x ex λ. Again by 0 x 2ωλ n/3 n/3 and using (1 x)n 1 nx, we have 2in x 1 (λ x)2 n x 1 (2ωλ)2 It follows that ep(x) c2 λxe λ/x! which is the desired result. Lemma 26. Let epkℓbe the probability mass function of Poi(λkℓ) and pkℓthat of Bin(nℓ, λkℓ/nℓ). Let epk = NL ℓ=1 epkℓand pk = NL ℓ=1 pkℓand similarly define epr and pr. Assume that max(λkℓ, λrℓ) min ωλkℓ, ωλrℓ, nℓ/3 , ℓ [L] for some ω > 1. Then, there is a universal constant C > 0 such that the sum of the type I and type II errors of the likelihood ratio test for pk against pr satisfies Pe,+ exp Is L log(Cω3/2) L 2 log Λmin , where Is is the information between λk and λℓ defined in (8) and Λmin = minℓ(λkℓ, λrℓ). Proof. For x ZL + satisfying x λs 1, we have xℓ λs ℓ+ 1 2ω min(λkℓ, λrℓ). Then by Lemma 25, pk(x) c Lepk(x) and pr(x) c Lepr(x). Therefore, Pe,+ max x ZL + min pk(x), pr(x) c L max x: x λs 1 min epk(x), epr(x) c L exp Is L log(2C0ω3/2) L exp Is L log(Cω3/2) L 2 log Λmin , where the third inequality is by (130) and C and C0 are positive universal constants. The proof is complete. Optimal Bipartite Network Clustering 200 400 600 800 1000 1200 1400 1600 1800 2000 # of nodes per cluster NMI (Overall) C = 0.50, = 0.90, =1.00, K=4, L=6 # of nodes per cluster log miss. (overall) C = 0.50, = 0.90, =1.00, K=4, L=6 Appendix H. Extra Simulation Results Here we present extra simulation results under the setup of Section 6. The following figure shows the overall NMI and log. error rate for different values of C and α: The next figure illustrates the results for unbalanced cluster sizes. To be specific, π(y) = (1, 4, 6, 9) 20 and π(z) = (1, 3, 4, 6, 7, 9) which implies β 3 according to (A2): We also consider the setting where the number of the 200 400 600 800 1000 1200 1400 1600 1800 2000 average # of nodes per cluster NMI (Overall) C = 1.00, = 0.75, =3.00, K=4, L=6 average # of nodes per cluster log miss. (overall) C = 1.00, = 0.75, =3.00, K=4, L=6 clusters of one side is significantly greater that of the other. We let K = 4, L = 12 and 1 2 3 4 5 6 7 8 9 10 11 12 4 5 6 7 8 9 10 11 12 1 2 3 7 8 9 10 11 12 1 2 3 4 5 6 10 11 12 1 2 3 4 5 6 7 8 9 The simulation results are as follows: Zhou and Amini 200 400 600 800 1000 1200 1400 1600 1800 2000 # of nodes per cluster NMI (Overall) C = 0.50, = 0.75, =1.00, K=4, L=12 # of nodes per cluster log miss. (overall) C = 0.50, = 0.75, =1.00, K=4, L=12 Emmanuel Abbe. Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(1):6446 6531, 2017. Emmanuel Abbe and Colin Sandon. Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 670 688. IEEE, 2015. Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block model. IEEE Transactions on Information Theory, 62(1):471 487, 2016. Naman Agarwal, Afonso S Bandeira, Konstantinos Koiliaris, and Alexandra Kolla. Multisection in the stochastic block model using semidefinite programming. In Compressed Sensing and its Applications, pages 125 162. Springer, 2017. Arash A Amini, Aiyou Chen, Peter J Bickel, Elizaveta Levina, et al. Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 41(4):2097 2122, 2013. Arash A Amini, Elizaveta Levina, et al. On semidefinite relaxations for the block model. The Annals of Statistics, 46(1):149 179, 2018. Afonso S Bandeira. Random laplacian matrices and convex relaxations. Foundations of Computational Mathematics, pages 1 35, 2015. Peter Bickel, David Choi, Xiangyu Chang, and Hai Zhang. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. The Annals of Statistics, pages 1922 1943, 2013. Peter J Bickel and Aiyou Chen. A nonparametric view of network models and newman girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50):21068 21073, 2009. Optimal Bipartite Network Clustering Charles Bordenave, Marc Lelarge, and Laurent Massouli e. Non-backtracking spectrum of random graphs: community detection and non-regular ramanujan graphs. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 1347 1357. IEEE, 2015. Rainer E Burkard and Eranda Cela. Linear assignment problems and extensions. In Handbook of combinatorial optimization, pages 75 149. Springer, 1999. Alain Celisse, Jean-Jacques Daudin, and Laurent Pierre. Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electronic Journal of Statistics, 6: 1847 1899, 2012. Yizong Cheng and George M Church. Biclustering of expression data. In Ismb, volume 8, pages 93 103, 2000. Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pages 493 507, 1952. I Chien, Chung-Yi Lin, and I-Hsiang Wang. Community detection in hypergraphs: Optimal statistical limit and efficient algorithms. In International Conference on Artificial Intelligence and Statistics, pages 871 879, 2018. Peter Chin, Anup Rao, and Van Vu. Stochastic block model and community detection in sparse graphs: A spectral algorithm with optimal rate of recovery. In Conference on Learning Theory, pages 391 423, 2015. Vasek Chv atal. The tail of the hypergeometric distribution. Discrete Mathematics, 25(3): 285 287, 1979. Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2006. Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborov a. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 84(6):066106, 2011. Inderjit S Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. pages 269 274, 2001. Inderjit S Dhillon. Information-Theoretic Co-clustering. pages 89 98, 2003. Donniell E Fishkind, Daniel L Sussman, Minh Tang, Joshua T Vogelstein, and Carey E Priebe. Consistent adjacency-spectral partitioning for the stochastic block model when the model parameters are unknown. SIAM Journal on Matrix Analysis and Applications, 34(1):23 39, 2013. Santo Fortunato and Darko Hric. Community detection in networks: A user guide. Physics Reports, 659:1 44, 2016. Chao Gao, Yu Lu, Zongming Ma, and Harrison H Zhou. Optimal estimation and completion of matrices with biclustering structures. Journal of Machine Learning Research, 17(161):1 29, 2016. Zhou and Amini Chao Gao, Zongming Ma, Anderson Y Zhang, and Harrison H Zhou. Achieving optimal misclassification proportion in stochastic block models. The Journal of Machine Learning Research, 18(1):1980 2024, 2017. Chao Gao, Zongming Ma, Anderson Y Zhang, Harrison H Zhou, et al. Community detection in degree-corrected block models. The Annals of Statistics, 46(5):2153 2185, 2018. Evarist Gin e and Richard Nickl. Mathematical foundations of infinite-dimensional statistical models, volume 40. Cambridge University Press, 2015. Olivier Gu edon and Roman Vershynin. Community detection in sparse networks via grothendieck s inequality. Probability Theory and Related Fields, 165(3-4):1025 1049, 2016. Lennart Gulikers, Marc Lelarge, and Laurent Massouli e. A spectral method for community detection in moderately sparse degree-corrected stochastic block models. Advances in Applied Probability, 49(3):686 721, 2017. Bruce Hajek, Yihong Wu, and Jiaming Xu. Achieving exact cluster recovery threshold via semidefinite programming. IEEE Transactions on Information Theory, 62(5):2788 2797, 2016a. Bruce Hajek, Yihong Wu, and Jiaming Xu. Achieving exact cluster recovery threshold via semidefinite programming: Extensions. IEEE Transactions on Information Theory, 62(10): 5918 5937, 2016b. John A Hartigan. Direct clustering of a data matrix. Journal of the american statistical association, 67(337):123 129, 1972. Joseph L Hodges and Lucien Le Cam. The poisson approximation to the poisson binomial distribution. The Annals of Mathematical Statistics, 31(3):737 740, 1960. G. J. O. Jameson. A simple proof of stirling s formula for the gamma function. The Mathematical Gazette, 99(544):68, 2015. Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, Lenka Zdeborov a, and Pan Zhang. Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences, 110(52):20935 20940, 2013. Daniel B Larremore, Aaron Clauset, and Abigail Z Jacobs. Efficiently inferring community structure in bipartite networks. Physical Review E, 90(1):012805, 2014. Jing Lei, Alessandro Rinaldo, et al. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215 237, 2015. Sara C Madeira, Miguel C Teixeira, Isabel Sa-Correia, and Arlindo L Oliveira. Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 7(1): 153 165, 2010. Laurent Massouli e. Community detection thresholds and the weak ramanujan property. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 694 703. ACM, 2014. Optimal Bipartite Network Clustering Andrea Montanari and Subhabrata Sen. Semidefinite programs on sparse random graphs and their application to community detection. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 814 827. ACM, 2016. Elchanan Mossel, Joe Neeman, and Allan Sly. Consistency thresholds for the planted bisection model. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 69 75. ACM, 2015. Krzysztof Nowicki and Tom A B Snijders. Estimation and prediction for stochastic blockstructures. Journal of the American statistical association, 96(455):1077 1087, 2001. Marianna Pensky, Teng Zhang, et al. Spectral clustering in the dynamic stochastic block model. Electronic Journal of Statistics, 13(1):678 709, 2019. Amelia Perry and Alexander S Wein. A semidefinite program for unbalanced multisection in the stochastic block model. In Sampling Theory and Applications (Samp TA), 2017 International Conference on, pages 64 67. IEEE, 2017. Zahra S Razaee, Arash A Amini, and Jingyi Jessica Li. Matched bipartite block model with covariates. Journal of Machine Learning Research, 20(34):1 44, 2019. Federico Ricci-Tersenghi, Adel Javanmard, and Andrea Montanari. Performance of a community detection algorithm based on semidefinite programming. In Journal of Physics: Conference Series, volume 699, page 012015. IOP Publishing, 2016. Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, pages 1878 1915, 2011. Karl Rohe, Tai Qin, and Bin Yu. Co-clustering for directed graphs: the stochastic coblockmodel and spectral algorithm di-sim. ar Xiv preprint ar Xiv:1204.2296, 2012. Amos Tanay, Roded Sharan, and Ron Shamir. Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18(suppl 1):S136 S144, 2002. Sergio Verd u. Asymptotic error probability of binary hypothesis testing for poisson pointprocess observations (corresp.). IEEE Transactions on Information Theory, 32(1):113 115, 1986. Van Vu. A simple svd algorithm for finding hidden partitions. Combinatorics, Probability and Computing, 27(1):124 140, 2018. Jason Wyse, Nial Friel, and Pierre Latouche. Inferring structure in bipartite networks using the latent blockmodel and exact icl. Network Science, 5(1):45 69, 2017. Min Xu, Varun Jog, and Po-Ling Loh. Optimal rates for community estimation in the weighted stochastic block model. ar Xiv preprint ar Xiv:1706.01175, 2017. Se-Young Yun and Alexandre Proutiere. Accurate community detection in the stochastic block model via spectral algorithms. ar Xiv preprint ar Xiv:1412.7335, 2014. Anderson Y Zhang and Harrison H Zhou. Theoretical and computational guarantees of mean field variational inference for community detection. ar Xiv preprint ar Xiv:1710.11268, 2017. Zhou and Amini Anderson Y Zhang, Harrison H Zhou, et al. Minimax rates of community detection in stochastic block models. The Annals of Statistics, 44(5):2252 2280, 2016. Yunpeng Zhao, Elizaveta Levina, Ji Zhu, et al. Consistency of community detection in networks under degree-corrected stochastic block models. The Annals of Statistics, 40(4):2266 2292, 2012. Tao Zhou, Jie Ren, Mat uˇs Medo, and Yi-Cheng Zhang. Bipartite network projection and personal recommendation. Physical review E, 76(4):046115, 2007. Zhixin Zhou and Arash A Amini. Analysis of spectral clustering algorithms for community detection: the general bipartite setting. Journal of Machine Learning Research, 20(47):1 47, 2019. Zhixin Zhou and Ping Li. Non-asymptotic chernofflower bound and its application to community detection in stochastic block model. ar Xiv preprint ar Xiv:1812.11269, 2018.