# fitting_networks_with_a_cancellation_trick__b66cea46.pdf Published as a conference paper at ICLR 2025 FITTING NETWORKS WITH A CANCELLATION TRICK Jiashun Jin Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213, USA jiashun@andrew.cmu.edu Jingming Wang Department of Statistics University of Virginia Virginia, VA 22903, USA pdw9qv@virginia.edu The degree-corrected block model (DCBM), latent space model (LSM), and βmodel are all popular network models. We combine their modeling ideas and propose the logit-DCBM as a new model. Similar as the β-model and LSM, the logit-DCBM contains nonlinear factors, where fitting the parameters is a challenging open problem. We resolve this problem by introducing a cancellation trick. We also propose R-SCORE as a recursive community detection algorithm, where in each iteration, we first use the idea above to update our parameter estimation, and then use the results to remove the nonlinear factors in the logit-DCBM so the renormalized model approximately satisfies a low-rank model, just like the DCBM. Our numerical study suggests that R-SCORE significantly improves over existing spectral approaches in many cases. Also, theoretically, we show that the Hamming error rate of R-SCORE is faster than that of SCORE in a specific sparse region, and is at least as fast outside this region. 1 INTRODUCTION Community detection is a problem of major interest in network analysis (e.g., see (Goldenberg et al., 2010), a survey paper). Consider an undirected network with n nodes and K communities C1, C2, . . . , CK (a community is a group of nodes with similar behaviors). Let A Rn,n be the adjacency matrix, where Aij = 1 if and only if there is an edge between node i and j, 1 i = j n. Conventionally, we do not count self edges, so Aij = 0 if i = j. As in many works on community detection (e.g., Chen et al. (2018); Zhao et al. (2012); Yuan et al. (2022a)), we assume that each node belongs to exactly one of the K communities. For each 1 i n, we encode the community label of node i by a K-dimensional vector πi (which is unknown to us) such that πi = ek if and only if node i Ck (ek: k-th standard Euclidean basis vector of RK). (1) The goal of community detection is to use (A, K) to cluster all n nodes into K communities/groups. The degree-corrected block model (DCBM) (Karrer & Newman, 2011) is a popular network model. Suppose we use a free parameter θi > 0 to model the degree heterogeneity of node i, 1 i n. For a non-negative matrix P RK,K, DCBM assumes that the upper triangular entries of A are independent Bernoulli variables satisfying P(Aij = 1) = θiθjπ i Pπj log(P(Aij = 1)) = log(θi) + log(θj) + π i Qπj, (2) where Q is a K K matrix such that Q = log(P) entry-wise. When all θi are equal, DCBM reduces to the well-known Stochastic Block Model (SBM) (Holland et al., 1983). Note that as 0 P(Aij = 1) 1, so implicitly, DCBM has imposed a set of constraints on its parameters: log(θi) + log(θj) + π i Qπj 0, 1 i, j n. (3) These constraints make an already complicated setting even more complicated, so we desire to remove them if possible. Also, if the matrix Q is positive definite, then Q = U U for a matrix U RK,K. In this special case, we can rewrite (2) as log(P(Aij = 1)) = log(θi) + log(θj) + z izj, where zi = Uπi, 1 i n. (4) Published as a conference paper at ICLR 2025 The latent space model (LSM) (Hoff, 2005) and the β-model (Chatterjee et al., 2011) are also popular network models. Denote the logit function by logit(x) = log(x/[1 x]), 0 < x < 1. In a representative form, for so-called latent positions z1, . . . , zn RK, the LSM assumes logit(P(Aij = 1)) = log(θi) + log(θj) + z izj. (5) Compared (5) with (4), the only difference is the log-link is replaced by the logit-link, so at least in the special case where Q is positive definite, two models are similar. Also, if we drop the z izj term on the RHS, then (5) reduces to the β model (where we only have one community, i.e., K = 1). However, despite the similarity, to many statisticians, (5) is highly preferred. The main reason is that, for binary data, the model recommended by textbooks is the logistic regression model (e.g., (Hastie et al., 2009, Section 4.4) and Dobson & Barnett (2018)), where the logit-link function was argued to be the most natural. Additionally, some popular Python packages, such as scikit-learn, frequently use the logit-link function. In fact, in the LSM case, since logit(P(Aij = 1)) can take any values in ( , ), we do not have the constraints (see (3)) as the DCBM case. To combine the modeling ideas of all three models, we propose the logit-DCBM, where we assume logit(P(Aij = 1)) = log(θi) + log(θj) + π i Qπj, with Q = log(P) entrywise as above. (6) Since we use the logit-link function, we do not need the constraints (3) as in the DCBM case. Also, we can view (6) as an extension of (2). Moreover, since we do not require Q to be positive definite in (6), so (6) also extends (5) to a broader setting. Last, 6) reduces to the β-model if we let Q = 0. In summary, we propose the logit-DCBM as a nonlinear variant of DCBM so hopefully it is more broadly acceptable, especially for researchers with a strong preferences in nonlinear network models (such as the LSM) and in using logistic regression type model for binary data. We now rewrite the logit-DCBM in the matrix form. Note that under the model, P(Aij = 1) = Nij θiθjπ i Pπj, where Nij = [1 + θiθjπ i Pπj] 1 is a nonlinear term. Let N = (Nij), Θ = diag(θ1, . . . , θn) Rn,n and Π = [π1, . . . , πn] . For any matrix Ω Rn,n, let diag(Ω) Rn,n be the diagonal matrix where the k-th diagonal entry is Ωkk. Let W Rn,n be the matrix where Wij = Aij E[Aij] if i = j and Wij = 0 otherwise. Let denote the Hadamard (or entry-wise) product (Horn & Johnson, 1985). Under the logit-DCBM model (6), A = Ω diag(Ω) + W, with Ω= N eΩand eΩ= ΘΠPΠ Θ. (7) Note that rank(eΩ) = K, but due to the matrix of nonlinear factors N, rank(Ω) may be much larger than K. For this reason, (7) is not a low-rank model in general. Remark 1. Since Nij 1 when θiθjπ i Pπj 0, one may think that the DCBM and logit-DCBM are close to each other. This is not true. First, θiθjπ i Pπj are not necessarily small for all i, j. Second, even if eΩand N eΩare close in each entry, their spectra and norms can be very different. Literature review and our contribution. The logit-DCBM (and all other models mentioned above) are so-called latent variable models, where Π is the matrix of latent variables. For latent variable models, the EM algorithm (e.g., Dempster et al. (1977)) is a well-known approach. However, EM algorithm is computationally expensive, lacks of theoretical guarantee for high dimensional setting as we have here, and does not perform well when the networks are sparse. For network data, penalization approach is popular, and in the DCBM setting, there are many interesting works (e.g., Chen et al. (2018); Zhao et al. (2012)). However, since the DCBM is a latent variable model with many unknown parameters, these methods usually involve a non-convex optimization, where a good initialization is crucial. Also, penalization approaches are usually computationally relatively slow and hard to analyze. We can extend these approaches to LSM (Ma & Yuan, 2020) and logit-DCBM, but due to the nonlinearity in LSM and logit-DCBM, these issues persist. For these reasons, spectral approaches for network data are especially appealing. Compared with EM algorithm and penalization approaches, spectral approaches are conceptually simpler, computational faster, and also easier (at least for the DCBM) to analyze. In the classical spectral approach, we cluster by applying k-means to the n rows of the matrix bΞ = [ˆξ1, . . . , ˆξK], where ˆξk is the k-th eigenvector of A. However, due to frequently observed phenomenon of severe degree heterogeneity in network data, such an approach frequently performs poorly. To fix the problem, (Jin, 2015) proposed SCORE as a new spectral approach. In the DCBM setting, SCORE was shown to have fast Published as a conference paper at ICLR 2025 convergence rates (e.g., Jin (2015); Jin et al. (2021b)). Also, in a survey paper (Ke & Jin, 2023), SCORE was compared with many algorithms on many real networks, where it was shown to be competitive in real data performances. Motivated by these, we wish to extend SCORE to our setting. The challenge is, the success of SCORE critically depends on the fact that the DCBM is a low-rank model, but unfortunately, the logit-DCBM is not a low-rank model (see above). We adapt SCORE to our setting by proposing the Recursive-SCORE (R-SCORE): we initialize by a possibly crude estimate for Π (denoted for bΠ), and then use A and bΠ to estimate N (denoted by b N). We then update A by A b N ( denotes the entry-wise division) and repeat the above process for a number of times. The main idea here is that, if b N N, then A b N A N = eΩ diag(eΩ) + W N, where the RHS is a low-rank model (recall that rank(eΩ) = K). The challenge is, how to estimate N is a difficult problem, even if Π is known. In fact, when Π is known, we can restrict the network to each of the K communities, where within each community, the logit-DCBM reduces to the β-model (which is a symmetrical version of the p1 model (Holland & Leinhardt, 1981)). How to estimate N in the β-model is a well-known open problem, as explained in the survey paper (Goldenberg et al., 2010) (see also Rinaldo et al. (2010)): A major problem with the p1 and related models, recognized by Holland and Leinhardt, is the lack of standard asymptotics, ..., we have no consistency in results for the maximum likelihood estimates . We tackle this with a cancellation trick. Construct two types of cycles. For each type, the expected cycle count is a big sum of many terms, where due to the matrix of nonlinear factors N, we can not derive a simple expression. Fortunately, in the ratio of the two big sums, the nonlinear factors in one big sum cancel with those in the other, and the ratio has a simple and closed-form expression. Therefore, if Π is known, then the idea gives rises to a simple and convenient way to estimate N. Note that this also solves the open problem for the β-model aforementioned. In our case, Π is unknown, but we can first obtain a possibly crude estimate bΠ, and then use bΠ and the idea above to obtain an estimate b N for N. We can then repeat the two steps as in the R-SCORE. Remark 2. As many recent procedures rely on a low-rank network model, the above idea is not only useful for adapting SCORE to our setting, but is also helpful in adapting other ideas (e.g., those on global testing (Jin et al., 2021a) and on estimating K (Jin et al., 2023)) to our setting. It remains to analyze SCORE and R-SCORE for the logit-DCBM model. Note that while the Hamming clustering error of SCORE was analyzed before (e.g., Jin (2015); Jin et al. (2021b)), but the focus were on the simpler DCBM model, where the analysis critically depends on that the DCBM is a low-rank model. Unfortunately, the logit-DCBM is not a low-rank model, so it is unclear how to extend the results in Jin (2015); Jin et al. (2021b) to our setting. Note also that R-SCORE is a new algorithm, and has never been analyzed before. For any community detection procedures, we measure the performance by the Hamming clustering error. In the logit-DCBM model, we can always write A = eΩ+ (N 1n1 n) eΩ diag(Ω) + W, where eΩis a low-rank matrix, and (N 1n1 n) eΩcan be viewed as a non-linear perturbation of eΩ. We show that the Hamming error rate of SCORE is upper bounded by C[λ1(eΩ) + (N 1n1 n) eΩ ]/λ2 K(eΩ). This is the first time we derive a bound for the Hamming clustering error of SCORE for a nonlinear network model. Compared with existing works on DCBM (e.g., Jin et al. (2021a; 2023)), the analysis is quite different. The Hamming error of R-SCORE is much harder to analyze, for many reasons. First, R-SCORE is a recursive algorithm, where the next step depends on the pervious one. Second, b N (the estimate for N) is a complicated nonlinear function of A, which depends not only on the clustering errors in the previous step, but also on the cycle count step aforementioned (where the analysis is non-standard). Fortunately, we manage to derive an upper bound for the Hamming error rate of R-SCORE. To save space, we consider a special case here, leaving more general cases to Section 3.1. Consider the Published as a conference paper at ICLR 2025 special case where for all 1 i n, c0n β θi c1n β, where β (0, 1/2) and c1 > c0 > 0 are constants. In this case, the Hamming error rates of SCORE are R-SCORE are upper bounded by Cn a0(β) and n a1(β), respectively, where a0(β) = min{(1 2β), 4β}, a1(β) = min{(1 2β), 6β}, and a1(β) > a0(β) when β < 1/6. Therefore, when 0 < β < 1/6, the rate of R-SCORE is faster than that of SCORE, and two rates are the same when 1/6 < β < 1/2 (the interesting range of β is 0 < β < 1/2; when β > 1/2, the signal-noise ratio is so low that no procedure could succeed). We have the following contributions: (a) propose the logit-DCBM as an extension of the LSM, β-model, and DCBM, and as a more appealing network model, (b) introduce a cancellation trick and use it to solve an open problem for the β-model, as well as for an open problem for the logit DCBM in the idealized case where Π is known, and (c) propose R-SCORE as recursive approach to community detection with the logit-DCBM, (d) for the first time, we derive upper bounds for the Hamming error rates of SCORE and R-SOCRE for the logit-DCBM (which is a nonlinear network model), and (e) show that the rate of R-SCORE is faster than that of SCORE in a specific sparse region, and is at least as fast outside the region. In summary, we propose the logit-DCBM as a nonlinear variant of DCBM, so it will be more broadly accepted. The nonlinear factors make the logit-DCBM harder to fit, but with a cancellation trick, we can successfully convert the model back to DCBM approximately, so we can continue to enjoy all nice properties the DCBM has. We also propose R-SCORE as a fast spectral approach where the error rate is faster than that of applying SCORE directly to the logit-DCBM. Content and notation. Section 2 introduces the cancellation trick. Section 3 introduces the RSCORE algorithm and theoretical analysis. Section 4 contains some numerical study. Section 5 discusses connections to other problems. In this paper, and denote the entry-wise product and division, respectively. For 1 k K, we use ek to denote the k-th standard basis vector of RK. For any n 2, In denotes the n n identity matrix and 1n Rn denotes the vector of all ones. For any two sequence of non-negative numbers {an} and {bn}, we write an bn if bn/an = o(1) (similar for an bn), and we write an bn if c0bn an c1bn for some constants c1 > c0 > 0. We use C to stand for a generic constant, which may vary from one occasion to another. 2 AN IDEA FOR CANCELLING NONLINEAR TERMS IN BIG SUMS We introduce the cancellation trick by considering two seemingly new problems. Although it seems a digression from our original purposes, the two problems are interesting in their own right, and provide the foundation for the refitting step of R-SCORE below. Consider the first problem. Suppose we have a matrix A Rn1,n2 with independent Bernoulli entries, where Ωij = P(Aij = 1) = x0Nijθiθj, x0 > 0, θi > 0, with Nij = [1 + x0θiθj] 1. Here, θi are known but x0 is not, and the interest is to estimate x0. We may estimate x0 by the maximum likelihood estimate (MLE), but it does not have a closed form and may be computationally slow, so we desire a new approach. Lemma 2.1 We have (I) = x0(II), where (I) = P i,j Ωij and (II) = P i,j θiθj(1 Ωij). Proof. As (I) = x0 P i,j Nijθiθj and (II) = P i,j Nijθiθj, the claim follows. The key is, due to the non-linear terms Nij, it is hard to derive a closed-form formula for (I) or (II), but by our careful design, the ratio of (I)/(II) has a very simple form. Now, to estimate x0, let ψ(1) n = Pn1 i=1 Pn2 j=1 Aij and ψ(2) n = Pn1 i=1 Pn2 j=1 θiθj(1 Aij). By Lemma 2.1, x0 = E[ψ(1) n ]/E[ψ(2) n ], so a convenient estimate for x0 is (note: the computational cost is O(n1n2)): ˆx0 = ψ(1) n /ψ(2) n = [ j=1 θiθj(1 Aij)]. (8) Consider the second problem. Suppose we have a network adjacency matrix A Rn1,n2 satisfying the β-model. That is, the upper triangle of A are independent Bernoulli satisfying Ωij P(Aij = 1) = Nijθiθj where Nij = [1 + θiθj] 1, 1 i = j n1. The parameters θi > 0 are unknown and the interest is to estimate them. Due to the nonlinear terms Nij, the problem remains a difficult open problem in the literature, where classical approaches such as the MLE face grand challenges (e.g., Published as a conference paper at ICLR 2025 Goldenberg et al. (2010); Karwa & Slavkovi c (2016); Rinaldo et al. (2010)). We propose a new approach, motivated by the following lemma. For any 1 i n1, let Si = {1, 2, . . . , n1} \ {i}. Lemma 2.2 Fix an odd number m 3. We have (dist below stands for distinct) P i2,...,im Si1(dist) Ωi1i2(1 Ωi2i3) . . . Ωim 2im 1(1 Ωim 1im)Ωimi1 P i2,...,im Si1(dist)(1 Ωi1i2)Ωi2i3 . . . (1 Ωim 2im 1)Ωim 1im(1 Ωimi1) = θ2 i1. (9) Proof. Note that Ωi1i2(1 Ωi2i3) . . . Ωim 2im 1(1 Ωim 1im)Ωimi1 = Ni1i2Ni2i3 . . . Nimi1θ2 i1θi2 . . . θim (1 Ωi1i2)Ωi2i3 . . . (1 Ωim 2im 1)Ωim 1im(1 Ωimi1) = Ni1i2Ni2i3 . . . Nimi1θi2 . . . θim. Comparing the RHS, the only difference is the term θ2 i . Since on both the numerator and denominator of (9), the sum is only over i2, i3, . . . , im with i1 being fixed, the claim follows. Similarly, due to the non-linear terms Nij, it is hard to derive a closed-form formula for both the numerator and denominator of (9), but by our design, the ratio in (9) has a very simple form. Let ϕ(1) n,m(i1) = P i2,...,im Si1(dist) Ai1i2(1 Ai2i3) . . . Aim 2im 1(1 Aim 1im)Aimi1 and ϕ(2) n,m(i1) = P i2,...,im Sk,i1(dist)(1 Ai1i2)Ai2i3 . . . (1 Aim 2im 1)Aim 1im(1 Aimi1). By Lemma 2.2, E[ϕ(1) n,m(i1)] E[ϕ(2) n,m(i1)] = i2,...,im Si1 Ωi1i2(1 Ωi2i3) . . . Ωim 2im 1(1 Ωim 1im)Ωimi1 P i2,...,im Si1(1 Ωi1i2)Ωi2i3 . . . (1 Ωim 2im 1)Ωim 1im(1 Ωimi1) = θ2 i1. Therefore, a reasonable estimator for θi1 is ˆθi1 = q ϕ(1) n,m(ii)/ϕ(2) n,m(i1). This solves the open problem aforementioned (see also Section 3). Especially, we may take m = 3 and estimate θi by ϕ(1) n,3(i)/ϕ(2) n,3(i) = j,k Si,j =k Aij(1 Ajk)Aki P j,k Si,j =k(1 Aij)Ajk(1 Aki). (10) Alternatively, we may use a larger m, but the numerical performance is similar, while the analysis is much longer. For each fixed m, the computational cost is O(n2d) (e.g., Jin et al. (2021a)), where d is the maximum node degree. Remark 3. Lemma 2.2 is readily extendable. For example, if there are positive functions g and h such that g(Ωij) = θiθjπ i Pπjh(Ωij) for all i, j, then similarly P i2,...,im Si1(dist) g(Ωi1i2)h(Ωi2i3) . . . g(Ωim 2im 1)h(Ωim 1im)g(Ωimi1) P i2,...,im Si1(dist) h(Ωi1i2)g(Ωi2i3) . . . h(Ωim 2im 1)g(Ωim 1im)h(Ωimi1)) = θ2 i1. In summary, the two problems above (especially the second one) are difficult. In these problems, the quantities of interest are hidden in some big sums. Due to the nonlinear factor Nij , it is hard to derive a closed-form formula for such big sums. However, if we can carefully construct two big sums, then we can cancel the nonlinear terms Nij by considering the ratio of the two big sums, and derive a closed-form formula for the quantity of interest. Such a cancellation trick gives rises to a convenient approach to solving the two problems above, and is readily extendable to many other settings (e.g., analysis of the p1 model for directed networks Holland & Leinhardt (1981), analysis of tensor and hyper-graphs Yuan et al. (2022b)). Below in Section 3, we introduce R-SCORE as a recursive algorithm, where the ideas above play a key role in the refitting steps of R-SCORE. For space reasons, we defer the analysis of the above idea (i.e., ˆx0 and ˆθi) to the supplement; see Sections C.2-C.3 of the supplement for details. 3 COMMUNITY DETECTION BY R-SCORE FOR THE LOGIT-DCBM We propose Recursive-SCORE (R-SCORE) for community detection, where the key is to use the ideas above in the refitting step; see Algorithm (1). The number of iteration is not critical, so we set M = 10 (R-SCORE typically converges in very few iterations). In each iteration, R-SCORE Published as a conference paper at ICLR 2025 consists a community detection step by SCORE (the SCORE step) and a refitting step. We choose SCORE for it is fast, competitive in real data analysis, and with fast error rates (e.g., Jin (2015); Jin et al. (2021b)), but we can also view our algorithm as a generic algorithm, where we can replace the SCORE by any other community detection approaches that are provably effective for DCBM. We now discuss the SCORE step and refitting step of Algorithm 1 in detail. Consider the SCORE step (Jin, 2015) first. In this step, for an input matrix A or A b N, let ˆξ1, . . . , ˆξK be the first K eigenvectors, and let b R = [ˆξ2/ˆξ1, . . . , ˆξK/ˆξ1], where ξ/η denotes the vector of entry-wise ratios. We cluster by applying the k-means to the n rows of b R, and let ˆπi be the estimated community label of node i. Let bΠ = [ˆπ1, . . . , ˆπn] . Note that ˆπi takes values in e1, e2, . . . , e K (ek: k-th standard basis vector of RK). Algorithm 1 The Recursive SCORE (R-SCORE) Input: A and K. Initialize with an estimate bΠ by SCORE. For m = 1, 2, . . . , M, Refitting. Update b N using A, bΠ in the most recent step, and the refitting step below. SCORE. Update bΠ by applying SCORE to A b N with the most recent b N. Output: bΠ = [ˆπ1, . . . , ˆπn] . Consider the refitting step. Let bΠ = [ˆπ1, . . . , ˆπn] be the estimated Π in the current iteration. Recall that even in the idealized case of bΠ = Π, refitting (i.e., how to estimate N) is a difficult and open problem. We tackle this with the idea in Section 2. In detail, let Ck = {1 i n : ˆπi = ek} be the k-th estimated community, 1 k K, and let ˆSk,i = b Ck \ {i}. By Lemma 2.2 and especially (10), we propose to estimate θi by j =k ˆSk,i Aij(1 Ajk)Aki)/( X j =k ˆSk,i (1 Aij)Ajk(1 Aki)), if i b Ck. (11) This corresponds to the case of m = 3 of our idea in Section 2, but we can also use a larger m. Also, inspired by Lemma 2.1, we can estimate the matrix P by ˆθiˆθj(1 Aij)], 1 k, ℓ K. (12) To appreciate the idea, consider the sub-matrix of A by restricting the rows and columns to b Ck and b Cℓ. In the idealized case where ˆθi = θi and b Ck = Ck, 1 i n, 1 k K, the mean of the sub-matrix satisfies the condition of Lemma 2.1 of Section 2. This gives rises to the estimates above. Finally, we update our estimate of N by letting b Nij = (1 + ˆθiˆθj b Pkℓ) 1 if i b Ck and j b Cℓ. Note that by the discussion in Section 2, the computational cost of the refitting step is no more than O(n2d), where d is the maximum node degree. As a result, the computational cost of R-SCORE is no more than O(n2d). Remark 4. One may want to replace the refitting step by a simpler step, but how to do so remains unclear. Recall that even when Π is known, how to estimate (Θ, P) is a challenging problem (e.g., Goldenberg et al. (2010); Karwa & Slavkovi c (2016); Rinaldo et al. (2010)). 3.1 THEORETICAL RESULTS For any community detection procedure, let bΠ be the resultant estimate for Π, where each row of bΠ takes values in {e1, e2, . . . , e K}. We measure the performance by the Hamming error rate: i=1 1{ˆπi = Pπi}, where P is any permutation in {1, 2, . . . , K}. Let bΠscore and bΠrscore be the bΠ for applying SCORE and R-SCORE to the adjacency matrix A. Note that under the logit-DCBM model, A = Ω diag(Ω)+W, where Ω= N eΩ, eΩ= ΘΠPΠ Θ and P has unit diagonal entries. (13) Published as a conference paper at ICLR 2025 The last item is a well-known identifiability condition (Jin et al., 2023). Since in most real networks, K is relatively small, so we suppose that K is fixed (this is only for technical simplicity and can be relaxed). Let nk be the number of nodes in the k-th community, 1 k K. We assume min k {nk} c0n, for some constant 0 < c0 1/K. (14) This is a frequently used and mild balance condition among the K communities (e.g., Jin (2015); Jin et al. (2021a)). Also, we assume that there exists constants c2 c1 > 0 such that θ 0, and c1 θ θi c2 θ for all 1 i n, where θ = Pn i=1 θi/n. (15) This condition is also only for technical reasons, and can be largely relaxed. Furthermore, we assume there exists an constant c3 > 0, such that n θ |λmin(P)| c3 log(n), λmin(P): smallest eigenvalue of P in magnitude. (16) In the special case where A satisfies a DCBM, Ω= eΩ, and SCORE was analyzed before (e.g., Jin (2015); Jin et al. (2021b)), where it is known that the signal-noise ratio (SNR) is given by |λK(eΩ)|/λ1/2 1 (eΩ) (|λK(eΩ)| and λ1/2 1 (eΩ) represent the signal and noise level respectively). In order for the Hamming error rate of SCORE tends to 0, it is necessary that the SNR . Condition (16) is necessary for otherwise the SNR may tend to 0. Also, here λmin(P) measures community dissimilarity. In the special case of P = b1K1K + (1 b)IK, 0 < b < 1, λmin(P) = 1 b. Therefore, if λmin(P) 0, then b 1, and all K communities are very similar. Condition (16) defines a class of weak signal settings where the problem of community detection is challenging. Lastly, consider PΠ Θ2Π. Let η be the first right eigenvector of PΠ Θ2Π, we assume that η is a positive vector and λ1(PΠ Θ2Π) |λ2(PΠ Θ2Π)| c4λ1(PΠ Θ2Π), max i η(i)/ min i η(i) c (17) This condition is necessary to guarantee that the first eigenvector is well-separated from the others and the SCORE normalization by the first eigenvector is well-defined, since the reciprocal of each entry of the first eigenvector cannot blow up. It is a mild condition by Perron s theorem on nonnegative matrices. Similar condition can be found in Jin et al. (2023). Note that while SCORE was analyzed before for the DCBM, it was not analyzed for the logit DCBM, where the analysis is expected to be much harder. In the logit-DCBM, we have Ω= eΩ+ (N 11 ) eΩ. To avoid that the nonlinearity completely ruins the low-rank structure, we need (N 1n1 n) eΩ) /|λK(eΩ)| 0; (18) Recall that SNR = |λK(eΩ)|/λ1/2 1 (eΩ). The following theorem is proved in the supplement. Theorem 3.1 Let bΠscore be the resultant estimate for Π when we apply SCORE directly to A and suppose (14)-(17) and (18) hold. With probability 1 o(n 3), rn(bΠscore) C[ (N 1n1 n) eΩ 2 + λ1(eΩ)]/λ2 K(eΩ). In the special case where A satisfies the DCBM, N = 1n1 n, and rn(bΠscore) Cλ1(eΩ)/λ2 K(eΩ). Next, we consider R-SCORE. Since R-SCORE is a recursive algorithm, it is useful to present a result that is applicable in general cases. Consider an estimate for eΩin the form of beΩ= bΘbΠ b P bΠ bΘ. By our construction, b Nij = 1/(1 + beΩij). Suppose that with probability 1 o(n 3), b P P max min{1, |λmin(P)| θ 1}, bΠ Π ( n |λmin(P)|) 1 θ 0, (19) (N b N 1n1 n) eΩ = o(|λK(eΩ)|), (N b N 1n1 n) F = o(λ1(eΩ)) . (20) The following lemma is proved in the supplement. Lemma 3.1 Suppose (14)-(17) hold. Let bΠ be the result of applying SCORE to A b N where (19)-(20) and ˆθi < C θ hold. With probability 1 o(n 3), rn(bΠ) C[ (N b N 1n1 n) eΩ 2 + τ 2 n + λ1(eΩ)] λ2 K(eΩ) , where τn = n θ3[ n b P P max + bΠ Π ]. Published as a conference paper at ICLR 2025 We now show that (19)-(20) hold in many settings. Our numerical study shows that R-SCORE typically converges in just one iteration, so for convenience in analysis, we consider R-SCORE with one iteration from now on in this section. Write for short δn = max{ (N 1n1 n) eΩ 2, λ1(eΩ)}/λ2 K(eΩ). Theorem 3.2 is proved in the supplement. Theorem 3.2 Suppose (14)-(17) hold and δn/ min{ θ2, θ|λmin(P)|, |λmin(P)|2/ θ2} 0. Let bΠrscore be the estimate for Π by applying R-SCORE to A, and let (bΘ, b P, bΩ, b N) be the corresponding estimates for (Θ, P, Ω, N) in the refitting step of R-SCORE. We have that (19)-(20) hold and that with probability 1 o(n 3), rn(bΠrscore) C λ1(eΩ) + n θ4 log(n) + n2 θ2δ2 n + n2 θ6δn . Corollary 3.1 Suppose (14)-(17) hold, |λmin(P)| C for a constant C > 0, and n θ4 . Let bΠscore and bΠrscore be the estimates for Π by applying SCORE and R-SCORE to the adjacency matrix A, respectively. With probability 1 o(n 3), rn(bΠscore) C 1 n θ2 + θ4 , rn(bΠrscore) C 1 n θ2 + θ6 + log(n) With a more careful analysis, we conjecture that the condition of n θ4 can be removed, and the rate of R-SCORE is at least as fast as that of SCORE in the whole range of interest (note that the proof of Theorem 3.2 and Corollary 3.1 is already hard and relatively long). If we calibrate θ = n β for a constant β > 0, then in order for the SNR (e.g., see (18)), we must have 0 < β < 1/2. In this range, rn(Πscore) Cn a0(β) and rn(Πrscore) Cn a1(β), where a0(β) = 4β, 0 < β 1/6, 1 2β, 1/6 < β < 1/2, a1(β) = ( 6β, 0 < β 1/8, (1 2β), 1/8 < β 1/6, (1 2β), 1/6 < β < 1/2. ; see Figure 1. Therefore, when 0 < β < 1/6, the Hamming error rate of R-SCORE is faster than that of SCORE. When β > 1/6, such a conclusion may also be true, as the current bound may be conservative: the Hamming error rate for R-SCORE depends on a complicated data dependent matrix b N, and the error bound can hopefully be improved with more careful analysis. Remark 5. The proof of Theorem 3.2 and Corollary 3.1 is hard, for b N has a complicated form: recall that b Nij = 1/(1 + beΩij) and beΩ= bΘbΠ b P bΠ bΘ, where bΠ is the SCORE estimate for Π, and (bΘ, b P) are constructed using bΠ, A, and a cancellation trick. Note that even when Π is known, how to estimate (Θ, P) is a nontrivial problem and we resolve it with a cancellation trick. 0 1/8 1/6 1/4 1/2 Figure 1: Comparison of error rates (x-axis: β. y-axis: a0(β) (blue) and a1(β) (red)) . 4 SIMULATION RESULTS We compare R-SCORE with SCORE and a non-convex penalization MLE-based approach by (Ma et al., 2020), which we refer to as np MLE. We compare with np MLE for (Ma et al., 2020) deals with Published as a conference paper at ICLR 2025 the LSM and is probably the closest related work to our paper. Our study contains 3 experiments. In Experiment 1, we compare R-SCORE with SCORE (which is viewed as a benchmark). In Experiment 2, we study how the error rates of R-SCORE and np MLE change across different iterations (both algorithms are recursive). In Experiment 3, we compare the errors of R-SCORE and np MLE. Experiment 1. R-SCORE vs. SCORE. The networks are simulated as follows: fixing (n, K), we first simulate an n n matrix Ωas in the logit-DCBM model as follows. Let P = β1K1K + (1 β)IK and Π = [π1, π2, . . . , πn] , where πi = ek for n0 = n/K different i (recall that ek is the k-th Euclidean basis vector of RK). Moreover, fixing a parameter bn > 0, we generate θ as follows. We first draw θ0 1, θ0 2, . . . , θ0 n i.i.d. from a fixed distribution F( ), and then renormalize θ0 to get θi = bn θi/ θ for 1 i n. Finally, we let eΩ= ΘΠPΠ Θ and Ωij = eΩij/(1 + eΩij), 1 i, j n. Once we have such a matrix Ω, we use it to generate a binary adjacency matrix A. In such settings, approximately, the Signal-to-Noise ratio (SNR) is bn(1 β) (e.g., see Jin et al. (2021a)). It is desirable to choose settings that the SNR is neither too large or too small. Consider four settings (A), (B), (C) and (D). In Setting (A), we fix (n, K) = (2400, 3) and F = Uniform(0.01, 2). We choose bn = 60 and β = 23/30 (and this way, SNR = 14). In Setting (B), we fix (n, K) = (2500, 5) and F = Uniform(0.1, 0.8). We choose bn = 70 and β = 0.65 (and so SNR = 24.5). In Setting (C), we fix (n, K) = (2400, 3) and F = Pareto(10, 1). To avoid extremely severe degree heterogeneity, we truncate each θ0 i at 200. We choose bn = 70 and β = 0.55 (and this way, SNR = 31.5). In Setting (D), we fix (n, K) = (2500, 5) and F = Pareto(10, 1) with truncation at 100. We choose bn = 50 and β = 0.55 (and so SNR = 22.5). The results are in Figure 2, where the x-axis is the # of iterations m, and the y-axis is the corresponding error rate by R-SCORE (green dashed line: error rate for R-SCORE with m = 0, same as that of applying SCORE directly to the adjacency matrix A). In all settings above, the performance of directly applying SCORE (m = 0) is unsatisfactory. The improvements achieved by R-SCORE are significant, with substantially reduced error rates. This suggests that (a) the R-SCORE is successful as we expect, and (b) the iteration algorithm typically converges in very few iterations. Based on the numerical results, we believe that the refitting procedure steps in R-SCORE are effective: by re-normalization, they reduce the logit-DCBM model approximately to a low-rank model. Experiment 2. Error rates of R-SCORE and np MLE in different iterations. Fix (n, K) = (5400, 6). Let Π be generated similarly to Experiment 1 except that πi = ek for nk different i. Let P = P1 P2 P2 P1 , where P1 = 0.5β11K/21K/2 + (1 0.5β1)IK/2 and P2 = 0.5(β1 + β2)1K/21K/2. We generate θi in the same manner as in Experiment 1 and similarly construct eΩand Ωto generate a binary adjacency matrix A. In this experiment, we choose (n1, n2, , n K) = 200 (5, 1.5, 6, 3, 7.5, 4), F = Uniform(0.01, 2), bn = 80, and (β1, β2) = (0.9, 0.6). Under this setting, the SNR is given by bn|λmin(P)| = 28. For each m = 1, 2, . . . , 1000, we apply R-SCORE and np MLE with m iterations (SCORE is also included for comparison, which is the same as RSCORE with m = 0). The result is in Figure 3 (left). We observed that (a) R-SCORE converges much more rapidly than the np MLE, and (b) the error rates of R-SCORE is significantly lower than that of SCORE, and is slightly lower than that of np MLE. 0 2 4 6 8 10 0.10 0.15 0.20 0.25 0.30 0 2 4 6 8 10 0.1 0.2 0.3 0.4 0 2 4 6 8 10 0.00 0.05 0.10 0.15 0.20 0.25 0.30 R SCORE SCORE 0 2 4 6 8 10 0.1 0.2 0.3 0.4 0.5 Iteration Iteration Iteration Iteration Figure 2: The error rates of R-SCORE for different m (# of iterations). See Experiment 1 for setting details. SCORE is also included for comparision as a benchmark, which corresponding to R-SCORE with m = 0 (x-axis: m; y-axis: error rate). Published as a conference paper at ICLR 2025 Experiment 3. R-SCORE vs. np MLE. Consider the same setting as in Experiment 2, but we let β2 vary: we set bn = 30 and let β2 range from 0.58 to 0.7 with a step size 0.02 (other parameters remain the same). The SNR of the simulated network hinges on the smallest eigenvalue of P, which in turn hinges on β2. The results (based on 20 repetitions) are in Figure 3 (right), which suggest that RSCORE steadily outperforms np MLE for β2 in the entire range. Also, R-SCORE is much faster than np MLE. In each repetition, it takes R-SCORE only 6 seconds, whereas it takes the np MLE more than 300 seconds (more than 50 times longer). This shows that R-SCORE not only is significantly faster than np MLE, but may also outperform the np MLE in many network settings. Iterations(log10) 2 0.58 0.6 0.62 0.64 0.66 0.68 0.7 R-SCORE np MLE 100 101 102 103 0 R-SCORE np MLE SCORE Figure 3: Left panel: The error rates by R-SCORE and np MLE for different m (# of iterations). See Experiment 2 (x-axis: log10(m), y-axis: error rates). SCORE is also included for comparison, which corresponds to R-SCORE with m = 0. Right panel: The error rates by R-SCORE and np MLE for different β2. See Experiment 3 for setting details (x-axis: β2; y-axis: error rate). 5 DISCUSSION In this paper, we have made a three-fold contribution to the area of network community detection. First, we propose the logit-DCBM as a new network model. We argue that the logit-DCBM is more reasonable than the popular DCBM, but also poses a challenge. Second, to overcome the challenge, we propose a trick that can effectively cancel the effect of the nonlinear factors of the logit-DCBM model in some statistics (especially the ratio of two cycle-counts). Last, we propose R-SCORE as a new algorithm for community detection, and show that it can significantly improve over existing spectral approaches including the SCORE. Our idea is generalizable to many other settings. For example, the p1 model by Holland and Leinhardt Holland & Leinhardt (1981) is one of the most popular models for directed networks with 1 community. Following the idea here, we can generalize it to a model with multiple communities, and extend R-SCORE for community detection with the new model. Also for example, the cancellation trick can be extended to many other settings (e.g., analysis of the p1 model for directed networks Holland & Leinhardt (1981), text analysis Ke et al. (2023), tensor analysis Yuan et al. (2022b)) where the data matrix A satisfies E[A] = N eΩfor a simple low-rank matrix eΩand a matrix N consisting nonlinear factors. Given that nonlinear models become increasingly more important in statistics and machine learning, the trick (and its extended form) may find increasingly more uses in many applications in the near future. The cancellation trick is especially useful. In machine learning, we have many nonlinear latent variable models spreading in many areas (e.g., cancer clustering (Jin & Wang, 2016), text analysis (Ke & Wang, 2024), and empirical finance). Due to the nonlinearity, how to analyze such models is a challenging problem. In this paper, we propose an interesting cancellation trick using which we can effectively remove the nonlinear factor in some latent variable models. For space reasons, we only showcase this trick with a network setting, but the idea is extendable to other nonlinear latent space models. For this reason, our work may spark new research in many different directions in machine learning. Sourav Chatterjee, Persi Diaconis, and Allan Sly. Random graphs with a given degree sequence. Ann. Appl. Probab., pp. 1400 1435, 2011. Published as a conference paper at ICLR 2025 Yudong Chen, Xiaodong Li, and Jiaming Xu. Convexified modularity maximization for degreecorrected stochastic block models. The Annals of Statistics, 46(4):1573 1602, 2018. Arthur Dempster, Nan Laird, and Donald Rubin. Maximum likelihood from incomplete data via the em algorithm. J. Roy. Statist. Soc. B, 39(1):1 22, 1977. Annette J Dobson and Adrian G Barnett. An introduction to generalized linear models. Chapman and Hall/CRC, 2018. Anna Goldenberg, Alice Zheng, Stephen Fienberg, and Edoardo Airoldi. A survey of statistical network models. Foundations and Trends in Machine Learning, 2(2):129 233, 2010. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. Springer, 2nd edition, 2009. Peter Hoff. Bilinear mixed-effects models for dyadic data. J. Amer. Statist. Assoc., 100(469):286 295, 2005. Paul W. Holland and Samuel Leinhardt. An exponential family of probability distributions for directed graphs. J. Amer. Statist. Assoc., 76(373):33 50, 1981. doi: 10.1080/01621459.1981. 10477598. URL https://www.tandfonline.com/doi/abs/10.1080/01621459. 1981.10477598. Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109 137, 1983. Roger Horn and Charles Johnson. Matrix Analysis. Cambridge University Press, 1985. Jiashun Jin. Fast community detection by score. Ann. Statist., 43(1):57 89, 2015. Jiashun Jin and Wanjie Wang. Influential features pca for high dimensional clustering. Ann. Statist., 44(6):2323 2359, 2016. Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Optimal adaptivity of signed-polygon statistics for network testing. Ann. Statist., 49(6):3408 3433, 2021a. doi: 10.1214/21-AOS2089. URL https://doi.org/10.1214/21-AOS2089. Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Improvements on score, especially for weak signals. Sankhya A, pp. 1 36, 2021b. Jiashun Jin, Zheng Tracy Ke, Shengming Luo, and Minzhe Wang. Optimal estimation of the number of network communities. J. Amer. Statist. Assoc., 118(543):2101 2116, 2023. doi: 10. 1080/01621459.2022.2035736. URL https://doi.org/10.1080/01621459.2022. 2035736. Brian Karrer and Mark Newman. Stochastic blockmodels and community structure in networks. Phys. Rev. E, 83(1):016107, 2011. V. Karwa and A. Slavkovi c. Inference using noisy degrees-differentially private beta model and synthetic graphs. The Annals of Statistics, 44:87 112, 2016. Zheng Tracy Ke and Jiashun Jin. Special invited paper: The SCORE normalization, especially for heterogeneous network and text data. Stat., 12(1):e545, 2023. Zheng Tracy Ke and Minzhe Wang. Using svd for topic modeling. Journal of American Statistics Association, 119(545):434 449, 2024. Zheng Tracy Ke, Jiashun Jin, Pengsheng Ji, and Wanshan Li. Recent advances in text analysis. Annual review of statistics and its application, 11, 2023. Zhuang Ma and Zongming Ma Hongsong Yuan. Universal latent space model fitting for large networks with edge covariates. J. Mach. Res., 21:1 67, 2020. Zhuang Ma, Zongming Ma, and Hongsong Yuan. Universal latent space model fitting for large networks with edge covariates. Journal of Machine Learning Research, 21(4):1 67, 2020. Published as a conference paper at ICLR 2025 A. Rinaldo, Sonja Petrovic, and S. Fienberg. On the existence of the mle for a directed random graph network model with reciprocation., 2010. Mingao Yuan, Yang Feng, and Zuofeng Shang. A likelihood-ratio type test for stochastic block models with bounded degrees. Journal of Statistical Planning and Inference, 219:98 119, 2022a. Mingao Yuan, Ruiqi Liu, Yang Feng, and Zuofeng Shang. Testing community structure for hypergraphs. Ann. Statist., 50(1):147 169, 2022b. Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. Consistency of community detection in networks under degree-corrected stochastic block models. Ann. Statist., 40:2266 2292, 2012.