# semisupervised_community_detection_via_structural_similarity_metrics__b3096aec.pdf Published as a conference paper at ICLR 2023 SEMI-SUPERVISED COMMUNITY DETECTION VIA STRUCTURAL SIMILARITY METRICS Yicong Jiang, Tracy Ke Department of Statistics Harvard University Cambridge, MA 02138, USA Motivated by social network analysis and network-based recommendation systems, we study a semi-supervised community detection problem in which the objective is to estimate the community label of a new node using the network topology and partially observed community labels of existing nodes. The network is modeled using a degree-corrected stochastic block model, which allows for severe degree heterogeneity and potentially non-assortative communities. We propose an algorithm that computes a structural similarity metric between the new node and each of the K communities by aggregating labeled and unlabeled data. The estimated label of the new node corresponds to the value of k that maximizes this similarity metric. Our method is fast and numerically outperforms existing semi-supervised algorithms. Theoretically, we derive explicit bounds for the misclassification error and show the efficiency of our method by comparing it with an ideal classifier. Our findings highlight, to the best of our knowledge, the first semi-supervised community detection algorithm that offers theoretical guarantees. 1 INTRODUCTION Nowadays, large network data are frequently observed on social media (such as Facebook, Twitter, and Linked In), science, and social science. Learning the latent community structure in a network is of particular interest. For example, community analysis is useful in designing recommendation systems (Debnath et al., 2008), measuring scholarly impacts (Ji et al., 2022), and re-constructing pseudo-dynamics in single-cell data (Liu et al., 2018). In this paper, we consider a semi-supervised community detection setting: we are given a symmetric network with n nodes, and denote by A Rn n the adjacency matrix, where Aij {0, 1} indicates whether there is an edge between nodes i and j. Suppose the nodes partition into K non-overlapping communities C1, C2, . . . , CK. For a subset L {1, 2, . . . , n}, we observe the true community label yi {1, 2, . . . , K} for each i L. Write m = |L| and YL = (yi)i L. In this context, there are two related semi-supervised community detection problems: (i) in-sample classification, where the goal is to classify all the existing unlabeled nodes; (ii) prediction, where the goal is to classify a new node joining the network. Notably, the in-sample classification problem can be easily reduced to prediction problem: we can successively single out each existing unlabeled node, regard it as the new node , and then predict its label by applying an algorithms for the prediction problem. Hence, for most of the paper, we focus on the prediction problem and defer the study of in-sample classification to Section 3. In the prediction problem, let X {0, 1}n denote the vector consisting of edges between the new node and each of the existing nodes. Given (A, YL, X), our goal is to estimate the community label of the new node. This problem has multiple applications. Consider the news suggestion or online advertising push for a new Facebook user (Shapira et al., 2013). Given a big Facebook network of existing users, for a small fraction of nodes (e.g., active users), we may have good information about the communities to which they belong, whereas for the majority of users, we just observe who they link to. We are interested in estimating the community label of the new user in order to personalize news or ad recommendations. For another example, in a co-citation network of researchers (Ji et al., 2022), each community might be interpreted as a group of researchers working on the same research area. We frequently have a clear understanding of the research areas of some authors (e.g., senior authors), and we intend to use this knowledge to determine the community to which a new node (e.g., a junior author) belongs. Published as a conference paper at ICLR 2023 The statistical literature on community detection has mainly focused on the unsupervised setting (Bickel & Chen, 2009; Rohe et al., 2011; Jin, 2015; Gao et al., 2018; Li et al., 2021). The semisupervised setting is less studied. Leng & Ma (2019) offers a comprehensive literature review of semi-supervised community detection algorithms. Liu et al. (2014) and Ji et al. (2016) derive systems of linear equations for the community labels through physics theory, and predict the labels by solving those equations. Zhou et al. (2018) leverages on the belief function to propagate labels across the network, so that one can estimate the label of a node through its belief. Betzel et al. (2018) extracts several patterns in size and structural composition across the known communities and search for similar patterns in the graph. Yang et al. (2015) unifies a number of different community detection algorithms based on non-negative matrix factorization or spectral clustering under the unsupervised setting, and fits them into the semi-supervised scenario by adding various regularization terms to encourage the estimated labels for nodes in L to match with the clustering behavior of their observed labels. However, the existing methods still face challenges. First, many of them employ the heuristic that a node tends to have more edges with nodes in the same community than those in other communities. This is true only when communities are assortative. But non-assortative communities are also seen in real networks (Goldenberg et al., 2010; Betzel et al., 2018); for instance, Facebook users sharing similar restaurant preferences are not necessarily friends of each other. Second, real networks often have severe degree heterogeneity (i.e., the degrees of some nodes can be many times larger than the degrees of other nodes), but most semi-supervised community detection algorithms do not handle degree heterogeneity. Third, the optimization-based algorithms (Yang et al., 2015) solve non-convex problems and face the issue of local minima. Last, to our best knowledge, none of the existing methods have theoretical guarantees. Attributed network clustering is a problem related to community detection, for which many algorithms have been developed (please see Chunaev et al. (2019) for a nice survey). The graph neural networks (GNN) reported great successes in attributed network clustering. Kipf & Welling (2016) proposes a graph convolutional network (GCN) approach to semi-supervised community detection, and Jin et al. (2019) combines GNN with the Markov random field to predict node labels. However, GNN is designed for the setting where each node has a large number of attributes and these attributes contain rich information of community labels. The key question in the GNN research is how to utilize the graph to better propagate messages. In contrast, we are interested in the scenario where it is infeasible or costly to collect node attributes. For instance, it is easy to construct a co-authorship network from bibtex files, but collecting features of authors is much harder. Additionally, a number of benchmark network datasets do not have attributes (e.g. Caltech (Red et al., 2011; Traud et al., 2012), Simmons (Red et al., 2011; Traud et al., 2012) , and Polblogs (Adamic & Glance, 2005)). It is unclear how to implement GNN on these data sets. In Section 4, we briefly study the performance of GNN with self-created nodal features from 1-hop representation, graph topology and node embedding. Our experiments indicate that GNN is often not suitable for the case of no node attributes. We propose a new algorithm for semi-supervised community detection to address the limitations of existing methods. We adopt the DCBM model (Karrer & Newman, 2011) for networks, which models degree heterogeneity and allows for both assortative and non-assortative communities. Inspired by the viewpoint of Goldenberg et al. (2010) that a community is a group of structurally equivalent nodes, we design a structural similar metric between the new node and each of the K communities. This metric aggregates information in both labeled and unlabeled nodes. We then estimate the community label of the new node by the k that maximizes this similarity metric. Our method is easy to implement, computationally fast, and compares favorably with other methods in numerical experiments. In theory, we derive explicit bounds for the misclassification probability of our method under the DCBM model. We also study the efficiency of our method by comparing its misclassification probability with that of an ideal classifier having access to the community labels of all nodes. 2 SEMI-SUPERVISED COMMUNITY DETECTION Recall that A is the n n adjacency matrix on the existing nodes and YL contains the community labels of nodes in L. Write [n] = {1, 2, . . . , n} and let U = [n] \ L denote the set of unlabeled nodes. We index the new node by n +1 and let X Rn be the binary vector consisting of the edges between the new node and existing nodes. Denote by A the adjacency matrix for the network of (n + 1) nodes. 2.1 The DCBM model and structural equivalence of communities We model A with the degreecorrected block model (DCBM) (Karrer & Newman, 2011). Define a K-dimensional membership Published as a conference paper at ICLR 2023 matrix πi {e1, e2, . . . , e K}, where ek s are the standard basis vectors of RK. We encode the community labels by πi, where πi = ek if and only if yi = k. For a symmetric nonnegative matrix P RK K and a degree parameter θi (0, 1] for each node i, we assume that the upper triangle of A contains independent Bernoulli variables, where P( Aij = 1) = θiθj π i Pπj, for all 1 i = j n + 1. (1) When θi are equal, the DCBM model reduces to the stochastic block model (SBM). Compared with SBM, DCBM is more flexible as it accommodates degree heterogeneity. For a matrix M or a vector v, let diag(M) and diag(v) denote the diagonal matrices whose diagonals are from the diagonal of M or the vector v, respectively. Write θ = (θ1, θ2, . . . , θn+1) , Θ = diag(θ), and Π = [π1, π2, . . . , πn+1] Rn K. Model (1) yields that A = Ω diag(Ω) + W, where Ω= ΘΠPΠ Θ and W = A E A. (2) Here, Ωis a low-rank matrix that captures the signal , W is a generalized Wigner matrix that captures noise , and diag(Ω) yields a bias to the signal but its effect is usually negligible. The DCBM belongs to the family of block models for networks. In block models, it is not necessarily true that the edge densities within a community are higher than those between different communities. Such communities are called assortative communities. However, non-assortative communities also appear in many real networks (Goldenberg et al., 2010; Betzel et al., 2018). For instance, in news and ad recommendation, we are interested in identifying a group of users who have similar behaviors, but they may not be densely connected to each other. Goldenberg et al. (2010) introduced an intuitive notion of structural equivalence - two nodes are structurally equivalent if their connectivity with similar nodes is similar. They argued that a community in block models is a group of structurally equivalent nodes. This way of defining communities is more general than assortative communities. We introduce a rigorous description of structural equivalence in the DCBM model. For two vectors u and v, define ψ(u, v) = arccos u u , v v , which is the angle between these two vectors. Let Ai be the ith column of A. This vector describes the behavior of node i in the network. Recall that Ωis as in (2). When the signal-to-noise ratio is sufficiently large, Ai Ωi, where Ωi is the ith column of Ω. We approximate the angle between Ai and Aj by the angle between Ωi and Ωj. By DCBM model, for a node i in community k, Ωi = θiΘΠPek, where ek is the kth standard basis of RK. It follows that for i Ck and j Cℓ, the degree parameters θi and θj cancel out in our structural similarity: cos ψ(Ωi, Ωj) = θiΘΠPek, θjΘΠPeℓ θiΘΠPek θjΘΠPeℓ = Mkℓ Mkk Mℓℓ , with M := PΠ Θ2ΠP. (3) It is seen that cos ψ(Ωi, Ωj) does not depend on the degree parameters of nodes and is solely determined by community membership. When k = ℓ(i.e., i and j are in the same community), cos ψ(Ωi, Ωj) = 1, which means the angle between these two vectors is zero. When k = ℓ, as long as P is non-singular and Π has a full column rank, M is a positive-definite matrix. It follows that cos ψ(Ωi, Ωj) < 1 and that the angle between Ωi and Ωj is nonzero. Example 1. Suppose K = 2, P R2 2 is such that the diagonal entries are 1 and off-diagonal entries are b, for some b > 0 and b = 1, and maxi{θi} < min{1/b, 1} (to guarantee that all entries of Ωare smaller than 1). For simplicity, we assume P i C1 θ2 i = P i C2 θ2 i . It can be shown that M is proportional to the matrix whose diagonal entries are (1 + b2) and off-diagonal entries are 2b. When b < 1, the communities are assortative, and when b > 1, the communities are non-assortative. However, regardless of the value of b, the off-diagonal entries of M are always strictly smaller than the diagonal entries, so that cos ψ(Ωi, Ωj) < 1, for nodes in distinct communities. 2.2 Semi-supervised community detection Inspired by (3), we propose assigning a community label to the new node based on its similarity to those labeled nodes. For each 1 k K, assume that L Ck = and define a vector A(k) Rn by A(k) j = P i L Ck Aij, for 1 j n. The vector A(k) describes the aggregated behavior of all labeled nodes in community k. Recall that X Rn contains the edges between the new node and all the existing nodes. We can estimate the community label of the new node by ˆy = arg min 1 k K ψ(A(k), X). (4) Published as a conference paper at ICLR 2023 We call (4) the Angle Min estimate. Note that each A(k) is an n-dimensional vector, the construction of which uses both ALL and ALU. Therefore, A(k) aggregates information from both labeled and unlabeled nodes, and so Angle Min is indeed a semi-supervised approach. The estimate in (4) still has space to improve. First, A(k) and X are high-dimensional random vectors, each entry of which is a sum of independent Bernoulli variables. When the network is very sparse or communities are heavily imbalanced in size or degree, the large-deviation bound for ψ(A(k), X) can be unsatisfactory. Second, recall that our observed data include A and X. Denote by ALL the submatrix of A restricted on L L and XL the subvector of X restricted on L; other notations are similar. In (4), only (ALL, ALU, X) are used, but the information in AUU is wasted. We now propose a variant of (4). For any vector x Rn, let x L and x U be the sub-vectors restricted to indices in L and U, respectively. Let 1(k) denote the |L|-dimensional vector indicating whether each labeled node is in community k. Given any |U| K matrix H = [h1, h2, . . . , h K], define f(x; H) = [x L1(1), . . . , x L1(k), x Uh1, . . . , x Uh K] R2K. (5) The mapping f( ; H) creates a low-dimensional projection of x. Suppose we now apply this mapping to A(k). In the projected vector, each entry is a weighted sum of a large number of entries of A(k). Since A(k) contains independent entries, it follows from large-deviation inequalities that each entry of f(A(k), H) has a nice asymptotic tail behavior. This resolves the first issue above. We then modify the Angle Min estimate in (4) to the following estimate, which we call (3):1 ˆy(H) = arg min 1 k K ψ f(A(k); H), f(X; H) . (6) Angle Min+ requires an input of H. Our theory suggests that H has to satisfy two conditions: (a) The spectral norm of H H is O(|U|). In fact, given any H, we can always multiply it by a scalar so that H H is at the order of |U|. Hence, this condition says that the scaling of H should be properly set to balance the contributions from labeled and unlabeled nodes. (b) The minimum singular value of H ΘUUΠU has to be at least a constant times H ΘUUΠU , where ΘUU is the submatrix of Θ restricted to the (U, U) block and ΠU is the sub-matrix of Π restricted to the rows in U. This condition prevents the columns of H from being orthogonal to the columns of ΘUUΠU, and it guarantees that the last K entries of f(x; H) retain enough information of the unlabeled nodes. We construct a data-driven H from AUU, by taking advantage of the existing unsupervised community detection algorithms such as Gao et al. (2018); Jin et al. (2021). Let ˆΠU = [ˆπi]i U be the community labels obtained by applying a community detection algorithm on the sub-network restricted to unlabeled nodes, where ˆπi = ek if and only if node k is clustered to community k. We propose using H = ˆΠU. (7) This choice of H always satisfies the aforementioned condition (a). Furthermore, under mild regularity conditions, as long as the clustering error fraction is bounded by a constant, this H also satisfies the aforementioned condition (b). We note that the information in AUU has been absorbed into H, so it resolves the second issue above. Combining (7) with (3) gives a two-stage algorithm for estimating y. Remark 1: A nice property of Angle Min+ is that it tolerates an arbitrary permutation of communities in ˆΠU. In other words, the communities output by the unsupervised community detection algorithm do not need to have a one-to-one correspondence with the communities on the labeled nodes. To see the reason, we consider an arbitrary permutation of columns of ˆΠU. By (12), this yields a permutation of the last K entries of f(x; H), simultaneously for all x. However, the angle between f(A(k); H) and f(X; H)) is still the same, and so ˆy(H) is unchanged. This property brings a lot of practical conveniences. When K is large or the signals are weak, it is challenging (both computationally and statistically) to match the communities in ˆΠU with those in ΠL. Our method avoids this issue. Remark 2: Angle Min+ is flexible to accommodate other choices of H. Some unsupervised community detection algorithms provide both ˆΠU and ˆΘUU (Jin et al., 2022). We may use H ˆΘUU ˆΠU, 1In Angle Min+, H serves to reduce noise. For example, let X, Y R2m be two random Bernoulli vectors, where EX = EY = (.1, . . . , .1, .4, . . . , .4) . As m , it can be shown that ψ(X, Y ) 0.34 = 1 almost surely. If we project X and Y into R2 by summing the first m coordinates and last m coordinates separately, then as m , ψ(X, Y ) 1 almost surely. Published as a conference paper at ICLR 2023 (subject to a re-scaling to satisfy the aforementioned condition (a)). This H down-weights the contribution of low-degree unlabeled nodes in the last K entries of (12). This is beneficial if the signals are weak and the degree heterogeneity is severe. Another choice is H ˆΞ(U)ˆΛ 1 (U), where ˆΛU is a diagonal matrix containing the K largest eigenvalues (in magnitude) of AUU and ˆΞ(U) is the associated matrix of eigenvectors. For this H, we do not even need to perform any community detection algorithm on AUU. We may also use spectral embedding (Rubin-Delanchy et al., 2017). Remark 3: The local refinement algorithm (Gao et al., 2018) may be adapted to the semi-supervised setting, but it requires prior knowledge on assortativity or dis-assortativity and a strong balance condition on the average degrees of communities. When these conditions are not satisfied, we can construct examples where the error rate of Angle Min+ is o(1) but the error rate of local refinement is 0.5. See Section C. 2.3 The choice of the unsupervised community detection algorithm We discuss how to obtain ˆΠU. In the statistical literature, there are several approaches to unsupervised community detection. The first is modularity maximization (Girvan & Newman, 2002). It exhaustively searches for all cluster assignments and selects the one that maximizes an empirical modularity function. The second is spectral clustering (Jin, 2015). It applies k-means clustering to rows of the matrix consisting of empirical eigenvectors. Other methods include post-processing the output of spectral clustering by majority vote (Gao et al., 2018). Not every method deals with degree heterogeneity and nonassortative communities as in the DCBM model. We use a recent spectral algorithm SCORE+ (Jin et al., 2021), which allows for both severe degree heterogeneity and non-assortative communities. SCORE+: We tentatively write AUU=A and |U|=n and assume the network (on unlabeled nodes) is connected (otherwise consider its giant component). SCORE+ first computes L=D 1/2 τ AD 1/2 τ , where Dτ=diag(d1, . . . , dn)+0.1dmax In, and di is degree of node i. Let ˆλk be the kth eigenvalue (in magnitude) of L and let ˆξk be the associated eigenvector. Let r=K or r=K+1 (see Jin et al. (2021) for details). Let ˆR Rn (r 1) by ˆRik = (ˆλk+1/ˆλ1) [ˆξk+1(i)/ˆξ1(i)]. Run k-means on rows of ˆR. 3 THEORETICAL PROPERTIES We assume that the observed adjacency matrix A follows the DCBM model in (1)-(2). From now on, let θ denote the degree parameter of the new node n + 1. Suppose k {1, 2, . . . , K} is its true community label, and the corresponding K-dimensional membership vector is π = ek . In (2), θ and P are not identifiable. To have identifiability, we assume that all diagonal entries of P are equal to 1 (if this is not true, we replace P by [diag(P)] 1 2 P[diag(P)] 1 2 and each θi in community k by θi Pkk, while keeping Ω= ΘΠPΠ Θ unchanged). In the asymptotic framework, we fix K and assume n . We need some regularity conditions. For any symmetric matrix B, let B max denote its entry-wise maximum norm and λmin(B) denote its minimum eigenvalue (in magnitude). We assume for a constant C1 > 0 and a positive sequence βn (which may tend to 0), P max C1, |λmin(P)| βn. (8) For 1 k K, let θ(k) Rn be the vector with θ(k) i = θi 1{i Ck}, and let θ(k) L and θ(k) U be the sub-vectors restricted to indices in L and U, respectively. We assume for a constant C2 > 0 and a properly small constant c3 > 0, max k θ(k) 1 C2 min k θ(k) 1, θ(k) L 2 c3βn θ(k) L 1 θ 1, for all 1 k K. (9) These conditions are mild. Consider (8). For identifiability, P is already scaled to make Pkk = 1 for all k. It is thus a mild condition to assume P max C1. The condition of |λmin(P)| βn is also mild, because we allow βn 0. Here, βn captures the dissimilarity of communities. To see this, consider a special P where the diagonals are 1 and the off-diagonals are all equal to b; in this example, |1 b| captures the difference of within-community connectivity and between-community connectivity, and it can be shown that |λmin(P)| = |1 b|. Consider (9). The first condition requires that the total degree in different communities are balanced, which is mild. The second condition is about degree heterogeneity. Let θmax and θ be the maximum and average of θi, respectively. In the second inequality of (9), the left hand side is O(n 1θmax/ θ), so this condition is satisfied as long as θmax/ θ = O(nβn). This is a very mild requirement. Published as a conference paper at ICLR 2023 3.1 The misclassification error of Angle Min+ For any |U| K matrix H, let ˆψk(H) = ψ(f(A(k); H), f(X; H)) be as in (3). Angle Min+ estimates the community label to the new node by finding the minimum of ˆψ1(H), . . . , ˆψK(H), with H = ˆΠU. We first introduce a counterpart of ˆψk(H). Recall that Ωis as in (2), which is the signal matrix. Let Ω(k) Rn by Ω(k) j = P i L Ck Ωij, for 1 j n, and define ψk(H) = ψ f(Ω(k); H), f(EX; H) , for 1 k K. (10) The next lemma gives the explicit expression of ψk(H) for an arbitrary H. Lemma 1. Consider the DCBM model where (8)-(9) are satisfied. We define three K K matrices: GLL = Π LΘLLΠL, GUU = Π UΘUUΠU, and Q = G 1 UUΠ UΘUUH. For 1 k K, ψk(H) = arccos Mkk Mkk Mk k , where M = P(G2 LL + GUUQQ GUU)P. The choice of H is flexible. For convenience, we focus on the class of H that is an eligible community membership matrix, i.e., H = ˆΠU. Our theory can be easily extended to more general forms of H. Definition 1. For any b0 (0, 1), we say that ˆΠU is b0-correct if min T P i U θi 1{T ˆπi = πi} b0 θ 1, where the minimum is taken over all permutations of K columns of ˆΠU. The next two theorems study ψk(H) and ˆψk(H), respectively, for H = ˆΠU. Theorem 1. Consider the DCBM model where (8)-(9) hold. Let k denote the true community label of the new node. Suppose ˆΠU is b0-correct, for a constant b0 (0, 1). When b0 is properly small, there exists a constant c0 > 0, which does not depend on b0, such that ψk (ˆΠU) = 0 and mink =k {ψk(ˆΠU)} c0βn. Theorem 2. Consider the DCBM model where (8)-(9) hold. There exists constant C > 0, such that for any δ (0, 1/2), with probability 1 δ, simultaneously for 1 k K, | ˆψk(ˆΠU) ψk(ˆΠU)| log(1/δ) θ 1 min{θ , θ(k) L 1} + θ(k) L 2 θ(k) L 1 θ 1 Write ˆψk = ˆψk(ˆΠU) and ψk = ψk(ˆΠU) for short. When maxk{| ˆψk ψk|} < (1/2) mink =k {ψk}, the community label of the new node is correctly estimated. We can immediately translate the results in Theorems 1-2 to an upper bound for the misclassification probability. Corollary 1. Consider the DCBM model where (8)-(9) hold. Suppose for some constants b0 (0, 1) and ϵ (0, 1/2), ˆΠU is b0-correct with probability 1 ϵ. When b0 is properly small, there exist constants C0 > 0 and C > 0, which do not depend on (b0, ϵ), such that P(ˆy = k ) ϵ + C PK k=1 exp C0β2 n θ 1 min{θ , θ(k) L 1} . Remark 4: When mink θ(k) L 1 O(θ ), the stochastic noise in X will dominate the error, and the misspecification probability in Corollary 1 will not improve with more label information. Typically, the error rate will be the same as in the ideal case that ΠU is known (except there is no ϵ in the ideal case). Hence, only little label information can make Angle Min+ perform almost as well as a fully supervised algorithm that possesses all the label information. We will formalize this in Section 3. Remark 5: Notice that min T P i U θi 1{T ˆπi = πi} 1 K! P i U θi 1{T ˆπi = πi} K 1 K θU 1. Therefore, if θL 1 (1 Kb0 K 1) θ 1, then min T P i U θi 1{T ˆπi = πi} b0 θ 1 is always true. In other words, as long as the information on the labels is strong enough, Angle Min+ would not require any assumption on the unsupervised community detection algorithm. For Angle Min+ to be consistent, we need the bound in Corollary 1 to be o(1). It then requires that for a small constant b0, ˆΠU is b0-correct with probability 1 o(1). This is a mild requirement and can be achieved by several unsupervised community detection algorithms. The next corollary studies the specific version of Angle Min+, when ˆΠU is from SCORE+: Corollary 2. Consider the DCBM model where (8)-(9) hold. We apply SCORE+ to obtain ˆΠU and plug it into Angle Min+. As n , suppose for some constant q0 > 0, mini U θi q0 maxi U θi, βn θU q0 p log(n), β2 n θ 1θ , and β2 n θ 1 mink{ θ(k) L 1} . Then, P(ˆy = k ) 0, so the Angle Min+ estimate is consistent. Published as a conference paper at ICLR 2023 3.2 Comparison with an information theoretical lower bound We compare the performance of Angle Min+ with an ideal estimate that has access to all model parameters, except for the community label k of the new node. For simplicity, we first consider the case of K = 2. For any label predictor y for the new node, define Risk( y) = P k [K] P( y = k |π = ek ). Lemma 2. Consider a DCBM with K = 2 and P = (1 b)I2 + b121 2. Suppose θ = o(1), mink θ(k) L 1 = o(1), 1 b = o(1), θ(1) L 1 θ(2) L 1 = θ(1) U 1 θ(2) U 1 = 1. There exists a constant c4 > 0 such that inf y{Risk( y)} c4 exp n 2[1 + o(1)] (1 b)2 8 θ ( θL 1 + θU 1) o , where the infimum is taken over all measurable functions of A, X, and parameters ΠL, ΠU, Θ, P, θ . In Angle Min+, suppose the second part of condition 9 holds with c3 = o(1), ˆΠU is b0-correct with b0 a.s. 0. There is a constant C4 > 0 such that, Risk(ˆy) C4 exp n [1 o(1)] (1 b)2 8 θ ( θL 2 1+ θU 2 1)2 θL 3 1+ θU 3 1 Lemma 2 indicates that the classification error of Angle Min+ is almost the same as the information theoretical lower bound of an algorithm that knows all the parameters except π apart from a mild difference of the exponents. This difference comes from two sources. The first is the extra "2" in the exponent of Risk(ˆy), which is largely an artifact of proof techniques, because we bound the total variation distance by the Hellinger distance (the total variation distance is hard to analyze directly). The second is the difference of θL 1 + θU 1 in inf y{Risk( y)} and ( θL 2 1+ θU 2 1)2 θL 3 1+ θU 3 1 in Risk(ˆy). Note that ( θL 2 1+ θU 2 1)2 θL 3 1+ θU 3 1 θL 1 + θU 1 1.125 ( θL 2 1+ θU 2 1)2 θL 3 1+ θU 3 1 , so this difference is quite mild. It arises from the fact that Angle Min+ does not aggregate the information in labeled and unlabeled data by adding the first and last K coordinates of f(x; H) together. The reason we do not do this is that unsupervised community detection methods only provide class labels up to a permutation, and practically it is really hard to estimate this permutation, which will result in the algorithm being extremely unstable. To conclude, the difference of the error rate of our method and the information theoretical lower bound is mild, demonstrating that our algorithm is nearly optimal. For a general K, we have a similar conclusion: Theorem 3. Suppose the conditions of Corollary 1 hold, where b0 is properly small , and suppose that ˆΠU is b0-correct. Furthermore, we assume for sufficiently large constant C3, θ 1 C3 , θ mink [K] C3 θ(k) L 1, and for a constant r0 > 0, mink =ℓ{Pkℓ} r0. Then, there is a constant c2 = c2(K, C1, C2, C3, c3, r0) > 0 such that [ log( c2Risk(ˆy))]/[ log(inf y{Risk( y)})] c2. 3.3 In-sample Classification In this part, we briefly discuss the in-sample classification problem. Formally, our goal is to estimate πi for all i U. As mentioned in section 1, an in-sample classification algorithm can be directly derived from Angle Min+: for each i U, predict the label of i as ˆyi(H) = arg min1 k K ψ f(A(k) i ; Hi), f(A i,i; Hi) , where A(k) i is the subvector of A(k) by removing the ith entry, A i,i is the subvector of Ai by removing the ith entry, and Hi is a (|U| 1) K projection matrix which may be different across distinct i. As discussed in subsection 2, the choices of Hi are quite flexible. For purely theoretical convenience, we would focus on the case that Hi = ˆΠU\{i}. For any in-sample classifier y = ( yi)i U [K]|U|, define the in-sample risk Riskins( y) = 1 |U| P i U P k [K] P( yi = k |πi = ek ). For the above in-sample classification algorithm, we have similar theoretical results as in section 3 on consistency and efficiency under some very mild conditions: Theorem 4. Consider the DCBM model where (8)-(9) hold. We apply SCORE+ to obtain ˆΠU\{i} and plug it into the above algorithm. As n , suppose for some constant q0 > 0 , mini U θi q0 maxi U θi, βn θU q0 p log(n), β2 n θ 1 mini U θi , and β2 n θ 1 mink{ θ(k) L 1} . Then, 1 |U| P i U P(ˆyi = ki) 0, so the above in-sample classification algorithm is consistent. Theorem 5. Suppose the conditions of Corollary 1 hold, where b0 is properly small , and suppose that ˆΠU\{i} is b0-correct for all i U. Furthermore, we assume for sufficiently large constant C3, maxi U θi 1 C3 , maxi U θi mink [K] C3 θ(k) L 1, log(|U|) C3β2 n θ 1 mini U θi, and for a constant r0 > 0, mink =ℓ{Pkℓ} r0. Then, there is a constant c21 = c21(K, C1, C2, C3, c3, r0) > 0 such that [ log( c21Riskins(ˆy))]/[ log(inf y{Riskins( y)})] c21, so the above in-sample classification algorithm is efficient. Published as a conference paper at ICLR 2023 Method Angle Min+ (subnetwork) Angle Min Angle Min+ SNMF Unsupervised (SCORE+) 100 200 300 400 N_L Classification error (a) log(n) n Gamma, bal. 100 200 300 400 N_L Classification error (b) log(n) n Gamma, imbal. 100 200 300 400 N_L Classification error (c) n 0.25Gamma, bal. 100 200 300 400 N_L Classification error (d) n 0.25Gamma, imbal. 100 200 300 400 N_L Classification error (e) log(n) n Pareto, bal. 100 200 300 400 N_L Classification error (f) log(n) n Pareto, imbal. 100 200 300 400 N_L Classification error (g) n 0.25Pareto, bal. 100 200 300 400 N_L Classification error (h) n 0.25Pareto, imbal. Figure 1: Simulations (n = 500, K = 3; data are generated from DCBM). In each plot, the x-axis is the number of labeled nodes, and the y-axis is the average misclassification rate over 100 repetitions. 4 EMPIRICAL STUDY We study the performance of Angel Min+, where ˆΠU is from SCORE+ (Jin et al., 2021). We compare our methods with SNMF (Yang et al., 2015) (a representative of semi-supervised approaches) and SCORE+ (a fully unsupervised approach). We also compare our algorithm to typical GNN methods (Kipf & Welling, 2016) in the real data part. Simulations: To illustrate how information in AUU will improve the classification accuracy, we would consider Angle Min in (4) in simulations. Also, to cast light on how information on unlabeled data will ameliorate the classification accuracy, we consider a special version of Angle Min+ in simulations by feeding into the algorithm only ALL and XL. It ignores information on unlabeled data and only uses the subnetwork consisting of labeled nodes. We call it Angle Min+(subnetwork). This method is practically uninteresting, but it serves as a representative of the fully supervised approach that ignores unlabeled nodes. We simulate data from the DCBM with (n, K) = (500, 3). To generate P, we draw its (off diagonal) entries from Uniform(0, 1), and then symmetrize it. We generate the degree heterogeneity parameters θi i.i.d. from one of the 4 following distributions: n 0.5p log(n)Gamma(3.5), n 0.25Gamma(3.5), n 0.5p log(n)Pareto(3.5), n 0.25Pareto(3.5). They cover most scenarios: Gamma distributions have considerable mass near 0, so the network has severely low degree nodes; Pareto distributions have heavy tails, so the network has severely high degree nodes. The scaling n 0.5p log(n) corresponds to the sparse regime, where the average node degree is log(n)2, and n 0.25 corresponds to the dense regime, with average node degree n. We consider two cases of Π: the balanced case (bal.) and the imbalanced case (inbal.). In the former, π(i) are i.i.d. from Multinomial(1/3, 1/3, 1/3), and in the latter, π(i) are i.i.d. from Multinomial(0.2, 0.2, 0.6). We repeat the simulation 100 times. Our results are presented in Figure 1, which shows the average classification error of each algorithm as the number of labeled nodes, NL increases. The plots indicate that Angle Min+ outperforms other methods in all the cases. Furthermore, though Angle Min is not so good as Angle Min+ when NL is small, it still surpasses all the other approaches except Angle Min+ in most scenarios. Compared to supervised and unsupervised methods which only use part of the data, we can see that Angle Min+ gains a great amount of accuracy by leveraging on both the labeled and unlabeled data. Real data: We consider three benchmark datasets for community detection, Caltech (Traud et al., 2012) , Simmons (Traud et al., 2012) , and Polblogs (Adamic & Glance, 2005). For each data set, we separate nodes into 10 folds and treat each fold as the test data at a time, with the other 9 folds as training data. In the training network, we randomly choose n L nodes as labeled nodes. We then estimate the label of each node in the test data and report the misclassification error rate (averaged Published as a conference paper at ICLR 2023 over 10 folds). We consider n L/n {0.3, 0.5, 0.7}, where n is the number of nodes in training data. The results are shown in Table 1. In most cases, Angle Min+ significantly outperforms the other methods (unsupervised or semi-supervised). Additionally, we notice that in the Polblogs data, the standard deviation of the error of SCORE+ is quite large, indicating that its performance is unstable. Remarkably, even though Angle Min+ uses SCORE+ to initialize, the performance of Angle Min+ is nearly unaffected: It still achieves low means and standard deviations in misclassification error. This is consistent with our theory in Section 3. We also compare the running time of different methods (please see Section B of the appendix) and find that Angle Min+ is much faster than SNMF. Table 1: Average misclassification error over 10 data splits, with standard deviation in the parentheses. Dataset n K n L/n SCORE+ Angle Min+ SNMF GNN (cons.) GNN (random) GNN (adj.) GNN (LP) GNN (node2vec) GNN (AΠ) Caltech 590 8 0.3 0.237 (0.061) 0.207 (0.059) 0.312 (0.049) 0.858 (0.038) 0.859 (0.035) 0.875 (0.038) 0.839 (0.046) 0.859 (0.055) 0.880 (0.026) 0.5 0.151 (0.040) 0.310 (0.042) 0.846 (0.054) 0.895 (0.026) 0.859 (0.037) 0.861 (0.043) 0.859 (0.039) 0.856 (0.040) 0.7 0.137 (0.046) 0.264 (0.051) 0.849 (0.043) 0.861 (0.034) 0.856 (0.031) 0.859 (0.036) 0.880 (0.027) 0.842 (0.027) Simmons 1137 4 0.3 0.234 (0.084) 0.128 (0.024) 0.266 (0.041) 0.691 (0.022) 0.702 (0.039) 0.702 (0.036) 0.698 (0.026) 0.706 (0.039) 0.696 (0.028) 0.5 0.096 (0.024) 0.233 (0.033) 0.691 (0.022) 0.711 (0.034) 0.685 (0.025) 0.691 (0.022) 0.710 (0.031) 0.691 (0.022) 0.7 0.092 (0.015) 0.220 (0.037) 0.691 (0.022) 0.692 (0.022) 0.691 (0.022) 0.691 (0.022) 0.707 (0.043) 0.698 (0.026) Polblogs 1222 2 0.3 0.166 (0.165) 0.074 (0.036) 0.073 (0.019) 0.499 (0.044) 0.502 (0.038) 0.439 (0.048) 0.482 (0.037) 0.502 (0.059) 0.501 (0.044) 0.5 0.092 (0.041) 0.068 (0.033) 0.517 (0.040) 0.516 (0.038) 0.453 (0.056) 0.488 (0.044) 0.499 (0.061) 0.484 (0.041) 0.7 0.066 (0.026) 0.063 (0.028) 0.485 (0.041) 0.492 (0.043) 0.430 (0.062) 0.493 (0.041) 0.492 (0.050) 0.486 (0.039) GNN is a popular approach for attributed node clustering. Although it is not designed for the case of no node attributes, we are still interested in whether GNN can be easily adapted to our setting by self-created features. We take the GCN method in Kipf & Welling (2016) and consider 6 schemes of creating a feature vector for each node: i) a 50-dimensional constant vector of 1 s, ii) a 50-dimensional randomly generated feature vector, iii) the n-dimensional adjacency vector, iv) the vector of landing probabilities (LP) (Li et al., 2019) (which contains network topology information), v) the embedding vector from node2vec (Grover & Leskovec, 2016), and vi) a practically infeasible vector e i AΠ RK (which uses the true Π). The results are in Table 1. GCN performs unsatisfactorily, regardless of how the features are created. For example, propagating messages with all-1 vectors seems to result in over-smoothing; and using adjacency vectors as node features means that the feature transformation linear layers size changes with the number of nodes in a network, which could heavily overfit due to too many parameters. We conclude that it is not easy to adapt GNN to the case of no node attributes. For a fairer comparison, we also consider a real network, Citeseer (Sen et al., 2008), that contains node features. We consider two state-of-the-art semi-supervised GNN algorithms, GCN (Kipf & Welling, 2016) and Mas G (Jin et al., 2019). Our methods can also be generalized to accommodate node features. Using the fusion" idea surveyed in Chunaev et al. (2019), we fuse" the adjacency matrix A (on n+1 nodes) and node features into a weighted adjacency matrix Afuse (see the appendix for details). We denote its top left block by Afuse Rn n and its last column by Xfuse Rn and apply Angle Min+ by replacing (A, X) by (Afuse, Xfuse). The misclassification error averaged over 10 data splits is reported in Table 2. The error rates of GCN and Mas G are quoted from those papers, which are based on 1 particular data split. We also re-run GCN on our 10 data splits. Table 2: Error rates on Citeseer, where node attributes are available. If the error rate has , it is quoted from literature and based on one particular data split; otherwise, it is averaged over 10 data splits. Dataset n K n L/n GCN GCN Mas G Angle Min+ Citeseer 3312 6 0.036 0.321 0.297 0.268 0.334 Conclusion and discussions: In this paper, we propose a fast semi-supervised community detection algorithm Angle Min+ based on the structural similarity metric of DCBM. Our method is able to address degree heterogeneity and non-assortative network, is computationally fast, and possesses favorable theoretical properties on consistency and efficiency. Also, our algorithm performs well on both simulations and real data, indicating its strong usage in practice. There are possible extensions for our method. Our method does not directly deal with soft label (a.k.a mixed membership) where the available label information is the probability of a certain node being in each community. We are currently endeavoring to solve this by fitting our algorithm into the degree-corrected mixed membership model (DCMM), and developing sharp theories for it. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS This work is partially supported by the NSF CAREER grant DMS-1943902. ETHICS STATEMENT This paper proposes a novel semi-supervised community detection algorithm, Angle Min+, based on the structural similarity metric of DCBM. Our method may be maliciously manipulated to identify certain group of people such as dissenters. This is a common drawback of all the community detection algorithms, and we think that this can be solved by replacing the network data by their differential private counterpart. All the real data we use come from public datasets which we have clearly cited, and we do not think that they will raise any privacy issues or other potential problems. REPRODUCIBILITY STATEMENT We provide detailed theory on our algorithm Angle Min+. we derive explicit bounds for the misclassification probability of our method under DCBM, and show that it is consistent. We also study the efficiency of our method by comparing its misclassification probability with that of an ideal classifier having access to the community labels of all nodes. Additionally, we provide clear explanations and insights of our theory. All the proofs, together with some generalization of our theory, are available in the appendix. Also, we perform empirical study on our proposed algorithms under both simulations and real data settings, and we consider a large number of scenarios in both cases. All the codes are available in the supplementary materials. Lada A Adamic and Natalie Glance. The political blogosphere and the 2004 us election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery, pp. 36 43, 2005. Richard F Betzel, Maxwell A Bertolero, and Danielle S Bassett. Non-assortative community structure in resting and task-evoked functional brain networks. bio Rxiv, pp. 355016, 2018. Peter J Bickel and Aiyou Chen. A nonparametric view of network models and newman girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50):21068 21073, 2009. Petr Chunaev, Ivan Nuzhdenko, and Klavdiya Bochenina. Community detection in attributed social networks: a unified weight-based model and its regimes. In 2019 International Conference on Data Mining Workshops (ICDMW), pp. 455 464. IEEE, 2019. Souvik Debnath, Niloy Ganguly, and Pabitra Mitra. Feature weighting in content based recommendation system using social network analysis. In Proceedings of the 17th international conference on World Wide Web, pp. 1041 1042, 2008. Chao Gao, Zongming Ma, Anderson Y Zhang, and Harrison H Zhou. Community detection in degree-corrected block models. The Annals of Statistics, 46(5):2153 2185, 2018. Michelle Girvan and Mark EJ Newman. Community structure in social and biological networks. Proceedings of the national academy of sciences, 99(12):7821 7826, 2002. Anna Goldenberg, Alice X Zheng, Stephen E Fienberg, Edoardo M Airoldi, et al. A survey of statistical network models. Foundations and Trends in Machine Learning, 2(2):129 233, 2010. Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855 864, 2016. Karl E Gustafson and Duggirala KM Rao. Numerical range. In Numerical range, pp. 1 26. Springer, 1997. Published as a conference paper at ICLR 2023 Min Ji, Dawei Zhang, Fuding Xie, Ying Zhang, Yong Zhang, and Jun Yang. Semisupervised community detection by voltage drops. Mathematical Problems in Engineering, 2016, 2016. Pengsheng Ji, Jiashun Jin, Zheng Tracy Ke, and Wanshan Li. Co-citation and co-authorship networks of statisticians. Journal of Business & Economic Statistics, 40(2):469 485, 2022. Di Jin, Ziyang Liu, Weihao Li, Dongxiao He, and Weixiong Zhang. Graph convolutional networks meet markov random fields: Semi-supervised community detection in attribute networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 152 159, 2019. Jiashun Jin. Fast community detection by score. The Annals of Statistics, 43(1):57 89, 2015. Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Improvements on SCORE, especially for weak signals. Sankhya A, 84(1):127 162, mar 2021. doi: 10.1007/s13171-020-00240-1. URL https://doi.org/10.1007%2Fs13171-020-00240-1. Jiashun Jin, Zheng Tracy Ke, Shengming Luo, and Minzhe Wang. Optimal estimation of the number of network communities. Journal of the American Statistical Association, pp. 1 16, 2022. Brian Karrer and Mark EJ Newman. Stochastic blockmodels and community structure in networks. Physical review E, 83(1):016107, 2011. Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016. Mingwei Leng and Tao Ma. Semi-supervised community detection: A survey. In Proceedings of the 2019 7th International Conference on Information Technology: Io T and Smart City, pp. 137 140, 2019. Pan Li, I Chien, and Olgica Milenkovic. Optimizing generalized pagerank methods for seed-expansion community detection. Advances in Neural Information Processing Systems, 32, 2019. Xiaodong Li, Yudong Chen, and Jiaming Xu. Convex relaxation methods for community detection. Statistical Science, 36(1):2 15, 2021. Dong Liu, Xiao Liu, Wenjun Wang, and Hongyu Bai. Semi-supervised community detection based on discrete potential theory. Physica A: Statistical Mechanics and its Applications, 416:173 182, 2014. Fuchen Liu, David Choi, Lu Xie, and Kathryn Roeder. Global spectral clustering in dynamic networks. Proceedings of the National Academy of Sciences, 115(5):927 932, 2018. Veronica Red, Eric D Kelsic, Peter J Mucha, and Mason A Porter. Comparing community structure to characteristics in online collegiate social networks. SIAM review, 53(3):526 543, 2011. Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878 1915, 2011. Patrick Rubin-Delanchy, Joshua Cape, Minh Tang, and Carey E Priebe. A statistical interpretation of spectral embedding: the generalised random dot product graph. ar Xiv preprint ar Xiv:1709.05506, 2017. Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93 93, 2008. Bracha Shapira, Lior Rokach, and Shirley Freilikhman. Facebook single and cross domain data for recommendation systems. User Modeling and User-Adapted Interaction, 23(2):211 247, 2013. Amanda L Traud, Peter J Mucha, and Mason A Porter. Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications, 391(16):4165 4180, 2012. Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer New York : Imprint: Springer, New York, NY, 1st ed. 2009. edition, 2009. ISBN 1-283-07270-X. Published as a conference paper at ICLR 2023 J. V Uspensky. Introduction to mathematical probability. Mc Graw-Hill, New York, 1937. Liang Yang, Xiaochun Cao, Di Jin, Xiao Wang, and Dan Meng. A unified semi-supervised community detection framework using latent space graph regularization. IEEE Transactions on Cybernetics, 45(11):2585 2598, 2015. doi: 10.1109/TCYB.2014.2377154. Kuang Zhou, Arnaud Martin, Quan Pan, and Zhunga Liu. Selp: Semi-supervised evidential label propagation algorithm for graph data clustering. International Journal of Approximate Reasoning, 92:139 154, 2018. Published as a conference paper at ICLR 2023 A Pseudo code of the algorithm 13 B Running Time 13 C Comparison with Local Refinement Algorithm 13 D Generalization of Lemma 2 15 E Preliminaries 15 F Proof of Lemma 1 17 G Proof of Theorem 1 19 G.1 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 H Proof of Theorem 2 25 H.1 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 H.2 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 I Proof of Corollary 1, 2 36 I.1 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 I.2 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 J Proof of Lemma 2 38 J.1 Proof of Lower Bound (75) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 J.2 Proof of Upper Bound (76 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 K Proof of Theorem 3 48 L Proof of Theorem 4, 5 52 L.1 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 L.2 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Published as a conference paper at ICLR 2023 A PSEUDO CODE OF THE ALGORITHM Below are the pseudo code of Angle Min+ which is deferred to the appendix due to the page limit. Algorithm 1: Angle Min+ Input: Number of communities K, adjacency matrix A Rn n, community labels yi for nodes in i L, and the vector of edges between a new node and the existing nodes X Rn . Output: Estimated community label ˆy of the new node. 1. Unsupervised community detection: Apply a community detection algorithm (e.g., SCORE+ in Section 2) on AUU, and let ˆΠU = [ˆπi]i U store the estimated community labels, where ˆπi = ek if and only if node k is clustered to community k, 1 k K. 2. Assigning the community label to a new node: Let ΠL = [πi]i L contain the community memberships of labeled nodes, where πi = ek if and only if yi = k, 1 k K. Let H = ˆΠU. Compute x = X LΠL, X UH , vk = e kΠ LALLΠL, e kΠ LALUH , 1 k K. Suppose k minimizes the angle between vk and x, among 1 k K (if there is a tie, pick the smaller k). Output ˆy = k . B RUNNING TIME Table 3 exhibits the running time of all the algorithms considered in Table 1. It can be seen from the result that our algorithm Angle Min+ is much faster than all the other algorithms. This is one of the merits of our method. Table 3: Running time on Caltech, Simons, and Polblogs networks. The quantities outside and inside the parentheses are the means and standard deviations of the running time, respectively. Dataset n K n L/n SCORE+ Angle Min+ SNMF GNN (cons.) GNN (random) GNN (adj.) GNN (LP) GNN (node2vec) GNN (AΠ) Caltech 590 8 0.3 0.083 (0.009) 0.068 (0.064) 0.178 (0.017) 0.277 (0.154) 0.249 (0.049) 0.311 (0.100) 0.296 (0.044) 0.498 (0.053) 0.396 (0.097) 0.5 0.034 (0.003) 0.211 (0.069) 0.575 (0.133) 0.535 (0.061) 0.620 (0.133) 0.609 (0.067) 0.836 (0.080) 0.649 (0.045) 0.7 0.022 (0.003) 0.211 (0.054) 0.861 (0.099) 0.892 (0.116) 1.068 (0.213) 0.949 (0.049) 1.204 (0.186) 0.998 (0.068) Simmons 1137 4 0.3 0.157 (0.008) 0.075 (0.008) 0.515 (0.036) 0.334 (0.086) 0.344 (0.102) 0.564 (0.273) 0.421 (0.094) 1.045 (0.680) 0.455 (0.087) 0.5 0.054 (0.011) 0.577 (0.090) 0.691 (0.199) 0.692 (0.084) 1.245 (0.691) 0.642 (0.032) 1.106 (0.151) 0.685 (0.059) 0.7 0.031 (0.003) 0.541 (0.073) 0.988 (0.139) 0.897 (0.056) 1.208 (0.454) 0.958 (0.057) 1.977 (0.775) 1.046 (0.069) Polblogs 1222 2 0.3 0.093 (0.014) 0.054 (0.006) 0.356 (0.034) 0.402 (0.127) 0.353 (0.093) 0.444 (0.160) 0.311 (0.055) 0.810 (0.261) 0.343 (0.031) 0.5 0.031 (0.004) 0.431 (0.098) 0.780 (0.147) 0.700 (0.181) 0.965 (0.179) 0.649 (0.054) 1.031 (0.190) 0.644 (0.044) 0.7 0.022 (0.004) 0.351 (0.037) 1.135 (0.118) 1.152 (0.314) 1.430 (0.169) 0.986 (0.149) 1.408 (0.210) 0.999 (0.060) C COMPARISON WITH LOCAL REFINEMENT ALGORITHM We would first illustrate why local refinement may not work with an example and then explain our insight behind it. Consider a network with n = 4m nodes and K = 2 communities. Suppose that there are 2m labeled nodes, m of them are in community C1 and have degree heterogeneity θ = 0.8, and the other m of them are in community C2 and have degree heterogeneity θ = 0.5. There are 2m unlabeled nodes, m of them are in community C1 and have degree heterogeneity θ = 0.6, and the other m of them are in community C2 and have degree heterogeneity θ = 0.7. The P matrix is defined as follows: P = 1 0.9 0.9 1 Under this setting, all the assumptions in our paper are satisfied. On the other hand, recall that the prototypical refinement algorithm, Algorithm 2 of Gao et al. (2018) is defined as follows: ˆyi = arg max u [K] 1 |{j : ˆy0(j) = u}| {j:ˆy0(j)=u} Aij Published as a conference paper at ICLR 2023 where ˆy0 is a vector of community label and ˆy is the refined community label. For semi-supervised setting, one may consider the following modification of local refinement algorithm: (i) Apply local refinement algorithm, with known labels to assign nodes in U. (ii) With the labels of all nodes, one updates the labels of every node by applying the same refinement procedure. Under the setting of our toy example, for step (i), all the unlabeled nodes which are actually in community C2 will be assigned to community C1 with probability converging to 1 as n . The reason is that for any unlabeled node i which is actually in community C2, when u = 1, {Aij : j L, yj = u} are iid Bern(θiθj P21) = Bern(θi 0.8 0.9) = Bern(0.72θi); when u = 2, {Aij : j L, yj = u} are iid Bern(θiθj P22) = Bern(θi 0.5 1) = Bern(0.5θi). Hence, by law of large numbers, 1 {Aij : j L, yj = u} {Aij:j L,yj=u} Aij a.s. 0.72θi, u = 1 0.5θi, u = 2 Consequently, the prototypical refinement algorithm will incorrectly assign all the unlabeled nodes which are actually in the community C2 to C1 with probability converging to 1 as n . This will cause a classification error of at least 50%. Based on the huge classification error in step (i), step (ii) will also perform poorly. Similar to the reasoning above, by law of large numbers, it can be shown that after step (ii). the algorithm will still assign all the unlabeled nodes which are actually in the community C2 to C1 with probability converging to 1 as n . In other words, even if the local refinement algorithm is applied to the whole network, a classification error of at least 50% will always remain. Even if all the labels of the nodes are known, applying the local refinement algorithm still can cause severe errors. Still consider our toy example. Suppose now that we know the label of all the nodes, and we perform the local refinement algorithm on these known labels in an attempt to purify them. By the law of large numbers, however, it is not hard to show that for any node i which is actually in community C2, 1 {Aij : j L, yj = u} {Aij:yj=u} Aij a.s. 0.63θi, u = 1 0.6θi, u = 2 Consequently, similar to the previous cases, the local refinement algorithm will incorrectly assign all the unlabeled nodes which are actually in the community C2 to C1 with probability converging to 1 as n . This will cause a classification error of at least 50%, even though the input of the algorithm is actually the true label vector. To conclude, in general, the local refinement algorithms may not work under the broad settings of our paper. Intrinsically, label refinement is quite challenging when there is moderate degree heterogeneity, not to mention the scenarios where non-assortative networks occur. Local refinement algorithm works theoretically because strong assumptions on degree heterogeneity are imposed. For instance, it is required that the mean of the degree heterogeneity parameter in each community is 1 + o(1), which means that the network is extremely dense and that the degree heterogeneity parameters across communities are strongly balanced. Both of these two assumptions are hardly true in the real world, where most of the networks are sparse and imbalanced. Gao et al. (2018) is a very good paper, but we think that local refinement algorithm or similar algorithms might not be good choices for our problem. Published as a conference paper at ICLR 2023 D GENERALIZATION OF LEMMA 2 In the main paper, for the smoothness and comprehensibility of the text, we do not present the most general form of Lemma 2. We have a more general version of the lemma by relaxing the condition θ(1) L 1 θ(2) L 1 = θ(1) U 1 θ(2) U 1 = 1. Please see Section J for more details. E PRELIMINARIES For any positive integer N, Define [N] = {1, 2, ..., N}. For a matrix D and two index sets S1, S2, define DS1S2 to be the submatrix (Dij)i S1,j S2, DS1 to be the submatrix (Dij)i S1,j L U, and D S2 to be the submatrix (Dij)i L U,| S . The two main assumptions (8), (9) in the main paper are presented below for convenience. P max C1, |λmin(P)| βn. (8) max1 k K{ θ(k) 1} min1 k K{ θ(k) 1} C2, max 1 k K θ(k) L 1 θ 1 , where constant c3 is properly small. We would specify this precisely in our proofs. A number of lemmas used in our proofs will be presented as follows. The following lemma shows that sin x and x have the same order. Lemma 3. Let x R. When x 0, sin x x; when x [0, π 2 ], sin x 2 Lemma 3 is quite obvious, but for the completeness of our work, we provide a proof for it. Proof. Let g1(x) = sin x x. Then d dxg1(x) = cos x 1 0 Hence g1(x) is monotonously decreasing on R. As a result, when x 0, g1(x) g1(0) = 0. Therefore, when x 0, sin x x. Let g2(x) = sin x 2 πx. Then d dxg2(x) = cos x 2 Since cos x is monotonously decreasing on [0, π 2 ], d dxg2(x) 0 when x [0, arccos 2 π] and d dxg2(x) 0 when x [arccos 2 2 ]. Hence, g2(x) is monotonously increasing on [0, arccos 2 π] and is monotonously decreasing on [arccos 2 2 ]. As a result, when x [0, π g2(x) min{g2(0), g2( 2 Therefore, when x [0, π 2 ], sin x 2 The following lemma demonstrates that the angle ψ(u, v) in Definition 1 satisfies the triangle inequality, so that it can be regarded as a sort of "metric". Lemma 4 (Angle Inequality). Let x, y, z be three real vectors. Then, ψ(x, z) ψ(x, y) + ψ(y, z) Published as a conference paper at ICLR 2023 The proof of Lemma 4 can be seen in Gustafson & Rao (1997), pg 56. Also, for completeness of our work, we provide a proof of Lemma 4. Proof. Let x = x x , y = y y , z = z z , then cos ψ(x, y) = x, y , cos ψ(y, z) = y, z , cos ψ(x, z) = x, z . Consider the following matrix 1 cos ψ(x, y) cos ψ(x, z) cos ψ(x, y) 1 cos ψ(y, z) cos ψ(x, z) cos ψ(y, z) 1 x, x x, y x, z y, x y, y y, z z, x z, y z, z For any vector c = (c1, c2, c3)T R3, c T Gc = c1 x + c2 y + c3 z, c1 x + c2 y + c3 z 0 Also, G is symmetric. Therefore, G is positive semi-definite. As a result, det(G) 0. In other word, 1 cos2 ψ(x, y) cos2 ψ(y, z) cos2 ψ(x, z) + 2 cos ψ(x, y) cos ψ(y, z) cos ψ(x, z) 0 The above inequality can be rewritten as (1 cos2 ψ(x, y))(1 cos2 ψ(y, z)) (cos ψ(x, y) cos ψ(y, z) cos ψ(x, z))2 (sin ψ(x, y) sin ψ(y, z))2 (cos ψ(x, y) cos ψ(y, z) cos ψ(x, z))2 By definition of arccos, ψ(x, y), ψ(y, z) [0, π], so sin ψ(x, y) sin ψ(y, z) 0. Therefore, sin ψ(x, y) sin ψ(y, z) cos ψ(x, y) cos ψ(y, z) cos ψ(x, z) sin ψ(x, y) sin ψ(y, z) cos ψ(x, z) cos ψ(x, y) cos ψ(y, z) sin ψ(x, y) sin ψ(y, z) cos ψ(x, z) cos(ψ(x, y) + ψ(y, z)) If ψ(x, y) + ψ(y, z) > π, because by definition of arccos, ψ(x, z) [0, π], it is immediate that ψ(x, z) ψ(x, y) + ψ(y, z) If ψ(x, y) + ψ(y, z) π, recall that ψ(x, y), ψ(y, z) [0, π], hence ψ(x, y) + ψ(y, z) [0, π]. Also, ψ(x, z) [0, π]. Since cos is monotone decreasing on [0, π], we obtain ψ(x, z) ψ(x, y) + ψ(y, z) In all, ψ(x, z) ψ(x, y) + ψ(y, z) The following lemma relates angle to Euclidean distance. Lemma 5. Suppose that x, y Rm, y < x . Then, ψ(x, x + y) arcsin y The equality holds if and only if y, x + y = 0 Published as a conference paper at ICLR 2023 Proof. Let ρ = y x , ψ0 = ψ(x, y). Then x, y = x y cos ψ0. Notice that ρ2(ρ + cos ψ0)2 0 This can be rewritten as (1 + ρ cos ψ0)2 (1 + ρ2 + 2ρ cos ψ0)(1 ρ2) Since y < x , so ρ < 1, 1 + ρ cos ψ0 > 0. Hence 1 + ρ cos ψ0 p 1 + ρ2 + 2ρ cos ψ0 p Plugging in ρ = y x , we have x 2 + x y cos ψ0 p x 2 + y 2 + 2 x y cos ψ0 Since x, y = x y cos ψ0, x 2 + y 2 + 2 x, y x + y, x + y In other words, cos ψ(x, x + y) x 2 = cos arcsin y x 0, arcsin y x [0, π 2 ]. Therefore, by monotonicity of cos on [0, π ψ(x, x + y) arcsin y The equality holds if and only if ρ2(ρ + cos ψ0)2 0, or equivalently, ( y 2 + x y cos ψ0)2 = 0 This can be reduced to y, x + y = 0 F PROOF OF LEMMA 1 Lemma 1. Consider the DCBM model where (8)-(9) are satisfied. We define three K K matrices: GLL = Π LΘLLΠL, GUU = Π UΘUUΠU, and Q = G 1 UUΠ UΘUUH. For 1 k K, ψk(H) = arccos Mkk Mkk Mk k , where M = P G2 LL + GUUQQ GUU P. Published as a conference paper at ICLR 2023 Proof. Recall that ψk(H) = ψ f(Ω(k); H), f(EX; H) , for 1 k K. (11) where f(x; H) = h x L1(1), . . . , x L1(k), x Uh1, . . . , x Uh K i = [x LΠL, x UH] . (12) i L Ck Ωij = e kΠ LΩL ej which indicates Ω(k) = (e kΠ LΩL ) f(Ω(k); H) = f((e kΠ LΩL ) ; H) = [e kΠ LΩLLΠL, e kΠ LΩLUH] = [e kΠ LΘLLΠLPΠT LΘLLΠL, e kΠ LΘLLΠLPΠT UΘUUH] = [Π LΘLLΠL, H Π UΘUU] PΠ LΘLLΠLek = [GLL, Q GUU] PΠ LΘLLΠLek (13) Notice that (Π LΘLLΠL)kl = 1(k)ΘLL1(l) = ( θ(k) L 1, k = l 0, k = l In other words, Π LΘLLΠL = diag θ(1) L 1, ..., θ(K) L 1 Hence f(Ω(k); H) = θ(k) L 1[GLL, Q GUU] Pek Similarly, f(EX; H) = θ [GLL, Q GUU] Pek f(Ω(k); H), f(EX; H) = ( θ(k) L 1[GLL, Q GUU] Pek) θ [GLL, Q GUU] Pek = θ θ(k) L 1e k P[GLL, GUUQ][GLL, Q GUU] Pek = θ θ(k) L 1e k P G2 LL + GUUQQ GUU Pek = θ θ(k) L 1e k Mek = θ θ(k) L 1Mkk (14) f(Ω(k); H) = q f(Ω(k); H), f(Ω(k); H) = θ(k) L 1 p f(EX; H) = p f(EX; H), f(EX; H) = θ p Published as a conference paper at ICLR 2023 ψk(H) = ψ f(Ω(k); H), f(EX; H) = arccos f(Ω(k); H), f(EX; H) f(Ω(k); H) f(EX; H) θ θ(k) L 1Mkk θ(k) L 1 Mkkθ Mk k = arccos Mkk Mkk Mk k G PROOF OF THEOREM 1 Theorem 1. Consider the DCBM model where (8)-(9) hold. Let k denote the true community label of the new node. Suppose ˆΠU is b0-correct, for a constant b0 (0, 1). When b0 is properly small, there exists a constant c0 > 0, which does not depend on b0, such that ψk (ˆΠU) = 0 and mink =k {ψk(ˆΠU)} c0βn. Proof. Define GLL = Π LΘLLΠL, GUU = Π UΘUUΠU, Q = G 1 UUΠ UΘUU ˆΠU, and M = P G2 LL + GUUQQ GUU P as in Lemma 1. According to Lemma 1, ψk(ˆΠU) = arccos Mkk Mkk Mk k ψk (ˆΠU) = arccos Mk k Mk k Mk k = arccos 1 = 0 When k = k , according to Lemma 3, ψk(ˆΠU) = 2 1 1 cos ψk(ˆΠU) 2 1 cos arccos Mkk Mkk Mk k 2 1 Mkk Mkk Mk k Let DM = diag(M11, ..., MKK), M = D 1 (ek ek ) M(ek ek ) = Mkk + Mk k Mkk Mk k = Mkk Mkk Mkk + Mk k Mk k Mk k Mkk Mkk Mk k Mk k p = 2 1 Mkk Mkk Mk k Published as a conference paper at ICLR 2023 (ek ek ) M(ek ek ) (20) M is affected by ˆΠU and is complicated to evaluate directly. Hence, we would first evaluate its oracle version and then reduce the noisy version to the oracle version. Define the oracle version of M as follow, where ˆΠU is replaced by ΠU M (0) = P G2 LL + G2 UU P Similarly, define the oracle version of DM, DM (0) = diag(M (0) 11 , ..., M (0) KK), and the oracle version of M, M (0) = D 1 2 M (0)M (0)D 1 Oracle Case We first study the oracle case |α M (0)α|. Since GLL = Π LΘLLΠL = diag θ(1) L 1, ..., θ(K) L 1 , GUU = Π UΘUUΠU = diag θ(1) U 1, ..., θ(K) U 1 , which indicates that G2 LL + G2 UU = diag θ(k) L 2 1 + θ(k) U 2 1 K for any vector α Rk, |α M (0)α| = |α D 1 2 M (0)M (0)D 1 2 M (0)P G2 LL + G2 UU PD 1 2 M (0)α 2 min k ( θ(k) L 2 1 + θ(k) U 2 1) (Cauchy-Schwartz Inequality) |α D 1 2 M (0)PP D 1 2 M (0)α| min k 1 2( θ(k) L 1 + θ(k) U 1)2 2λmin(P)2 D 1 2 M (0)α 2 min k ( θ(k) 1)2 (Condition (8)) 1 2β2 n|αD 1 M (0)α| min k ( θ(k) 1)2 2β2 n α 2 min k ( θ(k) L 2 1 + θ(k) U 2 1) 1 min k ( θ(k) 1)2 2β2 n α 2 min k h ( θ(k) L 1 + θ(k) U 1)2i 1 (min k θ(k) 1)2 2β2 n α 2 mink θ(k) 1 maxk θ(k) 1 (Condition (9)) β2 n α 2 It remains to study the noisy case. We reduce the noisy case to the oracle case through the following lemma. Lemma 6. Denote KC2 2b0 θU 1 Suppose that C5 1 4. Then, for any vector α Rk, |α Mα| 1 3C5 1 C5 |α M (0)α| 1 3|α M (0)α| (23) The proof of Lemma 6 is quite tedious and we would defer it to the end of this section. Published as a conference paper at ICLR 2023 Set b0 1 32K2 KC2 2 , then C5 θU 1 4. As a result, combining (21) with Lemma 6, we have for any vector α Rk, 3|α M (0)α| β2 n α 2 Hence, take α = ek ek in (24) and combine it with (20), we obtain that for any k [K] (ek ek ) M(ek ek ) 1 6C2 2 β2n ek ek 2 = 1 3C2 2 βn (25) Therefore, set c0 = q 1 3C2 2 , we have min k =k {ψk(ˆΠU)} c0βn (26) In all, when b0 is properly small such that b0 1 32K2 KC2 2 , there exists constant c0 = q 1 3C2 2 > 0 not depending on b0 such that ψk (ˆΠU) = 0 and mink =k {ψk(ˆΠU)} c0βn. G.1 PROOF OF LEMMA 6 Proof. For any vector α Rk, |α Mα α M (0)α| = |α D 1 2 M α α D 1 2 M (0)M (0)D 1 2 M (M M (0))D 1 2 M M (0)D 1 2 M α α D 1 2 M (0)M (0)D 1 2 M (M M (0))D 1 2 M M (0)D 1 2 M (0)M (0)D 1 2 M (0) α| (27) The first part on the RHS of (27), 2 M (M M (0))D 1 2 M α| = |α D 1 2 M PΠ UΘUU(ˆΠU ˆΠ U ΠUΠ U)ΘUUΠUPD 1 2 M α 2 Π UΘUU(ˆΠU ˆΠ U ΠUΠ U)ΘUUΠU 2 2 M α 2 Π UΘUU(ˆΠU ˆΠ U ΠUΠ U)ΘUUΠU (28) Denote G(d) = Π UΘUU(ˆΠU ˆΠ U ΠUΠ U)ΘUUΠU. Define ηl l = P i U,πi=el,ˆπi=e l θi. In other words, ηl l is the sum of the degree heterogeneity parameters of all the nodes in U with true label l and estimated label l. Then, (Π UΘUU ˆΠU)l l = ηl l (Π UΘUUΠU)l l = X i U,πi=el,πi=e l θi = Il= l X where Il= l is the indicator function of event {l = l}. Published as a conference paper at ICLR 2023 G(d) l l = (Π UΘUU(ˆΠU ˆΠ U ΠUΠ U)ΘUUΠU)l l = ((Π UΘUU ˆΠU)(Π UΘUU ˆΠU) )l l ((Π UΘUUΠU)(Π UΘUUΠU) )l l = X s [K] ηlsη ls Il= l( X s [K] ηls)2 (29) Since ˆΠU is b0 correct, there exists permutation T of K columns of ˆΠU such that P i U θi 1{T ˆπi = πi} b0 θ 1. Let r = r(l) satisfies er = T 1el. When l = l, we have |G(d) l l | = | X s [K] η2 ls ( X s [K] ηls)2| s [K] ηls)2 X s [K] η2 ls s [K] ηls)2 η2 lr s =r ηls)(ηlr + X s =r ηls)( X s [K] ηls) (30) When l = l, we have |G(d) l l | = | X s [K] ηlsη ls| s [K] ηlsη ls (31) l, l [K] |G(d) l l | l [K] |G(d) ll | + X l = l |G(d) ll | s =r ηls)( X s [K] ηls) + X s [K] ηlsη ls s [K] ηls X l [K],s =r(l) ηls + X l = l ηlsη ls s [K] ηls X l [K],s =r(l) ηls + X l [k] ηls 2 X l [k] η2 ls s [K] ηls X l [K],s =r(l) ηls + X l [k] ηls 2 X r(l)=s η2 ls s [K] ηls X l [K],s =r(l) ηls + X r(l) =s ηls X l [K] ηls + X Published as a conference paper at ICLR 2023 s [K] ηls X l [K],s =r(l) ηls + 2 X r(l) =s ηls X s [K] ηls X l [K],s =r(l) ηls + 2 max s l [K] ηls X r(l) =s ηls l [K],s =r(l) ηls max l s [K] ηls + max s l [K],s =r(l) ηls X s [K] ηls + X l [K],s =r(l) ηls (32) Recall that T satisfies P i U θi 1{T ˆπi = πi} b0 θ 1, hence l [K],s =r(l) ηls = X l [K],s =r(l), i U,πi=el,ˆπi=es i U θi 1{T ˆπi = πi} b0 θ 1 Therefore, G(d) 4b0 θU 1 θ 1 (33) Plugging (33) into (28), we obtain 2 M (M M (0))D 1 Kb0 θU 1 θ 1 PD 1 2 M α 2 (34) On the other hand, 2 M M (0)D 1 2 M α| = |α D 1 2 M P G2 LL + G2 UU PD 1 Since GLL = Π LΘLLΠL = diag( θ(1) L 1, ..., θ(K) L 1), GUU = Π UΘUUΠU = diag( θ(1) U 1, ..., θ(K) U 1), 2 M M (0)D 1 2 M α| PD 1 2 M α 2 min k ( θ(k) L 2 1 + θ(k) U 2 1) (Cauchy-Schwartz Inequality) PD 1 2 M α 2 min k 1 2( θ(k) L 1 + θ(k) U 1)2 2 M α 2 min k θ(k) 2 1 Recall condition (9) in the main paper, max1 k K{ θ(k) 1} min1 k K{ θ(k) 1} C2 2 M M (0)D 1 C2 max k θ(k) 1)2 2 M α 2( 1 KC2 θ 1)2 = 1 2K2C2 2 PD 1 2 M α 2 θ 2 1 (35) Comparing (35) with (34), we obtain 2 M (M M (0))D 1 Published as a conference paper at ICLR 2023 KC2 2b0 θU 1 2 M M (0)D 1 2 M (0)M (0)D 1 2 M M (0)D 1 2 M (0)M (0)D 1 2 M (0) α| (36) Consequently, we bound the first part of (27) by the second part of (27). It remains to bound the second part of (27). Since DM, DM (0), M (0) are all diagonal matrices, we can rewrite the second part on the LHS of (27) as follows: 2 M M (0)D 1 2 M (0)M (0)D 1 2 M (0)(M (0)) 1 2 (M (0)) 1 1 2 M (0)D 1 2 M M (0)D 1 1 2 M (0)(M (0)) 1 2 1 (M (0)) 1 2 D 1 2 M (0)(M (0)) 1 2 DM (0)D 1 M 1 (M (0)) 1 2 D 1 λmax DM (0)D 1 M 1 (M (0)) 1 2 D 1 = max k [K] M (0) kk Mkk 1 (M (0)) 1 2 D 1 = max k [K] M (0) kk (M (0) M)kk 1 2 M (0)M (0)D 1 2 M (0)α| (37) Notice that for any k [K] M (0) kk (M (0) M)kk e k M (0)ek e k(M (0) M)ek e k P G2 LL + G2 UU Pek e k PG(d)Pek Pek 2 mink( θ(k) L 2 1 + θ(k) U 2 1) Pek 2 G(d) 2 (Cauchy-Schwartz Inequality) mink 1 2( θ(k) L 1 + θ(k) U 1)2 (Plugging in (33)) mink( θ(k) 1)2 Kb0 θU 1 θ 1 (Condition (9)) (maxk 1 C2 θ(k) 1)2 Kb0 θU 1 θ 1 ( 1 C2K θ 1)2 Kb0 θU 1 θ 1 C5 4 > 1 (38) Plugging (38) into (37), we have 2 M M (0)D 1 2 M (0)M (0)D 1 Published as a conference paper at ICLR 2023 max k [K] 1 M (0) kk (M (0) M)kk 2 M (0)M (0)D 1 max k [K] 1 1 C5 1 |α D 1 2 M (0)M (0)D 1 = C5 1 C5 |α D 1 2 M (0)M (0)D 1 2 M (0)α| (39) Combining (27), (36), and (39), we have |α Mα α M (0)α| |α D 1 2 M (M M (0))D 1 2 M M (0)D 1 2 M (0)M (0)D 1 2 M (0)M (0)D 1 2 M M (0)D 1 2 M (0)M (0)D 1 2 M M (0)D 1 2 M (0)M (0)D 1 2 M (0)M (0)D 1 + (C5 + 1)|α D 1 2 M M (0)D 1 2 M (0)M (0)D 1 2 M (0)M (0)D 1 + (C5 + 1) C5 1 C5 |α D 1 2 M (0)M (0)D 1 = 2C5 1 C5 |α M (0)α| (40) Hence for any vector α Rk, |α Mα| |α M (0)α| |α Mα α M (0)α| 1 3C5 1 C5 |α M (0)α| (41) To conclude, in this subsection, we successfully reduce the noisy case |α Mα| to the oracle case |α M (0)α|. Result (41) will also be used in the proof of other claims. H PROOF OF THEOREM 2 Theorem 2. Consider the DCBM model where (8)-(9) hold. There exists constant C > 0, such that for any δ (0, 1/2), with probability 1 δ, simultaneously for 1 k K, | ˆψk(ˆΠU) ψk(ˆΠU)| C θ 1 min{θ , θ(k) L 1} + θ(k) L 2 θ(k) L 1 θ 1 To prove Theorem 2, we need a famous concentration inequality, Bernstein inequality: Lemma 7 (Bernstein inequality). Suppose X1, ..., Xn are independent random variables such that EXi = 0, |Xi| b and V ar(Xi) σ2 i for all i. Let σ2 = n 1 Pn i=1 σ2 i . Then, for any t > 0, i=1 Xi| t 2 exp nt2/2 σ2 + bt/3 Published as a conference paper at ICLR 2023 The proof of Lemma 7, Bernstein inequality, can be seen in most probability textbooks such as Uspensky (1937). Proof. Recall that for k [K], ˆψk(ˆΠU) = ψ(f(A(k); ˆΠU), f(X; H)), ψk(ˆΠU) = ψ f(Ω(k); ˆΠU), f(EX; ˆΠU) . Denote vk = f(A(k); ˆΠU), v = f(X; ˆΠU), vk = f(Ω(k); ˆΠU), v = f(EX; ˆΠU), k [K]. Then, by Lemma 4, ˆψk(ˆΠU) = ψ(vk, v ) ψ(vk, vk) + ψ( vk, v ) ψ(vk, vk) + ψ( vk, v ) + ψ( v , v ) = ψk(ˆΠU) + ψ(vk, vk) + ψ( v , v ) Similarly, ψk(ˆΠU) ˆψk(ˆΠU) + ψ(vk, vk) + ψ( v , v ) Therefore, | ˆψk(ˆΠU) ψk(ˆΠU)| ψ(vk, vk) + ψ( v , v ). (42) For any ϕ1, ..., ϕK 0. P k [K], | ˆψk(ˆΠU) ψk(ˆΠU)| ϕk = 1 P k [K], | ˆψk(ˆΠU) ψk(ˆΠU)| > ϕk k=1 P | ˆψk(ˆΠU) ψk(ˆΠU)| > ϕk (43) By definition of ψ, ˆψk(ˆΠU), ψk(ˆΠU) [0, π]. Hence, | ˆψk(ˆΠU) ψk(ˆΠU)| [0, π]. As a result, when ϕk π, P | ˆψk(ˆΠU) ψk(ˆΠU)| > ϕk = 0 When ϕk < π, by (42), P | ˆψk(ˆΠU) ψk(ˆΠU)| > ϕk P ψ(vk, vk) + ψ( v , v ) > ϕk P ψ(vk, vk) > 1 2ϕk or ψ( v , v ) > 1 P ψ(vk, vk) > 1 2ϕk + P ψ( v , v ) > 1 By lemma 5, when vk vk < vk ψ(vk, vk) arcsin vk vk Hence, for any ϕ [0, π 2 ), vk vk sin(ϕ) vk implies ψ(vk, vk) ϕ. As a result, for any ϕ [0, π 2 ), ψ(vk, vk) > ϕ implies vk vk > sin(ϕ) vk . Similarly, for any ϕ [0, π 2 ), ψ(v , v ) > ϕ implies v v sin(ϕ) v . By definition of ϕk, ϕk 0. Hence, when ϕk < π, 1 2 ). Plugging the above results into (44), we have Published as a conference paper at ICLR 2023 P | ˆψk(ˆΠU) ψk(ˆΠU)| > ϕk P ψ(vk, vk) > 1 2ϕk + P ψ( v , v ) > 1 P vk vk sin(1 2ϕk) vk + P v v sin(1 P l [2K], |(vk vk)l| 1 + P l [2K], |(v v )l| 1 l=1 P |(vk vk)l| 1 l=1 P |(v v )l| 1 Since when ϕk < π, 1 2 ], by Lemma 3, sin( 1 π 1 2ϕk = 1 πϕk. Plugging back to (45), we have when ϕk < π, P | ˆψk(ˆΠU) ψk(ˆΠU)| > ϕk l=1 P |(vk vk)l| 1 π l=1 P |(v v )l| 1 π K ϕk v (46) It remains to evaluate P |(vk vk)l| 1 π K ϕk vk and P |(v v )l| 1 π K ϕk v , which are illustrated in the following two lemmas. Lemma 8. Define C6 = C2 3 ). When ϕk 2 2πC2K2 θ(k) L 2 θ(k) L 1 θ 1 , P |(vk vk)l| 1 π K ϕk vk 2 exp C6 C2 ϕ2 k θ(k) L 1 θ 1 Lemma 9. Define C7 = C2 P |(v v )l| 1 π K ϕk v 2 exp C7 C2 ϕ2 kθ θ 1 The proof of Lemma 8 and 9 are quite tedious. We would defer their proofs to the end of this section. ϕk = ϕk(C, δ) = C θ 1 min{θ , θ(k) L 1} + θ(k) L 2 θ(k) L 1 θ 1 Then leveraging on Lemma 8 and 9, we have when C 2 P |(vk vk)l| 1 π K ϕk vk 2 exp C6C2 log(1/δ) θ(k) L 1 θ 1 C2 θ 1 min{θ , θ(k) L 1} 2 exp ( C6 log(1/δ)) = 2δC6 (47) Published as a conference paper at ICLR 2023 P |(v v )l| 1 π K ϕk v 2 exp C7C2 log(1/δ)θ θ 1 C2 θ 1 min{θ , θ(k) L 1} 2 exp ( C7 log(1/δ)) = 2δC7 (48) Plugging (47) and (48) back to (46), leveraging on the fact that δ 1 2 < 1, we obtain when ϕk < π, and C 2 P | ˆψk(ˆΠU) ψk(ˆΠU)| > ϕk 4KδC6 + 4KδC7 8KδC6 Recall that when ϕk π, P | ˆψk(ˆΠU) ψk(ˆΠU)| > ϕk = 0 . In all, we have that when C 2 P | ˆψk(ˆΠU) ψk(ˆΠU)| > ϕk 8KδC6 (49) Substituting (49) into (43), we obtain that when C 2 P k [K], | ˆψk(ˆΠU) ψk(ˆΠU)| ϕk 1 8K2δC6 (50) Hence, it suffices to make 8K2δC6 δ. Choose 3)(1 + log(8K2) log 2 )}, (51) Then C6 1 log(8K2) log 2 1. Since δ 1 log 2 = 1 8K2 As a result, 8K2δC6 δ. Hence, choose C as in (51), then C >), and for any δ (0, 1/2), P k [K], | ˆψk(ˆΠU) ψk(ˆΠU)| ϕk 1 δ (52) To conclude, there exists constant C > 0, such that for any δ (0, 1/2), with probability 1 δ, simultaneously for 1 k K, | ˆψk(ˆΠU) ψk(ˆΠU)| C θ 1 min{θ , θ(k) L 1} . H.1 PROOF OF LEMMA 8 Proof. When l [K], ( vk)l = (f(Ω(k); ˆΠU))l = Ω(k)1(l) = X Published as a conference paper at ICLR 2023 When l {K + 1, ..., 2K}, define ˆCl = {i U : ˆπi = el K} , then ( vk)l = (f(Ω(k); ˆΠU))l = Ω(k) ˆΠU l [2K] ( vk)2 l (Cauchy-Schwartz) l [2K] ( vk)l)2 l [2K] ( vk)l| j Cl L Ωij + j L Ωij + X j Ck θiθj Pkk (Identifiability condition) = 1 2K θ(k) L 1 θ(k) 1 2K θ(k) L 1 min l [K] θ(l) 1 (Condition (9)) 1 C2 2K θ(k) L 1 max l [K] θ(l) 1 2K θ(k) L 1 θ 1 (53) When l [K], (vk)l = (f(A(k); ˆΠU))l = A(k)1(l) = X Recall that, ( vk)l = (f(Ω(k); ˆΠU))l = Ω(k)1(l) = X So |(vk vk)l| = X j Cl L (Aij Ωij) Published as a conference paper at ICLR 2023 When l [K]\{k}, since ˆΠU only depends on AUU, it is independent of ALL. Hence, given ˆΠU, {Aij Ωij : i Ck L, j Cl L} are a collection of |Ck L||Cl L| independent random variables. Furthermore, given ˆΠU, for any i Ck L, j Cl L, E h Aij Ωij|ˆΠU i = E h Aij|ˆΠU i Ωij = Ωij Ωij = 0 Also, 1 Ωij Aij Ωij Aij 1 So |Aij Ωij| 1. Additionally, var Aij Ωij = var Aij|ˆΠU = Ωij(1 Ωij) Ωij Therefore, denote nkl = |Ck L||Cl L|, by Lemma 7, P |(vk vk)l| 1 π = E P |(vk vk)l| 1 π K ϕk vk |ˆΠU j Cl L (Aij Ωij)| 1 π Knkl ϕk vk |ˆΠU Knkl ϕk vk 2 j Cl L Ωij + 1 j Cl L Ωij + 1 K|( vk)l| + 1 When l = k, {Aij Ωij : i, j Ck L, i < j} are a collection of 1 2|Ck L|(|Ck L| 1) independent random variables. Furthermore, for any i, j Ck L, i < j, E(Aij Ωij) = EAij Ωij = Ωij Ωij = 0 Also, 1 Ωij Aij Ωij Aij 1 So |Aij Ωij| 1. Additionally, var(Aij Ωij) = var(Aij) = Ωij(1 Ωij) Ωij Denote nkk = 1 2|Ck L|(|Ck L| 1), we have P |(vk vk)l| 1 π j Ck L (Aij Ωij)| 1 π Published as a conference paper at ICLR 2023 i 0 and C > 0, which do not depend on (b0, ϵ), such that P(ˆy = k ) ϵ + C PK k=1 exp C0β2 n θ 1 min{θ , θ(k) L 1} . Proof. Let B0 be the event that ˆΠU is b0-correct. Then, = P(ˆy = k , BC 0 ) + P ˆy = k , B0 P(BC 0 ) + P n k = k , ˆψk(ˆΠU) ˆψk (ˆΠU) o , B0 ϵ + P n k = k , ψk(ˆΠU) ˆψk(ˆΠU) + ˆψk (ˆΠU) ψk (ˆΠU) ψk(ˆΠU) ψk (ˆΠU) o , B0 ϵ + P n k = k , ψk(ˆΠU) ˆψk(ˆΠU) + ˆψk (ˆΠU) ψk (ˆΠU) ψk(ˆΠU) ψk (ˆΠU) o , B0 (66) By Theorem 1, when b0 is properly small, B0 implies that there exists a constant c0 > 0, which does not depend on b0, such that ψk (ˆΠU) = 0 and mink =k {ψk(ˆΠU)} c0βn. Substituting this result into (66), we have P(ˆy = k ) ϵ + P n k = k , ψk(ˆΠU) ˆψk(ˆΠU) + ˆψk (ˆΠU) ψk (ˆΠU) c0βn o , B0 ϵ + P k = k , ψk(ˆΠU) ˆψk(ˆΠU) + ˆψk (ˆΠU) ψk (ˆΠU) c0βn ϵ + P k [K], ψk(ˆΠU) ˆψk(ˆΠU) 1 ϵ + P k [K], ψk(ˆΠU) ˆψk(ˆΠU) > 1 According to Theorem 2, there exists a constant C > 0, such that for any δ (0, 1/2), with probability 1 δ, simultaneously for 1 k K, | ˆψk(ˆΠU) ψk(ˆΠU)| C θ 1 min{θ , θ(k) L 1} + θ(k) L 2 θ(k) L 1 θ 1 Published as a conference paper at ICLR 2023 Take C0 = c2 0 36C2 , δ = exp C0β2 n θ 1 min n θ , min k [K] θ(k) L 1 o θ 1 min{θ , θ(k) L 1} = C v u u u tlog 1/ exp C0β2n θ 1 min n θ , mink [K] θ(k) L 1 o θ 1 min{θ , θ(k) L 1} v u u u t C0β2n θ 1 min n θ , mink [K] θ(k) L 1 o θ 1 min{θ , θ(k) L 1} On the other hand, take c3 in condition (9) properly small such that c3 c0 6C , then according to condition (9), θ(k) L 1 θ 1 C c3βn 1 Combining (68) and (69), we have θ 1 min{θ , θ(k) L 1} + θ(k) L 2 θ(k) L 1 θ 1 Therefore, when δ < 1 2, by Theorem 2, with probability 1 δ, simultaneously for 1 k K, | ˆψk(ˆΠU) ψk(ˆΠU)| 1 As a result, when δ < 1 P k [K], ψk(ˆΠU) ˆψk(ˆΠU) > 1 P k [K], ψk(ˆΠU) ˆψk(ˆΠU) > 1 Hence in total, we have P k [K], ψk(ˆΠU) ˆψk(ˆΠU) > 1 3c0βn 2δ (70) Plugging (70) into (67), we obtain P(ˆy = k ) ϵ + 2δ = ϵ + 2 exp C0β2 n θ 1 min n θ , min k [K] θ(k) L 1 o k=1 exp C0β2 n θ 1 min{θ , θ(k) L 1} (71) Choose C = 2, we obtain P(ˆy = k ) ϵ + C k=1 exp C0β2 n θ 1 min{θ , θ(k) L 1} Published as a conference paper at ICLR 2023 To conclude, when b0 is properly small, there exist constants C0 = c2 0 36C2 > 0 and C = 2 > 0, which do not depend on (b0, ϵ), such that P(ˆy = k ) ϵ+ C PK k=1 exp C0β2 n θ 1 min{θ , θ(k) L 1} . I.2 PROOF OF COROLLARY 2 Corollary 2. Consider the DCBM model where (8)-(9) hold. We apply SCORE+ to obtain ˆΠU and plug it into Angle Min+. As n , suppose for some constant q0 > 0, mini U θi q0 maxi U θi, βn θU q0 p log(n), β2 n θ 1θ , and β2 n θ 1 mink{ θ(k) L 1} . Then, P(ˆy = k ) 0, so the Angle Min+ estimate is consistent. Proof. By Corollary 2, let ϵ be the probability that ˆΠU obtained through SCORE+ is not b0-correct, then P(ˆy = k ) ϵ + C k=1 exp C0β2 n θ 1 min{θ , θ(k) L 1} Since β2 n θ 1θ , and β2 n θ 1 mink{ θ(k) L 1} , k=1 exp C0β2 n θ 1 min{θ , θ(k) L 1} 0 By Theorem 2.2 in Jin et al. (2021), when q0 is sufficiently large, mini U θi q0 maxi U θi and βn θU q0 p log(n) imply that ϵ 0. Hence, in all, we have P(ˆy = k ) 0, so the Angle Min+ estimate is consistent. J PROOF OF LEMMA 2 As mentioned in section D, in the main paper, for the smoothness and comprehensibility of the text, we do not present the most general form of Lemma 2. Here, we present both the original version, Lemma 2, and the generalized version, Lemma 2 below, where we relax the assumption that θ(1) L 1 θ(2) L 1 = θ(1) U 1 θ(2) U 1 = 1 to the much weaker assumption: the first part of condition (9) in the main text, which only assumes that θ(1) 1 and θ(2) 1 are of the same order. Lemma 2. Consider a DCBM with K = 2 and P = (1 b)I2 + b121 2. Suppose θ = o(1), mink θ(k) L 1 = o(1), 1 b = o(1), θ(1) L 1 θ(2) L 1 = θ(1) U 1 θ(2) U 1 = 1. There exists a constant c4 > 0 such that inf y {Risk( y)} c4 exp n 2[1 + o(1)](1 b)2 8 θ ( θL 1 + θU 1) o , (12) where the infimum is taken over all measurable functions of A, X, and parameters ΠL, ΠU, Θ, P, θ . In Angle Min+, suppose the second part of condition 9 holds with c3 = o(1), ˆΠU is b0-correct with b0 a.s. 0. There is a constant C4 > 0 such that, Risk(ˆy) C4 exp n [1 o(1)](1 b)2 8 θ ( θL 2 1 + θU 2 1)2 θL 3 1 + θU 3 1 Lemma 2 . Consider a DCBM with K = 2 and P = (1 b)I2 + b121 2. Suppose 1 b = o(1). There exists a constant c4 > 0 such that inf y {Risk( y)} c4 exp n 2[1 + o(1)](1 b)2 8 θ ( θL 1 + θU 1) o , (75) where the infimum is taken over all measurable functions of A, X, and parameters ΠL, ΠU, Θ, P, θ . In Angle Min+, suppose condition 9 holds with c3 = o(1), θ = o(1), θ mink θ(k) L 1 = o(1), ˆΠU is Published as a conference paper at ICLR 2023 b0-correct with b0 a.s. 0. There is a constant C4 > 0 such that, Risk(ˆy) C4 exp [1 o(1)](1 b)2 θ(1) L 3 1+ θ(1) U 3 1 ( θ(1) L 2 1+ θ(1) U 2 1)2 + θ(2) L 3 1+ θ(2) U 3 1 ( θ(2) L 2 1+ θ(2) U 2 1)2 When conditions of Lemma 2 hold, conditions of Lemma 2 hold. Also, with θ(1) L 1 θ(2) L 1 = θ(1) U 1 θ(2) U 1 = 1 assumed in Lemma 2, the results of Lemma 2 imply the results of 2. Therefore, it suffices to prove the generalized version, Lemma 2 . We prove the lower bound (75) and upper bound (76 ) separately. J.1 PROOF OF LOWER BOUND (75) Proof. Let P(1) and P(2) be the joint distribution of A and X given π = e1 and π = e2, respectively. For a random variable or vector or matrix Y , let P(1) Y and P(2) Y be the distribution of Y given π = e1 and π = e2, respectively. According to Theorem 2.2 in Section 2.4.2 of Tsybakov (2009), inf y {Risk( y)} 2 1 H2(P(1), P(2))(1 H2(P(1), P(2))/4)) 2H2(P(1), P(2)) 2 2H2(P(1), P(2)) 2 2H2(P(1), P(2)) 2 (76) where H2(P(1), P(2)) = Z p is the Hellinger distance between P(1) and P(2). As in Section 2.4 of Tsybakov (2009), one key property of Hellinger distance is that if Q(1) and Q(2) are product measures, Q(1) = N i=1Q(1) i , Q(2) = N i=1Q(2) i , then H2(Q(1), Q(2)) = 2 1 1 H2(Q(1) i , Q(2) i ) 2 Notice that for k = 1, 2, since according to DCBM, A, X1, ..., Xn are independent, P(k) = P(k) A n i=1P(k) Xi (78) Combining (77) and (78), we obtain H2(P(1), P(2)) = 2 1 1 H2(P(1) A , P(2) A ) 2 1 H2(P(1) Xi , P(2) Xi ) 2 Given π = e1 and π = e2, according to DCBM, the distribution of A remains the same. As a result, H2(P(1) A , P(2) A ) = 0. On the other hand, for k = 1, 2 and i [n], according to DCBM model P(k) Xi Bern(θ θi Pkki) Published as a conference paper at ICLR 2023 where ki is the true label of node i. As a result, 1 H2(P(1) Xi , P(2) Xi ) 2 θ θi P1ki p θ θi P2ki 2 + p 1 θ θi P1ki p 1 θ θi P2ki 2i θ θi P1kiθ θi P2ki + p (1 θ θi P1ki)(1 θ θi P2ki) (80) For x, y R, denote g(x, y) = log(xy + p (1 x2)(1 y2)). Then, 1 H2(P(1) Xi ,P(2) Xi ) 2 = exp g( p θ θi P1ki, p Hence, by (79) 2H2(P(1), P(2)) = 1 H2(P(1) A , P(2) A ) 2 1 H2(P(1) Xi , P(2) Xi ) 2 θ θi P1ki, p θ θi P2ki) (81) Substituting (81) into (76), we obtain inf y {Risk( y)} 1 θ θi P1ki, p θ θi P2ki) (82) To reduce the RHS of (82), we need to evaluate g. The following lemma shows that g(x, y) 1 Lemma 10. Suppose that 0 x, y a < 1. Then, 2(1 a2) 3 2 (x y)2 Proof. We first prove a short inequality on log. For z > 0, define g3(z) = log(z) z 1 d dz g3(z) = 1 Therefore, when z 1, d dzg3(z) 0, g3(z) is monotonously deceasing, hence g3(z) g3(1) = 0; when z 1, d dzg3(z) 0, g3(z) is monotonously increasing, hence g3(z) g3(1) = 0. In all, g3(z) 0, so Since x, y [0, 1], we could define ϕx = arcsin x [0, π 2 ], ϕy = arcsin y [0, π g(x, y) = g(sin ϕx, sin ϕy) = log sin ϕx sin ϕy + q (1 sin2 ϕx)(1 sin2 ϕy) = log(sin ϕx sin ϕy + cos ϕx cos ϕy) = log(cos(ϕx ϕy)) Published as a conference paper at ICLR 2023 (By (83)) cos(ϕx ϕy) 1 = 2 sin2 (ϕx ϕy) 2 cos(ϕx ϕy) 2 4 sin2 (ϕx ϕy) 2 (sin ϕx sin ϕy)2 1 cos(ϕx ϕy)(sin ϕx sin ϕy)2 2 4 sin2 (ϕx ϕy) 2 (2 sin (ϕx ϕy) 2 cos (ϕx+ϕy) 2 )2 1 cos(ϕx ϕy)(x y)2 cos(|ϕx ϕy|) cos2 (ϕx+ϕy) Since x, y [0, a], ϕx, ϕy [0, arcsin a].As a result, |ϕx ϕy|, (ϕx+ϕy) 2 [0, arcsin a]. Plugging this result back into (84), we have 2(x y)2 1 cos(arcsin a) cos2 arcsin a = 1 2(1 a2) 3 2 (x y)2 (85) This concludes the proof. Back to the proof of lower bound (75). Define θa = r θ maxi [n] θi (max{1, b}), then for any k {1, 2}, i [n], p θ θi Pkki θa. Therefore, applying Lemma 10 in (82), we have when θa < 1, inf y {Risk( y)} 1 2(1 (θa)2) 3 2 θ θi P1ki p θ θi P2ki 2 (1 (θa)2) 3 2 (1 (θa)2) 3 2 (1 (θa)2) 3 2 θ θ 1 (1 b)2 b)2(1 (θa)2) 3 2 (1 b)2 8 θ ( θL 1 + θU 1) (86) Since θ = o(1), b = 1 o(1), and by DCBM model, maxi [n] θi , we have θa = r θ maxi [n] θi (max{1, b}) = o(1). Since b = 1 o(1), (1 + b)2 4. Therefore, b)2(1 (θa)2) 3 2 = 1 o(1). Substituting these results into (86), we obtain inf y {Risk( y)} 1 2 exp n 2[1 + o(1)](1 b)2 8 θ ( θL 1 + θU 1) o , (87) This concludes our proof of lower bound (75), with c4 = 1 J.2 PROOF OF UPPER BOUND (76 ) Proof. When K = 2, Risk(ˆy) = P(ˆy = 2|π = e1) + P(ˆy = 1|π = e2). The evaluation of P(ˆy = 2|π = e1) and P(ˆy = 1|π = e2) are exactly the same. Without the loss of generosity, we would focus on P(ˆy = 2|π = e1). Published as a conference paper at ICLR 2023 Recall that in the proof of 2, we define vk = f(A(k); ˆΠU), v = f(X; ˆΠU), vk = f(Ω(k); ˆΠU), v = f(EX; ˆΠU), k [K]. P(ˆy = 2|π = e1) = P(ψ(v2, v ) ψ(v1, v )|π = e1) v v2 v , v1 Recall that in proof of Lemma 8 9, we define ˆCk = {i U : ˆπi = ek K}, k = K + 1, ..., 2K. Let w = v2 v2 v1 v1 . Since when l [K], (v )l = (f(X; ˆΠU))l = X1(l) = X ; when l {K + 1, ..., 2K}, (v )l = (f(X; ˆΠU))l = X1(l) = X P(ˆy = 2|π = e1) = P K X i Ck L Xi + i ˆCk Xi π = e1 i Ck L wk(Xi EXi) + i ˆCk wk(Xi EXi) i Ck L wk EXi i ˆCk wk EXi π = e1 i Ck L wk(Xi EXi) + i ˆCk wk(Xi EXi) i Ck L wk EXi + i ˆCk wk EXi π = e1 i Ck L wk(Xi EXi) + i ˆCk wk(Xi EXi) i Ck L wk EXi + i ˆCk wk EXi A, π = e1 i (89) Since A and X are independent and for k [2K], wk is measurable with respect to A, given A, {wk(Xi EXi) : k [K], i Ck L} {wk(Xi EXi) : k [2K]\[K], i ˆCk} are a collection of independent random variables. Also, for any k [2K], i [n], E[wk(Xi EXi)|A] = wk(E[Xi|A] EXi) = wk(EXi EXi) = 0 Furthermore, 1 EXi Xi EXi Xi 1 So |wk(Xi EXi)| maxk [2K] |wk|. Published as a conference paper at ICLR 2023 Additionally, var(wk(Xi EXi)|A) = w2 kvar(Xi) = w2 k EXi(1 EXi) w2 k EXi i Ck L wk EXi + i ˆCk wk EXi = 1 i Ck L w2 k EXi + i ˆCk w2 k EXi = 1 where w w is defined as (w2 1, ..., w2 2K). Then by Lemma 7, P(ˆy = 2|π = e1) E h P 1 i Ck L wk(Xi EXi) + i ˆCk wk(Xi EXi) i Ck L wk EXi + i ˆCk wk EXi A, π = e1 i 3 maxk [2K] |wk|t 3 maxk [2K] |wk|| w, v | By Lemma 8, when ϕ 2 2πC2K2 θ(k) L 2 θ(k) L 1 θ 1 , P |(vk vk)l| 1 π K ϕ vk 2 exp C6 C2 ϕ2 θ(k) L 1 θ 1 2πC2K2 θ(k) L 2 θ(k) L 1 θ 1 , |1 b| θ mink [K] θ(k) L 1 Then because θ mink [K] θ(k) L 1 = o(1) and by condition 9, θ(k) L 2 θ(k) L 1 θ 1 c3βn = o(|1 b|), we have ϕ = o(|1 b|). Also, P k [K], l [2K], |(vk vk)l| 1 π l [2K] P |(vk vk)l| 1 π l [2K] 2 exp C6 C2 ϕ2 θ(k) L 1 θ 1 l [2K] 2 exp C2 (1 b)2 θ mink [K] θ(k) L 1 0.5 θ(k) L 1 mink [K] θ(k) L 1 θ θ 1 o(1)(1 b)2θ θ 1 inf y {Risk( y)} (91) Published as a conference paper at ICLR 2023 Hence, we can focus on the case where for all k [K], l [2K], |(vk vk)l| = o(|1 b|) vk . Until the end of the proof, we assume that we are under this case. We first evaluate w, v . Let w = v2 v2 v1 v1 . Then, | w, v w, v | = | ( v2 v2 v1 v1 ) ( v2 v2 v1 v1 ), v | v2 v2 v2 ) ( v1 v1 v1 v1 ), v = v |(cos ψ(v2, v ) cos ψ( v2, v )) (cos ψ(v1, v ) cos ψ( v1, v ))| = v | 2 sin ψ(v2, v ) ψ( v2, v ) 2 sin ψ(v2, v ) + ψ( v2, v ) + 2 sin ψ(v1, v ) ψ( v1, v ) 2 sin ψ(v1, v ) + ψ( v1, v ) 2 v sin |ψ(v2, v ) ψ( v2, v )| 2 sin ψ(v2, v ) + ψ( v2, v ) + 2 v sin |ψ(v1, v ) ψ( v1, v )| 2 sin ψ(v1, v ) + ψ( v1, v ) (By Lemma 4) 2 v sin ψ(v2, v2) 2 sin 2ψ( v2, v ) + ψ(v2, v2) + 2 v sin ψ(v1, v1) 2 sin 2ψ( v1, v ) + ψ(v1, v1) Since π = 1, b0 0, by Theorem 1, ψ( v1, v ) = ψ1 = 0, ψ( v2, v ) = ψ2 c0βn = c0|1 b|. On the other hand, that for all k [K], l [2K], |(vk vk)l| = o(|1 b|) vk indicates that vk vk = o(|1 b|) vk . By lemma 5, this implies that ψ(v1, v1) = o(|1 b|), k = 1, 2. Therefore, | w, v w, v | o(1) 2 v sin ψ( v2, v ) 2 sin ψ( v2, v ) = o(1) 2 v sin ψ( v2, v ) 2 2 sin ψ( v2, v ) 2 cos ψ( v2, v ) = o(1) v sin2 ψ( v2, v ) 2 cos ψ( v2, v ) o(1) v sin2 ψ( v2, v ) 2 o(1) v (1 cos ψ( v2, v )) = o(1) v (cos ψ( v1, v ) cos ψ( v2, v )) = o(1) v v1 v1 v2 v2 , v = o(1) ( w, v ) o(1) | w, v | (93) Therefore, w, v = (1 + o(1)) w, v . Let ηkl = P πi=ek,ˆπi=el θi. Let µ(k) a = θ(k) a 1, a {L, U}, k [K]. Then, v1 = µ(1) L (µ(1) L , bµ(2) L , µ(1) U η12 + bη21, bµ(2) U + η12 bη21) v2 = µ(2) L (bµ(1) L , µ(2) L , bµ(1) U bη12 + η21, µ(2) U + bη12 η21) v = θ (µ(1) L , bµ(2) L , µ(1) U η12 + bη21, bµ(2) U + η12 bη21) w, v = v v1 v v1 v v2 v v2 v Published as a conference paper at ICLR 2023 1 (1 b2)2(µ(1) L )2(µ(2) L )2 I1I2 4I2 3 v1 2 v2 2 I1 = (µ(1) L )2 + (µ(1) U η12)2 + η2 21 I2 = (µ(2) L )2 + (µ(2) U η21)2 + η2 12 I3 = (µ(1) U η12)η21 + (µ(2) U η21)η12 Notice that |I1I2 4I2 3 ((µ(1) L )2 + (µ(1) U )2)((µ(2) L )2 + (µ(2) U )2)| ((µ(1) L )2 + (µ(1) U )2 + (µ(2) L )2 + (µ(2) U )2)(η2 12 + η2 21) + 2η12µ(1) U ((µ(2) L )2 + (µ(2) U )2) + 2η21µ(2) U ((µ(1) L )2 + (µ(1) U )2) ((µ(1) L )2 + (µ(1) U )2 + (µ(2) L )2 + (µ(2) U )2)(η12 + η21)2 + 2η12µ(1) U ((µ(2) L )2 + (µ(2) U )2) + 2η21µ(2) U ((µ(1) L )2 + (µ(1) U )2) + 8(µ(1) U )2η2 21 + 8(µ(2) U )2η2 12 θ 2 1(η12 + η21)2 + 2 θ 3 1(η12 + η21) + 8 θ 2 1(η12 + η21)2 (9 b2 0 + 2 b0) θ 4 1 = o(1) θ 4 1 (95) On the other hand, ((µ(1) L )2 + (µ(1) U )2)((µ(2) L )2 + (µ(2) U )2) 1 4(µ(1) L + µ(1) U )2(µ(2) L + µ(2) U )2 4 θ(1) 2 1 θ(2) 2 1 4 min k [K] θ(k) 2 1 max k [K] θ(k) 2 1 (By condition (9)) 1 4C2 max k [K] θ(k) 4 1 1 64C2 θ 4 1 Therefore, I1I2 4I2 3 = (1 + o(1))((µ(1) L )2 + (µ(1) U )2)((µ(2) L )2 + (µ(2) U )2) (µ(1) L )2(µ(2) L )2 = (1 + o(1))((µ(1) L )2 + (µ(1) U )2 + (µ(2) L )2 + (µ(2) U )2)2 v = (1 + o(1))θ q (µ(1) L )2 + (µ(1) U )2 + (µ(2) L )2 + (µ(2) U )2 Published as a conference paper at ICLR 2023 v u u t1 (1 + o(1))(1 b2)2 ((µ(1) L )2 + (µ(1) U )2)((µ(2) L )2 + (µ(2) U )2) ((µ(1) L )2 + (µ(1) U )2 + (µ(2) L )2 + (µ(2) U )2)2 2(1 + o(1))θ (1 b2)2 ((µ(1) L )2 + (µ(1) U )2)((µ(2) L )2 + (µ(2) U )2) ((µ(1) L )2 + (µ(1) U )2 + (µ(2) L )2 + (µ(2) U )2)1.5 = 2(1 + o(1))θ (1 b)2 ((µ(1) L )2 + (µ(1) U )2)((µ(2) L )2 + (µ(2) U )2) ((µ(1) L )2 + (µ(1) U )2 + (µ(2) L )2 + (µ(2) U )2)1.5 (97) We turn to w w, v . Denote that µ(1) L v1 = (µ(1) L , bµ(2) L , µ(1) U η12 + bη21, bµ(2) U + η12 bη21) µ(2) L v2 = (bµ(1) L , µ(2) L , bµ(1) U bη12 + η21, µ(2) U + bη12 η21) One simple but useful fact is that ν1 , ν2 = O( θ 1). The reason is that for k [K] νk = O( νk 1) = O(µ(1) L + bµ(2) L + µ(1) U + bµ(2) U ) = O( θ 1) Notice that w3 = bµ(1) U bη12 + η21 ν2 µ(1) U η12 + bη21 = (bµ(1) U bη12 + η21)( 1 ν2 1 ν1 ) 1 b ν1 (µ(1) U η12 η21) = bµ(1) U bη12 + η21 ν1 µ(1) U + o(1 b) = bµ(1) U bη12 + η21 1 ν2 2 ν1 2 v1 µ(1) U + o(1 b) (98) ν2 2 ν1 2 = (1 b2)((µ(2) L )2 + (µ(2) U η21)2 (µ(1) L )2 (µ(1) U η12)2) = (1 b2)((µ(2) L )2 + (µ(2) U )2 (µ(1) L )2 (µ(1) U )2 + o(1) θ 2 1) (99) w3 = bµ(1) U + o(1) θ 1 ν1 1 2 ν2 2 (1 b2)((µ(2) L )2 + (µ(2) U )2 (µ(1) L )2 (µ(1) U )2 + o(1) θ 2 1) ν1 µ(1) U + o(1 b) 1 + (µ(2) L )2 + (µ(2) U )2 (µ(1) L )2 (µ(1) U )2 + o(1 b) (100) Published as a conference paper at ICLR 2023 Since for all k [K], l [2K], |(vk vk)l| = o(|1 b|) vk , we have |wk wk| = o(|1 b|), k = 1, 2, ..., 2K. Hence, denote γ = (µ(2) L )2+(µ(2) U )2 (µ(1) L )2 (µ(1) U )2 ν2 2 , we obtain ν1 µ(1) U (1 + γ) + o(1 b) (101) Similarly, we can show that ν1 µ(1) L (1 + γ) + o(1 b) (102) ν2 µ(2) L (1 γ) + o(1 b) (103) ν2 µ(2) U (1 γ) + o(1 b) (104) As a result, νk µ(k) a (1 ( 1)kγ) + o(1 b))2θ µ(k) a = θ (1 b)2 X (µ(k) a )3(1 ( 1)kγ)2 νk 2 + 2o(1)(µ(k) a )2(1 ( 1)kγ) νk 2 + o(1)2 µ(k) a νk 2 When θ 1 = o(1), the bounds (75) and (76 ) both become trivial, so we can focus on the case where θ 1 O(1). In this case, (µ(k) a )2(1 ( 1)kγ) νk 2 O(1), µ(k) a νk 2 O(1). On the other hand, for k [K], νk 2 = (µ(k) L )3 + (µ(k) U )3 (Holder Inequality) (µ(k) L + µ(k) U )3 = θ(k) 3 1 4 νk 2 mink [K] θ(k) 3 1 4 νk 2 (By (9)) maxk [K] θ(k) 3 1 4C3 2 νk 2 θ 3 1 4K3C3 2 νk 2 Since 1 γ and 1 + γ cannot be both o(1), we have (µ(k) a )3(1 ( 1)kγ)2 Published as a conference paper at ICLR 2023 w w, ν = (1 + o(1))θ (1 b)2 X (µ(k) a )3(1 ( 1)kγ)2 = (1 + o(1))4θ (1 b)2 ((µ(1) L )3 + (µ(1) U )3)((µ(2) L )2 + (µ(2) U )2)2 + ((µ(2) L )3 + (µ(2) U )3)((µ(1) L )2 + (µ(1) U )2)2 ((µ(1) L )2 + (µ(1) U )2 + (µ(2) L )2 + (µ(2) U )2)3 (107) By (101) (102) (103) (104), maxk [2K] |wk| = O(1 b) = o(1). Substituting this result together with (97) and (107) into (90), we obtain P(ˆy = 2|π = e1) (1 o(1))(1 b)2 θ(1) L 3 1+ θ(1) U 3 1 ( θ(1) L 2 1+ θ(1) U 2 1)2 + θ(2) L 3 1+ θ(2) U 3 1 ( θ(2) L 2 1+ θ(2) U 2 1)2 P(ˆy = 1|π = e2) (1 o(1))(1 b)2 θ(1) L 3 1+ θ(1) U 3 1 ( θ(1) L 2 1+ θ(1) U 2 1)2 + θ(2) L 3 1+ θ(2) U 3 1 ( θ(2) L 2 1+ θ(2) U 2 1)2 Risk(ˆy) 4 exp (1 o(1))(1 b)2 θ(1) L 3 1+ θ(1) U 3 1 ( θ(1) L 2 1+ θ(1) U 2 1)2 + θ(2) L 3 1+ θ(2) U 3 1 ( θ(2) L 2 1+ θ(2) U 2 1)2 Taking C4 = 4, we conclude the proof. K PROOF OF THEOREM 3 Theorem 3. Suppose the conditions of Corollary 1 hold, where b0 is properly small , and suppose that ˆΠU is b0-correct. Furthermore, we assume for sufficiently large constant C3, θ 1 C3 , θ mink [K] C3 θ(k) L 1, and for a constant r0 > 0, mink =ℓ{Pkℓ} r0. Then, there is a constant c2 = c2(K, C1, C2, C3, c3, r0) > 0 such that [ log( c2Risk(ˆy))]/[ log(inf y{Risk( y)})] c2. Proof. On one hand, for any k, k [K], k = k , using exactly the same proof as in Section J.1, we can show that when C3 > C1, inf y (P(ˆy = k|π = ek ) + P(ˆy = k |π = k)) 2(1 (θa)2) 3 2 θ θi Pkki p θ θi Pk ki 2 (111) where ki is the true label of node i, θa = maxi [n] maxk [K] p θ θi Pkki. According to DCBM model and condition (8), θa C C3 . Hence take C3 inf y {Risk( y)} max k =k [K] inf y (P(ˆy = k|π = ek ) + P(ˆy = k |π = ek)) Published as a conference paper at ICLR 2023 max k =k [K] 1 2 exp 2 θ θi Pkki p θ θi Pk ki 2 2θ min k =k [K] l=1 θ(l) 1 p Pk l 2 (112) Let B0 be the event that ˆΠU is b0-correct. When inf y{Risk( y)} is replaced by the version conditioning on B0, since X and A are independent, and B0 σ(A), conditioning on B0 or not does not affect the distribution of X. On the other hand, for any k, k , since π does not affect the distribution of A, the distribution of A|B0, π = ek and A|B0, π = ek are the same, so their Hellinger distance is still 0. Hence, all the proofs in Section J and above remain unaffected. In other words, one does not gain a lot of information from B0. On the other hand, notice that proof of Theorem 2 still works conditioning on B0. In other words, there exists constant C > 0, such that given B0, for any δ (0, 1/2), with probability 1 δ, simultaneously for 1 k K, | ˆψk(ˆΠU) ψk(ˆΠU)| C θ 1 min{θ , θ(k) L 1} + θ(k) L 2 θ(k) L 1 θ 1 Define βk = ψk(ˆΠU). Replacing c0βn by βk and replicating the proof of Corollary 1 , we can show that P(ˆy = k |B0, π = ek ) C c2 0 β2 k θ 1 min{θ , θ(k) L 1} Since θ mink [K] C3 θ(k) L 1, min{θ , θ(k) L 1} min{1, 1 C3 }θ , therefore, P(ˆy = k |B0, π = ek ) C c2 0 min{1, 1 C3 } β2 kθ θ 1 (113) Recall that in (20) of Section G, we show that βk = ψk(ˆΠU) q (ek ek ) M(ek ek ) (114) By Lemma 6, denote KC2 2b0 θU 1 Suppose that C5 1 4. Then, for any vector α Rk, |α Mα| 1 3C5 1 C5 |α M (0)α| 1 3|α M (0)α| (116) where recall that in Section G, we define M (0) = P G2 LL + G2 UU P DM (0) = diag(M (0) 11 , ..., M (0) KK), and M, M (0) = D 1 2 M (0)M (0)D 1 Hence, when b0 is sufficiently small, β2 k (ek ek ) M(ek ek ) 1 3|(ek ek ) M (0)(ek ek )| 1 M (0) kk q Published as a conference paper at ICLR 2023 Define λl = q θ(l) L 2 1 + θ(l) U 2 1, l [K]. Then, l=1 λ2 l Pkl Plk = l=1 λ2 l Pkl Pk l l=1 λ2 l Pkl Plk = l=1 λ2 l P 2 kl M (0) k k = l=1 λ2 l Pk l Plk = l=1 λ2 l P 2 k l Therefore, plugging the above result into (117), we have 1 M (0) kk q 1 PK l=1 λ2 l Pkl Pk l q PK l=1 λ2 l P 2 kl q PK l=1 λ2 l P 2 k l v u u t1 (PK l=1 λ2 l P 2 kl)(PK l=1 λ2 l P 2 k l) (PK l=1 λ2 l Pkl Pk l)2 (PK l=1 λ2 l P 2 kl)(PK l=1 λ2 l P 2 k l) 3 (PK l=1 λ2 l P 2 kl)(PK l=1 λ2 l P 2 k l) (PK l=1 λ2 l Pkl Pk l)2 (PK l=1 λ2 l P 2 kl)(PK l=1 λ2 l P 2 k l) (118) Notice that l=1 λ2 l P 2 kl)( l=1 λ2 l P 2 k l) ( l=1 λ2 l Pkl Pk l)2 l, l [K] λ2 l λ2 l (Pkl Pk l Pk l Pk l)2 l [K] λ2 l λ2 k(Pkl Pk k Pkk Pk l)2 + X l [K] λ2 l λ2 k (Pkl Pk k Pkk Pk l)2 (Identification Condition) = X l [K] λ2 l λ2 k(Pkk Pkl Pk l)2 + λ2 k (Pkl Pkk Pk l)2 (Cauchy-Schwartz) X l [K] λ2 l 1 1 λ2 k + 1 λ2 k Pkk Pkl Pk l + Pkl Pkk Pk l 2 l [K] λ2 l 1 1 λ2 k + 1 λ2 k (1 + Pkk )2(Pkl Pk l)2 1 1 minl [K] λ2 l + 1 minl [K] λ2 l l [K] λ2 l (Pkl Pk l)2 2 min l [K] λ2 l X l [K] λ2 l (Pkl Pk l)2 (119) Substituting (119) into (118), we have minl [K] λ2 l P l [K] λ2 l (Pkl Pk l)2 (PK l=1 λ2 l P 2 kl)(PK l=1 λ2 l P 2 k l) Published as a conference paper at ICLR 2023 (By Condition (8)) 1 6C4 1 minl [K] λ2 l P l [K] λ2 l (Pkl Pk l)2 (PK l=1 λ2 l )2 (120) On one hand, by Cauchy-Schwartz inequality, λ2 l = θ(l) L 2 1 + θ(l) U 2 1 2( θ(l) L 1 + θ(l) U 1)2 2 θ(l) 2 1 (121) min l [K] λ2 l 1 2( min l [K] θ(l) 1)2 (By Condition (9)) 1 C2 max l [K] θ(l) 1)2 2( 1 KC2 θ 1)2 = 1 2K2C2 2 θ 2 1 (122) On the other hand, l=1 ( θ(l) L 2 1 + θ(l) U 2 1) l=1 θ(l) L 1 + l=1 θ(l) U 1)2 = θ 2 1 (123) Plugging (121), (122), (123) into (120), we obtain β2 k 1 6C4 1 minl [K] λ2 l P l [K] λ2 l (Pkl Pk l)2 (PK l=1 λ2 l )2 1 2K2C2 2 θ 2 1 P l [K] 1 2 θ(l) 2 1(Pkl Pk l)2 = 1 24K2C4 1C2 2 l [K] θ(l) 2 1(Pkl Pk l)2 1 24K2C4 1C2 2 minl [K] θ(l) 1 P l [K] θ(l) 1( Pkl Pk l)2 θ 2 1( Pkl + Pk l) (By (9), min k =ℓ{Pkℓ} r0) 1 24K2C4 1C2 2 1 C2 maxl [K] θ(l) 1 P l [K] θ(l) 1( Pkl Pk l)2 1 24K2C4 1C2 2 1 KC2 θ 1 P l [K] θ(l) 1( Pkl Pk l)2 = 1 48 r0K3C4 1C3 2 l [K] θ(l) 1( Pkl Pk l)2 Substituting (124) into (113), we obtain P(ˆy = k |B0, π = ek ) C c2 0 min{1, 1 C3 } β2 kθ θ 1 Published as a conference paper at ICLR 2023 k=1 exp C8θ K X l=1 θ(l) 1( p Pk l)2 (125) where C8 = C0 c2 0 min{1, 1 C3 } 1 48 r0K3C4 1C3 2 = min{1, 1 C3 } C0 48 r0K3c2 0C4 1C3 2 . As a result, 1 CK2 Risk(ˆy|B0) = 1 CK2 k =1 P(ˆy = k |B0, π = ek ) k=1 exp C8θ K X l=1 θ(l) 1( p k=1 exp C8θ K X l=1 θ(l) 1( p max k =k [K] exp C8θ K X l=1 θ(l) 1( p = exp C8θ min k =k [K] l=1 θ(l) 1( p Pk l)2 (126) Comparing (112) and (126), we have log( 1 2 CK2 Risk(ˆy|B0)) log(inf y{Risk( y)}) log(2) + C8I where I = θ mink =k [K] PK l=1 θ(l) 1( Pkl Pk l)2 denotes the efficient information in the data. Notice that since I 0, when C8 2 2, log(2)+C8I log(2)+2 2I 1; when C8 2 2, log(2)+C8I log(2)+2 2. Therefore, log( 1 2 CK2 Risk(ˆy|B0)) log(inf y{Risk( y)}) log(2) + C8I 2I min{1, C8 Take c2 = min{1, C8 2, 1 2 CK2 }. Then c2 only depends on K, C1, C2, C3, c3, r0 (recall that both C0 and c0 only depend on K, C1, C2, C3, c3, r0, and C = 2), and log( c2Risk(ˆy|B0)) log(inf y{Risk( y)}) log( 1 2 CK2 Risk(ˆy|B0)) log(inf y{Risk( y)}) min{1, C8 2} c2 (129) This concludes our proof. L PROOF OF THEOREM 4, 5 L.1 PROOF OF THEOREM 4 Theorem 4. Consider the DCBM model where (8)-(9) hold. We apply SCORE+ to obtain ˆΠU\{i} and plug it into the above algorithm. As n , suppose for some constant q0 > 0 , mini U θi q0 maxi U θi, βn θU q0 p log(n), β2 n θ 1 mini U θi , and β2 n θ 1 mink{ θ(k) L 1} . Then, 1 |U| P i U P(ˆyi = ki) 0, so the in-sample classification algorithm in section 3 is consistent. Published as a conference paper at ICLR 2023 Proof. Let i = arg maxi U P(ˆyi = ki). Then i U P(ˆyi = ki) P(ˆyi = ki ) Notice that the assumptions of theorem 4 directly imply the assumptions of Corollary 2 when taking i as the new node. Hence, regard i as the new node and leveraging on Corollary 2, we have P(ˆyi = ki ) 0 Therefore, 1 |U| i U P(ˆyi = ki) 0 In other words, the in-sample classification algorithm in section 3 is consistent. L.2 PROOF OF THEOREM 5 Theorem 5. Suppose the conditions of Corollary 1 hold, where b0 is properly small , and suppose that ˆΠU\{i} is b0-correct for all i U. Furthermore, we assume for sufficiently large constant C3, maxi U θi 1 C3 , maxi U θi mink [K] C3 θ(k) L 1, log(|U|) C3β2 n θ 1 mini U θi, and for a constant r0 > 0, mink =ℓ{Pkℓ} r0. Then, there is a constant c21 = c21(K, C1, C2, C3, c3, r0) > 0 such that [ log( c21Riskins(ˆy))]/[ log(inf y{Riskins( y)})] c21, so the in-sample classification algorithm in section 3 is efficient. Proof. For i U, define individual risk Risk( yi) = P k [K] P( yi = k |πi = ek ). Then the in-sample risk Riskins( y) = 1 |U| P i U Risk(ˆyi). The minimizer of Riskins( y) may not exist, so we define y(0) to be an approximate minimizer such that Riskins( y(0)) 2 inf y{Riskins( y)}. By the definition of infimum, such y(0) always exists as long as inf y{Riskins( y)} > 0. Notice that for any i U, inf y{Riskins( y)} inf yi 1 |U|{Risk( yi)}. Regarding node i as the new node and leveraging on (112), we know that inf yi{Risk( yi)} > 0. Hence, inf y{Riskins( y)} inf yi 1 |U|{Risk( yi)} > 0 (note that we are not taking n or |U| here), and y(0) is well-defined. Let i = arg maxi U Risk(ˆyi), and let k be the true label of i . Regard [n]\{i } as the existing nodes in the network and i as the new node. By (112) and (126), we have log( 1 2 CK2 Risk(ˆyi )) log(2) + C8Ii (130) log({Risk( y(0) i )}) inf yi log({Risk( yi )}) log(2) + 2 where Ii = θi mink =k [K](PK l=1 θ(l) 1( Pkl Pk l)2 θi (1 Pkk )2) denotes the efficient information in the data for classifying node i . As a result, we have log( 1 4 CK2 Riskins(ˆy)) log(inf y{Riskins( y)}) log( 1 4 CK2 Riskins(ˆy)) 2Riskins( y(0))) = log( 1 4 CK2 1 |U| P i U Risk(ˆyi)) log( 1 2|U| P i U Risk( y)(0) i ) Published as a conference paper at ICLR 2023 log( 1 4 CK2 Risk(ˆyi )) log( 1 |2U|Risk( y)(0) i ) (By (130), (131)) log(4) + C8Ii 2Ii + log(|U|) (132) Notice that = θi min k =k [K]( l=1 θ(l) 1( p Pk l)2 θi (1 p θi min k =k [K]( θ(k) 1( p Pk k)2 + θ(k ) 1( p Pk k )2 θi (1 p (By identification condition that Pkk = Pk k = 1) = θi min k =k [K](( θ(k) 1 + θ(k ) 1 θi )(1 p (The true label of node i is k ) θi min k =k [K]( θ(k) 1(1 p Pk k)2) (133) By assumption 8, for any k [K], Pk k)2 = (1 Pk k)2 (1 + Pk k)2 = (2 2Pk k)2 4(1 + Pk k)2 (By identification condition that Pkk = Pk k = 1) = (Pkk + Pk k 2Pk k)2 4(1 + Pk k)2 = ((ek ek ) P(ek ek ))2 4(1 + Pk k)2 (|λmin(P)| ek ek 2)2 β2 n (1 + C1)2 (134) By assumption 9, for any k [K] θ(k) 1 min k =k [K] θ(k) 1 C2 max k =k [K] θ(k) 1 k =k [K] θ(k) 1 = 1 KC2 θ 1 (135) From the assumption, log(|U|) C3β2 n θ 1 mini U θi. Plugging (134) (135) into (133), we obtain Ii θi min k =k [K]( 1 KC2(1 + C1)2 β2 n θ 1) 1 KC2(1 + C1)2 β2 n θ 1 min i U θi Published as a conference paper at ICLR 2023 1 KC2C3(1 + C1)2 log(|U|) (136) In other words, log(|U|) KC2C3(1 + p C1)2Ii (137) Substituting (137) into (132), we have log( 1 4 CK2 Riskins(ˆy)) log(inf y{Riskins( y)}) log(4) + C8Ii log(4) + (2 2 + KC2C3(1 + C1)2)Ii (138) Similar to the proof of Theorem 3, notice that since Ii 0, when C8 2 2+KC2C3(1+ C1)2, log(4) + C8Ii log(4) + (2 2 + KC2C3(1 + C1)2)Ii 1 (139) ; when C8 2 2 + KC2C3(1 + C1)2, log(4) + C8Ii log(4) + (2 2 + KC2C3(1 + C1)2)Ii log(4) C8 2 2+KC2C3(1+ C1)2 + C8Ii log(4) + (2 2 + KC2C3(1 + C1)2)Ii 2 + KC2C3(1 + C1)2 (140) log( 1 4 CK2 Riskins(ˆy)) log(inf y{Riskins( y)}) log(4) + C8Ii log(4) + (2 2 + KC2C3(1 + C1)2)Ii min{1, C8 2 2 + KC2C3(1 + C1)2 } (141) Take c21 = min{1, C8 2 2+KC2C3(1+ C1)2 , 1 4 CK2 }. Then c21 only depends on K, C1, C2, C3, c3, r0 (recall that both C0 and c0 only depend on K, C1, C2, C3, c3, r0, and C = 2), and log( c21Riskins(ˆy)) log(inf y{Riskins( y)}) log( 1 4 CK2 Riskins(ˆy)) log(inf y{Riskins( y)}) min{1, C8 2 2 + KC2C3(1 + C1)2 } This concludes our proof.