# weighted_graph_clustering_with_nonuniform_uncertainties__c2d8db41.pdf Weighted Graph Clustering with Non-Uniform Uncertainties Yudong Chen YUDONG.CHEN@EECS.BERKELEY.EDU University of California, Berkeley. Berkeley, CA 94720, USA Shiau Hong Lim MPELSH@NUS.EDU.SG National University of Singapore, Singapore 117575 Huan Xu MPEXUH@NUS.EDU.SG National University of Singapore, Singapore 117575 We study the graph clustering problem where each observation (edge or no-edge between a pair of nodes) may have a different level of confidence/uncertainty. We propose a clustering algorithm that is based on optimizing an appropriate weighted objective, where larger weights are given to observations with lower uncertainty. Our approach leads to a convex optimization problem that is efficiently solvable. We analyze our approach under a natural generative model, and establish theoretical guarantees for recovering the underlying clusters. Our main result is a general theorem that applies to any given weight and distribution for the uncertainty. By optimizing over the weights, we derive a provably optimal weighting scheme, which matches the information theoretic lower bound up to logarithmic factors and leads to strong performance bounds in several specific settings. By optimizing over the uncertainty distribution, we show that nonuniform uncertainties can actually help. In particular, if the graph is built by spending a limited amount of resource to take measurement on each node pair, then it is beneficial to allocate the resource in a non-uniform fashion to obtain accurate measurements on a few pairs of nodes, rather than obtaining inaccurate measurements on many pairs. We provide simulation results that validate our theoretical findings. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). 1. Introduction Graph clustering concerns with finding densely connected groups of nodes in a graph. Here an edge between two nodes usually indicates certain underlying similarity or affinity between nodes, whereas the absence of an edge indicates dissimilarity and distance. Therefore, the goal of graph clustering is to infer groups of closely related nodes given the (often noisy) similarity/dissimilarity observations encoded in the graph. Graph clustering is an important subroutine in many applications, such as community detection, user profiling and VLSI network partitioning (Mishra et al., 2007; Yahoo!-Inc, 2009; Krishnamurthy, 1984). In many of these applications, however, the edge/non-edge between each node pair may represent very different levels of confidence of the similarity of the nodes. In some cases, the observation of an edge (the absence of an edge, resp.) may be generated by accurate measurements and thus is a strong indicator that the two nodes should be assigned the same (different, resp.) clusters. In other circumstances, the observations may be very uncertain and thus an edge or the absence of it provides little information about the cluster structure. As an extreme case, the observations between some node pairs may carry no information at all, so these pairs are effectively unobserved. An example with nonuniform uncertainties is crowd-clustering (Gomes et al., 2011; Yi et al., 2012), where a number of users are asked whether or not they think a pair of nodes (e.g., movies or images) are similar, and the final graph is obtained by aggregating the users answers, for instance by taking a majority vote. The confidence level are naturally different across pairs: a pair receiving a large number of unanimous votes has a higher confidence level than those receiving few votes or divergent votes; in particular, pairs receiving no votes are completely uncertain. In such a non-uniform setting, each pair of nodes should be treated differently according to the level of uncertainty Weighted Graph Clustering with Non-Uniform Uncertainties between them. In many cases, a priori knowledge is available for the uncertainty levels since the graph is built from a known or controlled process. Intuitively, taking advantage of such knowledge should improve clustering performance. Our contributions: In this paper, we exploit the above intuition, and propose a new approach for clustering graphs with non-uniform edge uncertainties. Our approach is based on finding the clusters that optimizes an appropriate weighted objective, where larger weights are given to node pairs with lower levels of uncertainties. Doing so leads to an intractable combinatorial optimization problem, and our algorithm is a convex relaxation of it. To study the performance of our approach, we consider a natural probabilistic model for generating a random graph from an unknown set of clusters with nonuniform uncertainties. We provide theoretical guarantees for when the solution to the convex relaxation coincides with the combinatorial problem and exactly recovers the underlying clusters. Our main result is a general theorem that applies to any given weights and any uncertainty distribution. The theorem leads to a principled way of choosing the weights optimally based on knowledge of the uncertainties. By specializing this general result to different settings for the uncertainties, we recover the best known theoretic guarantees for clustering partially observed graphs, and obtain new results for more general non-uniform settings. In particular, we show that a weighted approach using the knowledge of the non-uniform uncertainties have order-wise better guarantees than an unweighted approach. We call this the power of knowledge . We use the above results to obtain theoretical insights to a resource allocation problem in graph clustering. As a corollary of our main theorem, it can be shown that the clustering problem becomes easier when the uncertainties are more non-uniform across the node pairs (provided that the knowledge of the uncertainty levels is appropriately exploited). Therefore, if the uncertainty level of the observation on each node pair depends on the resource devoted to it, then it is often more beneficial, sometimes significantly, to allocate the resource in a non-uniform way, with most of the resource spent on a small number of nodes so that they have low uncertainty levels. We provide simulation results to corroborate our theoretical findings. The results demonstrate that the weighted approach and the optimal weights outperform other methods, and non-uniform resource allocation lead to performance gain. 1.1. Related Work Planted partition model/Stochastic block model: The setup in this paper is closely related to the classical planted partition model (Condon & Karp, 2001), also known as the stochastic block model (Rohe et al., 2011). In these models, n nodes are partitioned into several groups called the underlying clusters, and a random graph is generated by placing an edge between each pair of nodes independently with probability p or q (with p > q) depending on whether the nodes belong to the same cluster. The goal is to recover the underlying clusters given the graph. Various approaches have been proposed for this problem, including spectral clustering algorithms (Mc Sherry, 2001; Giesen & Mitsche, 2005; Chaudhuri et al., 2012; Rohe et al., 2011) and randomized combinatorial methods (Shamir & Tsur, 2007). Most related to us are the convex optimization approaches in Chen et al. (2012; 2011); Ames & Vavasis (2011); Oymak & Hassibi (2011); Jalali et al. (2011); Mathieu & Schudy (2010), which are similar to an unweighted version of our method. Except for a few exceptions detailed below, most previous work focused on the uniform uncertainty case. Graph clustering with non-uniform uncertainty: Chaudhuri et al. (2012) explicitly consider non-uniform uncertainty under the planted partition model. They study the setup where each node is associated with a confidence di, and the probability of placing an edge between nodes i and j is dipdj (diqdj resp.) if i and j belong to the same cluster (different clusters resp.). A spectral clustering approach is proposed to tackle the problem. In our model the non-uniformity is pair-wise and need not have a product form as in their work. As we explained earlier, clustering with partial observations can be considered as a special case of non-uniform uncertainties. Here, an edge or the absence of an edge is observed for a subset of the nodes pairs. For the other node pairs only a ? is observed, meaning no information is available, which corresponds to maximum uncertainty in our non-uniform setting. A line of work explicitly addresses this setup. One natural approach, taken by Oymak & Hassibi (2011), is to convert the problem into one with uniform uncertainty by imputing the missing observations (either with no-edge or random choices), and then apply standard (unweighted) graph clustering methods. A more refined approach deals with the partial observations directly (Shamir & Tishby, 2011; Jalali et al., 2011). The work by Jalali et al. (2011) has the best known guarantees for the planted partition model with randomly missing observations. Their formulation is a special case of the more general method in this paper, and our theoretic results subsume theirs. There exists other work on clustering with partial observations (e.g., Balcan & Gupta, 2010; Voevodski et al., 2010; Krishnamurthy et al., 2012), but under rather different settings; it is also unclear how to generalize these methods to more general non-uniform uncertainty setting. Weighted Graph Clustering with Non-Uniform Uncertainties Another related line of work is correlation clustering (Bansal et al., 2004). There the goal is to find a set of clusters that minimize the total number of disagreements between the clusters and the observed graph. One may also consider minimizing a weighted sum of the disagreements (Demaine et al., 2006), leading to a weighted objective similar to ours. Work on correlation clustering focuses on establishing NP-hardness of the optimization problem and developing approximation schemes (Bansal et al., 2004; Giotis & Guruswami, 2006; Charikar et al., 2005). In contrast, we take a statistical approach akin to the planted partition model, and study conditions under which the underlying clusters can be recovered with high probability. Therefore, the theoretical results for correlation clustering are not directly comparable to ours. Recovering sparse signals and low-rank matrices with non-uniform priors: The problem of matrix decomposition (Cand es et al., 2011; Chandrasekaran et al., 2011) concerns with separating a low-rank matrix from sparse entry-wise errors of arbitrary magnitude. A standard approach is based on convex optimization using the trace norm (a.k.a. nuclear norm) as a convex surrogate of the rank function. Chen et al. (2013); Li (2013) consider extensions of the problem with unobserved entries. A similar approach has been applied to graph clustering (Jalali et al., 2011; Chen et al., 2011; Oymak & Hassibi, 2011; Mathieu & Schudy, 2010). Our clustering algorithm can be considered as separating a low rank matrix from sparse errors with a non-uniform prior. While our analysis is restricted to the graph clustering, our method and intuition may be relevant to the general matrix decomposition problem. To the best of our knowledge, matrix decomposition with general non-uniform error probability has not been studied in the literature. A related problem is recovering of a sparse vector signal, for which non-uniform priors have been considered. In a standard setting, it is assumed that each entry of the unknown vector is known to have a different probability of being non-zero. Khajehnejad et al. (2011) propose a weighted 1 norm minimization approach to address this problem. The analysis of the approach mainly focuses on the special case where the non-zero probability can take one of two possible values (Khajehnejad et al., 2011; Oymak et al., 2011; Krishnaswamy et al., 2012). In this two-value setting, it is shown that the weighted approach is superior to an unweighted approach. 2. Problem Setup and Algorithms In this section we provide a formal setup of the graph clustering problem with non-uniform uncertainty levels. We consider a probabilistic model for generating the graph and the edge uncertainties based on a set of underlying un- known clusters. We then present our weighted formulation for recovering the underlying clusters, which is efficiently solvable and weighs each node pair differently. Suppose there are n nodes which are partitioned into r unknown clusters of size at least K. We observe an unweighted graph of the nodes, given as an adjacency matrix A 2 {0, 1}n n, which is generated as follows: For two nodes i and j in the same cluster, we observe an edge between them (i.e., Aij = 1) with probability 1 Pij, and no edge otherwise; for i and j in different clusters, we observe an edge between them with probability Pij, and no edge otherwise. Therefore, Pij can be considered as the probability of false observation between i and j, where a false observation is a missing edge between two nodes in the same clusters (a false negative), or an edge between two nodes in different clusters (a false positive). We are interested in the case where the Pij s are potentially different across the (i, j) s, so the uncertainties associate with each observation Aij are non-uniform. In particular, Pij = 0 means that Aij is a noiseless observation of the cluster relation between the nodes i and j, whereas Pij = 1 2 means Aij is purely random and thus the relation between i and j is unobserved. We use P = (Pij)n i,j=1 2 Rn n to denote the matrix of error probabilities. It is in general an ill-posed problem to recover the clusters for arbitrary error probabilities P. For example, if Pij = 1 2, 8j for some node i, then the graph contains no information about the node i so exact cluster recovery is impossible. To avoid such pathological situation, we assume that P is randomly generated with i.i.d. entries. In particular, for each (i, j) and independent of all others, Pij is a random variable with some distribution Q supported on [0, 1/2], where Q is either the corresponding probability mass function if Pij takes discrete values or the probability density function if Pij is continuous. 2.2. Our Algorithm Our algorithm is based on finding a clustering of the nodes that optimizes an appropriate objective. To this end we need the notion of a cluster matrix: given a clustering of the n nodes, the associated cluster matrix is an n-by-n 0 1 matrix Y such that Yij = 1 if and only if the nodes i and j are assigned to the same clusters. Let Y be the true cluster matrix associated with the underlying clusters that generate the graph A. Our goal is therefore to recover Y from the graph A. A natural approach, akin to correlation clustering (Bansal et al., 2004), is to find a clustering Y that maximizes the sum of the total number of edges inside the clusters and the the total number of missing edges across Weighted Graph Clustering with Non-Uniform Uncertainties clusters. This can be written as maximizing the quantity i.j Aij Yij + P i.j (1 Aij) (1 Yij) i.j (2Aij 1) Yij + C over all cluster matrix Y , where C collects the terms independent of Y . This formulation gives equal weights to each node pair. When the uncertainty levels are non-uniform, we consider instead the following weighted formulation: max Y 2Rn n Bij(2Aij 1)Yij s.t. Y is a cluster matrix, where Bij 0 is the weight assigned to the pair (i, j). The formulation (1) is a hard combinatorial optimization problem because there are exponentially many possible cluster matrices. To obtain a tractable formulation, we relax the constraint Y is a cluster matrix with a set of convex constraints. Specifically, we consider the following: max Y 2Rn n Bij (2Aij 1) Yij (2) s.t. Y 2 S, (3) 0 Yij 1, 8i, j, (4) where S is a convex set that contains Y . We may use either one of the following choices: Y 2 Rn n : k Y k n Y 2 Rn n : trace(Y ) n, Y 0 here k Y k is the trace norm (a.k.a. nuclear norm) of Y , defined as the sum of the singular values of Y , and Y 0 is the positive semidefinite constraint. Both Snuclear and Spsd are standard convex relaxations for positive semidefinite low-rank matrices and cluster matrices (Mathieu & Schudy, 2010; Jalali et al., 2011). For both choices of S, the formulation (2) (4) is a semidefinite program (SDP) and can be solved in polynomial-time. Fast first-order solvers can also be used; we describe one such solver in the simulation section. 3. Theoretical Guarantees In this section, we provide theoretical guarantees for the formulation (2) (5) in recovering the true clusters Y . We first present a general main theorem that applies to any weights {Bij} and any distribution Q for the error probabilities {Pij}. We next derive a provably optimal way of choosing the weights, and characterizes its performance using the general theorem. We then specialize our theorem to different settings of Q and {Bij}. In the sequel, with high probability (w.h.p.) means with probability at least 1 n 10. 3.1. Main Theorem We assume that the weights satisfy Bij = f(Pij) for some function f, so Bij depends on the value of Pij but not the location (i, j). In this case, the Bij s are in general random and identically distributed. A constant function f( ) corresponds to uniform weights that are independent of the error probabilities. We have the following general theorem. Theorem 1. Suppose there exists b > 0 such that 0 Bij b almost surely for all (i, j). Then Y is the unique optimal solution to the program (2) (5) with high probability if for all (i, j) and a universal constant c0, (6) Here the theorem applies to both choices in (5), and the expectations and probability are w.r.t. the randomness of {Pij}, {Bij} and A. Remark 1. The condition (6) is identical for different (i, j) since the (Pij, Bij) s are identically distributed. This remark applies to any expression that involves the expectations or distributions of Pij, Bij and Aij. We note that b in the theorem can be any number and is allowed to scale with n and K etc. 3.2. Optimal Weighting and Maximum Likelihood We now turn to the question of how to choose the weights Bij = f(Pij) or equivalently the function f( ). Under the generative model in Section 2.1, a natural candidate is to consider the Maximum Likelihood objective, which we now show is a special case of our weighted objective (2). Given the graph A and the error probabilities P, the Maximum Likelihood Estimator (MLE) searches for a cluster matrix Y which maximizes the log likelihood log PY (A) of observing A given Y , where (1 Pij)Aij Yij (Pij)(1 Aij)Yij (Pij)Aij(1 Yij) (1 Pij)(1 Aij)(1 Yij)i (2Aij 1) Yij log where C0 collects the terms that are independent of Y . Therefore, the MLE objective corresponds to our objective (2) with the weights Bij = log . In fact, we may use any upper bound Pij 1 2 of the exact error prob- Weighted Graph Clustering with Non-Uniform Uncertainties ability Pij, which provides additional flexibility in the sequel. We refer to this as the MLE weights, namely , for each (i, j). (7) Remark 2. The MLE weight BMLE ij has a natural interpretation. When Pij = Pij = 1 2, the observation Aij on the pair (i, j) is purely random. In this case, BMLE = 0 so we assign zero weight to the pair (i, j). If Pij = Pij ! 0, then Aij is noiseless and we have exact knowledge about the cluster relationship between i and j; in particular Aij = 1 if and only if i and j are in the same underlying cluster. In this case, the MLE weight satisfies BMLE ij ! +1, so by maximizing Bij (2Aij 1) Yij, we force Yij to equal Aij, agreeing with the exact knowledge we possess. Our analysis shows that the weights BMLE ij are order-wise optimal under certain technical conditions, in the sense that its performance is (order-wise) at least as good as any other choice of for the weights Bij. In fact, it is order-wise as good as any algorithm for recovering Y . To see this, we first derive a guarantee for the MLE weights. The following corollary is proved using the general Theorem 1. Recall that Pij 1 2 is any upper bound on Pij. Corollary 1. Suppose 1 2 Pij , 8(i, j) almost surely for some 1 4 > 0. The weighted formulation (2) (5) with the MLE weights Bij = BMLE ij has a unique optimal solution equal to Y with high probability provided log n, 8(i, j) (8) for some universal constant c1. Moreover, the condition (8) is satisfied if we take Pij = max n K2 log n, 8(i, j). (9) Again we note that in the corollary may scale with n and K, and need not be bounded away from zero. We next establish a theorem that characterizes the performance limit of all algorithms. The theorem provides a lower bound on the minimax error probability and is proved by an information-theoretic argument. It generalizes similar lower bounds by Chaudhuri et al. (2012); Chen et al. (2011) for the uniform uncertainty setting. Theorem 2. Suppose there are r = 2 clusters with equal size K = n/2. Let Y := {Y : Y is a cluster matrix} be the set of all possible cluster matrices. If n, 8(i, j) (10) for some universal constant c0, then we have ˆY (A, P) 6= Y i where the infimum is over all measurable functions ˆY that map (A, P) to an element of Y, and the probability P [ ] is w.r.t. the randomness of A and P. Theorem 2 shows that in the case with r = 2 clusters, any algorithm fails with positive probability if (10) holds. In this setting, Corollary 1 guarantees that the formulation (2) (4) with the MLE weights succeeds w.h.p. if (1/2 Pij)2i n , 8(i, j). This matches the condition (10) up to a logarithmic factor, and thus cannot be substantially improved. This shows that the MLE weights is order-wise optimal in this case. We expect this to be true generally. Indeed, our simulations show that the MLE weights do outperform other weight schemes in a broad range of settings. 3.3. Consequences 3.3.1. THE POWER OF KNOWLEDGE The above results characterize the benefit of utilizing the knowledge of P via a weighted approach as compared to an unweighted approach that ignores P. Suppose Pij 0 for some universal constant 0 > 0. If we use uniform weights Bij 1 corresponding to an unweighted formulation, then Theorem 1 shows that the formulation succeeds w.h.p. if K2 + n log n K2 , 8(i, j). (11) On the other hand, if we have access to knowledge of Pij, we may use the optimal MLE weights Bij = BMLE ij , and Corollary 1 guarantees that the weighted formulation succeeds w.h.p. as long as the condition (9) is satisfied. The RHS of (9) is always no greater than the RHS of (11). More importantly, the LHS of (9) is strictly larger than the LHS of (11) whenever Pij is not equal to a constant almost surely, because E > (E [Pij])2. The gap is exactly the variance of Pij, and is large precisely when the error probability is far from being uniform. 3.3.2. TWO-LEVEL UNCERTAINTY AND PARTIAL OBSERVATIONS We consider a more concrete setting where Pij is nonuniform and can take one of two values. In particular, we assume that Pij = p1 with probability q and Pij = p2 with probability 1 q, where p1 < p2. If we use uniform weighting Bij 1, then by applying (11) and computing Weighted Graph Clustering with Non-Uniform Uncertainties the expectation, we obtain that the unweighted convex formulation succeeds if (p2 p1) + q2 (p2 p1)2 If we use the optimal weights BMLE ij , then (9) in Corollary 1 guarantees that the weighted formulation succeeds if (p2 p1) + q (p2 p1)2 Note that the left hand side becomes strictly larger as q(p2 p1)2 > q2(p2 p1)2, and hence the weighted formulation succeeds for a wider range of the model parameters. A special case of the above setting is when p2 = 1 2. This means that with probability 1 q, Aij is purely random and effectively unobserved, and with probability q it is observed but contains an error with probability p1. This coincide with the graph clustering with partial observation problem that has been studied before. In this case the unweighted approach and the weighted approach require q2 (1/2 p1)2 & n log2 n/K2 q (1/2 p1)2 & n log2 n/K2, respectively. Therefore, in the case with K = (n) and p1 = 1 4, the weighted approach can handle as few as qn2 = observations, whereas the unweighted approach requires (npn log n) observations, which is order-wise larger. We note that in this case the MLE weights is equivalent to assigning zero weight to unobserved node pairs and uniform positive weight to observe pairs. Our result matches the best existing bounds for the partial observation setting given in Jalali et al. (2011); Chen et al. (2011). 4. Resource Allocation in Graph Clustering Our results in the previous section show that given a graph with known uncertainties, one could achieve better performance guarantee by employing appropriate weights in solving the optimization problem. Here, we look at a complementary problem: Suppose we have the ability to control the uncertainties by spending some available resource (e.g., performing queries or measurements) to build the graph, how should one allocate this limited amount of resource in order to optimize the performance guarantee for the cluster recovery procedure? We show that our theoretical results can provide a principled solution to this resource allocation problem. Suppose that we wish to recover the underlying clusters by first assigning the probability distribution Pij for each pair of nodes and then using our proposed weighted scheme. The required amount of resource Mij should naturally be higher if the error probability Pij is small, and vice versa. The exact relationship between Mij and Pij depends on the particular setup being considered. By Corollary 1, in order to maximize the probability of successful recovery, we should aim to maximize E[( 1 2 Pij)2] over all possible distributions on Pij, subject to our resource constraint. We examine several different scenarios for the relationship between Pij and Mij. Model 1: We first consider a linear model Mij = γ( 1 2 Pij) or equivalently Pij = 1 γ , where γ > 0. Note that Mij = 0 implies Pij = 1 2, which is equivalent to an unobserved pair. Let M = E P i