# scalable_affinity_propagation_for_massive_datasets__d8a943cf.pdf Scalable Affinity Propagation for Massive Datasets Hiroaki Shiokawa Center for Computational Sciences, University of Tsukuba, Japan shiokawa@cs.tsukuba.ac.jp Affinity Propagation (AP) is a fundamental algorithm to identify clusters included in data objects. Given a similarities among objects, it iteratively performs message updates between all data object pairs until convergence. Although AP yields a higher clustering quality compared with other methods, it is computationally expensive. Hence, it has difficulty handling massive datasets that include numerous data objects. This is because the message updates require a quadratic cost of the number of data objects. Here, we propose a novel fast algorithm, Scale AP, which outputs the same clusters as AP but within a shorter computation time. Scale AP dynamically excludes unnecessary message updates without sacrificing its clustering accuracy. Our extensive evaluations demonstrate that Scale AP outperforms existing AP algorithms in terms of running time by up to two orders of magnitude. 1 Introduction Affinity Propagation (AP) (Frey and Dueck 2007) is one of the most successful clustering methods to overview complicated data objects in an unsupervised way. AP detects clusters and their corresponding representative data objects called exemplars by message-passing processes between all pairs of data objects. AP can (1) have lower clustering errors compared to other methods and (2) support any similarity that does not satisfy the triangle inequality (Sun and Guo 2014; Fujiwara et al. 2015). Hence, AP has been employed in many applications (Dueck and Frey 2007; Ambrogi et al. 2008; Kazantseva and Szpakowicz 2011). Although AP is useful in many applications, it has a quadratic cost to identify clusters since the message-passing iteratively updates real-valued messages for all object pairs. If N is the number of data objects and T is the number of iterations, AP needs O(N 2T) time. In the late-2000s, AP was applied to only small datasets composed of a few thousand data objects (e.g., social networks and location datasets). However, recent AI-powered applications handle massive datasets with numerous data objects because dataset resources are not only becoming larger but also more prevalent (Shiokawa, Fujiwara, and Onizuka 2013, 2015). Hence, current AP algorithms require several days to obtain clusters from massive datasets. Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1.1 Existing Approaches and Challenges Many studies have strived to overcome the expensive cost. One major approach is message sampling (Jia et al. 2008; Sun et al. 2017). AP updates the real-valued messages for all object pairs, but it is more reasonable to update only essential pairs that are potentially included in the same cluster. To specify such pairs, the message sampling drops off unpromising message updates before the message-passing. For instance, FSAP (Jia et al. 2008) constructs a k-nearest neighbor (k NN) graph from object similarities to prune unnecessary object pairs. Similarly, Fast AP (Sun et al. 2017) extracts important messages by previously finding microclusters. Although these methods are faster than AP, they have two drawbacks. First, they still have high costs for sampling messages, e.g., O(N 2) time. Second, they sacrifice clustering accuracy, which makes it difficult to realize truly effective applications. Instead of sampling methods, Fujiwara et al. proposed message bounding techniques (Fujiwara, Irie, and Kitahara 2011). They found that the real-valued messages converge as increasing the number of iterations, and they theoretically derived the upper and lower bounds of the messages. Based on these bounds, Fujiwara et al. designed incremental message-pruning algorithms, including Graph AP (Fujiwara, Irie, and Kitahara 2011) and F-AP (Fujiwara et al. 2015). Graph AP and F-AP guarantee that their pruning methods do not sacrifice clustering accuracy. Unlike other AP algorithms, Graph AP and F-AP can reduce the running time while maintaining the same clustering results as AP. Although the bounding methods have improved the efficiency, they still suffer from large computational costs to handle massive datasets (Matsushita, Shiokawa, and Kitagawa 2018). In the bounding methods, the upper and lower bounds estimation requires O(N 2) time. That is, they have a total time complexity of O(N 2+MT) time, where M is the number of non-pruned pairs of data objects. Furthermore, the bounding methods fail to prune unnecessary messages if the datasets include many large clusters since the bounds becomse loose in such cases. This incurs O(M) O(N 2) time in the worst case. Consequently, the bounding methods still require almost the same cost as AP, e.g., O(N 2T) time. Hence, it is a challenging task to develop a scalable AP algorithm that guarantees the same results as AP. The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) 1.2 Our Approaches and Contributions Our goal is to speed up AP without sacrificing its clustering accuracy. Here, we present a novel efficient algorithm, Scale AP, which is designed to reduce the number of message updates with a clustering accuracy assurance. The basic idea underlying Scale AP is to remove redundant message updates computed in the message-passing. To identify which updates to exclude, Scale AP focuses on the deterministic property of the message updates. In this paper, we theoretically clarified that most message updates are uniquely determined by corresponding similarities between data objects (Lemmas 2 and 4). That is, messages do not need to be computed repeatedly for similar object pairs. Based on this property, Scale AP employs the following two approaches to improve efficiency: (1) it theoretically derives prunable pairs that do not require repeated computations during the message-passing (Section 3.2), and (2) it introduces aggregated message updates to efficiently skip computations for prunable pairs based on the deterministic property (Section 3.3). As a result, Scale AP has the following attractive characteristics: Efficiency: Scale AP achieves faster clustering than the state-of-the-art AP algorithms proposed in the last few years (Section 4.1). We proved that Scale AP has a better time complexity than the AP algorithms (Theorem 1). Exactness: We theoretically and experimentally confirmed that Scale AP always outputs the same clustering results as AP, although Scale AP dynamically removes message updates (Theorem 2 and Section 4.4). Scalability: Scale AP is more scalable than the other AP algorithms (Section 4.3). It shows a nearly linear scalability against the number of data objects (Section 3.4). Easy to deploy: Unlike existing fast algorithms for AP, Scale AP does not require user-specified parameters (Algorithm 1). Scale AP provides users with a simple solution for applications using AP. Scale AP is the first solution to achieve a high efficiency while ensuring the same clustering results as AP on massive datasets. Scale AP outperforms state-of-the-art algorithms by up to two orders of magnitude in terms of clustering time. For example, for 120,000 data objects, Scale AP finds clusters within 30 minutes, whereas AP consumes more than 24 hours. AP is a fundamental tool to enhance application quality, but it is difficult to apply to massive datasets. However, Scale AP, which is well suited to massive datasets, should improve the quality in diverse AI-powered applications. 2 Preliminary We formally define the notations and briefly review AP (Frey and Dueck 2007). Let X be a set of data objects X = {x1, x2, . . . , x N}, s(i, j) be a real-valued similarity of a data object pair (xi, xj), and e(i) be the exemplar of data object xi. AP extracts clusters and the corresponding exemplars from pairwise similarities S = {s(i, j) | xi, xj X}. It places each data object xi into the same cluster as exemplar e(i) to maximize PN i=1 s(i, e(i)). If i = j, s(i, j) is referred to as the preference, which is a special class of similarities to control the number of clusters (Frey and Dueck 2007). The preference is typically set to the minimum or median value of the similarities. To identify exemplars from datasets, AP performs a message-passing process, which is derived from the belief propagation on factor graphs (Kschischang, Frey, and Loeliger 2001; Yedidia, Freeman, and Weiss 2005). AP iteratively exchanges two types of real-valued messages among data objects: responsibility and availability. In the t-th iteration, AP sends a responsibility rt(i, j) from xi to xj, which indicates how strongly xi wants to choose xj as its exemplar. It also sends an availability at(i, j) from xj to xi, which reflects the accumulated evidence for how well suited it would be for xi to choose xj as its exemplar. To compute the converged values, AP lets λ (0, 1) be a dumping factor and iteratively updates all messages, which are given as rt(i, j) = (1 λ)ρt(i, j) + λrt 1(i, j), (1) at(i, j) = (1 λ)αt(i, j) + λat 1(i, j), (2) where ρt(i, j) and αt(i, j) are computed as ρt(i, j) = s(i, j) maxk =j{at 1(i, k)+s(i, k)} (i =j) s(i, j) maxk =j{s(i, k)} (i=j) , (3) αt(i, j) = ( min n 0,rt 1(j,j)+P k =i,j max{0, rt 1(k,j)} o (i =j) P k =i max{0, rt 1(k, j)} (i=j) . (4) If t = 0, the messages are initially set to r0(i, j) = s(i, j) maxk =j{s(i, k)}, and a0 = 0. (5) After message-passing termination, AP determines an exemplar e(i) for each data object xi as e(i) = arg maxj{rt(i, j) + at(i, j)}. (6) AP requires O(N 2T) time, where N and T are the number of data objects and iterations, respectively. That is, if a given dataset is too large, AP incurs an excessive running time. Hence, AP fails to extract clusters and corresponding exemplars in massive datasets. 3 Proposed Method: Scale AP We present Scale AP to efficiently detect the same clustering results as AP. We first overview main ideas and then provide detailed descriptions of Scale AP. 3.1 Ideas Our goal is to efficiently find clusters and the corresponding exemplars without sacrificing clustering accuracy. AP iteratively computes real-valued messages, responsibility and availability, for all pairs of data objects until the messages are no longer updated. By contrast, Scale AP skips unnecessary message updates by introducing two approaches. First, we theoretically identify prunable pairs of data objects for responsibility and availability. The message updates for prunable pairs are deterministically computed from the corresponding similarities without message-passing processes. Second, we employ an aggregated message update to remove unnecessary message updates for prunable pairs while maintaining the same clustering accuracy as AP. Unlike message sampling and bounding methods, which invoke exhaustive message computations, Scale AP computes only the essential object pairs to efficiently find clusters and the corresponding exemplars. Our ideas have two main advantages. (1) Scale AP finds all clusters and exemplars on real-world datasets with a short running time. Our ideas successfully handle the data manifolds, which are well-known intrinsic structures embedded in real-world data (Roweis and Saul 2000; Belkin and Niyogi 2003). This is because these manifolds involve many similar object pairs that are removed by the adaptive message aggregation. This is why Scale AP can increase its performance on real-world datasets. (2) Scale AP always outputs the same clusters and exemplars as those of AP (Frey and Dueck 2007). We theoretically demonstrate that Scale AP does not miss opportunities to improve clustering accuracy, even though it dynamically removes prunable pairs from the message-passing processes. Thus, Scale AP does not sacrifice clustering quality compared to AP. 3.2 Prunable Pairs We introduce prunable pairs of data objects whose messages are uniquely computed from the corresponding similarity. Prunable responsibility: We first define prunable pairs for responsibility computations. Definition 1 (Prunable pairs for responsitiblity). In the t-th iteration, let ˆRt(i) be the prunable pairs of responsibility computations for data object xi. Then ˆRt(i) is given as ˆRt(i)={(xi, xj) X 2 | i = j, j = ηt 1(i)}, (7) where ηt 1(i) = arg maxk{at 1(i, k) + s(i, k)}. Note that ˆRt(i) includes all object pairs between xi and each of X except for (xi, xi) and (xi, xηt 1(i)). From Definition 1, we can derive the following property: Lemma 1. Given two object pairs (xi, xj), (xi, xk) ˆRt(i), ρt(i, j) = ρt(i, k) always holds iff s(i, j) = s(i, k). Proof. We first prove ρt(i, j) = ρt(i, k) s(i, j) = s(i, k). From Definition 1, xi = xj and xi = xj since (xi, xj), (xi, xk) ˆRt. Hence, as shown in Eq. (3), we have s(i,j)=ρt(i, j) + maxl =j{at 1(i,l) + s(i,l)} =ρt(i, k) + maxl =j{at 1(i,l) + s(i,l)}=s(i,k). (8) Thus, if ρt(i, j) = ρt(i, k), s(i, j) = s(i, k) holds. Similarly, we can prove s(i, j) = s(i, k) ρt(i, j) = ρt(i, k). Therefore, Lemma 1 holds. Lemma 1 implies that object pairs in ˆRt(i), (xi, xj) and (xi, xk), always have ρt(i, j) = ρt(i, k) if their similarities are equivalent, i.e., s(i, j) = s(i, k), and vice versa. By Lemma 1, we derive the following property, which plays a key role in Section 3.3. Lemma 2. If (xi, xj) ˆRt(i), rt(i, j) is uniquely determined by the corresponding similarity s(i, j). Proof. We prove by contradiction. Given (xi, xj), (xi, xk) ˆRt(i), we assume s(i, j) = s(i, k) rt(i, j) = rt(i, k). From Eq. (1), we have the following rt(i, j)=(1 λ) Pt l=1 λt lρl(i,j)+λtr0(i,j). (9) From Lemma 1, ρl(i, j) = ρl(i, k) for l = 1, . . . , t. Additionally, from Eq. (5), r0(i, j) = r0(i, k) since s(i, j) = s(i, k). Thus, we can derive the following equation. rt(i,j)=(1 λ) Pt l=1 λt lρl(i,k)+λtr0(i,k)=rt(i,k). (10) Since Eq. (10) contradicts the assumption, s(i, j) = s(i, k) rt(i, j) = rt(i, k) holds. This completes the proof of Lemma 2. Prunable availability: Next, we define prunable pairs for availability computations. Definition 2 (Prunable pairs for availability). In the t-th iteration, the prunable pairs of availability computations for data object xj, which are denoted by ˆ At(j), are defined as ˆ At(j)={(xi, xj) X 2 |i = j, maxk{rt 1(k,j)} 0}. (11) The prunable pairs ˆ At(j) contain all object pairs (xi, xj) X 2 except for (xj, xj) only if xj does not have any positive responsibilities in the (t-1)-th iteration. Otherwise, ˆ At(j) = . From Definition 2, we have the following property: Lemma 3. Given two object pairs (xi,xj), (xk,xj) ˆ At(j), αt(i, j) = αt(k, j) always holds. Proof. As shown in Definition 2, maxl{rt 1(l, j)} 0 holds since (xi, xj), (xk, xj) ˆ At(j). Thus, we can derive P l =i,j maxl{0, rt 1(l,j)} = 0. (12) From Eqs. (4) and (12), we can compute αt(i, j) as αt(i, j) = min{0, rt 1(j, j)} = αt(k, j), (13) which completes the proof of Lemma 3. The object pairs, (xi, xj) and (xk, xj), included in ˆ At(j) always hold αt(i, j) = αt(k, j), regardless of their similarities. From Lemma 3, we can derive the following key property, which is an essential in Section 3.3. Lemma 4. If (xi, xj) ˆ At(j), at(i, j) is uniquely determined by the corresponding initial responsibility r0(j, j). Proof. From Eq. (2) and Lemma 3, we can derive at(i, j) as at(i,j)=(1 λ) Pt l=1 min{0, rl 1(j,j)}. (14) That is, at(i, j) is determined by rl 1(j, j) in each iteration. From Eq. (1), we can also derive rl 1(j, j) as rt 1(j,j)=(1 λ) Pt 1 l=1 λt 1 lρl(j,j)+λt 1r0(j,j). (15) Recall that, as shown in Eqs. (1) and (5), ρt (j,j) = ρt 2(j,j) = . . . = ρ1(j,j) = r0(j,j) clearly holds. Thus, from Eq. (15), the following equation holds rt 1(j,j)=r0(j,j){(1 λ)Pt 1 l=1λt 1 l+λt 1}=r0(j,j). (16) Thus, at(i,j) = t(1 λ)r0(j, j) if r0(j, j) > 0. Otherwise, at(i, j) = 0. Hence, from Eq. (14), Lemma 4 holds. Complexity analysis: Finally, we theoretically assess the time complexity to compute the prunable pairs as follows: Lemma 5. In the t-th iteration, if ˆAt(1), ˆAt(2), . . . , ˆAt(N) are computed, ˆRt+1(i) can be obtained in O(1) time. Proof. As shown in Definition 1, ˆRt+1(i) includes all object pairs (xi, xj) for i = j and j = arg maxl{s(i, l) + at(i, l)}. Since at(i, 1), at(i, 2), . . . , at(i, N) are included in ˆAt(1), ˆAt(2), . . . , ˆAt(N), respectively, arg maxl{s(i, l) + at(i, l)} is known after computing ˆAt(1), ˆAt(2), . . . , ˆAt(N). By letting (xi, xl) be an object pair maximizing s(i, l) + at(i, l), we can obtain ˆRt+1(i) in O(1) time by removing (xi, xl) from {(xi, xj) | i = j}. Lemma 6. In the t-the iteration, if ˆRt(1), ˆRt(2), . . . , ˆRt(N) are computed, ˆAt+1(i) can be obtained in O(1) time. Proof. From Definition 2, we need to find maxk{rt(k, i)} to obtain ˆAt+1(i). Since rt(1), rt(2), . . . , rt(N) are respectively included in ˆRt(1), ˆRt(2), . . . , ˆRt(N), we can find maxk{rt(k, i)} when we compute ˆRt(1), ˆRt(2), . . . , ˆRt(N). Thus, ˆAt+1(i) can be obtained in O(1) time by checking whether maxk{rt(k, i)} 0 holds or not. 3.3 Aggregated Message Update We introduce aggregated message update to remove unnecessary message-passing processes in AP algorithms. If data objects are included in the prunable pairs, responsibilities and availabilities are uniquely determined by the corresponding similarities as shown in Lemmas 2 and 4. Thus, once the prunable pairs are obtained, messages of the prunable pairs can be updated from the corresponding similarities without performing iterative message-passing processes. To leverage the above properties, Scale AP replaces the responsibility and availability computations (Eqs. (1) and (2)) with aggregated responsibility and aggregated availability, respectively. These are given as: Definition 3 (Aggregated responsibility). Given (xi, xj) X 2 and ˆRt(i) in the t-th iteration, the aggregated responsibility rt(i, j) for (xi, xj) is defined as rt(i,j)= (1 λ)ρt(i,j)+λ rt 1(i,j) ((xi,xj) Rt(i)) rt(i,k) + si(j,k) (Otherwise) , (17) where Rt(i) = {(xi, xj) | (xi, xj) / ˆRt(i)} {(xi, xk)} such that k = arg maxl{s(i, l) | (xi, xj) ˆRt}, and si(j, k) = s(i, j) s(i, k). Definition 4 (Aggregated availability). Given (xi, xj) X 2 and ˆ At(j) in the t-th iteration, the aggregated availability rt(i, j) for (xi, xj) is defined as at(i,j)= (1 λ)αt(i,j)+λ at 1(i,j) ((xi,xj) / ˆ At(j)) t(1 λ) min{0, r0(j,j)} (Otherwise) . (18) As shown in Definition 3, Scale AP computes responsibility by following the iterative message-passing processes in Eq. (1) only if the object pair (xi, xj) Rt(i). Otherwise, it skips the message-passing and reconstructs responsibilities from the corresponding similarities. Additionally, in the aggregated availability shown in Definition4, Scale AP directly obtains the availability from r0(j, j) without using messagepassing if object pairs (xi, xj) ˆ At(j). To discuss the theoretical aspects of Definitions 3 and 4, we derive the following two properties: Lemma 7. In the t-th iteration, rt(i, j) = rt(i, j) holds. Proof. If (xi, xj) Rt(i), rt(i, j) = rt(i, j) clearly holds. Hence, we prove rt(i, j) = rt(i, k) + si(j, k) such that k = arg maxl{s(i, l) | (xi, xl) ˆRt(i)} using mathematical induction. First, suppose that t = 1. From Eq. (1), r1(i, j) = (1 λ)ρ1(i, j) + λr0(i, j). As shown in Definition 1, j, k = maxl{s(i, l)} since (xi, xj), (xi, xk) ˆR1(i). Thus, using s(i, j) = s(i, k)+ si(j, k), ρ1(i, k) and r0(i, k) should be computed as ρ1(i, j) = ρ1(i, k) + si(j, k) and r0(i, j) = r0(i, k) + si(j, k), respectively. That is, from the above equations, Lemma 7 clearly holds for t = 1. Next, we prove Lemma 7 for t = n if rn 1(i, j) = rn 1(i, k) + si(j, k) holds. From Eq. (1), we have rn(i, j)=(1 λ)ρn(i, j) + λrn 1(i, j). (19) From Lemma 1, ρn(i,j) is uniquely determined by the corresponding similarity s(i, j) = s(i, k) + si(j, k). Thus, from Eq. (3), we can derive ρn(i, j) as ρn(i,j)=s(i,k)+ si(j,k) maxk =l{at 1(i,l) + s(i,l)} =ρn(i,k)+ si(j,k). (20) Recall that we have rn 1(i, j) = rn 1(i, k) + si(j, k). Hence, from Eq. (19), Lemma 7 holds for t = n, which completes the proof of Lemma 7. Lemma 8. In the t-th iteration, at(i, j) = at(i, j) holds. Proof. If (xi, xj) / ˆAt(j), then at(i, j) = at(i, j) holds from Eq. (2) and Definition 4. Here, we prove Lemma 8 for object pairs included in ˆAt(j). From Lemma 4, if (xi, xj) ˆAt(j), at(i, j) is uniquely determined by r0(j, j). That is, at(i,j)=(1 λ) Pt l=1 min{0, r0(j,j)} = at(i,j), (21) which completes the proof of Lemma 8. Lemmas 7 and 8 indicate that Scale AP can obtain the same responsibilities and availabilities, even though it skips the message-passing by the aggregated message update. 3.4 Algorithm Algorithm 1 gives a full description of Scale AP. Scale AP iteratively performs message updates using a given set of similarities, S, until all messages converge (lines 3-14). In each iteration, Scale AP first computes responsibilities (lines 4 8). At the beginning of the responsibility computation, it constructs Rt(i) for each data object xi X (lines 5 6). From Lemma 5, we can obtain ˆRt(i) in O(1). Thus, Scale AP can constructs Rt(i) in O(1). Afterward, Scale AP computes the responsibilities only for the object pairs included in Rt(i) by following Definition 3 (lines 7-8). As Algorithm 1 Proposed method: Scale AP Input: A set of similarities S; Output: A set of exemplars E; 1: E ; 2: Obtain ˆR1(i) and ˆ A1(i) for i = 1, . . . , N; 3: for t = 1 to T do 4: for i = 1 to N do 5: k arg maxl{(xi, xl) | (xi, xl) ˆRt(i)}; 6: Rt(i) {(xi, xj) | (xi, xj) / ˆRt(i)} {(xi, xk)}; 7: for each (xi, xj) Rt(i) do 8: Compute rt(i, j) by Definition 3; 9: for j = 1 to N do 10: At(j) {(xi, xj) | (xi, xj) / ˆAt(j)}; 11: for each (xi, xj) At(j) do 12: Compute at(i, j) by Definition 4; 13: for i = 1 to N do 14: Obtain ˆRt+1(i) and ˆAt+1(i) by Definition 1 and 2; 15: for i = 1 to N do 16: e(i) arg maxj{rt(i, j) + at(i, j)}; 17: E E {e(i)}; 18: return E; we proved in Lemma 7, we can obtain exactly the same responsibilities for (xi, xj) / Rt(i) in O(1) if object pairs in Rt(i) are computed. Hence, in Algorithm 1, Scale AP skips to compute responsibilities for (xi, xj) / Rt(i). Then it computes the availabilities (lines 9 12). Similar to the above responsibility computations, Scale AP constructs At(j) from ˆAt(j) in O(1) (line 10). Afterward, it computes the availabilities for object pairs included in At(j) using Definition 4. After updating the messages by following Lemmas 5 and 6, Scale AP constructs ˆRt+1(i) and ˆAt+1(i) in O(1) for each object xi X (lines 13-14). Scale AP iterates the above message updates until those messages are no longer updated. Finally, it outputs exemplar E using Eq. (6) (lines 15-18). Algorithm 1 shows that Scale AP does not require any pre-computations or user-specified parameters. This sharply contrasts existing AP algorithms. Consequently, Scale AP provides a simple solution for users. Finally, we discuss the theoretical aspects of Scale AP. Let N and T be the number of data objects and iterations, respectively. Scale AP has the following properties: Theorem 1. Scale AP incurs O((1+ϵ)NT) time, where ϵ is the average number of exemplar candidates. Proof. From Lemmas 5 and 6, Scale AP can obtain ˆRt(i) and ˆ At(i) in O(1) for each iteration. Hence, Scale AP incurs O(NT) time to find prunable pairs. In each iteration, Scale AP computes responsibilities for object pairs included in Rt(i) (lines 4-8 in Algorithm 1). From Definition 3, Rt(i) contains three object pairs at most (i.e., (xi, xi), (xi, xj), and (xi, xk), where j = arg maxl{at 1(i, l) + s(i, l)}, and k = arg maxl{s(i, l)}). Thus, Scale AP incurs O(3NT) = O(NT) time for responsibility computations. Additionally, in the availability computations (lines 9-12), Scale AP computes the availability only if a data object xj has max{rt 1(i, j)} > 0. Recall from Eqs. (1) and (5), rt 1(i, j) should be a positive value only if s(xi, xj) is the largest value among all possible xj X. rt 1(i, j) is the message sent from xi to xj to represent how strongly xi wants to choose xj as its exemplar. That is, a data object xj is a candidate of exemplars in the (t 1)-th iteration, xj has max{rt 1(i, j)} > 0, and Scale AP computes availabilities for xj. Thus, Scale AP incurs O(ϵNT) time for the availability computations. Finally, it obtains exemplars in O(N) time. Therefore, Scale AP requires O(NT + NT + ϵNT + N) = O((1 + ϵ)NT) time to obtain exemplars. As shown in Section 1, existing AP algorithms require O(N 2T) time. Consequently, Theorem 1 indicates that Scale AP is dramatically faster than AP and other state-ofthe-art AP algorithms. In practice, ϵ should be a small constant in real-world datasets (i.e., ϵ N) since real-world datasets generally have a small number of clusters (Tibshirani, Walther, and Hastie 2001; Shiokawa, Amagasa, and Kitagawa 2019). Specifically, as shown in Table 1, the realworld datasets examined in the next section have at most 31 clusters, which are significantly smaller than the dataset sizes. Thus, Scale AP has a nearly linear scalability against the number of data objects. In other words, O((1+ϵ)NT) O(NT) on the real-world datasets. In Section 4, we experimentally verify the efficiency of Scale AP. Theorem 2. Scale AP always outputs the same exemplars as those of AP (Frey and Dueck 2007). Proof. As we proved in Lemmas 7 and 8, rt(i, j) = rt(i, j) and at(i, j) = at(i, j) always hold even if an object pair (xi, xj) is not computed by Algorithm 1. In each iteration, Scale AP can perform the same message updates as those of AP. From Eq. (6), the exemplars are uniquely determined by the values of responsibility and availability. Therefore, Scale AP always outputs the same exemplars as AP. 4 Experimental Evaluation We experimentally evaluated the effectiveness of Scale AP by comparing it with the following AP algorithms. AP: The original AP algorithm (Frey and Dueck 2007). FSAP: The most popular message sampling approach (Jia et al. 2008). It samples object pairs by constructing a k NN graph and performs message-passing on the graph until convergence. For the graph construction, we set k = 15. Fast AP: The state-of-the-art message sampling approach (Sun et al. 2017). Fast AP constructs a sparse factor graph by finding micro-cluster structures from given similarities. It performs the message-passing on the graph. Graph AP: The message bounding algorithm, which prunes unpromising object pairs while keeping its clustering quality (Fujiwara, Irie, and Kitahara 2011). F-AP: The state-of-the-art bounding algorithm (Fujiwara et al. 2015). It improves the efficiency of Graph AP by using dynamic message pruning methods. MLAP: The multi-level coarsening algorithm proposed by (Shang et al. 2012). It recursively coarsens object pairs by using the eigenvalue decomposition. All experiments were conducted on a server with Intel Xeon CPU 2.60 GHz and 768 Gi B RAM. Name N # of attributes # of clusters Data source Human 10,299 561 6 (Anguita et al. 2013) Mo Cap 78,095 38 5 (Gardner et al. 2014) Youtube 120,000 647 31 (Madani, Georg, and Ross 2013) Table 1: Statistics of real-world datasets. Scale AP AP FSAP MLAP Fast AP Graph AP F-AP Human Mo Cap Youtube 100 Running time (sec.) Human Mo Cap Youtube 100 Running time (sec.) (b) Minimum Figure 1: Running time Scale AP AP FSAP Fast AP Graph AP F-AP 100 200 300 400 500 600 700 800 900 1000 # of iterations # of updated object pairs 100 200 300 400 500 600 700 800 900 1000 # of iterations # of updated object pairs (b) Minimum Figure 2: # of message updates on Mo Cap Datasets: We used three real-world datasets summarized in Table 1. All the datasets are published by UCI Machine Learning Repository (Dua and Graff 2017). Note that in the Youtube dataset, we chose the 647-dimensional HOG features provided by (Madani, Georg, and Ross 2013). Experimental settings: The experiments used the negative Euclidean distance as the similarity between data objects. In accordance with (Frey and Dueck 2007), we set the preference to both the median and minimum of the input similarities, λ = 0.5, and the maximum number of iterations to T = 1, 000. The results were the average of over 10 independent runs. We also reported the standard deviation values of accuracy, whereas they were omitted from runtime evaluations since they were smaller than 0.1 seconds. 4.1 Efficiency Figure 1 shows the running time on the real-world datasets, where DNF indicates that the runtime exceeded 24 hours. Overall, Scale AP achieves the highest performance. On average, Scale AP is 58.7 times faster than the other approaches. Although Scale AP guarantees the same results as AP, it is up to 106.6, 83.4, and 75.9 times faster than AP, Graph AP, and F-AP, respectively. As we proved in Theorem 1, Scale AP shows a better time complexity compared to other competitive algorithms. This feature is responsible for the superior running time compared to other methods. Scale AP is still efficient even if the preference is set to the minimum. This differs from the bounding algorithms, which significantly degrade their running time for such preference setting. The bounding algorithms prune messages by updating the bounds of the messages. However, if one preference is smaller than the others, the bounds become loose. This is why bounding algorithms fail to prune message updates for the minimum preference. By contrast, Scale AP finds the prunable pairs (Section 3.2) regardless of the preference value. Thus, Scale AP can efficiently compute massive datasets compared to the other AP algorithms. 4.2 Effectiveness To verify how Scale AP effectively reduces message updates during the message-passing, Figure 2 plots the number of computed pairs in each iteration. The results of MLAP are excluded because MLAP does not employ the messagepassing. Compared to AP, Scale AP shows a 95.4% reduction in the updates during iterative computations. Moreover, the other algorithms computed many more messages than Scale AP. Although FSAP and Fast AP also reduced the number of message updates, they require a longer running time than Scale AP (Figure 1). To preserve the clustering quality, these algorithms have sampling overheads prior to messagepassing processes. By contrast, as we proved in Lemmas 5 and 6, Scale AP can find each prunable pair in O(1) without pre-computations and overheads. Thus, Scale AP can efficiently reduce the number of message updates. Next, we evaluated the effectiveness of our approaches by comparing the runtime of Scale AP with variants that exclude either the aggregated responsibility or the aggregated availability. Figure 3 shows the runtime of each algorithm, where W/O-AR and W/O-AA represent Scale AP without the aggregated responsibility and the aggregated availability, respectively. On average, Scale AP is 24.5 and 10.7 times faster than W/O-AR and W/O-AA, respectively. These results indicate that the aggregated responsibility leads to an enhanced efficiency improvement. As discussed in Section 3.4, Rt(i) includes at most three object pairs regardless of the similarities. Consequently, the cost for the responsibility computations approaches an almost linear running time. 4.3 Scalability We assessed the scalability of Scale AP. Figure 4 shows the running time on Youtube as a function of the number of data objects. We randomly sampled 102, 103, 104, and 105 objects from Youtube and evaluated the runtime on each sampled dataset. If the method did not finish within 24 hours, the results are excluded from Figure 4. Scale AP is more Scale AP W/O-AR W/O-AA Human Mo Cap Youtube 100 Running time (sec.) (a) Preference (median) Human Mo Cap Youtube 100 Running time (sec.) (b) Preference (minimum) Figure 3: Effectiveness Scale AP AP FSAP Fast AP Graph AP F-AP 102 103 104 105 # of data objects 100 101 102 103 104 105 Running time (sec.) (a) Preference (median) 102 103 104 105 # of data objects 100 101 102 103 104 105 Running time (sec.) (b) Preference (minimum) Figure 4: Scalability Scale AP FSAP MLAP Fast AP Graph AP F-AP Human Mo Cap 0.0 (a) Preference (median) Human Mo Cap 0.0 (b) Preference (minimum) Figure 5: F-measure (Error bars show standard deviation) Scale AP AP FSAP MLAP Fast AP Graph AP F-AP Human Mo Cap Youtube 0.0 (a) Preference (median) Human Mo Cap Youtube 0.0 (b) Preference (minimum) Figure 6: NMI (Error bars show standard deviation) scalable than the other AP algorithms. It shows a nearly linear scalability against the number of data objects. As described in Section 1, AP and its variants require O(N 2T) time, whereas Scale AP requires O((1 + ϵ)NT) time, as discussed in Theorem 1. Furthermore, the real-world datasets have very small numbers of clusters (Table 1). This implies that the number of exemplars, ϵ, should be a small constant in the datasets. Consequently, the practical cost of Scale AP is O((1 + ϵ)NT) O(NT). Thus, Scale AP achieves a nearly linear scalability. 4.4 Exactness One advantage of Scale AP is that it outputs the same exemplars as those of AP, while dynamically excluding unnecessary message updates. To verify this advantage, we used the following metrics to evaluate AP algorithms: F-measure: We measured F-measure (Manning, Raghavan, and Sch utze 2008) of the obtained exemplars against those of AP. F-measure is 1 if the exemplars exactly match those of AP. Otherwise, it approaches 0. Normalized mutual information (NMI): We measured NMI (Cilibrasi and Vit anyi 2005) to evaluate the clustering accuracy of each algorithm against the ground-truth clusters of each dataset. NMI is 1 if the obtained clusters are the same as the ground truth. To compare the exactness of Scale AP, Figure 5 shows the F-measure. Scale AP detects the same exemplars as those of AP. On the other hand, the other sampling algorithms (FSAP, MLAP, and Fast AP) fail to reproduce the exemplars of AP. Scale AP theoretically guarantees the same responsibilities and availabilities as AP by Lemmas 7 and 8, while removing message updates during the message-passing processes. That is, Scale AP performs the message-passing processes equivalent to those processes of AP. Hence, Scale AP finds the same exemplars as AP by Theorem 2. Figure 6 compares NMI scores to analyze the clustering accuracy. Scale AP has higher NMI scores than FSAP, MLAP, and Fast AP. In addition, Scale AP yields the same NMI scores as AP. As described in Section 1.1, FSAP, MLAP, and Fast AP samples, exemplars, or object pairs aim to reduce the high computational cost. However, these sampling approaches fail to reproduce the same exemplars as AP (Figure 5). By contrast, as we proved in Theorem 2, Scale AP is theoretically designed to find the same exemplars as AP. Therefore, Scale AP can successfully inherit the high clustering quality from AP. 5 Conclusion Scale AP is an efficient AP algorithm that produces the same clustering results as AP but with a faster computation time. Scale AP excludes unnecessary message updates by finding prunable pairs during message passing. In an experiment, Scale AP offers an improved efficiency on massive datasets compared to other AP algorithms without sacrificing the clustering quality. Consequently, employing Scale AP for massive datasets should enhance the effectiveness of AIpowered applications. Acknowledgments This work was partially supported by JST ACT-I and JST PRESTO JPMJPR2033, Japan. I thank to Hiroyuki Kitagawa, Ken-ichi Kawarabayashi, and Tomohiro Matsushita for their helps and useful discussions. References Ambrogi, F.; Boracchi, P.; Biganzoli, E.; Raimondi, E.; and Soria, D. 2008. Cancer Profiles by Affinity Propagation. In Proceedings of the 7th IEEE International Conference on Machine Learning and Applications (ICMLA 2008), 650 655. Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; and Reyes Ortiz, J. L. 2013. A Public Domain Dataset for Human Activity Recognition Using Smartphones. In Proceedings of the 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2013), 437 442. URL https://archive.ics.uci.edu/ml/datasets/Human+Activity+ Recognition+Using+Smartphones. Belkin, M.; and Niyogi, P. 2003. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Compututation 15(6): 1373 1396. ISSN 0899-7667. Cilibrasi, R.; and Vit anyi, P. M. 2005. Clustering by Compression. IEEE Transactions on Information Theory 51(4): 1523 1545. Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. URL http://archive.ics.uci.edu/ml. Accessed: September 1st, 2020. Dueck, D.; and Frey, B. J. 2007. Non-metric Affinity Propagation for Unsupervised Image Categorization. In Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV 2007), 1 8. Frey, B. J.; and Dueck, D. 2007. Clustering by Passing Messages Between Data Points. Science 315(5814): 972 976. Fujiwara, Y.; Irie, G.; and Kitahara, T. 2011. Fast Algorithm for Affinity Propagation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), 2238 2243. Fujiwara, Y.; Nakatsuji, M.; Shiokawa, H.; Ida, Y.; and Toyoda, M. 2015. Adaptive Message Update for Fast Affinity Propagation. In Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2015), 309 318. Gardner, A.; Kanno, J.; Duncan, C. A.; and Selmic, R. 2014. Measuring Distance between Unordered Sets of Different Sizes. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), 137 143. URL https://archive.ics.uci.edu/ml/datasets/Mo Cap+ Hand+Postures. Jia, Y.; Wang, J.; Zhang, C.; and Hua, X.-S. 2008. Finding Image Exemplars Using Fast Sparse Affinity Propagation. In Proceedings of the 16th ACM International Conference on Multimedia (MM 2008), 639 642. Kazantseva, A.; and Szpakowicz, S. 2011. Linear Text Segmentation Using Affinity Propagation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), 284 293. Kschischang, F. R.; Frey, B. J.; and Loeliger, H. 2001. Factor Graphs and the Sum-Product Algorithm. IEEE Transactions on Information Theory 47(2): 498 519. Madani, O.; Georg, M.; and Ross, D. A. 2013. On Using Nearly-Independent Feature Families for High Precision and Confidence. Machine Learning 92: 457 477. URL https://archive.ics.uci.edu/ml/datasets/You Tube+ Multiview+Video+Games+Dataset. Manning, C. D.; Raghavan, P.; and Sch utze, H. 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press. ISBN 0521865719, 9780521865715. Matsushita, T.; Shiokawa, H.; and Kitagawa, H. 2018. CAP: Cell-based Algorithm for Efficient Affinity Propagation. In Proceedings of the 20th International Conference on Information Integration and Web-based Applications & Services (ii WAS 2018), 156 163. Roweis, S. T.; and Saul, L. K. 2000. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290(5500): 2323 2326. Shang, F.; Jiao, L.; Shi, J.; Wang, F.; and Gong, M. 2012. Fast Affinity Propagation Clustering: A Multilevel Approach. Pattern Recognition 45(1): 474 486. ISSN 00313203. Shiokawa, H.; Amagasa, T.; and Kitagawa, H. 2019. Scaling Fine-grained Modularity Clustering for Massive Graphs. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), 4597 4604. Shiokawa, H.; Fujiwara, Y.; and Onizuka, M. 2013. Fast Algorithm for Modularity-based Graph Clustering. In Proc. AAAI 2013, 1170 1176. Shiokawa, H.; Fujiwara, Y.; and Onizuka, M. 2015. SCAN++: Efficient Algorithm for Finding Clusters, Hubs and Outliers on Large-scale Graphs. Proceedings of the Very Large Data Bases Endowment (PVLDB) 8(11): 1178 1189. Sun, L.; and Guo, C. 2014. Incremental Affinity Propagation Clustering Based on Message Passing. IEEE Transactions on Knowledge and Data Engineering 26(11): 2731 2744. Sun, L.; Guo, C.; Liu, C.; and Xiong, H. 2017. Fast Affinity Propagation Clustering Based on Incomplete Similarity Matrix. Knowledge and Information Systems 51: 941 963. Tibshirani, R.; Walther, G.; and Hastie, T. 2001. Estimating the Number of Clusters in a Data Set via the Gap Statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2): 411 423. Yedidia, J. S.; Freeman, W. T.; and Weiss, Y. 2005. Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms. IEEE Transactions on Information Theory 51(7): 2282 2312.