# fast_distributed_submodular_cover_publicprivate_data_summarization__37a70ca9.pdf

Fast Distributed Submodular Cover: Public-Private Data Summarization

Baharan Mirzasoleiman Morteza Zadimoghaddam Amin Karbasi ETH Zurich Google Research Yale University

In this paper, we introduce the public-private framework of data summarization motivated by privacy concerns in personalized recommender systems and online social services. Such systems have usually access to massive data generated by a large pool of users. A major fraction of the data is public and is visible to (and can be used for) all users. However, each user can also contribute some private data that should not be shared with other users to ensure her privacy. The goal is to provide a succinct summary of massive dataset, ideally as small as possible, from which customized summaries can be built for each user, i.e. it can contain elements from the public data (for diversity) and users private data (for personalization). To formalize the above challenge, we assume that the scoring function according to which a user evaluates the utility of her summary satisﬁes submodularity, a widely used notion in data summarization applications. Thus, we model the data summarization targeted to each user as an instance of a submodular cover problem. However, when the data is massive it is infeasible to use the centralized greedy algorithm to ﬁnd a customized summary even for a single user. Moreover, for a large pool of users, it is too time consuming to ﬁnd such summaries separately. Instead, we develop a fast distributed algorithm for submodular cover, FASTCOVER, that provides a succinct summary in one shot and for all users. We show that the solution provided by FASTCOVER is competitive with that of the centralized algorithm with the number of rounds that is exponentially smaller than state of the art results. Moreover, we have implemented FASTCOVER with Spark to demonstrate its practical performance on a number of concrete applications, including personalized location recommendation, personalized movie recommendation, and dominating set on tens of millions of data points and varying number of users.

1 Introduction

Data summarization, a central challenge in machine learning, is the task of ﬁnding a representative subset of manageable size out of a large dataset. It has found numerous applications, including image summarization [1], recommender systems [2], scene summarization [3], clustering [4, 5], active set selection in non-parametric learning [6], and document and corpus summarization [7, 8], to name a few. A general recipe to obtain a faithful summary is to deﬁne a utility/scoring function that measures coverage and diversity of the selected subset [1]. In many applications, the choice of utility functions used for summarization exhibit submodularity, a natural diminishing returns property. In words, submodularity implies that the added value of any given element from the dataset decreases as we include more data points to the summary. Thus, the data summarization problem can be naturally reduced to that of a submodular cover problem where the objective is to ﬁnd the smallest subset whose utility achieves a desired fraction of the utility provided by the entire dataset.

It is known that the classical greedy algorithm yields a logarithmic factor approximation to the optimum summary [9]. It starts with an empty set, and at each iteration adds an element with the

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

maximum added value to the summary selected so far. It is also known that improving upon the logarithmic approximation ratio is NP-hard [10]. Even though the greedy algorithm produces a near-optimal solution, it is highly impractical for massive datasets, as sequentially selecting elements on a single machine is heavily constrained in terms of speed and memory. Hence, in order to solve the submodular cover problem at scale, we need to make use of Map Reduce-style parallel computation models [11, 12]. The greedy algorithm, due to its sequential nature, is poorly suited for parallelization.

In this paper, we propose a fast distributed algorithm, FASTCOVER, that enables us to solve the more general problem of covering multiple submodular functions in one run of the algorithm. It relies one three important ingredients: 1) a reduction from multiple submodular cover problems into a single instance of a submodular cover problem [13, 14], 2) randomized ﬁltration mechanism to select elements with high utility, and 3) a set of carefully chosen threshold functions used for the ﬁlteration mechanism. FASTCOVER also provides a natural tarde-off between the number of Map Reduce rounds and the size of the returned summary. It effectively lets us choose between compact summaries (i.e., smaller solution size) while running more Map Reduce rounds or larger summaries while running fewer Map Reduce rounds.

This setting is motivated by privacy concerns in many modern applications, including personalized recommender systems, online social services, and the data collected by apps on mobile platforms [15, 16]. In such applications, users have some control over their own data and can mark some part of it private (in a slightly more general case, we can assume that users can make part of their data private to speciﬁc groups and public to others). As a result, the dataset consists of public data, shared among all users, and disjoint sets of private data accessible to the owners only.

We call this more general framework for data summarization, public-private data summarization, where the private data of one user should not be included in another user s summary (see also [15]). This model naturally reduces to solving one instance of the submodular cover problem for each user, as their view of the dataset and the speciﬁc utility function specifying users preferences differ across users. When the number of users is small, one can solve the public-private data summarization separately for each user, using the greedy algorithm (for datasets of small size) or the recently proposed distributed algorithm DISCOVER [12] (for datasets of moderate size). However, when there are many users or the dataset is massive, none of the prior work truly scales.

We report performance of DISCOVER using Spark on concrete applications of the public-private data summarization, including personalized movie recommendation on a dataset containing 2 million ratings by more than 100K users for 1000 movies, personalized location recommendation based on 20 users and their collected GPS locations, and ﬁnding the dominating set on a social network containing more than 65 million nodes and 1.8 billion edges. For small to moderate sized datasets, we compare our results with previous work, namely, classical greedy algorithm and DISCOVER [12]. For truly large-scale experiments, where the data is big and/or there are many users involved (e.g., movie recommendation), we cannot run DISCOVER as the number of Map Reduce rounds in addition to their communication costs is prohibitive. In our experiments, we constantly observe that FASTCOVER provides solutions of size similar to the greedy algorithm (and very often even smaller) with the number of rounds that are orders of magnitude smaller than DISCOVER. This makes FASTCOVER the ﬁrst distributed algorithm that solves the public-private data summarization fast and at scale.

2 Problem Statement: Public-Private Data Summarization

In this section, we formally deﬁne the public-private model of data summarization1. Here, we consider a potentially large dataset (sometimes called universe of items) V of size n and a set of users U. The dataset consists of public data VP and disjoint subsets of private data Vu for each user u U. The public-private aspect of data summarization realizes in two dimensions. First, each user u U has her own utility function fu(S) according to which she scores the value of a subset S V. Throughout this paper we assume that fu( ) is integer-valued2, non-negative, and monotone

1All the results are applicable to submodular cover as a special case where there is only public data. 2For the submodular cover problem it is a standard assumption that the function is integer-values for the theoretical results to hold. In applications where this assumption is not satisﬁed, either we can appropriately discretize and rescale the function, or instead of achieving the desired utility Q, try to reach (1 δ)Q, for some 0 < δ < 1. In the latter case, we can simply replace Q with Q/δ in the theorems to get the correct bounds.

submodular. More formally, submodularity means that

fu(A {e}) fu(A) fu(B {e}) fu(B) A B V and e V \ B.

Monotonicity implies that for any A V and e V we have fu(e|A) .= fu(A {e}) fu(A) 0. The term fu(e|A) is called the marginal gain (or added value) of e to the set A. Whenever it is clear from the context we drop fu from fu(e|A). Without loss of generality, we normalize all users functions so that they achieve the same maximum value, i.e., fu(V) = fv(V) for all u, v U. Second, and in contrast to public data that is shared among all users, the private data of a user cannot be shared with others. Thus, a user u U can only evaluate the public and her own private part of a summary S, i.e., S (VP Vu). In other words, if the summary S contains private data of a user v = u, the user u cannot have access or evaluate v s private part of S, i.e., S Vv. In public-private data summarization, we would like to ﬁnd the smallest subset S V such that all users reach a desired utility Q fu(V) = fu(VP Vu) simultaneously, i.e.,

OPT = arg min S V |S|, such that fu(S (VP Vu)) Q u U. (1)

A naive way to solve the above problem is to ﬁnd a separate summary for each user and then return the union of all summaries as S. A more clever way is to realize that problem (1) is in fact equivalent to the following problem [13, 14]

OPT = arg min S V |S|, such that f(S) .= X

u U min{fu(S (VP Vu)), Q} Q |U|. (2)

Note that the surrogate function f( ) is also monotone submodular as a thresholded submodular function remains submodular. Thus, ﬁnding a set S that provides each user with utility Q is equivalent of ﬁnding a set S with f(S) L .= Q |U|. This reduction lets us focus on developing a fast distributed solution for solving a single submodular cover problem. Our method FASTCOVER is explained in detail in Section 4.

Related Work: When the data is small, we can use the centralized greedy algorithm to solve problem (2) (and equivalently problem (1)). The greedy algorithm sequentially picks elements and returns a solution of size (1 + ln M)OPT ln(L)|OPT| where M = maxe V f(e). As elaborated earlier, when the data is large, one cannot run this greedy algorithm as it requires centralized access to the full dataset. This is why scalable solutions for the submodular cover problem have recently gained a lot of interest. In particular, for the set cover problem (a special case of submodular cover problem) there have been efﬁcient Map Reduce-based implementations proposed in the literature [17, 18, 19]. There have also been recent studies on the streaming set cover problem [20]. Perhaps the closest work to our efforts is [12] where the authors proposed a distributed algorithm for the submodular cover problem called DISCOVER. Their method relies on the reduction of the submodular cover problem to multiple instances of the distributed constrained submodular maximization problem [6, 21]. For any ﬁxed 0 < α 1, DISCOVER returns a solution of size 2αk+72 log(L)|OPT| p

min(m, α|OPT|)) in log(α|OPT|) + 36 p

min(m, α|OPT|) log(L)/α + 1 rounds, where m denotes the number of machines. Even though DISCOVER scales better than the greedy algorithm, the solution it returns is usually much larger. Moreover, the dependency of the number of Map Reduce rounds on p

min(m, α|OPT|) is far from desirable. Note that as we increase the number of machines, the number of rounds may increase (rather than decreasing). Instead, in this paper we propose a fast distributed algorithm, FASTCOVER, that truly scales to massive data and produces a solution that is competitive with that of the greedy algorithm. More speciﬁcally, for any ϵ > 0, FASTCOVER returns a solution of size at most ln(L)|OPT|/(1 ϵ) with at most log3/2(n/m|OPT|) log(M)/ϵ+log(L) rounds, where M = maxe V f(e). Thus, in terms of speed, FASTCOVER improves exponentially upon DISCOVER while providing a smaller solution. Moreover, in our work, the number of rounds decreases as the number of machines increases, in sharp contrast to [12].

3 Applications of Pubic-Private Data Data Summarization

In this section, we discuss 3 concrete applications where parts of data are private and the remaining parts are public. All objective functions are non-negative, monotone, and submodular.

3.1 Personalized Movie Recommendation

Consider a movie recommender system that allows users to anonymously and privately rate movies. The system can use this information to recognize users preferences using existing matrix completion techniques [22]. A good set of recommended movies should meet two criteria: 1) be correlated with user s preferences, and 2) be diverse and contains globally popular movies. To this end, we deﬁne the following sum-coverage function to score the quality of the selected movies S for a user u:

fu(S) = αu X

i S,j Vu si,j + (1 αu) X

i S,j VP \S si,j, (3)

where Vu is the list of highly ranked movies by user u (i.e., private information), VP is the set of all movies in the database3, and si,j measures the similarity between movie i and j. The similarity can be easily calculated using the inner product between the corresponding feature vectors of any two movies i and j. The term P

i S,j Vu si,j measures the similarity between the recommended set S and the user s preferences. The second term P

i S,j VP \S si,j encourages diversity. Finally, the parameter 0 αu 1 provides the user the freedom to specify how much she cares about personalization versus diversity, i.e., αu = 1 indicates that all the recommended movies should be very similar to the movies she highly ranked and αu = 0 means that she prefers to receive a set of globally popular movies among all users, irrespective of her own private ratings. Note that in this application, the universe of items (i.e., movies) is public. What is private is the users ratings through which we identify the set of highly ranked movies by each user Vu. The effect of private data is expressed in users utility functions. The objective is to ﬁnd the smallest set S of movies V, from which we can build recommendations for all users in a way that all reach a certain utility.

3.2 Personalized Location Recommendation

Nowadays, many mobile apps collect geolocation data of their users. To comply with privacy concerns, some let their customers have control over their data, i.e., users can mark some part of their data private and disallow the app to share it with other users. In the personalized location recommendation, a user is interested in identifying a set of locations that are correlated with the places she visited and popular places everyone else visited. Note that as close by locations are likely to be similar it is very typical to deﬁne a kernel matrix K capturing the similarity between data points. A commonly used kernel in practice is the squared exponential kernel K(ei, ej) = exp( ||ei ej||2 2/h2). To deﬁne the information gain of a set of locations indexed by S, it is natural to use f(S) = log det(I + σKS,S). The information gain objective captures the diversity and is used in many ML applications, e.g., active set selection for nonparametric learning [6], sensor placement [13], determinantal point processes, among many others. Then, the personalized location recommendation can be modeled by

fu(S) = αuf(S Vu) + (1 αu)f(S VP ), (4)

where Vu is the set of locations that user u does not want to share with others and VP is the collection of all publicly disclosed locations. Again, the parameter αu lets the user indicate to what extent she is willing to receive recommendations based on her private information. The objective is to ﬁnd the smallest set of locations to recommend to all users such that each reaches a desired threshold. Note that private data is usually small and private functions are fast to compute. Thus, the function evaluation is mainly affected by the amount of public data. Moreover, for many objectives, e.g., information gain, each machine can evaluate fu(S) by using its own portion of the private data.

3.3 Dominating Set in Social Networks

Probably the easiest way to deﬁne the inﬂuence of a subset of users on other members of a social network is by the dominating set problem. Here, we assume that there is a graph G = (V, E) where V and E indicate the set of nodes and edges, respectively. Let N(S) denote the neighbors of S. Then, we deﬁne the coverage size of S by f(S) = |N(S) S|. The goal is to ﬁnd the smallest subset S such that the coverage size is at least some fraction of |V|.This is a trivial instance of public-private data summarization as all the data is public and there is a single utility function. We use the dominating set problem to run a large-scale application for which DISCOVER terminates in a reasonable amount of time and its performance can be compared to our algorithm FASTCOVER.

3Two private lists may point to similar movies, but for now we treat the items on each list as unique entities.

4 FASTCOVER for Fast Distributed Submodular Cover

In this section, we explain in detail our fast distributed Algorithm FASTCOVER shown in Alg. 1. It receives a universe of items V and an integer-valued, non-negative, monotone submodular function f : 2V R+. The objective is to ﬁnd the smallest set S that achieves a value L f(V). FASTCOVER starts with S = , and keeps adding those items x V to S whose marginal values (e|S) are at least some threshold τ. In the beginning, τ is set to a conservative initial value M .= maxx V f(x). When there are no more items with a marginal value τ, FASTCOVER lowers τ by a factor of (1 ϵ), and iterates anew through the elements. Thus, τ ranges over τ0 = M, τ1 = (1 ϵ)M, , τℓ= (1 ϵ)ℓM, . FASTCOVER terminates when f(S) L. The parameter ϵ determines the size of the ﬁnal solution. When ϵ is small, we expect to ﬁnd better solutions (i.e., smaller in size) while having to spend more number of rounds.

One of the key ideas behind FASTCOVER is that ﬁnding elements with marginal values τ = τℓcan be done in a distributed manner. Effectively, FASTCOVER partitions V into m sets T1, . . . , Tm, one for each cluster node/machine. A naive distributed implementation is the following. For a given set S (whose elements are communicated to all machines) each machine i ﬁnds all of its items x Ti whose marginal values (x|S) are larger than τ and send them all to a central machine (note that S is ﬁxed on each machine). Then, this central machine sequentially augments S with elements whose marginal values are more than τ (here S changes by each insertion). The new elements of S are communicated back to all machines and they run the same procedure, this time with a smaller threshold τ(1 ϵ). The main problem with this approach is that there might be many items on each machine that satisfy the chosen threshold τ at each round (i.e., many more than |OPT|). A ﬂood of such items from m machines overwhelms the central machine. Instead, what FASTCOVER does is to enforce each machine to randomly pick only k items from their potentially big set of candidates (i.e., THRESHOLDSAMPLE algorithm shown in Alg. 2). The value k is carefully chosen (line 7). This way the number of items the central machine processes is never more than O(m|OPT|).

1 Input: V, ϵ, L, and m

2 Output: S V where f(S) L

3 Find a balanced partition {Ti}m i=1 of V;

5 τ maxx V f(x);

6 while τ 1 do

7 k (L f(S))/τ ;

8 forall the 1 i m do

9 <Si, Fulli> Threshold Sample(i,τ,k,S);

10 forall the x m i=1Si do

11 if f({x} S) f(S) τ then

12 S S {x};

13 if f(S) L then Break;

14 if i : Fulli = False then

15 if τ > 1 then τ max{1, (1 ϵ)τ};

16 else Break;

17 Return S; Algorithm 1: FASTCOVER

1 Input: Index i, τ, k, and S

2 Output: Si Ti with |Si| k

4 forall the x Si do

5 if f(S {x}) f(S) τ then

6 Si Si {x};

7 if |Si| k then

8 Return < Si, False >;

10 Si k random items of Si;

11 Return < Si, True >; Algorithm 2: THRESHOLDSAMPLE

Theorem 4.1. FASTCOVER terminates with at most log3/2(n/(|OPT|m))(1+log(M)/ϵ)+log2(L) rounds (with high probability) and a solution of size at most |OPT| ln(L)/(1 ϵ).

Although FASTCOVER is distributed and unlike centralized algorithms does not enjoy the beneﬁts of accessing all items together, its solution size is truly competitive with the greedy algorithm and is only away by a factor of 1/(1 ϵ). Moreover, its number of rounds is logarithmic in n and L. This is in sharp contrast with the previously best known algorithm, DISCOVER [12], where the number of rounds scales with p

min(m, |OPT|)4. Thus, FASTCOVER not only improves exponentially over

4Note that p

min(m, |OPT|) can be as large as n1/6 when |OPT| = n1/3 and the memory limit of each machine is n2/3 which results in m n1/3.

DISCOVER in terms of speed but also its number of rounds decreases as the number of available machines m increases. Even though FASTCOVER is a simple distributed algorithm, its performance analysis is technical and is deferred to the supplementary materials. Below, we provide the main ideas behind the proof of Theorem 4.1. Proof sketch. We say that an item has a high value if its marginal value to S is at least τ. We deﬁne an epoch to be the rounds during which τ does not change. In the last round of each epoch, all high value items are sent to the central machine (i.e., the set m i=1Si) because Fulli is false for all machines. We also add every high value item to S in lines 11 12. So, at the end of each epoch, marginal values of all items to S are less than τ. Since we reduce τ by a factor of (1 ϵ), we can always say that τ (1 ϵ) maxx V (x|S) which means we are only adding items that have almost the highest marginal values. By the classic analysis of greedy algorithm for submodular maximization, we can conclude that every item we add has an added value that is at least (1 ϵ)(L f(S))/|OPT|. Therefore, after adding |OPT| ln(L)/(1 ϵ) items, f(S) becomes at least L.

To upper bound rounds, we divide the rounds into two groups. In a good round, the algorithm adds at least k

2 items to S. The rest are bad rounds. In a good round, we add k/2 (L f(S))/(2τ) items, and each of them increases the value of S by τ. Therefore in a good round, we see at least (L f(S))/2 increase in value of S. In other words, the gap L f(S) is reduced by a factor of at least 2 in each good round. Since f only takes integer values, once L f(S) becomes less than 1, we know that f(S) L. Therefore, there cannot be more than log2 L good rounds. Every time we update τ (start of an epoch), we decrease it by a factor of 1 ϵ (except maybe the last round for which τ = 1). Therefore, there are at most 1 + log 1 1 ϵ (M) 1 + log(M) log(1/(1 ϵ)) 1 + log(M)

ϵ epochs. In a bad round, a machine with more than k high value items, sends k of those to the central machine, and at most k/2 of them are selected. In other words, the addition of these items to S in this bad round caused more than half of high value items of each machine to become of low value (marginal values less than τ). Since there are n/m items in each machine, and Fulli becomes False once there are at most k high value items in the machine, we conclude that in expectation there should not be more than log2(n/km) bad rounds in each epoch. Summarizing the upper bounds yields the bound on total number of rounds. Finer analysis leads to the high probability claim.

5 Experiments

In this section, we evaluate the performance of FASTCOVER on the three applications that we described in Section 3: personalized movie recommendation, personalized location recommendation, and dominating set on social networks. To validate our theoretical results and demonstrate the effectiveness of FASTCOVER, we compare the performance of our algorithm against DISCOVER and the centralized greedy algorithm (when possible).

Our experimental infrastructure was a cluster of 16 quad-core machines with 20GB of memory each, running Spark. The cluster was conﬁgured with one master node responsible for resource management, and the remaining 15 machines working as executors. We set the number of reducers to m = 60. To run FASTCOVER on Spark, we ﬁrst distributed the data uniformly at random to the machines, and performed a map/reduce task to ﬁnd the highest marginal gain τ = M. Each machine then carries out a set of map/reduce tasks in sequence, where each map/reduce stage ﬁlters out elements with a speciﬁc threshold τ on the whole dataset. We then tune the parameter τ, communicate back the results to the machines and perform another round of map/reduce calculation. We continue performing map/reduce tasks until we get to the desired value L.

5.1 Personalized Location Recommendation with Spark Our location recommendation experiment involves applying FASTCOVER to the information gain utility function, described in Eq. (4). Our dataset consists of 3,056 GPS measurements from 20 users in the form of (latitude, longitude, altitude) collected during bike tours around Zurich [23]. The size of each path is between 50 and 500 GPS coordinates. For each pairs of points i and j we used the corresponding GPS coordinates to calculate their distance in meters d(i, j) and then formed a squared exponential kernel Ki,j = exp( d(i, j)2/h2) with h = 1500. For each user, we marked 20% of her data private (data points are chosen consecutively) selected from each path taken by the biker. The parameter αu is set randomly for each user u.

Figures 1a, 1b, 1c compare the performance of FASTCOVER to the benchmarks for building a recommendation set that covers 60%, 80%, and 90% of the maximum utility of each user. We

considered running DISCOVER with different values of parameter α that makes a trade off between the size of the solution and number of rounds of the algorithm. It can be seen that by avoiding the doubling steps of DISCOVER, our algorithm FASTCOVER is able to return a signiﬁcantly smaller solution than that of DISCOVER in considerably less number of rounds. Interestingly, for small values of ϵ, FASTCOVER returns a solution that is even smaller than the centralized greedy algorithm.

5.2 Personalized Movie Recommendation with Spark

Our personalized public-private recommendation experiment involves FASTCOVER applied to a set of 1,313 movies, and 20,000,263 users ratings from 138,493 users of the Movie Lens database [24]. All selected users rated at least 20 movies. Each movie is associated with a 25 dimensional feature vector calculated from users ratings. We use the inner product of the non-normalized feature vectors to compute the similarity si,j between movies i and j [25]. Our ﬁnal objective function consists of 138,493 coverage functions -one per userand a global sum-coverage function deﬁned on the whole pool of movies (see Eq. (3)). Each function is normalized by its maximum value to make sure that all functions have the same scale.

Fig 1d, 1e, 1f show the ratio of the size of the solutions obtained by FASTCOVER to that of the greedy algorithm. The ﬁgures demonstrate the results for 10%, 20%, and 30% covers for all the 138,493 users utility functions. The parameter αu is set to 0.7 for all users. We scaled down the number of iterations by a factor of 0.01, so that the corresponding bars can be shown in the same ﬁgures. Again, FASTCOVER was able to ﬁnd a considerably smaller solution than the centralized greedy. Here, we couldn t run DISCOVER because of its prohibitive running time on Spark.

Fig 1g shows the size of the solution set obtained by FASTCOVER for building recommendations from a set of 1000 movies for 1000 users vs. the size of the merged solutions found by ﬁnding recommendations separately for each user. It can be seen that FASTCOVER was able to ﬁnd a much smaller solution by covering all the functions at the same time.

5.3 Large Scale Dominating Set with Spark

In order to be able to compare the performance of our algorithm with DISCOVER more precisely, we applied FASTCOVER to the Friendster network consists of 65,608,366 nodes and 1,806,067,135 edges [26]. This dataset was used in [12] to evaluate the performance of DISCOVER.

Fig. 1j, 1k, 1l show the performance of FASTCOVER for obtaining covers for 50%, 40%, 30% of the whole graph, compared to the centralized greedy solution. Again, the size of the solution obtained by FASTCOVER is smaller than the greedy algorithm for small values of ϵ. Note that running the centralized greedy is impractical if the dataset cannot ﬁt into the memory of a single machine. Fig. 1h compares the solution set size and the number of rounds for FASTCOVER and DISCOVER with different values of ϵ and α. The points in the bottom left correspond to the solution obtained by FASTCOVER which conﬁrm its superior performance. We further measured the actual running time of both algorithms on a smaller instance of the same graph with 14,043,721 nodes. We tuned ϵ and α to get solutions of approximately equal size for both algorithms. Fig. 1i shows the speedup of FASTCOVER over DISCOVER. It can be observed that by increasing the coverage value L, FASTCOVER shows an exponential speedup over DISCOVER.

6 Conclusion

In this paper, we introduced the public-private model of data summarization motivated by privacy concerns of recommender systems. We also developed a fast distributed algorithm, FASTCOVER, that provides a succinct summary for all users without violating their privacy. We showed that FASTCOVER returns a solution that is competitive to that of the best centralized, polynomial-time algorithm (i.e., greedy solution). We also showed that FASTCOVER runs exponentially faster than the previously proposed distributed algorithms. The superior practical performance of FASTCOVER against all the benchmarks was demonstrated through a large set of experiments, including movie recommendation, location recommendation and dominating set (all were implemented with Spark). Our theoretical results combined with the practical performance of FASTCOVER makes it the only existing distributed algorithm for the submodular cover problem that truly scales to massive data.

Acknowledgment: This research was supported by Google Faculty Research Award and DARPA Young Faculty Award (D16AP00046).

Number of rounds 10 20 30 40

Solution set size

Fast Cover Dis Cover Greedy ,=1.0

,=0.1 0=0.4 0=0.3

(a) Location data (60%)

Number of rounds 10 20 30 40 50 60

Solution set size

Fast Cover Dis Cover Greedy

0=0.4 0=0.3

(b) Location data (80%)

Number of rounds 10 20 30 40 50

Solution set size

Fast Cover Dis Cover Greedy

0=0.9 ,=1.0

,=0.1 ,=0.2

(c) Location data (90%)

0=0.5 0=0.3 0=0.1 0

Number of iterations Normalized solution set size

(d) Movies (10%)

0=0.7 0=0.5 0=0.3 0

Number of iterations Normalized solution set size

(e) Movies (20%)

0=0.7 0=0.5 0=0.3 0

0.3 Number of iterations Normalized solution set size

(f) Movies (30%)

Coverage 0.1 0.2 0.3 0.4 0.5

Solution set size

900 Union of the summaries for each user Single summary for all users

(g) Movie (1K)

Number of rounds 0 50 100 150 200

Solution set size

4 Dis Cover ,=0.1 Dis Cover ,=0.2 Dis Cover ,=0.4 Dis Cover ,=1.0 Fast Cover 0=0.5 Fast Cover 0=0.3 Fast Cover 0=0.1

(h) Friendster (50%)

Solution set size 1M 2M 3M 4M 5M 6M 7M

Fast Cover speedup

(i) Friendster (14M)

Number of rounds 10 20 30

Solution set size

Fast Cover Greedy

(j) Friendster (30%)

Number of rounds 10 20 30 40

Solution set size

Fast Cover Greedy

(k) Friendster (40%)

Number of rounds 10 20 30 40 50

Solution set size

Fast Cover Greedy

(l) Friendster (50%)

Figure 1: Performance of FASTCOVER vs. other baselines. a), b), c) solution set size vs. number of rounds for personalized location recommendation on a set of 3,056 GPS measurements, for covering 60%, 80%, 90% of the maximum utility of each user. d), e), f) same measures for personalized movie recommendation on a set of 1000 movies, 138,493 users and 20,000,263 ratings, for covering 10%, 20%, 30% of the maximum utility of each user. g) solution set size vs. coverage for simultaneously covering all users vs. covering users one by one and taking the union. The recommendation is on a set of 1000 movies for 1000 users. h) solution set size vs. the number of rounds for FASTCOVER and DISCOVER for covering 50% of the Friendster network with 65,608,366 vertices. i) Exponential speedup of FASTCOVER over DISCOVER on a subgraph of 14M nodes. j), k), l) solution set size vs. the number of rounds for covering 30%, 40%, 50% of the Friendster network.

[1] Sebastian Tschiatschek, Rishabh Iyer, Haochen Wei, and Jeff Bilmes. Learning Mixtures of Submodular Functions for Image Collection Summarization. In NIPS, 2014.

[2] Khalid El-Arini and Carlos Guestrin. Beyond keyword search: discovering relevant scientiﬁc literature. In KDD, 2011.

[3] Ian Simon, Noah Snavely, and Steven M Seitz. Scene summarization for online image collections. In ICCV, 2007.

[4] Delbert Dueck and Brendan J Frey. Non-metric afﬁnity propagation for unsupervised image categorization. In ICCV, 2007.

[5] Ryan Gomes and Andreas Krause. Budgeted nonparametric learning from data streams. In ICML, 2010.

[6] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In NIPS, 2013.

[7] Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In North American chapter of the Assoc. for Comp. Linguistics/Human Lang. Tech., 2011.

[8] Ruben Sipos, Adith Swaminathan, Pannaga Shivaswamy, and Thorsten Joachims. Temporal corpus summarization using submodular word coverage. In CIKM, 2012.

[9] Laurence A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 1982.

[10] Uriel Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 1998.

[11] J. Dean and S. Ghemawat. Mapreduce: Simpliﬁed data processing on large clusters. In OSDI, 2004.

[12] Baharan Mirzasoleiman, Amin Karbasi, Ashwinkumar Badanidiyuru, and Andreas Krause. Distributed submodular cover: Succinctly summarizing massive data. In NIPS, 2015.

[13] Andreas Krause, Brendan Mc Mahan, Carlos Guestrin, and Anupam Gupta. Robust submodular observation selection. JMLR, 2008.

[14] Rishabh K Iyer and Jeff A Bilmes. Submodular optimization with submodular cover and submodular knapsack constraints. In NIPS, 2013.

[15] Flavio Chierichetti, Alessandro Epasto, Ravi Kumar, Silvio Lattanzi, and Vahab Mirrokni. Efﬁcient algorithms for public-private social networks. In KDD, 2015.

[16] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, and Amin Karbasi. Fast constrained submodular maximization: Personalized data summarization. In ICML, 2016.

[17] Bonnie Berger, John Rompel, and Peter W Shor. Efﬁcient nc algorithms for set cover with applications to learning and geometry. Journal of Computer and System Sciences, 1994.

[18] Guy E. Blelloch, Richard Peng, and Kanat Tangwongsan. Linear-work greedy parallel approximate set cover and variants. In SPAA, 2011.

[19] Stergios Stergiou and Kostas Tsioutsiouliklis. Set cover at web scale. In SIGKDD, 2015.

[20] Erik D Demaine, Piotr Indyk, Sepideh Mahabadi, and Ali Vakilian. On streaming and communication complexity of the set cover problem. In Distributed Computing. 2014.

[21] Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, and Andrea Vattani. Fast greedy algorithms in mapreduce and streaming. TOPC, 2015.

[22] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 2009.

[23] https://refind.com/fphilipe/topics/open-data.

[24] Grouplens. movielens 20m dataset. http://grouplens.org/datasets/movielens/20m/.

[25] Erik M Lindgren, Shanshan Wu, and Alexandros G Dimakis. Sparse and greedy: Sparsifying submodular facility location problems. NIPS, 2015.

[26] Jaewon Yang and Jure Leskovec. Deﬁning and evaluating network communities based on ground-truth. Knowledge and Information Systems, 2015.