# sparsification_of_decomposable_submodular_functions__99353b94.pdf Sparsification of Decomposable Submodular Functions Akbar Rafiey,1 Yuichi Yoshida 2 1 Simon Fraser University 2 National Institute of Informatics arafiey@sfu.ca, yyoshida@nii.ac.jp Submodular functions are at the core of many machine learning and data mining tasks. The underlying submodular functions for many of these tasks are decomposable, i.e., they are sum of several simple submodular functions. In many data intensive applications, however, the number of underlying submodular functions in the original function is so large that we need prohibitively large amount of time to process it and/or it does not even fit in the main memory. To overcome this issue, we introduce the notion of sparsification for decomposable submodular functions whose objective is to obtain an accurate approximation of the original function that is a (weighted) sum of only a few submodular functions. Our main result is a polynomial-time randomized sparsification algorithm such that the expected number of functions used in the output is independent of the number of underlying submodular functions in the original function. We also study the effectiveness of our algorithm under various constraints such as matroid and cardinality constraints. We complement our theoretical analysis with an empirical study of the performance of our algorithm. Introduction Submodularity of a set function is an intuitive diminishing returns property, stating that adding an element to a smaller set helps gaining more return than adding it to a larger set. This fundamental structure has emerged as a very beneficial property in many combinatorial optimization problems arising in machine learning, graph theory, economics, game theory, to name a few. Formally, a set function f : 2E R is submodular if for any S T E and e E \ T it holds that f(S {e}) f(S) f(T {e}) f(T). Submodularity allows one to efficiently find provably (near-)optimal solutions. In particular, a wide range of efficient approximation algorithms have been developed for maximizing or minimizing submodular functions subject to different constraints. Unfortunately, these algorithms require number of function evaluations which in many data intensive applications are infeasible or extremely inefficient. Fortunately, several submodular optimization problems arising in machine learning have structure that allows solving them Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. more efficiently. A novel class of submodular functions are decomposable submodular functions. These are functions that can be written as sums of several simple submodular functions, i.e., i=1 fi(S) S E, where each fi : 2E R is a submodular function on the ground set E with |E| = n. Decomposable submodular functions encompass many of the examples of submodular functions studied in the context of machine learning as well as economics. For example, they are extensively used in economics in the problem of welfare maximization in combinatorial auctions (Dobzinski and Schapira 2006; Feige 2006; Feige and Vondr ak 2006; Papadimitriou, Schapira, and Singer 2008; Vondr ak 2008). Example 1 (Welfare maximization). Let E be a set of n resources and a1, . . . , a N be N agents. Each agent has an interest over subsets of resources which is expressed as a submodular function fi : 2E R. The objective is to select a small subset of resources that maximizes the happiness across all the agents, the social welfare . More formally, the goal is to find a subset S E of size at most k that maximizes F(S) = PN i=1 fi(S), where k is a positive integer. Decomposable submodular functions appear in various machine learning tasks such as data summarization, where we seek a representative subset of elements of small size. This has numerous applications, including exemplar-based clustering (Dueck and Frey 2007; Gomes and Krause 2010), image summarization (Tschiatschek et al. 2014), recommender systems (Parambath, Usunier, and Grandvalet 2016), and document and corpus summarization (Lin and Bilmes 2011). The problem of maximizing decomposable submodular functions has been studied under different constraints such as cardinality and matroid constraints in various data summarization settings (Mirzasoleiman, Badanidiyuru, and Karbasi 2016; Mirzasoleiman et al. 2016; Mirzasoleiman, Zadimoghaddam, and Karbasi 2016), and differential privacy settings (Chaturvedi, Nguyen, and Zakynthinou 2021; Mitrovic et al. 2017; Rafiey and Yoshida 2020). In many of these applications, the number of underlying submodular functions are too large (i.e., N is too large) to The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) even fit in the main memory, and building a compressed representation that preserves relevant properties of the submodular function is appealing. This motivates us to find a sparse representation for a decomposable submodular function F. Sparsification is an algorithmic paradigm where a dense object is replaced by a sparse one with similar features , which often leads to significant improvements in efficiency of algorithms, including running time, space complexity, and communication. In this work, we propose a simple and very effective algorithm that yields a sparse and accurate representation of a decomposable submodular function. To the best of our knowledge this work is the first to study sparsification of decomposable submodular functions. Our Contributions General setting. Given a decomposable submodular function F = PN i=1 fi, we present a randomized algorithm that yields a sparse representation that approximates F. Our algorithm chooses each submodular function fi with probability proportional to its importance in the sum PN i=1 fi to be in the sparsifier. Moreover, each selected submodular function will be assigned a weight which also is proportional to its importance . We prove this simple algorithm yields a sparsifier of small size (independent of N) with a very good approximation of F. Let |B(fi)| denote the number of extreme points in the base polytope of fi, and B = maxi [N] |B(fi)|. Theorem 2. Let F = PN i=1 fi be a decomposable submodular function. For any ϵ > 0, there exists a vector w RN with at most O( B n2 ϵ2 ) non-zero entries such that for the submodular function F = PN i=1 wifi we have (1 ϵ)F (S) F(S) (1 + ϵ)F (S) S E. Moreover, if all fi s are monotone, then there exists a polynomial-time randomized algorithm that outputs a vector w RN with at most O( B n2.5 log n ϵ2 ) non-zero entries in expectation such that for the submodular function F = PN i=1 wifi, with high probability, we have (1 ϵ)F (S) F(S) (1 + ϵ)F (S) S E. Remark 3 (Tightness). The existential result is almost tight because in the special case of directed graphs, we have maxi |B(fi)| = 2 and it is known that we need Ω(n2) edges to construct a sparsifier (Cohen et al. 2017). Sparsifying under constraints. We consider the setting where we only are interested in evaluation of F on particular sets. For instance, the objective is to optimize F on subsets of size at most k, or it is to optimize F over subsets that form a matroid. Optimizing submodular functions under these constraints has been extensively studied and has an extremely rich theoretical landscape. Our algorithm can be tailored to these types of constraints and constructs a sparsifier of even smaller size. Theorem 4. Let F = PN i=1 fi be a decomposable submodular function. For any ϵ > 0 and a matroid M of rank r, there exists a vector w RN with at most O( B r n ϵ2 ) nonzero entries such that for the submodular function F = PN i=1 wifi we have (1 ϵ)F (S) F(S) (1 + ϵ)F (S) S M. Moreover, if all fi s are monotone, then there exists a polynomial-time randomized algorithm that outputs a vector w RN with at most O( B r n1.5 log n ϵ2 ) non-zero entries in expectation such that for the submodular function F = PN i=1 wifi, with high probability, we have (1 ϵ)F (S) F(S) (1 + ϵ)F (S) S M. Applications, speeding up maximization/minimization. Our sparsifying algorithm can be used as a preprocessing step in many settings in order to speed up algorithms. To elaborate on this, we consider the classical greedy algorithm of Nemhauser, Wolsey, and Fisher for maximizing monotone submodular functions under cardinality constraints. We observe that sparsifying the instance reduces the number of function evaluations from O(kn N) to O( Bk2n2 ϵ2 ), which is a significant speed up when N n. Regarding minimization, we prove our algorithm gives an approximation on the Lov asz extension, thus it can be used as a preprocessing step for algorithms working on Lov asz extensions such as the ones in (Axiotis et al. 2021; Ene, Nguyen, and V egh 2017). One particular regime that has been considered in many results is where each submodular function fi acts on O(1) elements of the ground set which implies B = maxi |B(fi)| is O(1). Using our sparsifier algorithm as a preprocessing step is quite beneficial here. For instance, it improves the running time of Axiotis et al. (2021) from e O(Tmaxflow(n, n + N) log 1 ϵ ) to e O(Tmaxflow(n, n + ϵ ). Here, Tmaxflow(n, m) denotes the time required to compute the maximum flow in a directed graph of n vertices and m arcs with polynomially bounded integral capacities. Well-known examples. In practice, the bounds on the size of sparsifiers are often better than the ones presented in Theorems 2 and 4 e.g. B is a constant. We consider several examples of decomposable submodular functions that appear in many applications, namely, MAXIMUM COVERAGE, FACILITY LOCATION, and SUBMODULAR HYPERGRAPH MIN CUT problems. For the first two examples, sparsifiers of size O( n2 ϵ2 ) can be constructed in time linear in N. For SUBMODULAR HYPERGRAPH MIN CUT when each hyperedge is of constant size sparsifiers of size O( n2 ϵ2 ) exist, and in several specific cases with various applications efficient algorithms are employed to construct them. Empirical results. Finally, we empirically examine our algorithm and demonstrate that it constructs a concise sparsifier on which we can efficiently perform algorithms. Related Work To the best of our knowledge there is no prior work on sparsification algorithms for decomposable submodular functions. However, special cases of this problem have attracted much attention, most notably cut sparsifiers for graphs. The cut function of a graph G = (V, E) can be seen as a decomposable submodular function F(S) = P e E fe, where fe(S) = 1 if and only if e S = and e (V \S) = . The problem of sparsifying a graph while approximately preserving its cut structure has been extensively studied, (See Ahn, Guha, and Mc Gregor (2012, 2013); Bansal, Svensson, and Trevisan (2019); Bencz ur and Karger (2015) and references therein.) The pioneering work of Bencz ur and Karger (1996) showed for any graph G with n vertices one can construct a weighted subgraph G in nearly linear time with O(n log n/ϵ2) edges such that the weight of every cut in G is preserved within a multiplicative (1 ϵ)-factor in G . Note that a graph on n vertices can have N = Ω(n2) edges. The bound on the number of edges was later improved to O(n/ϵ2) (Batson, Spielman, and Srivastava 2012) which is tight (Andoni et al. 2016). A more general concept for graphs called spectral sparsifier was introduced by Spielman and Teng (2011). This notion captures the spectral similarity between a graph and its sparsifiers. A spectral sparsifier approximates the quadratic form of the Laplacian of a graph. Note that a spectral sparsifier is also a cut sparsifier. This notion has numerous applications in linear algebra (Mahoney 2011; Li, Miller, and Peng 2013; Cohen et al. 2015; Lee and Sidford 2014), and it has been used to design efficient approximation algorithms related to cuts and flows (Bencz ur and Karger 2015; Karger and Levine 2002; Madry 2010). Spielman and Teng s sparsifier has O(n logc n) edges for a large constant c > 0 which was improved to O(n/ϵ2) (Lee and Sun 2018). In pursuing a more general setting, the notions of cut sparsifier and spectral sparsifier have been studied for hypergraphs. Observe that a hypergraph on n vertices can have exponentially many hyperedges i.e., N = Ω(2n). For hypergraphs, Kogan and Krauthgamer (2015) provided a polynomial-time algorithm that constructs an ϵ-cut sparsifier with O(n(r + log n)/ϵ2) hyperedges where r denotes the maximum size of a hyperedge. The current best result is due to Chen, Khanna, and Nagda (2020) where their ϵ-cut sparsifier uses O(n log n/ϵ2) hyperedges and can be constructed in time O(Nn2 + n10/ϵ2) where N is the number of hyperedges. Recently, Soma and Yoshida (2019) initiated the study of spectral sparsifiers for hypergraphs and showed that every hypergraph admits an ϵ-spectral sparsifier with O(n3 log n/ϵ2) hyperedges. For the case where the maximum size of a hyperedge is r, Bansal, Svensson, and Trevisan (2019) showed that every hypergraph has an ϵ-spectral sparsifier of size O(nr3 log n/ϵ2). Recently, this bound has been improved to O(nr(log n/ϵ)O(1)) and then to O(n(log n/ϵ)O(1)) (Kapralov et al. 2021a,b). This leads to the study of sparsification of submodular functions which is the focus of this paper and provides a unifying framework for these previous works. Preliminaries For a positive integer n, let [n] = {1, 2, . . . , n}. Let E be a set of elements of size n which we call the ground set. For a set S E, 1S RE denotes the characteristic vector of S. For a vector x RE and a set S E, x(S) = P e S x(e). Submodular functions. Let f : 2E R+ be a set function. We say that f is monotone if f(S) f(T) holds for every S T E. We say that f is submodular if f(S {e}) f(S) f(T {e}) f(T) holds for any S T E and e E \T. The base polytope of a submodular function f is defined as B(f) = {y RE | y(S) f(S) S E, y(E) = f(E)}, and |B(f)| denotes the number of extreme points in the base polytope B(f). Definition 5 (ϵ-sparsifier). Let fi (i D) be a set of N submodular functions, and F(S) = P i D fi(S) be a decomposable submodular function. A vector w RN is called an ϵ-sparsifier of F if, for the submodular function F := P i D wifi, the following holds for every S E (1 ϵ)F (S) F(S) (1 + ϵ)F (S). (1) The size of an ϵ-sparsifier w, size(w), is the number of indices i s with wi = 0. Matroids and matroid polytopes. A pair M = (E, I) of a set E and I 2E is called a matroid if (1) I, (2) A I for any A B I, and (3) for any A, B I with |A| < |B|, there exists e B \ A such that A {e} I. We call a set in I an independent set. The rank function r M : 2E Z+ of M is r M(S) = max{|I| : I S, I I}. An independent set S I is called a base if r M(S) = r M(E). We denote the rank of M by r(M). The matroid polytope P(M) RE of M is P(M) = conv{1I : I I}, where conv denotes the convex hull. Or equivalently (Edmonds 2001), P(M) = {x 0 : x(S) r M(S) S E} . Constructing a Sparsifier In this section, we propose a probabilistic argument that proves the existence of an accurate sparsifier and turn this argument into an (polynomial-time) algorithm that finds a sparsifier with high probability. For each submodular function fi, let pi = max A E fi(A) F(A). (2) The values pi s are our guide on how much weight should be allocated to a submodular function fi and with what probability it might happen. To construct an ϵ-sparsifier of F, for each submodular function fi, we assign weight 1/(κ pi) to wi with probability κ pi and do nothing for the complement probability 1 κ pi (see Algorithm 1). Here κ depends on n, ϵ and δ where δ is the failure probability of our algorithm. Observe that, for each fi, the expected weight wi is exactly one. We show that the expected number of entries of w with wi > 0 is n2 maxi D |B(fi)|. Let B = maxi D |B(fi)| in the rest of the paper. Lemma 6. Algorithm 1 returns w which is an ϵ-sparsifier of F with probability at least 1 δ. Lemma 7. Algorithm 1 outputs an ϵ-sparsifier with the expected size O( B n2 Algorithm 1 Require: Submodular functions fi in dataset D where each fi : {0, 1}E R, ϵ, δ (0, 1) 1: w 0 2: κ 3 log(2n+1/δ)/ϵ2 3: for fi in D do 4: pi max A E fi(A)/F(A) 5: κi min{1, κ pi} 6: wi 1/κi with probability κi do nothing with probability 1 κi 7: return w RD. Proof. In Algorithm 1, each wi is greater than zero with probability κi and it is zero with probability 1 κi. Hence, E[size(w)] = X i D pi O (n/ϵ2) X It suffices to show an upper bound for P i D pi. Claim 8. P i D pi n maxi D |B(fi)| = n B. Claim 8 and inequality (3) yield the desired bound. Lemmas 6 and 7 proves the existence part of Theorem 2. That is, for every ϵ, δ (0, 1), there exists an ϵ-sparsifier of size at most O( B n2 ϵ2 ) with probability at least 1 δ. Polynomial time algorithm. Observe that computing pi s (2) may not be a polynomial-time task in general. However, to guarantee that Algorithm 1 outputs an ϵ-sparsifier with high probability it is sufficient to instantiate it with an upper bound for each pi (see proof of Lemma 6). Fortunately, the result of Bai et al. (2016) provides an algorithm to approximate the ratio of two monotone submodular functions. Theorem 9 (Bai et al. (2016)). Let f and g be two monotone submodular functions. Then there exists a polynomialtime algorithm that approximates max S E f(S) g(S) within O( n log n) factor. Hence, when all fi s are monotone we can compute ˆpi s with pi ˆpi O( n log n)pi in polynomial time which leads to a polynomial-time randomized algorithm that constructs an ϵ-sparsifier of the expected size at most O( B n2.5 log n ϵ2 ). This proves the second part of Theorem 2. As we will see, in various applications, the expected size of the sparsifier is often much better than the ones presented in this section. Also, we emphasize that once a sparsifier is constructed it can be reused many times (possibly for maximization/minimization under several different constraints). Hence computing or approximating pi s should be regarded as a preprocessing step. Finally, it is straightforward to adapt our algorithm to sparsify decomposable submodular functions of the form P i D αifi, known as mixtures of submodular functions (Bairi et al. 2015; Tschiatschek et al. 2014). Algorithm 2 Require: Submodular functions fi : {0, 1}E R in dataset D, matroid M = (E, I), and ϵ, δ (0, 1) 1: w 0 2: κ 3 log(2nr+1/δ)/ϵ2, where r is the rank of M. 3: for fi in D do 4: pi max A I fi(A)/F(A) 5: κi min{1, κ pi} 6: wi 1/κi with probability κi do nothing with probability 1 κi 7: return w RD. Constructing a Sparsifier under Constraints Here we are interested in constructing a sparsifier for a decomposable submodular function F while the goal is to optimize F subject to constraints. One of the most commonly used and general constraints are matroid constraints. That is, for a matroid M = (E, I), the objective is finding S = argmax S E,S I F(S). In this setting it is sufficient to construct a sparsifier that approximates F only on independent sets. It turns out that we can construct a smaller sparsifier than the one constructed to approximate F everywhere. For each submodular function fi, let pi = max A I fi(A) F(A). (4) Other than different definition for pi s and different κ, Algorithm 2 is the same as Algorithm 1. Theorem 10. Algorithm 2 returns a vector w with expected size at most O( B r n ϵ2 ) such that, with probability at least 1 δ, for F = P i D wifi we have (1 ϵ)F (S) F(S) (1 + ϵ)F (S) S M. Theorem 10 proves the existence part of Theorem 4. Algorithm 2 can be turned into a polynomial-time algorithm if one can approximate pi s (4). By modifying the proof of Theorem 9 we prove the following. Theorem 11. Let f and g be two monotone submodular functions and M = (E, I) be a matroid. Then there exists a polynomial-time algorithm that approximates max S E,S I f(S) g(S) within O( n log n) factor. By this theorem, when all fis are monotone we can compute ˆpi s with pi ˆpi O( n log n)pi in polynomial time which leads to a polynomial-time randomized algorithm that constructs an ϵ-sparsifier of the expected size at most O( B r n1.5 log n ϵ2 ). This proves the second part of Theorem 4. Applications Submodular Function Maximization with Cardinality Constraint Our sparsification algorithm can be used as a preprocessing step and once a sparsifier is constructed it can be reused Algorithm 3 Require: Submodular function F = P i D fi with each fi : {0, 1}E R, constant k, and ϵ, δ (0, 1) 1: Compute F = P i D wifi, an ϵ-sparsifier for F. 2: A . 3: while |A| k do 4: ai argmaxa E\A(F (A {a}) F (A)). 5: A A {ai}. 6: return A. many times (possibly for maximization/minimization under several different constraints). To elaborate on this, we consider the problem of maximizing a submodular function subject to a cardinality constraint. That is finding S = argmax S E,|S| k F(S). Cardinality constraint is a special case of matroid constraint where the independent sets are all subsets of size at most k and the rank of the matroid is k. A celebrated result of Nemhauser, Wolsey, and Fisher (1978) states that for non-negative monotone submodular functions a simple greedy algorithm provides a solution with (1 1/e) approximation guarantee to the optimal (intractable) solution. For a ground set E of size n and a monotone submodular function F = P i D fi, this greedy algorithm needs O(kn N) function evaluations to find S of size k such that F(S) (1 1/e)F(S ). We refer to this algorithm as Greedy Alg. In many applications where N n, having a sparsifier is beneficial. Applying Greedy Alg on an ϵsparsifier of size O(Bkn/ϵ2) improves the number of function evaluations to O(Bk2n2/ϵ2) and yields S of size k such that F(S) (1 1/e ϵ)F(S ) with high probability (see Algorithm 3). We point out that sampling techniques such as (Mitrovic et al. 2018; Mirzasoleiman et al. 2015) sample elements from the ground set E rather than sampling from functions f1, . . . , f N. Hence their running time depend on N, which could be slow when N is large the regime we care about. Besides, our algorithm can be used as a preprocessing step for these algorithms. For instance, the lazier than lazy greedy algorithm (Mirzasoleiman et al. 2015) requires O(n N log 1 ϵ ) function evaluations. However, when N is much larger than n it is absolutely beneficial to use our sparsification algorithm and reduce the number of submodular functions that one should consider. Two Well-known Examples Maximum Coverage Problem. Let [N] be a universe and E = {S1, . . . , Sn} with each Si N be a family of sets. Given a positive integer k, in the MAX COVERAGE problem the objective is to select at most k of sets from E such that the maximum number of elements are covered, i.e., the union of the selected sets has maximal size. One can formulate this problem as follows. For every i [N] and A [n] define fi(A) as fi(A) = 1 if there exists a A such that i Sa, 0 otherwise. Note that fi s are monotone and submodular. Furthermore, define F : 2n R+ to be F(A) = P i [N] fi(A) which is monotone and submodular as well. Now the MAX COVERAGE problem is equivalent to max A [n],|A| k F(A). For each submodular function fi, the corresponding pi is pi = max A [n],|A| k fi(A) F(A) = max Sa E,i Sa fi({a}) F({a}) = max Sa E,i Sa 1 F({a}) = max Sa E,i Sa 1 |Sa|. We can compute all the pi s in O(P |Si|) time, which is the input size. Then we can construct a sparsifier in O(N) time. In total, the time required for sparsification is O(P |Si| + N). On the other hand, for this case we have i=1 max Sa S,i Sa 1 |Sa| |Si| |Si| = n. By Lemma 7, this upper bound provides that our algorithm constructs an ϵ-sparsifier of size at most O(kn/ϵ2). Algorithm 3 improves the running time of the Greedy Alg from O(kn N) to O(k2n2/ϵ2). Furthermore, Algorithm 3 returns a set A of size at most k such that (1 1/e ϵ)OPT F(A). (OPT denotes F(S ) where S = argmax S E,|S| k F(S).) Facility Location Problem. Let I be a set of N clients and E be a set of facilities with |E| = n. Let c : I E R be the cost of assigning a given client to a given facility. For each client i and each subset of facilities A E, define fi(A) = maxj A c(i, j). For any non-empty subset A E, the value of A is given by i I fi(A) = X i I max j A c(i, j). For completeness, we define F( ) = 0. An instance of the MAX FACILITY LOCATION problem is specified by a tuple (I, E, c). The objective is to choose a subset A E of size at most k maximizing F(A). For each submodular function fi, the corresponding pi is pi = max A E,|A| k max j A c(i, j) F(A) = max j E c(i, j) F({j}) It is clear that pi s can be computed in O(|I| |E|) time, which is the input size. In this case, we have i I max j E c(i, j) F({j}) i I c(i, j) F({j}) F({j}) Hence, by Lemma 7, our algorithm construct a sparsifier of size O(kn/ϵ2). Algorithm 3 improves the running time of the Greedy Alg from O(kn N) to in O(k2n2/ϵ2). Furthermore, Algorithm 3 returns a set A of size at most k such that (1 1/e ϵ)OPT F(A). Remark 12. Lindgren, Wu, and Dimakis (2016) sparsify an instance of the FACILITY LOCATION problem by zeroing out entries in the cost matrix this is not applicable to the general setting. The runtime of the Greedy Alg applied on their sparsified instance is O(n N/ϵ). This runtime is huge when N is large the regime we care about. Moreover, we can first construct our sparsifier and apply the algorithm of Lindgren, Wu, and Dimakis on it. Submodular Function Minimization Besides the applications regarding submodular maximization, our sparsification algorithm can be used as a preprocessing step for submodular minimization as well. In many applications of the submodular minimization problem such as image segmentation (Shanu, Arora, and Singla 2016), Markov random field inference (Fix et al. 2013; Kohli, Ladicky, and Torr 2009; Vicente, Kolmogorov, and Rother 2009), hypergraph cuts (Veldt, Benson, and Kleinberg 2020b), covering functions (Stobbe and Krause 2010), the submodular function at hand is a decomposable submodular function. Many of recent advances on decomposable submodular minimization such as (Ene, Nguyen, and V egh 2017; Axiotis et al. 2021) have leveraged a mix of ideas coming from both discrete and continuous optimization. Here we discuss that our sparsifying algorithm approximates the so called Lov asz extension, a natural extension of a submodular function to the continuous domain [0, 1]n. Lov asz extension. Let x [0, 1]n be the vector (x1, x2, . . . , xn). Let π : [n] [n] be a sorting permutation of x1, x2, . . . , xn, which means if π(i) = j, then xj is the i-th largest element in the vector x. Hence, 1 xπ(1) xπ(n) 0. Let xπ(0) = 1 and xπ(n+1) = 0. Define sets Sπ 0 = and Sπ i = {π(1), . . . , π(i)}. The Lov asz extension of f is defined as follows f L(x) = Pn i=0(xπ(i) xπ(i+1))f(Sπ i ). It is wellknown that f L(x) = maxy B(f) y, x . For a decomposable submodular function F = P i D fi, its Lov asz extension is i D (xπ(j) xπ(j+1))fi(Sπ j ). Recall the definition of pi s (2), they can be expressed in an equivalent way in terms of permutations as follow pi = max A E fi(A) F(A) = max π max j [n] fi(Sπ j ) F(Sπ j ). (5) Furthermore, note that F L(x) is a linear combination of F(S), S E. Given these, we prove Algorithm 1 outputs a sparsifier that not only approximates the function itself but also approximates its Lov asz extension. Theorem 13. Algorithm 1 returns a vector w with expected size at most O( B n2 ϵ2 ) such that, with probability at least 1 δ, for F = P i D wifi it holds that (1 ϵ)F L(x) F L(x) (1 + ϵ)F L(x) x [0, 1]n. Remark 14 (Relation to spectral sparsification of graphs). The cut function of a graph G = (V, E) can be seen as a decomposable submodular function F(S) = P e E fe, where fe(S) = 1 if and only if e S = and e (V \ S) = . The goal of spectral sparsification of graphs (Spielman and Teng 2011) is to preserve the quadratic form of the Laplacian of G, which can be rephrased as P e E f L e (x) 2. In contrast, our sparsification preserves F L(x) = P e E f L e (x). Although we can construct a sparsifier that preserves P e E f L e (x) 2 in the general submodular setting, we adopted the one used here because, in many applications where submodular functions are involved, we are more interested in the value of P e E f L e (x) than P e E f L e (x) 2, and the algorithm for preserving the former is simpler than that for preserving the latter. Because our algorithm gives an approximation on the Lov asz extension, it can be used as a preprocessing step for algorithms working on Lov asz extensions such as the ones in (Axiotis et al. 2021; Ene, Nguyen, and V egh 2017). For instance, it improves the running time of Axiotis et al. (2021) from e O(Tmaxflow(n, n + N) log 1 ϵ ) to e O(Tmaxflow(n, n + ϵ ) in cases where each submodular function fi D acts on O(1) elements of the ground set which implies B = maxi |B(fi)| is O(1). An example of such cases is hypergraph cut functions with O(1) sized hyperedges. Next we discuss several examples for which computing pi s is a computationally efficient task, thus achieving a polynomial-time algorithm to construct sparsifiers. Recall that the cut function of a graph G = (V, E) can be seen as a decomposable submodular function. In this case, computing each pe for an edge e = st E is equivalent to finding the minimum s-t cut in the graph, which is a polynomial time task. A more general framework is the submodular hypergraph minimum s-t cut problem discussed in what follows. Submodular hypergraphs (Li and Milenkovic 2018; Yoshida 2019). Let H be a hypergraph with vertex set V and set of hyperedges E where each hyperedge is a subset of vertices V . A submodular function fe is associated to each hyperedge e E. In the submodular hypergraph minimum s-t cut problem the objective is minimize S V X e E fe(e S) (6) subject to s S, t V \S. This problem has been studied by Veldt, Benson, and Kleinberg (2020a) and its special cases where submodular functions fe take particular forms have been studied with applications in semi-supervised learning, clustering, and rank learning (see Li and Milenkovic (2018); Veldt, Benson, and Kleinberg (2020a) for more details). Examples of such special cases include: Linear penalty: fe(S) = min{|S|, |e \ S|} Quadratic Penalty: fe(S) = |S| |e \ S| We refer to Table 1 in Veldt, Benson, and Kleinberg (2020a) for more examples. These examples are cardinality-based, (a) Uber pickup (b) Discogs (c) Figure 1: Figures (a) and (b) show the relative performance of the greedy method on sparsifiers on Uber pickup and Discogs datasets, respectively. Figure (c) shows relative size of sparsifiers and relative runtime of the greedy method on sparsifiers. that is, the value of the submodular function depends on the cardinality of the input set (see Definition 3.2 of Veldt, Benson, and Kleinberg (2020a)). It is known that if all the submodular functions are cardinality-based, then computing the s-t minimum cut in the submodular hypergraph can be reduced to that in an auxiliary (ordinary) graph (Theorem 4.6 of Veldt, Benson, and Kleinberg (2020a)), which allows us to compute pe s in polynomial time. Remark 15. Our sparsification algorithm can also be used to construct submodular Laplacian based on the Lov asz extension of submodular functions. Submodular Laplacian was introduced by Yoshida (2019) and has numerous applications in machine learning, including in learning ranking data, clustering based on network motifs (Li and Milenkovic 2017), network analysis (Yoshida 2016), and etc. Experimental Results In this section, we empirically demonstrate that our algorithm (Algorithm 2) generates a sparse representation of a decomposable submodular function F : 2E R+ with which we can efficiently obtain a high-quality solution for maximizing F. We consider the following two settings. Uber pickup. We used a dataset of Uber pickups in New York city in May 2014 consisting of a set R of 564,517 records1. Each record has a pickup position, longitude and latitude. Consider selecting k locations as waiting spots for idle Uber drivers. To formalize this problem, we selected a set L of 36 popular pickup locations in the database, and constructed a facility location function F : 2L R+ as F(S) = P v R fv(S), where fv(S) = maxu L d(u, v) minu S d(u, v) and d(u, v) is the Manhattan distance between u and v. Then, the goal of the problem is to maximize F(S) subject to |S| k. Discogs (Kunegis 2013). This dataset provides information about audio records as a bipartite graph G = (L, R; E), where each edge (u, v) L R indicates that a label v was involved in the production of a release of a style u. We have |L| = 383 and |R| = 243, 764, and |E| = 5, 255, 950. Consider selecting k styles that cover the activity of as 1Available at https://www.kaggle.com/fivethirtyeight/uberpickups-in-new-york-city many labels as possible. To formalize this problem, we constructed a maximum coverage function F : 2L R as F(S) = P v R fv(S), where fv(S) is 1 if v has a neighbor in S and 0 otherwise. Then, the goal is to maximize F(S) subject to |S| k. Figure 1 (a) and (b) show the objective value of the solution obtained by the greedy method on the sparsifier relative to that on the original input function with its 25th and 75th percentiles. Although our theoretical results do not give any guarantee when ϵ > 1, we tried constructing our sparsifier with ϵ > 1 to see its performance. The solution quality of our sparsifier for Uber pickup is more than 99.9% even when ϵ = 4, and that for Discogs is more than 90% performance when ϵ 1.0. The performance for Uber pickup is higher than that for Discogs because the objective function of the former saturates easily. These results suggest that we get a reasonably good solution quality by setting ϵ = 1. Number of functions and speedups. Figure 1 (c) shows the size, that is, the number of functions with positive weights, of our sparsifier relative to that of the original function and the runtime of the greedy method on the sparsifier relative to that on the original function with their 25th and 75th percentiles when k = 8. The size and runtime are decreased by a factor of 30 50 when ϵ = 1. To summarize, our experimental results suggest that our sparsifier highly compresses the original function without sacrificing the solution quality. Conclusion Decomposable submodular functions appear in many data intensive tasks in machine learning and data mining. Thus, having a sparsifying algorithm is of great interest. In this paper, we introduce the notion of sparsifier for decomposable submodular functions. We propose the first sparsifying algorithm for these types of functions which outputs accurate sparsifiers with expected size independent of the size of original function. Our algorithm can be adapted to sparsify mixtures of submodular functions. We also study the effectiveness of our algorithm under various constraints such as matroid and cardinality constraints. Our experimental results complement our theoretical results on real world data. This work does not present any foreseeable societal consequence. Acknowledgments Akbar Rafiey was supported by NSERC. Yuichi Yoshida was supported by JST PRESTO Grant Number JPMJPR192B and JSPS KAKENHI Grant Number 20H05965. References Ahn, K. J.; Guha, S.; and Mc Gregor, A. 2012. Graph sketches: sparsification, spanners, and subgraphs. In PODS, 5 14. Ahn, K. J.; Guha, S.; and Mc Gregor, A. 2013. Spectral Sparsification in Dynamic Graph Streams. In APPROX, 1 10. Andoni, A.; Chen, J.; Krauthgamer, R.; Qin, B.; Woodruff, D. P.; and Zhang, Q. 2016. On Sketching Quadratic Forms. In ITCS, 311 319. Axiotis, K.; Karczmarz, A.; Mukherjee, A.; Sankowski, P.; and Vladu, A. 2021. Decomposable Submodular Function Minimization via Maximum Flow. In ICML, 446 456. Bai, W.; Iyer, R. K.; Wei, K.; and Bilmes, J. A. 2016. Algorithms for Optimizing the Ratio of Submodular Functions. In ICML, 2751 2759. Bairi, R.; Iyer, R. K.; Ramakrishnan, G.; and Bilmes, J. A. 2015. Summarization of Multi-Document Topic Hierarchies using Submodular Mixtures. In ACL, 553 563. Bansal, N.; Svensson, O.; and Trevisan, L. 2019. New Notions and Constructions of Sparsification for Graphs and Hypergraphs. In FOCS, 910 928. Batson, J. D.; Spielman, D. A.; and Srivastava, N. 2012. Twice-Ramanujan Sparsifiers. SIAM J. Comput., 41(6): 1704 1721. Bencz ur, A. A.; and Karger, D. R. 1996. Approximating s-t Minimum Cuts in O(n2) Time. In STOC, 47 55. Bencz ur, A. A.; and Karger, D. R. 2015. Randomized Approximation Schemes for Cuts and Flows in Capacitated Graphs. SIAM J. Comput., 44(2): 290 319. Chaturvedi, A.; Nguyen, H. L.; and Zakynthinou, L. 2021. Differentially Private Decomposable Submodular Maximization. In AAAI, 6984 6992. Chen, Y.; Khanna, S.; and Nagda, A. 2020. Near-linear Size Hypergraph Cut Sparsifiers. In FOCS, 61 72. Cohen, M. B.; Kelner, J.; Peebles, J.; Peng, R.; Rao, A. B.; Sidford, A.; and Vladu, A. 2017. Almost-linear-time algorithms for Markov chains and new spectral primitives for directed graphs. In STOC, 410 419. Cohen, M. B.; Lee, Y. T.; Musco, C.; Musco, C.; Peng, R.; and Sidford, A. 2015. Uniform Sampling for Matrix Approximation. In ITCS, 181 190. Dobzinski, S.; and Schapira, M. 2006. An improved approximation algorithm for combinatorial auctions with submodular bidders. In SODA, 1064 1073. Dueck, D.; and Frey, B. J. 2007. Non-metric affinity propagation for unsupervised image categorization. In ICCV, 1 8. Edmonds, J. 2001. Submodular Functions, Matroids, and Certain Polyhedra. In Combinatorial Optimization - Eureka, You Shrink!, 11 26. Ene, A.; Nguyen, H. L.; and V egh, L. A. 2017. Decomposable Submodular Function Minimization: Discrete and Continuous. In Neur IPS, 2870 2880. Feige, U. 2006. On maximizing welfare when utility functions are subadditive. In STOC, 41 50. Feige, U.; and Vondr ak, J. 2006. Approximation algorithms for allocation problems: Improving the factor of 1 - 1/e. In FOCS, 667 676. Fix, A.; Joachims, T.; Park, S. M.; and Zabih, R. 2013. Structured Learning of Sum-of-Submodular Higher Order Energy Functions. In ICCV, 3104 3111. Gomes, R.; and Krause, A. 2010. Budgeted Nonparametric Learning from Data Streams. In ICML, 391 398. Kapralov, M.; Krauthgamer, R.; Tardos, J.; and Yoshida, Y. 2021a. Spectral Hypergraph Sparsifiers of Nearly Linear Size. In FOCS. Kapralov, M.; Krauthgamer, R.; Tardos, J.; and Yoshida, Y. 2021b. Towards Tight Bounds for Spectral Sparsification of Hypergraphs. In STOC, 598 611. Karger, D. R.; and Levine, M. S. 2002. Random sampling in residual graphs. In STOC, 63 66. Kogan, D.; and Krauthgamer, R. 2015. Sketching Cuts in Graphs and Hypergraphs. In ITCS, 367 376. Kohli, P.; Ladicky, L.; and Torr, P. H. S. 2009. Robust Higher Order Potentials for Enforcing Label Consistency. Int. J. Comput. Vis., 82(3): 302 324. Kunegis, J. 2013. KONECT: the Koblenz network collection. In WWW, Companion Volume, 1343 1350. Lee, Y. T.; and Sidford, A. 2014. Path Finding Methods for Linear Programming: Solving Linear Programs in O(vrank) Iterations and Faster Algorithms for Maximum Flow. In FOCS, 424 433. Lee, Y. T.; and Sun, H. 2018. Constructing Linear-Sized Spectral Sparsification in Almost-Linear Time. SIAM J. Comput., 47(6): 2315 2336. Li, M.; Miller, G. L.; and Peng, R. 2013. Iterative Row Sampling. In FOCS, 127 136. Li, P.; and Milenkovic, O. 2017. Inhomogeneous Hypergraph Clustering with Applications. In Neur IPS, 2308 2318. Li, P.; and Milenkovic, O. 2018. Submodular Hypergraphs: p-Laplacians, Cheeger Inequalities and Spectral Clustering. In ICML, 3020 3029. Lin, H.; and Bilmes, J. A. 2011. A Class of Submodular Functions for Document Summarization. In HLT, 510 520. Lindgren, E. M.; Wu, S.; and Dimakis, A. G. 2016. Leveraging Sparsity for Efficient Submodular Data Summarization. In Neur IPS, 3414 3422. Madry, A. 2010. Fast Approximation Algorithms for Cut Based Problems in Undirected Graphs. In FOCS, 245 254. Mahoney, M. W. 2011. Randomized Algorithms for Matrices and Data. Found. Trends Mach. Learn., 3(2): 123 224. Mirzasoleiman, B.; Badanidiyuru, A.; and Karbasi, A. 2016. Fast Constrained Submodular Maximization: Personalized Data Summarization. In ICML, volume 48, 1358 1367. Mirzasoleiman, B.; Badanidiyuru, A.; Karbasi, A.; Vondr ak, J.; and Krause, A. 2015. Lazier Than Lazy Greedy. In AAAI, 1812 1818. Mirzasoleiman, B.; Karbasi, A.; Sarkar, R.; and Krause, A. 2016. Distributed Submodular Maximization. J. Mach. Learn. Res., 17: 238:1 238:44. Mirzasoleiman, B.; Zadimoghaddam, M.; and Karbasi, A. 2016. Fast Distributed Submodular Cover: Public-Private Data Summarization. In Neur IPS, 3594 3602. Mitrovic, M.; Bun, M.; Krause, A.; and Karbasi, A. 2017. Differentially Private Submodular Maximization: Data Summarization in Disguise. In ICML, 2478 2487. Mitrovic, M.; Kazemi, E.; Zadimoghaddam, M.; and Karbasi, A. 2018. Data Summarization at Scale: A Two-Stage Submodular Approach. In ICML, volume 80, 3593 3602. Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978. An analysis of approximations for maximizing submodular set functions - I. Math. Program., 14(1): 265 294. Papadimitriou, C. H.; Schapira, M.; and Singer, Y. 2008. On the Hardness of Being Truthful. In FOCS, 250 259. Parambath, S. P.; Usunier, N.; and Grandvalet, Y. 2016. A Coverage-Based Approach to Recommendation Diversity On Similarity Graph. In Rec Sys, 15 22. Rafiey, A.; and Yoshida, Y. 2020. Fast and Private Submodular and k-Submodular Functions Maximization with Matroid Constraints. In ICML, 7887 7897. Shanu, I.; Arora, C.; and Singla, P. 2016. Min Norm Point Algorithm for Higher Order MRF-MAP Inference. In CVPR, 5365 5374. Soma, T.; and Yoshida, Y. 2019. Spectral Sparsification of Hypergraphs. In SODA, 2570 2581. Spielman, D. A.; and Teng, S. 2011. Spectral Sparsification of Graphs. SIAM J. Comput., 40(4): 981 1025. Stobbe, P.; and Krause, A. 2010. Efficient Minimization of Decomposable Submodular Functions. In Neur IPS, 2208 2216. Tschiatschek, S.; Iyer, R. K.; Wei, H.; and Bilmes, J. A. 2014. Learning Mixtures of Submodular Functions for Image Collection Summarization. In Neur IPS, 1413 1421. Veldt, N.; Benson, A. R.; and Kleinberg, J. M. 2020a. Hypergraph Cuts with General Splitting Functions. Co RR, abs/2001.02817. Veldt, N.; Benson, A. R.; and Kleinberg, J. M. 2020b. Minimizing Localized Ratio Cut Objectives in Hypergraphs. In KDD, 1708 1718. Vicente, S.; Kolmogorov, V.; and Rother, C. 2009. Joint optimization of segmentation and appearance models. In ICCV, 755 762. Vondr ak, J. 2008. Optimal approximation for the submodular welfare problem in the value oracle model. In STOC, 67 74. Yoshida, Y. 2016. Nonlinear Laplacian for Digraphs and its Applications to Network Analysis. In WSDM, 483 492. Yoshida, Y. 2019. Cheeger Inequalities for Submodular Transformations. In SODA, 2582 2601.