# sparsification_of_decomposable_submodular_functions__99353b94.pdf

Sparsiﬁcation of Decomposable Submodular Functions

Akbar Raﬁey,1 Yuichi Yoshida 2

1 Simon Fraser University 2 National Institute of Informatics araﬁey@sfu.ca, yyoshida@nii.ac.jp

Submodular functions are at the core of many machine learning and data mining tasks. The underlying submodular functions for many of these tasks are decomposable, i.e., they are sum of several simple submodular functions. In many data intensive applications, however, the number of underlying submodular functions in the original function is so large that we need prohibitively large amount of time to process it and/or it does not even ﬁt in the main memory. To overcome this issue, we introduce the notion of sparsiﬁcation for decomposable submodular functions whose objective is to obtain an accurate approximation of the original function that is a (weighted) sum of only a few submodular functions. Our main result is a polynomial-time randomized sparsiﬁcation algorithm such that the expected number of functions used in the output is independent of the number of underlying submodular functions in the original function. We also study the effectiveness of our algorithm under various constraints such as matroid and cardinality constraints. We complement our theoretical analysis with an empirical study of the performance of our algorithm.

Introduction Submodularity of a set function is an intuitive diminishing returns property, stating that adding an element to a smaller set helps gaining more return than adding it to a larger set. This fundamental structure has emerged as a very beneﬁcial property in many combinatorial optimization problems arising in machine learning, graph theory, economics, game theory, to name a few. Formally, a set function f : 2E R is submodular if for any S T E and e E \ T it holds that

f(S {e}) f(S) f(T {e}) f(T).

Submodularity allows one to efﬁciently ﬁnd provably (near-)optimal solutions. In particular, a wide range of efﬁcient approximation algorithms have been developed for maximizing or minimizing submodular functions subject to different constraints. Unfortunately, these algorithms require number of function evaluations which in many data intensive applications are infeasible or extremely inefﬁcient. Fortunately, several submodular optimization problems arising in machine learning have structure that allows solving them

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

more efﬁciently. A novel class of submodular functions are decomposable submodular functions. These are functions that can be written as sums of several simple submodular functions, i.e.,

i=1 fi(S) S E,

where each fi : 2E R is a submodular function on the ground set E with |E| = n. Decomposable submodular functions encompass many of the examples of submodular functions studied in the context of machine learning as well as economics. For example, they are extensively used in economics in the problem of welfare maximization in combinatorial auctions (Dobzinski and Schapira 2006; Feige 2006; Feige and Vondr ak 2006; Papadimitriou, Schapira, and Singer 2008; Vondr ak 2008).

Example 1 (Welfare maximization). Let E be a set of n resources and a1, . . . , a N be N agents. Each agent has an interest over subsets of resources which is expressed as a submodular function fi : 2E R. The objective is to select a small subset of resources that maximizes the happiness across all the agents, the social welfare . More formally, the goal is to ﬁnd a subset S E of size at most k that maximizes F(S) = PN i=1 fi(S), where k is a positive integer.

Decomposable submodular functions appear in various machine learning tasks such as data summarization, where we seek a representative subset of elements of small size. This has numerous applications, including exemplar-based clustering (Dueck and Frey 2007; Gomes and Krause 2010), image summarization (Tschiatschek et al. 2014), recommender systems (Parambath, Usunier, and Grandvalet 2016), and document and corpus summarization (Lin and Bilmes 2011). The problem of maximizing decomposable submodular functions has been studied under different constraints such as cardinality and matroid constraints in various data summarization settings (Mirzasoleiman, Badanidiyuru, and Karbasi 2016; Mirzasoleiman et al. 2016; Mirzasoleiman, Zadimoghaddam, and Karbasi 2016), and differential privacy settings (Chaturvedi, Nguyen, and Zakynthinou 2021; Mitrovic et al. 2017; Raﬁey and Yoshida 2020). In many of these applications, the number of underlying submodular functions are too large (i.e., N is too large) to

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

even ﬁt in the main memory, and building a compressed representation that preserves relevant properties of the submodular function is appealing. This motivates us to ﬁnd a sparse representation for a decomposable submodular function F. Sparsiﬁcation is an algorithmic paradigm where a dense object is replaced by a sparse one with similar features , which often leads to signiﬁcant improvements in efﬁciency of algorithms, including running time, space complexity, and communication. In this work, we propose a simple and very effective algorithm that yields a sparse and accurate representation of a decomposable submodular function. To the best of our knowledge this work is the ﬁrst to study sparsiﬁcation of decomposable submodular functions.

Our Contributions General setting. Given a decomposable submodular function F = PN i=1 fi, we present a randomized algorithm that yields a sparse representation that approximates F. Our algorithm chooses each submodular function fi with probability proportional to its importance in the sum PN i=1 fi to be in the sparsiﬁer. Moreover, each selected submodular function will be assigned a weight which also is proportional to its importance . We prove this simple algorithm yields a sparsiﬁer of small size (independent of N) with a very good approximation of F. Let |B(fi)| denote the number of extreme points in the base polytope of fi, and B = maxi [N] |B(fi)|.

Theorem 2. Let F = PN i=1 fi be a decomposable submodular function. For any ϵ > 0, there exists a vector w RN

with at most O( B n2

ϵ2 ) non-zero entries such that for the submodular function F = PN i=1 wifi we have

(1 ϵ)F (S) F(S) (1 + ϵ)F (S) S E.

Moreover, if all fi s are monotone, then there exists a polynomial-time randomized algorithm that outputs a vector w RN with at most O( B n2.5 log n

ϵ2 ) non-zero entries in expectation such that for the submodular function F = PN i=1 wifi, with high probability, we have

(1 ϵ)F (S) F(S) (1 + ϵ)F (S) S E.

Remark 3 (Tightness). The existential result is almost tight because in the special case of directed graphs, we have maxi |B(fi)| = 2 and it is known that we need Ω(n2) edges to construct a sparsiﬁer (Cohen et al. 2017).

Sparsifying under constraints. We consider the setting where we only are interested in evaluation of F on particular sets. For instance, the objective is to optimize F on subsets of size at most k, or it is to optimize F over subsets that form a matroid. Optimizing submodular functions under these constraints has been extensively studied and has an extremely rich theoretical landscape. Our algorithm can be tailored to these types of constraints and constructs a sparsiﬁer of even smaller size.

Theorem 4. Let F = PN i=1 fi be a decomposable submodular function. For any ϵ > 0 and a matroid M of rank r,

there exists a vector w RN with at most O( B r n

ϵ2 ) nonzero entries such that for the submodular function F = PN i=1 wifi we have

(1 ϵ)F (S) F(S) (1 + ϵ)F (S) S M.

Moreover, if all fi s are monotone, then there exists a polynomial-time randomized algorithm that outputs a vector w RN with at most O( B r n1.5 log n

ϵ2 ) non-zero entries in expectation such that for the submodular function F = PN i=1 wifi, with high probability, we have

(1 ϵ)F (S) F(S) (1 + ϵ)F (S) S M.

Applications, speeding up maximization/minimization. Our sparsifying algorithm can be used as a preprocessing step in many settings in order to speed up algorithms. To elaborate on this, we consider the classical greedy algorithm of Nemhauser, Wolsey, and Fisher for maximizing monotone submodular functions under cardinality constraints. We observe that sparsifying the instance reduces the number of function evaluations from O(kn N) to O( Bk2n2

ϵ2 ), which is a signiﬁcant speed up when N n. Regarding minimization, we prove our algorithm gives an approximation on the Lov asz extension, thus it can be used as a preprocessing step for algorithms working on Lov asz extensions such as the ones in (Axiotis et al. 2021; Ene, Nguyen, and V egh 2017). One particular regime that has been considered in many results is where each submodular function fi acts on O(1) elements of the ground set which implies B = maxi |B(fi)| is O(1). Using our sparsiﬁer algorithm as a preprocessing step is quite beneﬁcial here. For instance, it improves the running time of Axiotis et al. (2021) from e O(Tmaxﬂow(n, n + N) log 1

ϵ ) to e O(Tmaxﬂow(n, n +

ϵ ). Here, Tmaxﬂow(n, m) denotes the time required to compute the maximum ﬂow in a directed graph of n vertices and m arcs with polynomially bounded integral capacities.

Well-known examples. In practice, the bounds on the size of sparsiﬁers are often better than the ones presented in Theorems 2 and 4 e.g. B is a constant. We consider several examples of decomposable submodular functions that appear in many applications, namely, MAXIMUM COVERAGE, FACILITY LOCATION, and SUBMODULAR HYPERGRAPH MIN CUT problems. For the ﬁrst two examples, sparsiﬁers of size O( n2

ϵ2 ) can be constructed in time linear in N. For SUBMODULAR HYPERGRAPH MIN CUT when each hyperedge is of constant size sparsiﬁers of size O( n2

ϵ2 ) exist, and in several speciﬁc cases with various applications efﬁcient algorithms are employed to construct them.

Empirical results. Finally, we empirically examine our algorithm and demonstrate that it constructs a concise sparsiﬁer on which we can efﬁciently perform algorithms.

Related Work To the best of our knowledge there is no prior work on sparsiﬁcation algorithms for decomposable submodular functions. However, special cases of this problem have attracted

much attention, most notably cut sparsiﬁers for graphs. The cut function of a graph G = (V, E) can be seen as a decomposable submodular function F(S) = P e E fe, where fe(S) = 1 if and only if e S = and e (V \S) = . The problem of sparsifying a graph while approximately preserving its cut structure has been extensively studied, (See Ahn, Guha, and Mc Gregor (2012, 2013); Bansal, Svensson, and Trevisan (2019); Bencz ur and Karger (2015) and references therein.) The pioneering work of Bencz ur and Karger (1996) showed for any graph G with n vertices one can construct a weighted subgraph G in nearly linear time with O(n log n/ϵ2) edges such that the weight of every cut in G is preserved within a multiplicative (1 ϵ)-factor in G . Note that a graph on n vertices can have N = Ω(n2) edges. The bound on the number of edges was later improved to O(n/ϵ2) (Batson, Spielman, and Srivastava 2012) which is tight (Andoni et al. 2016). A more general concept for graphs called spectral sparsiﬁer was introduced by Spielman and Teng (2011). This notion captures the spectral similarity between a graph and its sparsiﬁers. A spectral sparsiﬁer approximates the quadratic form of the Laplacian of a graph. Note that a spectral sparsiﬁer is also a cut sparsiﬁer. This notion has numerous applications in linear algebra (Mahoney 2011; Li, Miller, and Peng 2013; Cohen et al. 2015; Lee and Sidford 2014), and it has been used to design efﬁcient approximation algorithms related to cuts and ﬂows (Bencz ur and Karger 2015; Karger and Levine 2002; Madry 2010). Spielman and Teng s sparsiﬁer has O(n logc n) edges for a large constant c > 0 which was improved to O(n/ϵ2) (Lee and Sun 2018). In pursuing a more general setting, the notions of cut sparsiﬁer and spectral sparsiﬁer have been studied for hypergraphs. Observe that a hypergraph on n vertices can have exponentially many hyperedges i.e., N = Ω(2n). For hypergraphs, Kogan and Krauthgamer (2015) provided a polynomial-time algorithm that constructs an ϵ-cut sparsiﬁer with O(n(r + log n)/ϵ2) hyperedges where r denotes the maximum size of a hyperedge. The current best result is due to Chen, Khanna, and Nagda (2020) where their ϵ-cut sparsiﬁer uses O(n log n/ϵ2) hyperedges and can be constructed in time O(Nn2 + n10/ϵ2) where N is the number of hyperedges. Recently, Soma and Yoshida (2019) initiated the study of spectral sparsiﬁers for hypergraphs and showed that every hypergraph admits an ϵ-spectral sparsiﬁer with O(n3 log n/ϵ2) hyperedges. For the case where the maximum size of a hyperedge is r, Bansal, Svensson, and Trevisan (2019) showed that every hypergraph has an ϵ-spectral sparsiﬁer of size O(nr3 log n/ϵ2). Recently, this bound has been improved to O(nr(log n/ϵ)O(1)) and then to O(n(log n/ϵ)O(1)) (Kapralov et al. 2021a,b). This leads to the study of sparsiﬁcation of submodular functions which is the focus of this paper and provides a unifying framework for these previous works.

Preliminaries

For a positive integer n, let [n] = {1, 2, . . . , n}. Let E be a set of elements of size n which we call the ground set. For a set S E, 1S RE denotes the characteristic vector of S.

For a vector x RE and a set S E, x(S) = P e S x(e).

Submodular functions. Let f : 2E R+ be a set function. We say that f is monotone if f(S) f(T) holds for every S T E. We say that f is submodular if f(S {e}) f(S) f(T {e}) f(T) holds for any S T E and e E \T. The base polytope of a submodular function f is deﬁned as

B(f) = {y RE | y(S) f(S) S E, y(E) = f(E)},

and |B(f)| denotes the number of extreme points in the base polytope B(f). Deﬁnition 5 (ϵ-sparsiﬁer). Let fi (i D) be a set of N submodular functions, and F(S) = P i D fi(S) be a decomposable submodular function. A vector w RN is called an ϵ-sparsiﬁer of F if, for the submodular function F := P i D wifi, the following holds for every S E

(1 ϵ)F (S) F(S) (1 + ϵ)F (S). (1)

The size of an ϵ-sparsiﬁer w, size(w), is the number of indices i s with wi = 0.

Matroids and matroid polytopes. A pair M = (E, I) of a set E and I 2E is called a matroid if (1) I, (2) A I for any A B I, and (3) for any A, B I with |A| < |B|, there exists e B \ A such that A {e} I. We call a set in I an independent set. The rank function r M : 2E Z+ of M is r M(S) = max{|I| : I S, I I}. An independent set S I is called a base if r M(S) = r M(E). We denote the rank of M by r(M). The matroid polytope P(M) RE of M is P(M) = conv{1I : I I}, where conv denotes the convex hull. Or equivalently (Edmonds 2001), P(M) = {x 0 : x(S) r M(S) S E} .

Constructing a Sparsiﬁer In this section, we propose a probabilistic argument that proves the existence of an accurate sparsiﬁer and turn this argument into an (polynomial-time) algorithm that ﬁnds a sparsiﬁer with high probability. For each submodular function fi, let

pi = max A E fi(A) F(A). (2)

The values pi s are our guide on how much weight should be allocated to a submodular function fi and with what probability it might happen. To construct an ϵ-sparsiﬁer of F, for each submodular function fi, we assign weight 1/(κ pi) to wi with probability κ pi and do nothing for the complement probability 1 κ pi (see Algorithm 1). Here κ depends on n, ϵ and δ where δ is the failure probability of our algorithm. Observe that, for each fi, the expected weight wi is exactly one. We show that the expected number of entries of w with wi > 0 is n2 maxi D |B(fi)|. Let B = maxi D |B(fi)| in the rest of the paper. Lemma 6. Algorithm 1 returns w which is an ϵ-sparsiﬁer of F with probability at least 1 δ. Lemma 7. Algorithm 1 outputs an ϵ-sparsiﬁer with the expected size O( B n2

Algorithm 1

Require: Submodular functions fi in dataset D where each fi : {0, 1}E R, ϵ, δ (0, 1) 1: w 0 2: κ 3 log(2n+1/δ)/ϵ2

3: for fi in D do 4: pi max A E fi(A)/F(A) 5: κi min{1, κ pi} 6: wi 1/κi with probability κi do nothing with probability 1 κi 7: return w RD.

Proof. In Algorithm 1, each wi is greater than zero with probability κi and it is zero with probability 1 κi. Hence,

E[size(w)] = X

i D pi O (n/ϵ2) X

It sufﬁces to show an upper bound for P i D pi.

Claim 8. P i D pi n maxi D |B(fi)| = n B.

Claim 8 and inequality (3) yield the desired bound.

Lemmas 6 and 7 proves the existence part of Theorem 2. That is, for every ϵ, δ (0, 1), there exists an ϵ-sparsiﬁer of size at most O( B n2

ϵ2 ) with probability at least 1 δ.

Polynomial time algorithm. Observe that computing pi s (2) may not be a polynomial-time task in general. However, to guarantee that Algorithm 1 outputs an ϵ-sparsiﬁer with high probability it is sufﬁcient to instantiate it with an upper bound for each pi (see proof of Lemma 6). Fortunately, the result of Bai et al. (2016) provides an algorithm to approximate the ratio of two monotone submodular functions.

Theorem 9 (Bai et al. (2016)). Let f and g be two monotone submodular functions. Then there exists a polynomialtime algorithm that approximates max S E f(S) g(S) within O( n log n) factor.

Hence, when all fi s are monotone we can compute ˆpi s with pi ˆpi O( n log n)pi in polynomial time which leads to a polynomial-time randomized algorithm that constructs an ϵ-sparsiﬁer of the expected size at most O( B n2.5 log n

ϵ2 ). This proves the second part of Theorem 2. As we will see, in various applications, the expected size of the sparsiﬁer is often much better than the ones presented in this section. Also, we emphasize that once a sparsiﬁer is constructed it can be reused many times (possibly for maximization/minimization under several different constraints). Hence computing or approximating pi s should be regarded as a preprocessing step. Finally, it is straightforward to adapt our algorithm to sparsify decomposable submodular functions of the form P

i D αifi, known as mixtures of submodular functions (Bairi et al. 2015; Tschiatschek et al. 2014).

Algorithm 2

Require: Submodular functions fi : {0, 1}E R in dataset D, matroid M = (E, I), and ϵ, δ (0, 1) 1: w 0 2: κ 3 log(2nr+1/δ)/ϵ2, where r is the rank of M. 3: for fi in D do 4: pi max A I fi(A)/F(A) 5: κi min{1, κ pi} 6: wi 1/κi with probability κi do nothing with probability 1 κi 7: return w RD.

Constructing a Sparsiﬁer under Constraints Here we are interested in constructing a sparsiﬁer for a decomposable submodular function F while the goal is to optimize F subject to constraints. One of the most commonly used and general constraints are matroid constraints. That is, for a matroid M = (E, I), the objective is ﬁnding S = argmax S E,S I F(S). In this setting it is sufﬁcient to construct a sparsiﬁer that approximates F only on independent sets. It turns out that we can construct a smaller sparsiﬁer than the one constructed to approximate F everywhere. For each submodular function fi, let

pi = max A I fi(A) F(A). (4)

Other than different deﬁnition for pi s and different κ, Algorithm 2 is the same as Algorithm 1. Theorem 10. Algorithm 2 returns a vector w with expected size at most O( B r n

ϵ2 ) such that, with probability at least 1 δ, for F = P i D wifi we have

(1 ϵ)F (S) F(S) (1 + ϵ)F (S) S M.

Theorem 10 proves the existence part of Theorem 4. Algorithm 2 can be turned into a polynomial-time algorithm if one can approximate pi s (4). By modifying the proof of Theorem 9 we prove the following. Theorem 11. Let f and g be two monotone submodular functions and M = (E, I) be a matroid. Then there exists a polynomial-time algorithm that approximates max S E,S I f(S) g(S) within O( n log n) factor.

By this theorem, when all fis are monotone we can compute ˆpi s with pi ˆpi O( n log n)pi in polynomial time which leads to a polynomial-time randomized algorithm that constructs an ϵ-sparsiﬁer of the expected size at most O( B r n1.5 log n

ϵ2 ). This proves the second part of Theorem 4.

Applications Submodular Function Maximization with Cardinality Constraint Our sparsiﬁcation algorithm can be used as a preprocessing step and once a sparsiﬁer is constructed it can be reused

Algorithm 3

Require: Submodular function F = P

i D fi with each fi :

{0, 1}E R, constant k, and ϵ, δ (0, 1) 1: Compute F = P i D wifi, an ϵ-sparsiﬁer for F. 2: A . 3: while |A| k do 4: ai argmaxa E\A(F (A {a}) F (A)). 5: A A {ai}.

6: return A.

many times (possibly for maximization/minimization under several different constraints). To elaborate on this, we consider the problem of maximizing a submodular function subject to a cardinality constraint. That is ﬁnding S = argmax S E,|S| k F(S). Cardinality constraint is a special case of matroid constraint where the independent sets are all subsets of size at most k and the rank of the matroid is k. A celebrated result of Nemhauser, Wolsey, and Fisher (1978) states that for non-negative monotone submodular functions a simple greedy algorithm provides a solution with (1 1/e) approximation guarantee to the optimal (intractable) solution. For a ground set E of size n and a monotone submodular function F = P i D fi, this greedy algorithm needs O(kn N) function evaluations to ﬁnd S of size k such that F(S) (1 1/e)F(S ). We refer to this algorithm as Greedy Alg. In many applications where N n, having a sparsiﬁer is beneﬁcial. Applying Greedy Alg on an ϵsparsiﬁer of size O(Bkn/ϵ2) improves the number of function evaluations to O(Bk2n2/ϵ2) and yields S of size k such that F(S) (1 1/e ϵ)F(S ) with high probability (see Algorithm 3). We point out that sampling techniques such as (Mitrovic et al. 2018; Mirzasoleiman et al. 2015) sample elements from the ground set E rather than sampling from functions f1, . . . , f N. Hence their running time depend on N, which could be slow when N is large the regime we care about. Besides, our algorithm can be used as a preprocessing step for these algorithms. For instance, the lazier than lazy greedy algorithm (Mirzasoleiman et al. 2015) requires O(n N log 1

ϵ ) function evaluations. However, when N is much larger than n it is absolutely beneﬁcial to use our sparsiﬁcation algorithm and reduce the number of submodular functions that one should consider.

Two Well-known Examples

Maximum Coverage Problem. Let [N] be a universe and E = {S1, . . . , Sn} with each Si N be a family of sets. Given a positive integer k, in the MAX COVERAGE problem the objective is to select at most k of sets from E such that the maximum number of elements are covered, i.e., the union of the selected sets has maximal size. One can formulate this problem as follows. For every i [N] and A [n] deﬁne fi(A) as

fi(A) = 1 if there exists a A such that i Sa, 0 otherwise.

Note that fi s are monotone and submodular. Furthermore, deﬁne F : 2n R+ to be F(A) = P i [N] fi(A) which is monotone and submodular as well. Now the MAX COVERAGE problem is equivalent to max A [n],|A| k F(A). For each submodular function fi, the corresponding pi is

pi = max A [n],|A| k fi(A) F(A) = max Sa E,i Sa fi({a}) F({a})

= max Sa E,i Sa 1 F({a}) = max Sa E,i Sa 1 |Sa|.

We can compute all the pi s in O(P |Si|) time, which is the input size. Then we can construct a sparsiﬁer in O(N) time. In total, the time required for sparsiﬁcation is O(P |Si| + N). On the other hand, for this case we have

i=1 max Sa S,i Sa 1 |Sa|

|Si| |Si| = n.

By Lemma 7, this upper bound provides that our algorithm constructs an ϵ-sparsiﬁer of size at most O(kn/ϵ2). Algorithm 3 improves the running time of the Greedy Alg from O(kn N) to O(k2n2/ϵ2). Furthermore, Algorithm 3 returns a set A of size at most k such that (1 1/e ϵ)OPT F(A). (OPT denotes F(S ) where S = argmax S E,|S| k F(S).)

Facility Location Problem. Let I be a set of N clients and E be a set of facilities with |E| = n. Let c : I E R be the cost of assigning a given client to a given facility. For each client i and each subset of facilities A E, deﬁne fi(A) = maxj A c(i, j). For any non-empty subset A E, the value of A is given by

i I fi(A) = X

i I max j A c(i, j).

For completeness, we deﬁne F( ) = 0. An instance of the MAX FACILITY LOCATION problem is speciﬁed by a tuple (I, E, c). The objective is to choose a subset A E of size at most k maximizing F(A). For each submodular function fi, the corresponding pi is

pi = max A E,|A| k

max j A c(i, j)

F(A) = max j E c(i, j) F({j})

It is clear that pi s can be computed in O(|I| |E|) time, which is the input size. In this case, we have

i I max j E c(i, j) F({j})

i I c(i, j)

F({j}) F({j})

Hence, by Lemma 7, our algorithm construct a sparsiﬁer of size O(kn/ϵ2). Algorithm 3 improves the running time of the Greedy Alg from O(kn N) to in O(k2n2/ϵ2). Furthermore, Algorithm 3 returns a set A of size at most k such that (1 1/e ϵ)OPT F(A).

Remark 12. Lindgren, Wu, and Dimakis (2016) sparsify an instance of the FACILITY LOCATION problem by zeroing out entries in the cost matrix this is not applicable to the general setting. The runtime of the Greedy Alg applied on their sparsiﬁed instance is O(n N/ϵ). This runtime is huge when N is large the regime we care about. Moreover, we can ﬁrst construct our sparsiﬁer and apply the algorithm of Lindgren, Wu, and Dimakis on it.

Submodular Function Minimization

Besides the applications regarding submodular maximization, our sparsiﬁcation algorithm can be used as a preprocessing step for submodular minimization as well. In many applications of the submodular minimization problem such as image segmentation (Shanu, Arora, and Singla 2016), Markov random ﬁeld inference (Fix et al. 2013; Kohli, Ladicky, and Torr 2009; Vicente, Kolmogorov, and Rother 2009), hypergraph cuts (Veldt, Benson, and Kleinberg 2020b), covering functions (Stobbe and Krause 2010), the submodular function at hand is a decomposable submodular function. Many of recent advances on decomposable submodular minimization such as (Ene, Nguyen, and V egh 2017; Axiotis et al. 2021) have leveraged a mix of ideas coming from both discrete and continuous optimization. Here we discuss that our sparsifying algorithm approximates the so called Lov asz extension, a natural extension of a submodular function to the continuous domain [0, 1]n.

Lov asz extension. Let x [0, 1]n be the vector (x1, x2, . . . , xn). Let π : [n] [n] be a sorting permutation of x1, x2, . . . , xn, which means if π(i) = j, then xj is the i-th largest element in the vector x. Hence, 1 xπ(1) xπ(n) 0. Let xπ(0) = 1 and xπ(n+1) = 0. Deﬁne sets Sπ 0 = and Sπ i = {π(1), . . . , π(i)}. The Lov asz extension of f is deﬁned as follows f L(x) = Pn i=0(xπ(i) xπ(i+1))f(Sπ i ). It is wellknown that f L(x) = maxy B(f) y, x . For a decomposable submodular function F = P i D fi, its Lov asz extension is

i D (xπ(j) xπ(j+1))fi(Sπ j ).

Recall the deﬁnition of pi s (2), they can be expressed in an equivalent way in terms of permutations as follow

pi = max A E fi(A) F(A) = max π max j [n] fi(Sπ j ) F(Sπ j ). (5)

Furthermore, note that F L(x) is a linear combination of F(S), S E. Given these, we prove Algorithm 1 outputs a sparsiﬁer that not only approximates the function itself but also approximates its Lov asz extension.

Theorem 13. Algorithm 1 returns a vector w with expected size at most O( B n2

ϵ2 ) such that, with probability at least 1 δ, for F = P i D wifi it holds that

(1 ϵ)F L(x) F L(x) (1 + ϵ)F L(x) x [0, 1]n.

Remark 14 (Relation to spectral sparsiﬁcation of graphs). The cut function of a graph G = (V, E) can be seen as a decomposable submodular function F(S) = P e E fe, where fe(S) = 1 if and only if e S = and e (V \ S) = . The goal of spectral sparsiﬁcation of graphs (Spielman and Teng 2011) is to preserve the quadratic form of the Laplacian of G, which can be rephrased as P e E f L e (x) 2. In contrast, our sparsiﬁcation preserves F L(x) = P e E f L e (x). Although we can construct a sparsiﬁer that preserves P e E f L e (x) 2 in the general submodular setting, we adopted the one used here because, in many applications where submodular functions are involved, we are more interested in the value of P e E f L e (x) than P e E f L e (x) 2, and the algorithm for preserving the former is simpler than that for preserving the latter.

Because our algorithm gives an approximation on the Lov asz extension, it can be used as a preprocessing step for algorithms working on Lov asz extensions such as the ones in (Axiotis et al. 2021; Ene, Nguyen, and V egh 2017). For instance, it improves the running time of Axiotis et al. (2021) from e O(Tmaxﬂow(n, n + N) log 1

ϵ ) to e O(Tmaxﬂow(n, n +

ϵ ) in cases where each submodular function fi D acts on O(1) elements of the ground set which implies B = maxi |B(fi)| is O(1). An example of such cases is hypergraph cut functions with O(1) sized hyperedges. Next we discuss several examples for which computing pi s is a computationally efﬁcient task, thus achieving a polynomial-time algorithm to construct sparsiﬁers. Recall that the cut function of a graph G = (V, E) can be seen as a decomposable submodular function. In this case, computing each pe for an edge e = st E is equivalent to ﬁnding the minimum s-t cut in the graph, which is a polynomial time task. A more general framework is the submodular hypergraph minimum s-t cut problem discussed in what follows.

Submodular hypergraphs (Li and Milenkovic 2018; Yoshida 2019). Let H be a hypergraph with vertex set V and set of hyperedges E where each hyperedge is a subset of vertices V . A submodular function fe is associated to each hyperedge e E. In the submodular hypergraph minimum s-t cut problem the objective is

minimize S V X

e E fe(e S) (6)

subject to s S, t V \S. This problem has been studied by Veldt, Benson, and Kleinberg (2020a) and its special cases where submodular functions fe take particular forms have been studied with applications in semi-supervised learning, clustering, and rank learning (see Li and Milenkovic (2018); Veldt, Benson, and Kleinberg (2020a) for more details). Examples of such special cases include:

Linear penalty: fe(S) = min{|S|, |e \ S|} Quadratic Penalty: fe(S) = |S| |e \ S|

We refer to Table 1 in Veldt, Benson, and Kleinberg (2020a) for more examples. These examples are cardinality-based,

(a) Uber pickup (b) Discogs (c)

Figure 1: Figures (a) and (b) show the relative performance of the greedy method on sparsiﬁers on Uber pickup and Discogs datasets, respectively. Figure (c) shows relative size of sparsiﬁers and relative runtime of the greedy method on sparsiﬁers.

that is, the value of the submodular function depends on the cardinality of the input set (see Deﬁnition 3.2 of Veldt, Benson, and Kleinberg (2020a)). It is known that if all the submodular functions are cardinality-based, then computing the s-t minimum cut in the submodular hypergraph can be reduced to that in an auxiliary (ordinary) graph (Theorem 4.6 of Veldt, Benson, and Kleinberg (2020a)), which allows us to compute pe s in polynomial time.

Remark 15. Our sparsiﬁcation algorithm can also be used to construct submodular Laplacian based on the Lov asz extension of submodular functions. Submodular Laplacian was introduced by Yoshida (2019) and has numerous applications in machine learning, including in learning ranking data, clustering based on network motifs (Li and Milenkovic 2017), network analysis (Yoshida 2016), and etc.

Experimental Results

In this section, we empirically demonstrate that our algorithm (Algorithm 2) generates a sparse representation of a decomposable submodular function F : 2E R+ with which we can efﬁciently obtain a high-quality solution for maximizing F. We consider the following two settings. Uber pickup. We used a dataset of Uber pickups in New York city in May 2014 consisting of a set R of 564,517 records1. Each record has a pickup position, longitude and latitude. Consider selecting k locations as waiting spots for idle Uber drivers. To formalize this problem, we selected a set L of 36 popular pickup locations in the database, and constructed a facility location function F : 2L R+ as F(S) = P v R fv(S), where fv(S) = maxu L d(u, v) minu S d(u, v) and d(u, v) is the Manhattan distance between u and v. Then, the goal of the problem is to maximize F(S) subject to |S| k. Discogs (Kunegis 2013). This dataset provides information about audio records as a bipartite graph G = (L, R; E), where each edge (u, v) L R indicates that a label v was involved in the production of a release of a style u. We have |L| = 383 and |R| = 243, 764, and |E| = 5, 255, 950. Consider selecting k styles that cover the activity of as

1Available at https://www.kaggle.com/ﬁvethirtyeight/uberpickups-in-new-york-city

many labels as possible. To formalize this problem, we constructed a maximum coverage function F : 2L R as F(S) = P v R fv(S), where fv(S) is 1 if v has a neighbor in S and 0 otherwise. Then, the goal is to maximize F(S) subject to |S| k. Figure 1 (a) and (b) show the objective value of the solution obtained by the greedy method on the sparsiﬁer relative to that on the original input function with its 25th and 75th percentiles. Although our theoretical results do not give any guarantee when ϵ > 1, we tried constructing our sparsiﬁer with ϵ > 1 to see its performance. The solution quality of our sparsiﬁer for Uber pickup is more than 99.9% even when ϵ = 4, and that for Discogs is more than 90% performance when ϵ 1.0. The performance for Uber pickup is higher than that for Discogs because the objective function of the former saturates easily. These results suggest that we get a reasonably good solution quality by setting ϵ = 1.

Number of functions and speedups. Figure 1 (c) shows the size, that is, the number of functions with positive weights, of our sparsiﬁer relative to that of the original function and the runtime of the greedy method on the sparsiﬁer relative to that on the original function with their 25th and 75th percentiles when k = 8. The size and runtime are decreased by a factor of 30 50 when ϵ = 1. To summarize, our experimental results suggest that our sparsiﬁer highly compresses the original function without sacriﬁcing the solution quality.

Conclusion Decomposable submodular functions appear in many data intensive tasks in machine learning and data mining. Thus, having a sparsifying algorithm is of great interest. In this paper, we introduce the notion of sparsiﬁer for decomposable submodular functions. We propose the ﬁrst sparsifying algorithm for these types of functions which outputs accurate sparsiﬁers with expected size independent of the size of original function. Our algorithm can be adapted to sparsify mixtures of submodular functions. We also study the effectiveness of our algorithm under various constraints such as matroid and cardinality constraints. Our experimental results complement our theoretical results on real world data. This work does not present any foreseeable societal consequence.

Acknowledgments Akbar Raﬁey was supported by NSERC. Yuichi Yoshida was supported by JST PRESTO Grant Number JPMJPR192B and JSPS KAKENHI Grant Number 20H05965.

References Ahn, K. J.; Guha, S.; and Mc Gregor, A. 2012. Graph sketches: sparsiﬁcation, spanners, and subgraphs. In PODS, 5 14. Ahn, K. J.; Guha, S.; and Mc Gregor, A. 2013. Spectral Sparsiﬁcation in Dynamic Graph Streams. In APPROX, 1 10. Andoni, A.; Chen, J.; Krauthgamer, R.; Qin, B.; Woodruff, D. P.; and Zhang, Q. 2016. On Sketching Quadratic Forms. In ITCS, 311 319. Axiotis, K.; Karczmarz, A.; Mukherjee, A.; Sankowski, P.; and Vladu, A. 2021. Decomposable Submodular Function Minimization via Maximum Flow. In ICML, 446 456. Bai, W.; Iyer, R. K.; Wei, K.; and Bilmes, J. A. 2016. Algorithms for Optimizing the Ratio of Submodular Functions. In ICML, 2751 2759. Bairi, R.; Iyer, R. K.; Ramakrishnan, G.; and Bilmes, J. A. 2015. Summarization of Multi-Document Topic Hierarchies using Submodular Mixtures. In ACL, 553 563. Bansal, N.; Svensson, O.; and Trevisan, L. 2019. New Notions and Constructions of Sparsiﬁcation for Graphs and Hypergraphs. In FOCS, 910 928. Batson, J. D.; Spielman, D. A.; and Srivastava, N. 2012. Twice-Ramanujan Sparsiﬁers. SIAM J. Comput., 41(6): 1704 1721. Bencz ur, A. A.; and Karger, D. R. 1996. Approximating s-t Minimum Cuts in O(n2) Time. In STOC, 47 55. Bencz ur, A. A.; and Karger, D. R. 2015. Randomized Approximation Schemes for Cuts and Flows in Capacitated Graphs. SIAM J. Comput., 44(2): 290 319. Chaturvedi, A.; Nguyen, H. L.; and Zakynthinou, L. 2021. Differentially Private Decomposable Submodular Maximization. In AAAI, 6984 6992. Chen, Y.; Khanna, S.; and Nagda, A. 2020. Near-linear Size Hypergraph Cut Sparsiﬁers. In FOCS, 61 72. Cohen, M. B.; Kelner, J.; Peebles, J.; Peng, R.; Rao, A. B.; Sidford, A.; and Vladu, A. 2017. Almost-linear-time algorithms for Markov chains and new spectral primitives for directed graphs. In STOC, 410 419. Cohen, M. B.; Lee, Y. T.; Musco, C.; Musco, C.; Peng, R.; and Sidford, A. 2015. Uniform Sampling for Matrix Approximation. In ITCS, 181 190. Dobzinski, S.; and Schapira, M. 2006. An improved approximation algorithm for combinatorial auctions with submodular bidders. In SODA, 1064 1073. Dueck, D.; and Frey, B. J. 2007. Non-metric afﬁnity propagation for unsupervised image categorization. In ICCV, 1 8. Edmonds, J. 2001. Submodular Functions, Matroids, and Certain Polyhedra. In Combinatorial Optimization - Eureka, You Shrink!, 11 26.

Ene, A.; Nguyen, H. L.; and V egh, L. A. 2017. Decomposable Submodular Function Minimization: Discrete and Continuous. In Neur IPS, 2870 2880. Feige, U. 2006. On maximizing welfare when utility functions are subadditive. In STOC, 41 50. Feige, U.; and Vondr ak, J. 2006. Approximation algorithms for allocation problems: Improving the factor of 1 - 1/e. In FOCS, 667 676. Fix, A.; Joachims, T.; Park, S. M.; and Zabih, R. 2013. Structured Learning of Sum-of-Submodular Higher Order Energy Functions. In ICCV, 3104 3111. Gomes, R.; and Krause, A. 2010. Budgeted Nonparametric Learning from Data Streams. In ICML, 391 398. Kapralov, M.; Krauthgamer, R.; Tardos, J.; and Yoshida, Y. 2021a. Spectral Hypergraph Sparsiﬁers of Nearly Linear Size. In FOCS. Kapralov, M.; Krauthgamer, R.; Tardos, J.; and Yoshida, Y. 2021b. Towards Tight Bounds for Spectral Sparsiﬁcation of Hypergraphs. In STOC, 598 611. Karger, D. R.; and Levine, M. S. 2002. Random sampling in residual graphs. In STOC, 63 66. Kogan, D.; and Krauthgamer, R. 2015. Sketching Cuts in Graphs and Hypergraphs. In ITCS, 367 376. Kohli, P.; Ladicky, L.; and Torr, P. H. S. 2009. Robust Higher Order Potentials for Enforcing Label Consistency. Int. J. Comput. Vis., 82(3): 302 324. Kunegis, J. 2013. KONECT: the Koblenz network collection. In WWW, Companion Volume, 1343 1350. Lee, Y. T.; and Sidford, A. 2014. Path Finding Methods for Linear Programming: Solving Linear Programs in O(vrank) Iterations and Faster Algorithms for Maximum Flow. In FOCS, 424 433. Lee, Y. T.; and Sun, H. 2018. Constructing Linear-Sized Spectral Sparsiﬁcation in Almost-Linear Time. SIAM J. Comput., 47(6): 2315 2336. Li, M.; Miller, G. L.; and Peng, R. 2013. Iterative Row Sampling. In FOCS, 127 136. Li, P.; and Milenkovic, O. 2017. Inhomogeneous Hypergraph Clustering with Applications. In Neur IPS, 2308 2318. Li, P.; and Milenkovic, O. 2018. Submodular Hypergraphs: p-Laplacians, Cheeger Inequalities and Spectral Clustering. In ICML, 3020 3029. Lin, H.; and Bilmes, J. A. 2011. A Class of Submodular Functions for Document Summarization. In HLT, 510 520. Lindgren, E. M.; Wu, S.; and Dimakis, A. G. 2016. Leveraging Sparsity for Efﬁcient Submodular Data Summarization. In Neur IPS, 3414 3422. Madry, A. 2010. Fast Approximation Algorithms for Cut Based Problems in Undirected Graphs. In FOCS, 245 254. Mahoney, M. W. 2011. Randomized Algorithms for Matrices and Data. Found. Trends Mach. Learn., 3(2): 123 224. Mirzasoleiman, B.; Badanidiyuru, A.; and Karbasi, A. 2016. Fast Constrained Submodular Maximization: Personalized Data Summarization. In ICML, volume 48, 1358 1367.

Mirzasoleiman, B.; Badanidiyuru, A.; Karbasi, A.; Vondr ak, J.; and Krause, A. 2015. Lazier Than Lazy Greedy. In AAAI, 1812 1818. Mirzasoleiman, B.; Karbasi, A.; Sarkar, R.; and Krause, A. 2016. Distributed Submodular Maximization. J. Mach. Learn. Res., 17: 238:1 238:44. Mirzasoleiman, B.; Zadimoghaddam, M.; and Karbasi, A. 2016. Fast Distributed Submodular Cover: Public-Private Data Summarization. In Neur IPS, 3594 3602. Mitrovic, M.; Bun, M.; Krause, A.; and Karbasi, A. 2017. Differentially Private Submodular Maximization: Data Summarization in Disguise. In ICML, 2478 2487. Mitrovic, M.; Kazemi, E.; Zadimoghaddam, M.; and Karbasi, A. 2018. Data Summarization at Scale: A Two-Stage Submodular Approach. In ICML, volume 80, 3593 3602. Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978. An analysis of approximations for maximizing submodular set functions - I. Math. Program., 14(1): 265 294. Papadimitriou, C. H.; Schapira, M.; and Singer, Y. 2008. On the Hardness of Being Truthful. In FOCS, 250 259. Parambath, S. P.; Usunier, N.; and Grandvalet, Y. 2016. A Coverage-Based Approach to Recommendation Diversity On Similarity Graph. In Rec Sys, 15 22. Raﬁey, A.; and Yoshida, Y. 2020. Fast and Private Submodular and k-Submodular Functions Maximization with Matroid Constraints. In ICML, 7887 7897. Shanu, I.; Arora, C.; and Singla, P. 2016. Min Norm Point Algorithm for Higher Order MRF-MAP Inference. In CVPR, 5365 5374. Soma, T.; and Yoshida, Y. 2019. Spectral Sparsiﬁcation of Hypergraphs. In SODA, 2570 2581. Spielman, D. A.; and Teng, S. 2011. Spectral Sparsiﬁcation of Graphs. SIAM J. Comput., 40(4): 981 1025. Stobbe, P.; and Krause, A. 2010. Efﬁcient Minimization of Decomposable Submodular Functions. In Neur IPS, 2208 2216. Tschiatschek, S.; Iyer, R. K.; Wei, H.; and Bilmes, J. A. 2014. Learning Mixtures of Submodular Functions for Image Collection Summarization. In Neur IPS, 1413 1421. Veldt, N.; Benson, A. R.; and Kleinberg, J. M. 2020a. Hypergraph Cuts with General Splitting Functions. Co RR, abs/2001.02817. Veldt, N.; Benson, A. R.; and Kleinberg, J. M. 2020b. Minimizing Localized Ratio Cut Objectives in Hypergraphs. In KDD, 1708 1718. Vicente, S.; Kolmogorov, V.; and Rother, C. 2009. Joint optimization of segmentation and appearance models. In ICCV, 755 762. Vondr ak, J. 2008. Optimal approximation for the submodular welfare problem in the value oracle model. In STOC, 67 74. Yoshida, Y. 2016. Nonlinear Laplacian for Digraphs and its Applications to Network Analysis. In WSDM, 483 492. Yoshida, Y. 2019. Cheeger Inequalities for Submodular Transformations. In SODA, 2582 2601.