# maximization_of_approximately_submodular_functions__ce0eb3b0.pdf

Maximization of Approximately Submodular Functions

Thibaut Horel Harvard University thorel@seas.harvard.edu

Yaron Singer Harvard University yaron@seas.harvard.edu

We study the problem of maximizing a function that is approximately submodular under a cardinality constraint. Approximate submodularity implicitly appears in a wide range of applications as in many cases errors in evaluation of a submodular function break submodularity. Say that F is ε-approximately submodular if there exists a submodular function f such that (1 ε)f(S) F(S) (1+ε)f(S) for all subsets S. We are interested in characterizing the query-complexity of maximizing F subject to a cardinality constraint k as a function of the error level ε > 0. We provide both lower and upper bounds: for ε > n 1/2 we show an exponential query-complexity lower bound. In contrast, when ε < 1/k or under a stronger bounded curvature assumption, we give constant approximation algorithms.

1 Introduction

In recent years, there has been a surge of interest in machine learning methods that involve discrete optimization. In this realm, the evolving theory of submodular optimization has been a catalyst for progress in extraordinarily varied application areas. Examples include active learning and experimental design [9, 12, 14, 19, 20], sparse reconstruction [1, 6, 7], graph inference [23, 24, 8], video analysis [29], clustering [10], document summarization [21], object detection [27], information retrieval [28], network inference [23, 24], and information diffusion in networks [17].

The power of submodularity as a modeling tool lies in its ability to capture interesting application domains while maintaining provable guarantees for optimization. The guarantees however, apply to the case in which one has access to the exact function to optimize. In many applications, one does not have access to the exact version of the function, but rather some approximate version of it. If the approximate version remains submodular then the theory of submodular optimization clearly applies and modest errors translate to modest loss in quality of approximation. But if the approximate version of the function ceases to be submodular all bets are off.

Approximate submodularity. Recall that a function f : 2N R is submodular if for all S, T N, f(S T) + f(S T) f(S) + f(T). We say that a function F : 2N R is ε-approximately submodular if there exists a submodular function f : 2N R s.t. for any S N:

(1 ε)f(S) F(S) (1 + ε)f(S). (1)

Unless otherwise stated, all submodular functions f considered are normalized (f( ) = 0) and monotone (f(S) f(T) for S T). Approximate submodularity appears in various domains.

Optimization with noisy oracles. In these scenarios, we wish to solve optimization problems where one does not have access to a submodular function but a noisy version of it. An example recently studied in [5] involves maximizing information gain in graphical models; this captures many Bayesian experimental design settings.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

PMAC learning. In the active area of learning submodular functions initiated by Balcan and Harvey [3], the objective is to approximately learn submodular functions. Roughly speaking, the PMAC-learning framework guarantees that the learned function is a constantfactor approximation of the true submodular function with high probability. Therefore, after learning a submodular function, one obtains an approximately submodular function. Sketching. Since submodular functions have, in general, exponential-size representation, [2] studied the problem of sketching submodular functions: ﬁnding a function with polynomialsize representation approximating a given submodular function. The resulting sketch is an approximately submodular function.

Optimization of approximate submodularity. We focus on optimization problems of the form

max S : |S| k F(S) (2)

where F is an ε-approximately submodular function and k N is the cardinality constraint. We say that a set S N is an α-approximation to the optimal solution of (2) if |S| k and F(S) α max|T | k F(T). As is common in submodular optimization, we assume the value query model: optimization algorithms have access to the objective function F in a black-box manner, i.e. they make queries to an oracle which returns, for a queried set S, the value F(S). The querycomplexity of the algorithm is simply the number of queries made to the oracle. An algorithm is called an α-approximation algorithm if for any approximately submodular input F the solution returned by the algorithm is an α-approximately optimal solution. Note that if there exists an αapproximation algorithm for the problem of maximizing an ε-approximate submodular function F, then this algorithm is a α(1 ε)

1+ε -approximation algorithm for the original submodular function f 1. Conversely, if no such algorithm exists, this implies an inapproximability for the original function.

Clearly, if a function is 0-approximately submodular then it retains desirable provable guarantees2, and it if it is arbitrarily far from being submodular it can be shown to be trivially inapproximable (e.g. maximize a function which takes value of 1 for a single arbitrary set S N and 0 elsewhere). The question is therefore:

How close should a function be to submodular to retain provable approximation guarantees?

In recent work, it was shown that for any constant ε > 0 there exists a class of ε-approximately submodular functions for which no algorithm using fewer than exponentially-many queries has a constant approximation ratio for the canonical problem of maximizing a monotone submodular function under a cardinality constraint [13]. Such an impossibility result suggests two natural relaxations: the ﬁrst is to make additional assumptions about the structure of errors, such a stochastic error model. This is the direction taken in [13], where the main result shows that when errors are drawn i.i.d. from a wide class of distributions, optimal guarantees are obtainable. The second alternative is to assume the error is subconstant, which is the focus of this paper.

1.1 Overview of the results

Our main result is a spoiler: even for ε = 1/n1/2 β for any constant β > 0 and n = |N|, no algorithm can obtain a constant-factor approximation guarantee. More speciﬁcally, we show that:

For the general case of monotone submodular functions, for any β > 0, given access to a 1 n1/2 β -approximately submodular function, no algorithm can obtain an approximation ratio better than O(1/nβ) using polynomially many queries (Theorem 3); For the case of coverage functions we show that for any ﬁxed β > 0 given access to an 1 n1/3 β -approximately submodular function, no algorithm can obtain an approximation ratio strictly better than O(1/nβ) using polynomially many queries (Theorem 4).

1Observe that for an approximately submodular function F, there exists many submodular functions f of which it is an approximation. All such submodular functions f are called representatives of F. The conversion between an approximation guarantee for F and an approximation guarantee for a representative f of F holds for any choice of the representative. 2Speciﬁcally, [22] shows that it possible to obtain a (1 1/e) approximation ratio for a cardinality constraint.

The above results imply that even in cases where the objective function is arbitrarily close to being submodular as the number n of elements in N grows, reasonable optimization guarantees are unachievable. The second result shows that this is the case even when we aim to optimize coverage functions. Coverage functions are an important class of submodular functions which are used in numerous applications [11, 21, 18].

Approximation guarantees. The inapproximability results follow from two properties of the model: the structure of the function (submodularity), and the size of ε in the deﬁnition of approximate submodularity. A natural question is whether one can relax either conditions to obtain positive approximation guarantees. We show that this is indeed the case:

In the general case of monotone submodular functions we show that the greedy algorithm achieves a 1 1/e O(δ) approximation ratio when ε = δ

k (Theorem 5). Furthermore, this bound is tight: given a 1/k1 β-approximately submodular function, the greedy algorithm no longer provides a constant factor approximation guarantee (Proposition 6). Since our query-complexity lower bound holds for coverage functions, which already contain a great deal of structure, we relax the structural assumption by considering functions with bounded curvature c; this is a common assumption in applications of submodularity to machine learning and has been used in prior work to obtain theoretical guarantees [15, 16]. Under this assumption, we give an algorithm which achieves an approximation ratio of (1 c)( 1 ε

1+ε)2 (Proposition 8).

We state our positive results for the case of a cardinality constraint of k. Similar results hold for matroids of rank k, the proofs of those can be found in the Appendix. Note that cardinality constraints are a special case of matroid constraints, therefore our lower bounds also apply to matroid constraints.

1.2 Discussion and additional related work

Before transitioning to the technical results, we brieﬂy survey error in applications of submodularity and the implications of our results to these applications. First, notice that there is a coupling between approximate submodularity and erroneous evaluations of a submodular function: if one can evaluate a submodular function within (multiplicative) accuracy of 1 ε then this is an ε-approximately submodular function.

Additive vs multiplicative approximation. The deﬁnition of approximate submodularity in (1) uses relative (multiplicative) approximation. We could instead consider absolute (additive) approximation, i.e. require that f(S) ε F(S) f(S) + ε for all sets S. This deﬁnition has been used in the related problem of optimizing approximately convex functions [4, 25], where functions are assumed to have normalized range. For un-normalized functions or functions whose range is unknown, a relative approximation is more informative. When the range is known, speciﬁcally if an upper bound B on f(S) is known, an ε/B-approximately submodular function is also an ε-additively approximate submodular function. This implies that our lower bounds and approximation results could equivalently be expressed for additive approximations of normalized functions.

Error vs noise. If we interpret Equation (1) in terms of error, we see that no assumption is made on the source of the error yielding the approximately submodular function. In particular, there is no stochastic assumption: the error is deterministic and worst-case. Previous work have considered submodular or combinatorial optimization under random noise. Two models naturally arise:

consistent noise: the approximate function F is such that F(S) = ξSf(S) where ξS is drawn independently for each set S from a distribution D. The key aspect of consistent noise is that the random draws occur only once: querying the same set multiple times always returns the same value. This deﬁnition is the one adopted in [13]; a similar notion is called persistent noise in [5]. inconsistent noise: in this model F(S) is a random variable such that f(S) = E[F(S)]. The noisy oracle can be queried multiple times and each query corresponds to a new independent random draw from the distribution of F(S). This model was considered in [26] in the context of dataset summarization and is also implicitly present in [17] where the objective function is deﬁned as an expectation and has to be estimated via sampling.

Formal guarantees for consistent noise have been obtained in [13]. A standard way to approach optimization with inconsistent noise is to estimate the value of each set used by the algorithm to an accuracy ε via independent randomized sampling, where ε is chosen small enough so as to obtain approximation guarantees. Speciﬁcally, assuming that the algorithm only makes polynomially many value queries and that the function f is such that F(S) [b, B] for any set S, then a classical application of the Chernoff bound combined with a union bound implies that if the value of each set is estimated by averaging the value of m samples with m = Ω B log n

bε2 , then with high probability the estimated value F(S) of each set used by the algorithm is such that (1 ε)f(S) F(S) (1 + ε)f(S). In other words, randomized sampling is used to construct a function which is εapproximately submodular with high probability.

Implications of results in this paper. Given the above discussion, our results can be interpreted in the context of noise as providing guarantees on what is a tolerable noise level. In particular, Theorem 5 implies that if a submodular function is estimated using m samples, with m = Ω Bn2 log n

b , then the Greedy algorithm is a constant approximation algorithm for the problem of maximizing a monotone submodular function under a cardinality constraint. Theorem 3 implies that if m = O Bn log n

then the resulting estimation error is within the range where no algorithm can obtain a constant approximation ratio.

2 Query-complexity lower bounds

In this section we give query-complexity lower bounds for the problem of maximizing an εapproximately submodular function subject to a cardinality constraint. In Section 2.1, we show an exponential query-complexity lower bound for the case of general submodular functions when ε n 1/2 (Theorem 3). The same lower-bound is then shown to hold even when we restrict ourselves to the case of coverage functions for ε n 1/3 (Theorem 4).

A general overview of query-complexity lower bounds. At a high level, the lower bounds are constructed as follows. We deﬁne a class of monotone submodular functions F, and draw a function f uniformly at random from F. In addition we deﬁne a submodular function g : 2N R s.t. max|S| k f(S) ρ(n) max|S| k g(S), where ρ(n) = o(1) for a particular choice of k < n. We then deﬁne the approximately submodular function F:

F(S) = g(S), if (1 ε)f(S) g(S) (1 + ε)f(S) f(S) otherwise

Note that by its deﬁnition, this function is an ε-approximately submodular function. To show the lower bound, we reduce the problem of proving inapproximability of optimizing an approximately submodular function to the problem of distinguishing between f and g using F. We show that for every algorithm, there exists a function f F s.t. if f is unknown to the algorithm, it cannot distinguish between the case in which the underlying function is f and the case in which the underlying function is g using polynomially-many value queries to F, even when g is known to the algorithm. Since max|S| k f(S) ρ(n) max|S| k g(S), this implies that no algorithm can obtain an approximation better than ρ(n) using polynomially-many queries; otherwise such an algorithm could be used to distinguish between f and g.

2.1 Monotone submodular functions

Constructing a class of hard functions. A natural candidate for a class of functions F and a function g satisfying the properties described in the overview is:

f H(S) = |S H| and g(S) = |S|h

n for H N of size h. The reason why g is hard to distinguish from f H is that when H is drawn uniformly at random among sets of size h, f H is close to g with high probability. This follows from an application of the Chernoff bound for negatively associated random variables. Formally, this is stated in Lemma 1 whose proof is given in the Appendix.

Lemma 1. Let H N be a set drawn uniformly among sets of size h, then for any S N, writing µ = |S|h

n , for any ε such that ε2µ > 1:

PH (1 ε)µ |S H| (1 + ε)µ 1 2 Ω(ε2µ)

Unfortunately this construction fails if the algorithm is allowed to evaluate the approximately submodular function at small sets: for those the concentration of Lemma 1 is not high enough. Our construction instead relies on designing F and g such that when S is large , then we can make use of the concentration result of Lemma 1 and when S is small , functions in F and g are deterministically close to each other. Speciﬁcally, we introduce for H N of size h:

f H(S) = |S H| + min |S (N \ H)|, α 1 h

g(S) = min |S|, |S|h

The value of the parameters α and h will be set later in the analysis. Observe that when S is small (|S H| α(1 h/n) and |S| α) then f H(S) = g(S) = |S|. When S is large, Lemma 1 implies that |S H| is close to |S|h/n and |S (N \ H)| is close to |S|(1 h/n) with high probability.

First note that f H and g are monotone submodular functions. f H is the sum of a monotone additive function and a monotone budget-additive function. The function g can be written g(S) = G(|S|) where G(x) = min(x, xh/n + α(1 h/n)). G is a non-decreasing concave function (minimum between two non-decreasing linear functions) hence g is monotone submodular.

Next, we observe that there is a gap between the maxima of the functions f H and the one of g. When S k, g(S) = |S|h

n . The maximum is clearly attained when |S| = k and is upper-bounded by kh

n + α. For f H, the maximum is equal to k and is attained when S is a subset of H of size k. So for α k h, we obtain:

max |S| k g(S) α

max |S| k f H(S), H N (4)

Indistinguishability. The main challenge is now to prove that f H is close to g with high probability. Formally, we have the following lemma. Lemma 2. For h n

2 , let H be drawn uniformly at random among sets of size h, then for any S:

PH (1 ε)f H(S) g(S) (1 + ε)f H(S) 1 2 Ω(ε2αh/n) (5)

Proof. For concision we deﬁne H := N \ H, the complement of H in N. We consider four cases depending on the cardinality of S and S H.

Case 1: |S| α and |S H| α 1 h

n . In this case f H(S) = |S H| + |S H| = |S| and g(S) = |S|. The two functions are equal and the inequality is immediately satisﬁed.

Case 2: |S| α and |S H| α(1 h

n). In this case g(S) = |S| = |S H| + |S H| and f H(S) = |S H| + α(1 h

n). By assumption on |S H|, we have:

For the other side, by assumption on |S H|, we have that |S| α(1 h

2 (since h n

2 ). We can then apply Lemma 1 and obtain:

|S H| (1 + ε)α 1 h

1 2 Ω(ε2αh/n)

Case 3: |S| α and |S H| α 1 h

n . In this case f H(S) = |S H| + α(1 h

n) and g(S) = |S|h

n). We need to show that:

n |S H| (1 + ε)|S|h

1 2 Ω(ε2αh/n)

This is a direct consequence of Lemma 1.

Case 4: |S| α and |S H| α 1 h

n . In this case f H(S) = |S H| + |S H| and g(S) = |S|h

n). As in the previous case, we have:

n |S H| (1 + ε)|S|h

1 2 Ω(ε2αh/n)

By the assumption on |S H|, we also have:

|S H| α 1 h

(1 + ε)α 1 h

So we need to show that:

|S H| 1 2 Ω(ε2αh/n)

and then we will be able to conclude by union bound. This is again a consequence of Lemma 1.

Theorem 3. For any 0 < β < 1 2, ε 1 n1/2 β , and any (possibly randomized) algorithm with

query-complexity smaller than 2Ω(nβ/2), there exists an ε-approximately submodular function F such that for the problem of maximizing F under a cardinality constraint, the algorithm achieves an approximation ratio upper-bounded by 2 nβ/2 with probability at least 1 1 2Ω(nβ/2) .

Proof. We set k = h = n1 β/2 and α = n1 β. Let H be drawn uniformly at random among sets of size h and let f H and g be as in (3). We ﬁrst deﬁne the ε-approximately submodular function F H:

F H(S) = g(S) if (1 ε)f H(S) g(S) (1 + ε)f H(S) f H(S) otherwise It is clear from the deﬁnition that this is an ε-approximately submodular function. Consider a deterministic algorithm A and let us denote by S1, . . . , Sm the queries made by the algorithm when given as input the function g (g is 0-approximately submodular, hence it is a valid input to A). Without loss of generality, we can include the set returned by the algorithm in the queries, so Sm denotes the set returned by the algorithm. By (5), for any i [m]:

PH[(1 ε)f H(Si) g(Si) (1 + ε)f H(Si)] 1 2 Ω n β 2

when these events realize, we have F H(Si) = g(Si). By union bound over i, when m < 2Ω n β 2 :

PH[ i, F H(Si) = g(Si)] > 1 m2 Ω nβ/2 = 1 2 Ω nβ/2 > 0 This implies the existence of H such that A follows the same query path when given g and F H as inputs. For this H:

F H(Sm) = g(Sm) max |S| k g(S) α

max |S| k f H(S) = α

max |S| k F H(S)

where the second inequality comes from (4). For our choice of parameters, α

n = 2/n β 2 , hence:

n β 2 max |S| k F H(S)

Let us now consider the case where the algorithm A is randomized and let us denote AH,R the solution returned by the algorithm when given function F H as input and random bits R. We have:

F H(AH,R) 2 nβ/2 max |S| k F H(S) = X

r P[R = r]PH

F H(AH,R) 2 nβ/2 max |S| k F H(S)

(1 2 Ω(n β 2 )) X

r P[R = r] = 1 2 Ω(nβ2)

where the equality comes from the analysis of the deterministic case (when the random bits are ﬁxed, the algorithm is deterministic). This implies the existence of H such that:

F H(AH,R) 2 nβ/2 max |S| k F H(S) 1 2 Ω(nβ2)

and concludes the proof of the theorem.

2.2 Coverage functions

In this section, we show that an exponential query-complexity lower bound still holds even in the restricted case where the objective function approximates a coverage function. Recall that by deﬁnition of a coverage function, the elements of the ground set N are subsets of a set U called the universe. For a set S = {S1, . . . , Sm} of subsets of U, the value f(S) is given by f(S) = |Sm i=1 Si|.

Theorem 4. For any 0 < β < 1 2, ε 1 n1/3 β , and any (possibly randomized) algorithm with

query-complexity smaller than 2Ω(nβ/2), there exists a function F which ε-approximates a coverage function, such that for the problem of maximizing F under a cardinality constraint, the algorithm achieves an approximation ratio upper-bounded by 2 nβ/2 with probability at least 1 1 2Ω(nβ/2) .

The proof of Theorem 4 has the same structure as the proof of Theorem 3. The main difference is a different choice of class of functions F and function g. The details can be found in the appendix.

3 Approximation algorithms

The results from Section 2 can be seen as a strong impossibility result since an exponential querycomplexity lower bound holds even in the speciﬁc case of coverage functions which exhibit a lot of structure. Faced with such an impossibility, we analyze two ways to relax the assumptions in order to obtain positive results. One relaxation considers ε-approximate submodularity when ε 1

k; in this case we show that the Greedy algorithm achieves a constant approximation ratio (and that ε = 1

k is tight for the Greedy algorithm). The other relaxation considers functions with stronger structural properties, namely, functions with bounded curvature. In this case, we show that a constant approximation ratio can be obtained for any constant ε.

3.1 Greedy algorithm

For the general class of monotone submodular functions, the result of [22] shows that a simple greedy algorithm achieves an approximation ratio of 1 1

e. Running the same algorithm for an ε-approximately submodular function results in a constant approximation ratio when ε 1

k. The detailed description of the algorithm can be found in the appendix.

Theorem 5. Let F be an ε-approximately submodular function, then the set S returned by the greedy algorithm satisﬁes:

F(S) 1 1 + 4kε (1 ε)2

max S:|S| k F(S)

In particular, for k 2, any constant 0 δ < 1 and ε = δ

k, this approximation ratio is constant and lower-bounded by 1 1

Proof. Let us denote by O an optimal solution to max S:|S| K F(S) and by f a submodular representative of F. Let us write S = {e1, . . . , eℓ} the set returned by the greedy algoithm and deﬁne Si = {e1, . . . , ei}, then:

f(O) f(Si) + X

f(Si {e}) f(Si) f(Si) + X

1 1 εF(Si {e}) f(Si)

1 1 εF(Si+1) f(Si) f(Si) + X

1 εf(Si+1) f(Si)

f(Si) + K 1 + ε

1 εf(Si+1) f(Si)

where the ﬁrst inequality uses submodularity, the second uses the deﬁnition of approximate submodularity, the third uses the deﬁnition of the Algorithm, the fourth uses approximate submodularity again and the last one uses that |O| k.

Reordering the terms, and expressing the inequality in terms of F (using the deﬁnition of approximate submodularity) gives:

F(Si+1) 1 1

2 F(Si) + 1

This is an inductive inequality of the form ui+1 aui + b, u0 = 0. Whose solution is ui b 1 a(1 ai). For our speciﬁc a and b, we obtain:

F(S) 1 1 + 4kε (1 ε)2

The following proposition shows that ε = 1

k is tight for the greedy algorithm, and that this is the case even for additive functions. The proof can be found in the Appendix.

Proposition 6. For any β > 0, there exists an ε-approximately additive function with ε = Ω 1 k1 β

for which the Greedy algorithm has non-constant approximation ratio.

Matroid constraint. Theorem 5 can be generalized to the case of matroid constraints. We are now looking at a problem of the form: max S I F(S), where I is the set of independent sets of a matroid.

Theorem 7. Let I be the set of independent sets of a matroid of rank k, and let F be an εapproximately submodular function, then if S is the set returned by the greedy algorithm:

1 1 + kε 1 ε max S I f(S)

In particular, for k 2, any constant 0 δ < 1 and ε = δ

k, this approximation ratio is constant and lower-bounded by ( 1

3.2 Bounded curvature

With an additional assumption on the curvature of the submodular function f, it is possible to obtain a constant approximation ratio for any ε-approximately submodular function with constant ε. Recall that the curvature c of function f : 2N R is deﬁned by c = 1 mina N f N\{a}(a)

f(a) . A consequence of this deﬁnition when f is submodular is that for any S N and a N \ S we have that f S(a) (1 c)f(a).

Proposition 8. For the problem max|S| k F(S) where F is an ε-approximately submodular function which approximates a monotone submodular f with curvature c, there exists a polynomial time algorithm which achieves an approximation ratio of (1 c)( 1 ε

[1] F. Bach. Structured sparsity-inducing norms through submodular functions. In NIPS, 2010.

[2] A. Badanidiyuru, S. Dobzinski, H. Fu, R. Kleinberg, N. Nisan, and T. Roughgarden. Sketching valuation functions. In SODA, pages 1025 1035. SIAM, 2012.

[3] M.-F. Balcan and N. J. Harvey. Learning submodular functions. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 793 802. ACM, 2011.

[4] A. Belloni, T. Liang, H. Narayanan, and A. Rakhlin. Escaping the local minima via simulated annealing: Optimization of approximately convex functions. In COLT, pages 240 265, 2015.

[5] Y. Chen, S. H. Hassani, A. Karbasi, and A. Krause. Sequential information maximization: When is greedy near-optimal? In COLT, pages 338 363, 2015.

[6] A. Das, A. Dasgupta, and R. Kumar. Selecting diverse features via spectral relaxation. In NIPS, 2012.

[7] A. Das and D. Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. In ICML, 2011.

[8] A. Defazio and T. Caetano. A convex formulation for learning scale-free networks via submodular relaxation. In NIPS, 2012.

[9] D. Golovin and A. Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. JAIR, 42:427 486, 2011.

[10] R. Gomes and A. Krause. Budgeted nonparametric learning from data streams. In ICML, 2010.

[11] C. Guestrin, A. Krause, and A. Singh. Near-optimal sensor placements in Gaussian processes. In International Conference on Machine Learning (ICML), August 2005.

[12] A. Guillory and J. Bilmes. Simultaneous learning and covering with adversarial noise. In ICML, 2011.

[13] A. Hassidim and Y. Singer. Submodular optimization under noise. Co RR, abs/1601.03095, 2016.

[14] S. Hoi, R. Jin, J. Zhu, and M. Lyu. Batch mode active learning and its application to medical image classiﬁcation. In ICML, 2006.

[15] R. K. Iyer and J. A. Bilmes. Submodular optimization with submodular cover and submodular knapsack constraints. In NIPS, pages 2436 2444, 2013.

[16] R. K. Iyer, S. Jegelka, and J. A. Bilmes. Curvature and optimal algorithms for learning and minimizing submodular functions. In NIPS, pages 2742 2750, 2013.

[17] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of inﬂuence through a social network. In KDD, 2003.

[18] A. Krause and C. Guestrin. Near-optimal observation selection using submodular functions. In National Conference on Artiﬁcial Intelligence (AAAI), Nectar track, July 2007.

[19] A. Krause and C. Guestrin. Nonmyopic active learning of gaussian processes. an exploration exploitation approach. In ICML, 2007.

[20] A. Krause and C. Guestrin. Submodularity and its applications in optimized information gathering. ACM Trans. on Int. Systems and Technology, 2(4), 2011.

[21] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In ACL/HLT, 2011.

[22] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions i. Mathematical Programming, 14(1):265 294, 1978.

[23] M. G. Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and inﬂuence. ACM TKDD, 5(4), 2011.

[24] M. G. Rodriguez and B. Schölkopf. Submodular inference of diffusion networks from multiple trees. In ICML, 2012.

[25] Y. Singer and J. Vondrák. Information-theoretic lower bounds for convex optimization with erroneous oracles. In NIPS, pages 3186 3194, 2015.

[26] A. Singla, S. Tschiatschek, and A. Krause. Noisy submodular maximization via adaptive sampling with applications to crowdsourced image collection summarization. ar Xiv preprint ar Xiv:1511.07211, 2015.

[27] H. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell. On learning to localize objects with minimal supervision. In ICML, 2014.

[28] S. Tschiatschek, R. Iyer, H. Wei, and J. Bilmes. Learning mixtures of submodular functions for image collection summarization. In NIPS, 2014.

[29] J. Zheng, Z. Jiang, R. Chellappa, and J. Phillips. Submodular attribute selection for action recognition in video. In NIPS, 2014.