# optimal_approximation_for_unconstrained_nonsubmodular_minimization__980855ae.pdf

Optimal approximation for unconstrained non-submodular minimization

Marwa El Halabi 1 Stefanie Jegelka 1

Submodular function minimization is well stud ied, and existing algorithms solve it exactly or up to arbitrary accuracy. However, in many applica tions, such as structured sparse learning or batch Bayesian optimization, the objective function is not exactly submodular, but close. In this case, no theoretical guarantees exist. Indeed, submodu lar minimization algorithms rely on intricate con nections between submodularity and convexity. We show how these relations can be extended to obtain approximation guarantees for minimizing non-submodular functions, characterized by how close the function is to submodular. We also ex tend this result to noisy function evaluations. Our approximation results are the ﬁrst for minimizing non-submodular functions, and are optimal, as established by our matching lower bound.

1. Introduction

Many machine learning problems can be formulated as mini mizing a set function H. This problem is in general NP-hard, and can only be solved efﬁciently with additional structure. One especially popular example of such structure is that H is submodular, i.e., it satisﬁes the diminishing returns (DR) property: H(A {i}) H(A) H(B {i}) H(B), for all A B, i V \ B. Several existing algorithms mini mize a submodular H in polynomial time, exactly or within arbitrary accuracy. Submodularity is a natural model for a variety of applications, such as image segmentation (Boykov & Kolmogorov, 2004), data selection (Lin & Bilmes, 2010), or clustering (Narasimhan et al., 2006). But, in many other settings, such as structured sparse learning, Bayesian opti mization, and column subset selection, the objective func tion is not exactly submodular. Instead, it satisﬁes a weaker version of the diminishing returns property. An important

1Massachusetts Institute of Technology. Correspondence to: Marwa El Halabi <marwash@mit.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

class of such functions are α-weakly DR-submodular func tions, introduced in (Lehmann et al., 2006). The parameter α quantiﬁes how close the function is to being submodu lar (see Section 2 for a precise deﬁnition). Furthermore, in many cases, only noisy evaluations of the objective are available. Hence, we ask: Do submodular minimization algorithms extend to such non-submodular noisy functions?

Non-submodular maximization, under various notions of approximate submodularity, has recently received a lot of attention (Das & Kempe, 2011; Elenberg et al., 2018; Sakaue, 2019; Bian et al., 2017; Chen et al., 2017; Gatmiry & Gomez-Rodriguez, 2019; Harshaw et al., 2019; Kuhnle et al., 2018; Horel & Singer, 2016; Hassidim & Singer, 2018). In contrast, only few studies consider minimization of non-submodular set functions. Recent works have stud ied the problem of minimizing the ratio of two set functions, where one (Bai et al., 2016; Qian et al., 2017a) or both (Wang et al., 2019) are non-submodular. The ratio problem is related to constrained minimization, which does not admit a constant factor approximation even in the submodular case (Svitkina & Fleischer, 2011). If the objective is approxi mately modular, i.e., it has bounded curvature, algorithmic techniques related to those for submodular maximization achieve optimal approximations for constrained minimiza tion (Sviridenko et al., 2017; Iyer et al., 2013). Algorithms for minimizing the difference of two submodular functions were proposed in (Iyer & Bilmes, 2012; Kawahara et al., 2015), but no approximation guarantees were provided.

In this paper, we study the unconstrained non-submodular minimization problem

min S V H(S) := F (S) G(S), (1)

where F and G are monotone (i.e., non-decreasing or non-increasing) functions, F is α-weakly DR-submodular, and G is β-weakly DR-supermodular, i.e., G is β 1 weakly DR-submodular. The deﬁnitions of weak DR-sub /supermodularity only hold for monotone functions, and thus do not directly apply to H. We show that, perhaps surprisingly, any set function H can be decomposed into functions F and G that satisfy these assumptions, albeit with properties leading to weaker approximations when the function is far from being submodular.

Optimal approximation for unconstrained non-submodular minimization

A key strategy for minimizing submodular functions ex ploits a tractable tight convex relaxation that enables the use of convex optimization algorithms. But, this relies on the equivalence between the convex closure of a submod ular function and the polynomial-time computable Lovász extension. In general, the convex closure of a set function is NP-hard to compute, and the Lovász extension is convex if and only if the set function is submodular. Thus, the op timization delicately relies on submodularity; generally, a tractable tight convex relaxation is impossible. Yet, in this paper, we show that for approximately submodular func tions, the Lovász extension can be approximately minimized using a projected subgradient method (PGM). In fact, this strategy is guaranteed to obtain an approximate solution to Problem (1). This insight broadly expands the scope of submodular minimization techniques. In short, our main contributions are:

the ﬁrst approximation guarantee for unconstrained non-submodular minimization characterized by close ness to submodularity: PGM achieves a tight approxi mation of H(S) F (S )/α βG(S ) + ϵ;

an extension of this result to the case where only a noisy oracle of H is accessible;

a hardness result showing that improving on this ap

proximation guarantee would require exponentially many queries in the value oracle model;

applications to structured sparse learning and variance reduction in batch Bayesian optimization, implying the ﬁrst approximation guarantees for these problems;

experiments demonstrating the robustness of classical submodular minimization algorithms against noise and non-submodularity, reﬂecting our theoretical results.

2. Preliminaries

We begin by introducing our notation, the deﬁnitions of weak DR-submodularity/supermodularity, and by reviewing some facts about classical submodular minimization.

Notation Let V = {1, , d} be the ground set. Given a set function F : 2V R, we denote the marginal gain of adding an element i to a set A by F (i|A) = F (A {i}) F (A). Given a vector x Rd , xi is its i-th entry and supp(x) = {i V |xi =6 0} is its support set; x also P deﬁnes a modular set function as x(A) = i A xi.

Set function classes The function F is normalized if F ( ) = 0, and non-decreasing (non-increasing) if F (A) F (B) (F (A) F (B)) for all A B. F is submodular if it has diminishing marginal gains: F (i|A) F (i|B) for all A B, i V \ B, modular if the inequality holds as an equality, and supermodular if F (i|A) F (i|B). Re laxing these inequalities leads to the notions of weak DR-

Figure 1. Classes of set functions

submodularity/supermodularity introduced in (Lehmann et al., 2006) and (Bian et al., 2017), respectively.

Deﬁnition 1 (Weak DR-sub/supermodularity). A set func tion F is α-weakly DR-submodular, with α > 0, if

F (i|A) αF (i|B), for all A B, i V \ B.

Similarly, F is β-weakly DR-supermodular, with β > 0, if

F (i|B) βF (i|A), for all A B, i V \ B.

We say that F is (α, β)-weakly DR-modular if it satisﬁes both properties.

If F is non-decreasing, then α, β (0, 1], and if it is nonincreasing, then α, β 1. F is submodular (supermodular) iff α = 1 (β = 1) and modular iff both α = β = 1. The parameters 1 α and 1 β are referred to as generalized inverse curvature (Bogunovic et al., 2018) and generalized curvature (Bian et al., 2017), respectively. They extend the notions of inverse curvature and curvature (Conforti & Cornuéjols, 1984) commonly deﬁned for supermodular and submodular functions. These notions are also related to weakly sub-/supermodular functions (Das & Kempe, 2011; Bogunovic et al., 2018). Namely, the classes of weakly DR sub-/super-/modular functions are respective subsets of the classes of weakly sub-/super-/modular functions (El Halabi et al., 2018, Prop. 8), (Bogunovic et al., 2018, Prop. 1), as illustrated in Figure 2. For a survey of other notions of approximate submodularity, we refer the reader to (Bian et al., 2017, Sect. 6).

Submodular minimization Minimizing a submodular set function F is equivalent to minimizing a non-smooth convex function that is given by a continuous extension of F , i.e., a continuous interpolation of F on the full hypercube [0, 1]d . This extension, called the Lovász extension (Lovász, 1983), is convex if and only if F is submodular.

Deﬁnition 2 (Lovász extension). Given any normalized set function F , its Lovász extension f L : Rd R is deﬁned as

d X f L(s) = sjk F (jk|Sk 1),

where sj1 sjd are the sorted entries of s in decreas ing order, and Sk = {j1, , jk}.

Optimal approximation for unconstrained non-submodular minimization

Minimizing f L is equivalent to minimizing F . Moreover, when F is submodular, a subgradient κ of f L at any s Rd

can be computed efﬁciently by sorting the entries of s in de creasing order and taking κjk = F (jk|Sk 1) for all k V (Edmonds, 2003). This relation between submodularity and convexity allows for generic convex optimization algorithms to be used for minimizing F . However, it has been unclear how these relations are affected if the function is only ap proximately submodular. In this paper, we give an answer to this question.

3. Approximately submodular minimization

We consider set functions H : 2V R of the form H(S) = F (S) G(S), where F is α-weakly DR-submodular, G is β-weakly DR-supermodular, and both F and G are normal ized non-decreasing functions. We later extend our results to non-increasing functions. We assume a value oracle ac cess to H; i.e., there is an oracle that, given a set S V , returns the value H(S). Note that H itself is in general not weakly DR-submodular. Interestingly, any set function can be decomposed in this form.

Proposition 1. Given a set function H, and α, β (0, 1] such that αβ < 1, there exists a non-decreasing α-weakly DR-submodular function F , and a non-decreasing (α, β) weakly DR-modular function G, such that H(S) = F (S) G(S) for all S V .

Proof sketch. This decomposition builds on the decomposi tion of H into the difference of two non-decreasing submod ular functions (Iyer & Bilmes, 2012). We start by choos ing any function G0 which is non-decreasing (α, β)-weakly DR-modular, and is strictly α-weakly DR-submodular, i.e., ϵG0 = mini V,A B V \i G0(i|A) αG0(i|B) > 0. It is not possible to choose G0 such that α = β = 1 (this would imply G0(i|B) G0(i|A) > G0(i|B)). We then construct F and G based on G0 . Let ϵH = mini V,A B V \i H(i|A) αH(i|B) < 0 be the violation of α-weak DR-submodularity of H; we may use a lower bound ϵ0 We deﬁne F 0(S) = H ϵH . |ϵ0 | H H(S) + ϵG0 G0(S). F 0 is not necessarily non-decreasing. To correct for that, let V = {i : F 0(i|V \ i) < 0} and P deﬁne F (S) = F 0(S) F 0(i|V \ i). We can i S V show that F is non-decreasing α-weakly DR-submodular.

|ϵ0 P H We also deﬁne G(S) = | G0(S) F 0(i|V \i), ϵG0 i S V then G is non-decreasing (α, β)-weakly DR-modular, and H(S) = F (S) G(S).

Proposition 1 generalizes the result of (Cunningham, 1983, Theorem 18) showing that any submodular function can be decomposed into the difference of a non-decreasing submod ular function and a non-decreasing modular function. When

H is submodular, the decomposition in Proposition 1 recov ers the one from (Cunningham, 1983), by simply choosing α = β = 1. The resulting violation of submodularity is ϵH = 0, and G0 is not needed.

Computing such a decomposition is not required to run PGM for minimization; it is only needed to evaluate the corresponding approximation guarantee. The construction in the above proof uses the maximum violation ϵH of α weak DR-submodularity of H, which is NP-hard in general. However, when ϵH or a lower bound of it is known, F and G can be obtained in polynomial time, for a suitable choice of G0 . Proposition 2 provides a valid choice of G0 for α = 1. Any modular function can be used for α < 1. Proposition 2. Given β (0, 1), let G0(S) = g(|S|) where

1 2 β 1 g(x) = ax + (1 1 a)x with a = . Then G0 is 2 2 d 1 non-decreasing (1, β)-weakly DR-modular, and is strictly submodular, with ϵG0 = mini V,A B V \i G0(i|A) G0(i|B) = a > 0.

The lower bound on ϵH and the choice of α, β and G0 will affect the approximation guarantee on H, as we clarify later. When H is far from being submodular, it may not be possible to choose G0 to obtain a non-trivial guarantee. However, many important non-submodular functions do ad mit a decomposition which leads to non-trivial bounds. We call such functions approximately submodular, and provide some examples in Section 4.

In what follows, we establish a connection between ap proximate submodularity and approximate convexity, which allows us to derive a tight approximation guarantee for PGM on Problem (1). All omitted proofs are in the Supplement.

3.1. Convex relaxation

When H is not submodular, the connections between its Lovász extension and tight convex relaxation for exact min imization, outlined in Section 2, break down. However, Problem (1) can still be converted to a non-smooth con vex optimization problem, via a different convex extension. Given a set function H, its convex closure h is the pointwise largest convex function from [0, 1]d to R that always lower bounds H. Intuitively, h is the tightest convex ex tension of H on [0, 1]d . The following equivalence holds (Dughmi, 2009, Prop. 3.23):

min H(S) = min h (s). (2) S V s [0,1]d

Unfortunately, evaluating and optimizing h for a general set function is NP-hard (Vondrák, 2007). The key property that makes Problem (2) efﬁcient to solve when H is submodular is that its convex closure then coincides with its tractable Lovász extension, i.e., h = h L. This equivalence no longer holds if H is only approximately submodular. But, in this case, a weaker key property holds: Lemma 1

Optimal approximation for unconstrained non-submodular minimization

shows that the Lovász extension approximates the convex closure h , and that the same vectors that served as its subgradients in the submodular case can serve as approximate subgradients to h . Lemma 1. Given a vector s [0, 1]d such that sj1 sjd , we deﬁne κ such that κjk = H(jk|Sk 1) where Sk = {j1, , jk}. Then, h L(s) = κ>s h (s), and

κ(A) 1 F (A) βG(A) for all A V, α κ> s 0 1 f (s 0) + β( g) (s 0) for all s 0 [0, 1]d . α

To prove Lemma 1, we use a speciﬁc formulation of the convex closure h (El Halabi, 2018, Def. 20):

h (s) = max {κ> s+ρ : κ(A)+ρ H(A), A V }, κ Rd,ρ R

and build on the proof of Edmonds greedy algorithm (Ed monds, 2003). We can view the vector κ in Lemma 1 as an approximate subgradient of h at s in the following sense:

1 f (s 0)+β( g) (s 0) h (s)+hκ, s 0 si, s 0 [0, 1]d . α

Lemma 1 also implies that the Lovász extension h L approx imates the convex closure h in the following sense:

h (s) h L(s) 1 f (s) + β( g) (s), s [0, 1]d . α

We can thus say that h L is approximately convex in this case. This key insight allows us to approximately minimize h via convex optimization algorithms.

3.2. Algorithm and approximation guarantees

Equipped with the approximate subgradients of h , we can now apply an approximate projected subgradient method (PGM). Starting from an arbitrary s1 [0, 1]d , PGM iter

t+1 atively updates s = Π[0,1]d (st ηκt), where κt is the approximate subgradient at st from Lemma 1, and Π[0,1]d is

R the projection onto [0, 1]d . We set the step size to η = , L T where L = F (V ) + G(V ) is the Lipschitz constant, i.e., kκtk2 L for all t, and R = 2 d is the domain radius ks1 s k2 R. Theorem 1. After T iterations of PGM, sˆ arg mint {1, ,T } h L(st) satisﬁes:

h (sˆ) h L(sˆ) 1 f (s ) + β( g) (s ) + RL , α T

where s is an optimal solution of mins [0,1]d h (s).

Importantly, the algorithm does not need to know the pa rameters α and β, which can be hard to compute in practice. In fact, its iterates are exactly the same as in the submodular case. Theorem 1 provides an approximate fractional solu tion sˆ [0, 1]d . To round it to a discrete solution, Corollary 1 shows that it is sufﬁcient to pick the superlevel set of sˆ with the smallest H value.

Corollary 1. Given the fractional solution sˆ in Theorem 1, let Sˆ k = {j1, , jk} such that sˆj1 sˆjd , and

ˆS0 = . Then Sˆ arg mink {0, ,d} H(Sˆ k) satisﬁes

H(Sˆ) 1 F (S ) βG(S ) + RL , α T

where S is an optimal solution of Problem (1).

To obtain a set that satisﬁes H(Sˆ) F (S )/α βG(S )+ ϵ, we thus need at most O(d L2/ϵ2) iterations of PGM, where the time per iteration is O(d log d + d EO), with EO being the time needed to evaluate H on any set. More over, the techniques from (Chakrabarty et al., 2017; Axelrod et al., 2019) for accelerating the runtime of stochastic PGM to O (d EO/ϵ2) can be extended to our setting.

If F is regarded as a cost and G as a revenue, this guarantee states that the returned solution achieves at least a fraction β of the revenue of the optimal solution, by paying at most a 1/α-multiple of the cost. The quality of this guarantee depends on F, G and their parameters α, β; it becomes vac uous when F (S )/α βG(S ). If H is submodular, Prob lem (1) reduces to submodular minimization and Corollary 1 recovers the guarantee H(Sˆ) H(S ) + RL/ T .

Remark 1. The upper bound in Corollary 1 still holds if the worst case parameters α, β are instead replaced by PT F (S PT κt (S 1 ) 1 G ) αT = κt and βT = , where T t=1 (S ) T t=1 G(S ) F (κF

t )jk = F (jk

t )jk = G(jk

1). This reﬁned upper bound yields improvements if only few of the relevant submodularity inequalities are violated.

All results in this section extend to the case where F and G are non-increasing functions.

Corollary 2. Given H(S) = F (S) G(S), where F and G are non-increasing functions with F (V ) = G(V ) = 0, we run PGM with H (S) = H(V \ S) for T iterations. Let s arg mint {1, ,T } h L(st) and Sˆ = V \ S , where S is the superlevel set of s with the smallest H value, then

H(Sˆ) αF (S ) 1 G(S ) + RL , β T

where S is an optimal solution of Problem (1).

For a general set function H, using F and G from the de composition in Proposition 1, yields in Corollary 1:

|ϵ0 | 0 X 0 H(Sˆ) 1 H(S )+( 1 β) H G (S ) F (i|V \i) +ϵ, α α ϵG0 i S V

H is a lower bound on the violation of α-weak DR submodularity of H, F 0 and G0 are the auxilliary functions used to construct F and G, and ϵG0 is the strict α-weak DR-submodularity of G0 (see proof of Proposition 1 for precise deﬁnitions). It is clear that a larger lower bound |ϵ0 | H

Optimal approximation for unconstrained non-submodular minimization

worsens the upper bound on H(Sˆ). Moreover, the choice of G0 affects the bound: ideally, we want to choose G0 to minimize G0(S ), and maximize the quantities α, ϵG0 and β, which characterize how submodular and supermodular G0

is, respectively. However, a larger α leads to a larger |ϵ0 | H and smaller ϵG0 , and a larger ϵG0 would result in a smaller β, and vice versa. The best choice of G0 will depend on H.

In Appendix B.4, we provide an example showing that the approximation guarantees in Corollary 1 and 3 are tight, i.e., they cannot be improved for PGM, even if F and G are weakly DR-modular. Furthermore, in Section 3.4 we show that these approximation guarantees are optimal in general. Apart from the above results for general uncon strained minimization, our results also imply approximation guarantees for generalizing constrained submodular mini mization to weakly DR-submodular functions. We discuss this extension in Appendix A.

3.3. Extension to noisy evaluations

In many real-world applications, we do not have access to the objective function itself, but rather to a noisy version of it. Several works have considered maximizing noisy oracles of submodular (Horel & Singer, 2016; Singla et al., 2016; Hassidim & Singer, 2017; 2018) and weakly submodular (Qian et al., 2017b) functions. In contrast, to the best of our knowledge, minimizing noisy oracles of submodular functions was only studied in (Blais et al., 2018).

We address a more general setup where the underlying func tion H is not necessarily submodular. We assume again that F and G are normalized and non-decreasing. The results easily extend to non-increasing functions as in Corollary 3. We show in Proposition 3 that our approximation guarantee for Problem (1) continues to hold when we only have ac cess to an approximate oracle H . Essentially, H still allows to obtain approximate subgradients of h in the sense of Lemma 1, but now with an additional additive error.

Proposition 3. Assume we have an approximate oracle H with input parameters ϵ, δ (0, 1), such that for every S V , |H (S) H(S)| ϵ with probability 1 δ. We run PGM

with H for T iterations. Let sˆ = arg mint {1, ,T } h L(st), and Sˆ k = {j1, , jk} such that sˆj1 sˆjd . Then

ˆ S arg mink {0, ,d} H(Sˆk) satisﬁes

H(Sˆ) 1 F (S ) βG(S ) + ϵ0 , α

ϵ0 δ0ϵ02 with probability 1 δ0, by choosing ϵ = , δ = and 8d 32d2 using 2Td calls to H with T = (4 d L/ϵ0)2 .

Blais et al (2018) consider the same setup for the special case of submodular H, and use the cutting plane method of (Lee et al., 2015). Their runtime has better dependence O(log(1/ϵ0)) on the error ϵ0, but worse dependence O(d3)

on the dimension d = |V |, and their result needs oracle accuracy ϵ = O(ϵ02/d5). Hence, for large ground set sizes d, Proposition 3 is preferable. This proposition allows us, in particular, to handle multiplicative and additive noise in H.

Proposition 4. Let H = ξH where the noise ξ 0 is bounded by |ξ| ω and is independently drawn from a dis tribution D with mean µ > 0. We deﬁne the function H m as the mean of m queries to H (S). H m is then an approximate oracle to µH. In particular, for every δ, ϵ (0, 1), taking m = (ωHmax/ϵ)2 ln(1/δ) where Hmax = max S V H(S), we have for every S V , |H m(S) µH(S)| ϵ with probability at least 1 δ.

Propositions 3 and 4 imply that by using PGM with H m and picking the superlevel set with the small est H m value, we can ﬁnd a set Sˆ such that H(Sˆ) F (S )/α βG(S ) + ϵ0 with probability

( ωHmaxd d2 1 δ0 , using m = O )2 ln( ) sam µϵ0 δ0µ2ϵ02 ples, after T = O ( dµHmax/ϵ0)2 iterations, with ω ( Hmaxd d2 O )4 ln( ) total calls to H . Note that Hmax µ ϵ0 δ0µ2 ϵ02 is upper bounded by F (V ). This result provides a theoret ical upper bound on the number of samples needed to be robust to bounded multiplicative noise. Much fewer sam ples are actually needed in practice, as illustrated in our experiments (Section 5.1). Using similar arguments, our results also extend to additive noise oracles H = H + ξ.

3.4. Inapproximability Result

By Proposition 1, Problem (1) is equivalent to general set function minimization. Thus, solving it optimally or within any multiplicative approximation factor, i.e., H(Sˆ) γ(d)H(S ) for some positive polynomial time computable function γ(d) of d, is NP-Hard (Trevisan, 2004; Iyer & Bilmes, 2012). Moreover, in the value oracle model, it is impossible to obtain any multiplicative constant factor approximation within a subexponential number of queries (Iyer & Bilmes, 2012). Hence, it is necessary to consider bicriteria-like approximation guarantees as we do.

We now show that our approximation results are optimal: in the value oracle model, no algorithm with a subexponen tial number of queries can improve on the approximation guarantees achieved by PGM, even when G is weakly DRmodular.

Theorem 2. For any α, β (0, 1] such that αβ < 1, d > 2 and δ > 0, there are instances of Problem (1) such that no (deterministic or randomized) algorithm, using less than exponentially many queries, can always ﬁnd a solution S V of expected value at most F (S )/α βG(S ) δ.

Proof sketch. Our proof technique is similar to (Feige et al., 2011): We randomly partition the ground set into V = C D, and construct a normalized set function H whose

Optimal approximation for unconstrained non-submodular minimization

values depend only on k(S) = |S C| and (S) = |S D|:

0 if |k(S) (S)| ϵd H(S) = 2αδ otherwise, 2 d

for some ϵ [1/d, 1/2). We use Proposition 1 to decom pose H into the difference of a non-decreasing α-weakly DR-submodular function F , and a non-decreasing (α, β) weakly DR-modular function G. We argue that, with prob-

d ability 1 2 exp( ϵ2 ), any given query S will be bal 4 anced , i.e., |k(S) (S)| ϵd. Hence no algorithm can distinguish between H and the constant zero function, with subexponentially many queries. On the other hand, we

2αδ have H(S ) = < 0, achieved at S = C or D, and 2 d 1 F (S ) βG(S ) δ < 0. Therefore, the algorithm cannot α ﬁnd a set with value H(S) F (S )/α βG(S ) δ.

The approximation guarantees in Corollary 1 and 3 are thus optimal. In the above proof, G belongs to the smaller class of weakly DR-modular functions, but F not necessar ily. Whether the approximation guarantee can be improved when F is also weakly DR-modular is left as an open ques tion. Yet, the tightness result in Appendix B.4 implies that such improvement cannot be achieved by PGM.

4. Applications

Several applications can beneﬁt from the theory in this work. We discuss two examples here, where we show that the objective functions have the form of Problem 1, implying the ﬁrst approximation guarantees for these problems. Other examples include column subset selection (Sviridenko et al., 2017) and Bayesian A-optimal experimental design (Bian et al., 2017), where F is the cardinality function, and G is weakly DR-supermodular with β depending on the inverse of the condition number of the data matrix.

4.1. Structured sparse learning

Structured sparse learning aims to estimate a sparse param eter vector whose support satisﬁes a particular structure, such as group-sparsity, clustering, tree-structure, or diver sity (Obozinski & Bach, 2016; Kyrillidis et al., 2015). Such problems can be formulated as

min (x) + λF (supp(x)), (3) x Rd

where is a convex loss function and F is a set function favoring the desirable supports. Existing convex methods propose to replace the discrete regularizer F (supp(x)) by its closest convex relaxation (Bach, 2010; El Halabi & Cevher, 2015; Obozinski & Bach, 2016; El Halabi et al., 2018). For example, the cardinality regularizer |supp(x)| is replaced by the 1-norm. This allows the use of standard convex optimization methods, but does not provide any

approximation guarantee for the original objective function without statistical modeling assumptions. This approach is computationally feasible only when F is submodular (Bach, 2010) or can be expressed as an integral linear program (El Halabi & Cevher, 2015).

Alternatively, one may write Problem (3) as

min H(S) = λF (S) G (S), (4) S V where G (S) = (0) minsupp(x) S (x) is a normalized non-decreasing set function. Recently, it was shown that if has restricted smoothness and strong convexity, G is weakly modular (Elenberg et al., 2018; Bogunovic et al., 2018; Sakaue, 2018). This allows for approximation guar antees of greedy algorithms to be applied to the constrained variant of Problem (3), but only for the special cases of a sparsity constraint (Das & Kempe, 2011; Elenberg et al., 2018) or some near-modular constraints (Sakaue, 2019).

In applications, however, the structure of interest is often better modeled by a non-modular regularizer F , which may be submodular (Bach, 2010) or non-submodular (El Halabi & Cevher, 2015; El Halabi et al., 2018). Weak modularity of G is not enough to directly apply the result in Corollary 1, but, if the loss function is smooth, strongly convex, and is generated from random data, then we show that G is also weakly DR-modular.

> Proposition 5. Let (x)= L(x) z x, where L is smooth and strongly convex, and z Rd has a continuous density w.r.t the Lebesgue measure. Then there exist αG, βG>0 such that G is (αG, βG)-weakly DR-modular, almost surely.

We prove Proposition 5 by ﬁrst utilizing a result from (Elen berg et al., 2018), which relates the marginal gain of G to the marginal decrease of . We then argue that the minimizer of , restricted to any given support, has full support with probability one, and thus has non-zero marginal decrease with probability one. The proof is given in Appendix C.1. The actual αG, βG parameters depend on the conditioning of . Their positivity also relies on z being random, typi cally, data drawn from a distribution (Sakaue, 2018, Sect. A.1). In Section 5.2, we evaluate Proposition 5 empirically.

The approximation guarantee in Corollary 1 thus applies directly to Problem (4), whenever has the form in Propo sition 5, and F is α-weakly DR-submodular. For example, this holds when is the least squares loss with a nonsingular measurement matrix. Examples of structure-inducing reg ularizers F include submodular regularizers (Bach, 2010), and non-submodular ones such as the range cost function

1 (Bach, 2010; El Halabi et al., 2018) (α = ), which d 1 favors interval supports, with applications in time-series and cancer diagnosis (Rapaport et al., 2008), and the cost

1+a function considered (Sakaue, 2019) (α = , where 1+(b a) 0 < 2a < b are cost parameters), which favors the selection of sparse and cheap features, with applications in healthcare.

Optimal approximation for unconstrained non-submodular minimization

4.2. Batch Bayesian optimization

The goal in batch Bayesian optimization is to optimize an unknown expensive-to-evaluate noisy function f with as few batches of function evaluations as possible (Desautels et al., 2014; González et al., 2016). For example, evalua tions can correspond to performing expensive experiments. The evaluation points are chosen to maximize an acquisition function subject to a cardinality constraint. Several acquisi tion functions have been proposed for this purpose, amongst others the variance reduction function (Krause et al., 2008; Bogunovic et al., 2016). This function is used to maxi mally reduce the variance of the posterior distribution over potential maximizers of the unknown function.

Often, the unknown f is modeled by a Gaussian process with zero mean and kernel function k(x, x0), and we ob serve noisy evaluations y = f(x) + ϵ of the function, where ϵ N (0, σ2). Given a set X = {x1, , xd} of potential maximizers of f, each xi Rn , and a set S V , let y S = [yi]i S be the corresponding observa tions at points xi, i S. The posterior distribution of f given y S is again a Gaussian process, with variance σ2 (x) = k(x, x) k S (x)>(KS + σ2I) 1k S (x) where S k S = [k(xi, x)]i S , and KS = [k(xi, xj )]i,j S is the cor responding submatrix of the positive deﬁnite kernel matrix K. The variance reduction function is deﬁned as:

X G(S) = σ2(xi) σS

where σ2(xi) = k(xi, xi). We show that the variance reduction function is weakly DR-modular.

Proposition 6. The variance reduction function G is non-decreasing (β, β)-weakly DR-modular, with β =

λmin(K) )2 λmin(K) ( , where λmax(K) and λmin(K) are σ2+λmin(K) λmax(K) the largest and smallest eigenvalues of K.

To prove Proposition 6, we show that G can be written as a noisy column subset selection objective, and prove that such an objective function is weakly DR-modular, generalizing the result of (Sviridenko et al., 2017). The proof is given in Appendix C.2. The variance reduction function can thus be maximized with a greedy algorithm to a β-approximation (Sviridenko et al., 2017), which follows from a stronger notion of approximate modularity.

Maximizing the variance reduction may also be phrased as an instance of Problem (1), with G being the variance re duction function, and F (S) = λ|S| an item-wise cost. This formulation easily allows to include nonlinear costs with (weak) decrease in marginal costs (economies of scale). For example, in the sensor placement application, the cost of placing a sensor in a hazardous environment may diminish if other sensors are also placed in similar environments. Un like previous works, the approximation guarantee in Corol

lary 1 still applies to such cost functions, while maintaining the β-approximation with respect to G.

5. Experiments

We empirically validate our results on noisy submodular minimization and structured sparse learning. In particular, we address the following questions: (1) How robust are different submodular minimization algorithms, including PGM, to multiplicative noise? (2) How well can PGM minimize a non-submodular objective? Do the parameters (α, β) accurately characterize its performance?

All experiments were implemented in Matlab, and con ducted on cluster nodes with 16 Intel Xeon E5 CPU cores and 64 GB RAM. Source code is available at https: //github.com/marwash25/non-sub-min.

5.1. Noisy submodular minimization

First, we consider minimizing a submodular function H given a noisy oracle H = ξH, where ξ is independently drawn from a Gaussian distribution with mean one and standard deviation 0.1. We evaluate the performance of different submodular minimization algorithms, on two example problems, minimum cut and clustering. We use the Matlab code from http://www.di.ens.fr/ ~fbach/submodular/, and compare seven algorithms:

the minimum-norm-point algorithm (MNP) (Fujishige & Isotani, 2011), the conditional gradient method (Jaggi, 2013) with ﬁxed step-size (CG-2/(t+2)) and with line search (CG LS), PGM with ﬁxed step-size (PGM-1/ t) and with the approximation of Polyak s rule (PGM-polyak) (Bertsekas, 1995), the analytic center cutting plane method (Gofﬁn & Vial, 1993) (ACCPM) and a variant of it that emulates the simplicial method (ACCPM-Kelley).

We replace the true oracle for H by the approximate oracle P (S) = 1 m ξi H(S), for all these algorithms, and Hm m i=1 test them on two datasets: Genrmf-long, a min-cut/max ﬂow problem with d = 575 nodes and 2390 edges, and Two-moons, a synthetic semi-supervised clustering instance with d = 400 data points and 16 labeled points. We refer the reader to (Bach, 2013, Sect. 12.1) for more details about the algorithms and datasets. We stopped each algorithm after 1000 iterations for the ﬁrst dataset and after 400 iterations for the second one, or until the approximate duality gap reached 10 8 . To compute the optimal value H(S ), we use MNP with the noise-free oracle H.

Figure 2 shows the gap in discrete objective value for all algorithms on the two datasets, for increasing number of samples m (top), and for two ﬁxed values of m, as a func tion of iterations (middle and bottom). We plot the best value achieved so far. As expected, the accuracy improves with more samples. In fact, this improvement is faster than

Optimal approximation for unconstrained non-submodular minimization

0 200 400 600 800 1000 number of samples m

0 50 100 150 number of samples m

0 200 400 600 800 1000 iterations

0 200 400 600 iterations

0 200 400 600 800 1000 iterations

0 200 400 600 iterations

0 0.5 1 1.5 2 n/d

Estimation Error

0.5 1 1.5 2 n/d

Support Error

10-4 10-2 100 102

10-4 10-2 100 102

10-4 10-2 100 102

10-4 10-2 100 102

Figure 2. Noisy submodular minimization results for Genrmf-long (right) and Two-moons (left) data: Best achieved objective (logscale) vs. number of samples (top). Objective (log-scale) vs. it erations, for m = 50 (middle), m = 1000 (bottom-right), and m = 150 (bottom-left).

the bounds in Proposition 4 and in (Blais et al., 2018). The objective values in the Two-moons data are smaller, which makes it easier to solve in the multiplicative noise setting (Prop. 4), as we indeed observe. Among the compared algorithms, ACCPM and MNP converge fastest, as also ob served in (Bach, 2013) without noise, but they also seem to be the most sensitive to noise. In summary, these empirical results suggest that submodular minimization algorithms are indeed robust to noise, as predicted by our theory.

5.2. Structured sparse learning

Our second set of experiments is structured sparse learning, where we aim to estimate a sparse parameter vector x\ Rd

whose support is an interval. The range function F r(S) = max(S) min(S)+1 if S 6= , and F r( ) = 0, is a natural

1 regularizer to choose. F r is -weakly DR-submodular d 1 (El Halabi et al., 2018). Another reasonable regularizer is the modiﬁed range function F mr(S) = d 1+F r(S), S 6= and F mr( ) = 0, which is non-decreasing and submodular (Bach, 2010). As discussed in Section 4.1, no prior method provides a guaranteed approximate solution to Problem (3) with such regularizers, with the exception of some statistical

\ assumptions, under which x can be recovered using the tightest convex relaxation Θr of F r (El Halabi et al., 2018). Evaluating Θr involves a linear program with constraints corresponding to all possible interval sets. Such exhaustive

Figure 3. Structured sparsity results: Support and estimation errors (log-scale) vs measurement ratio (top); objective and correspond ing αT , βT parameters vs. regularization parameter for n = 112 (middle) and n = 306 (bottom).

search is not feasible in more complex settings.

We consider a simple linear regression setting in which x\ Rd has k consecutive ones and is zero otherwise. We observe y = Ax\ + ϵ, where A Rd n is an i.i.d Gaus sian matrix with normalized columns, and ϵ Rn is an i.i.d Gaussian noise vector with standard deviation 0.01. We set d = 250, k = 20 and vary the number of measurements n between d/4 and 2d. We compare the solutions obtained by

1 minimizing the least squares loss (x) = ky Axk2 with 2 2 the three regularizers: The range function F r, where H is optimized via exhaustive search (OPT-Range), or via PGM (PGM-Range); the modiﬁed range function F mr, solved via exhaustive search (OPT-Mod Range), or via PGM (PGMMod Range); and the convex relaxation Θr (CR-Range), solved using CVX (Grant & Boyd, 2014). The marginal gains of G can be efﬁciently computed using rank-1 up dates of the pseudo-inverse (Meyer, 1973).

Figure 3 (top) displays the best achieved support error in hamming distance, and estimation error kxˆ x\k2/kx\k2 on the regularization path, where λ was varied between 10 4 and 10. Figure 3 (middle and bottom) illustrates the objective value H = λF r G for PGM-Range, CR-Range, and OPT-Range, and H = λF mr G for PGM-Mod Range, and OPT-Mod Range, and the corresponding parameters αT , βT deﬁned in Remark 1, for two ﬁxed values of n. Results are averaged over 5 runs.

We observe that PGM minimizes the objective with F mr

Optimal approximation for unconstrained non-submodular minimization

almost exactly as n grows. It performs a bit worse with F r , which is expected since F r is not submodular. This is also reﬂected in the support and estimation errors. Moreover, αT , βT here reasonably predict the performance of PGM; larger values correlate with closer to optimal objective val ues. They are also more accurate than the worst case α, β in Deﬁnition 1. Indeed, the αT for the range function is

1 much larger than the worst case . Similarly, βT for G d 1 is quite large and approaches 1 as n grows, while in Propo sition 5 the worst case β is only guaranteed to be non-zero when is strongly convex. Finally, the convex approach with Θr essentially matches the performance of OPT-Range when n d. In this regime, G becomes nearly modular, hence the convex objective + λΘr starts approximating the convex closure of λF r G .

6. Conclusion

We established new links between approximate submodular ity and convexity, and used them to analyze the performance of PGM for unconstrained, possibly noisy, non-submodular minimization. This yielded the ﬁrst approximation guaran tee for this problem, with a matching lower bound establish ing its optimality. We experimentally validated our theory, and illustrated the robustness of submodular minimization algorithms to noise and non-submodularity.

Acknowledgments

This research was supported by a DARPA D3M award, NSF CAREER award 1553284, and NSF award 1717610. The views, opinions, and/or ﬁndings contained in this article are those of the authors and should not be interpreted as representing the ofﬁcial views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense. The authors ac knowledge the MIT Super Cloud and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported within this paper.

Axelrod, B., Liu, Y. P., and Sidford, A. Near-optimal ap

proximate discrete and continuous submodular function minimization. ar Xiv preprint ar Xiv:1909.00171, 2019.

Bach, F. Structured sparsity-inducing norms through submodular functions. In Proceedings of the International Conference on Neural Information Processing Systems, pp. 118 126, 2010.

Bach, F. Learning with submodular functions: A convex optimization perspective. Foundations and Trends R in Machine Learning, 6(2-3):145 373, 2013.

Bai, W., Iyer, R., Wei, K., and Bilmes, J. Algorithms for optimizing the ratio of submodular functions. In Pro ceedings of the International Conference on Machine Learning, pp. 2751 2759, 2016.

Bertsekas, D. P. Nonlinear programming. Athena scientiﬁc, 1995.

Bian, A. A., Buhmann, J. M., Krause, A., and Tschi

atschek, S. Guarantees for greedy maximization of non submodular functions with applications. In Proceedings of the International Conference on Machine Learning, volume 70, pp. 498 507. JMLR. org, 2017.

Blais, E., Canonne, C. L., Eden, T., Levi, A., and Ron, D. Tolerant junta testing and the connection to submodular optimization and function isomorphism. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 2113 2132. Society for Industrial and Applied Math ematics, 2018.

Bogunovic, I., Scarlett, J., Krause, A., and Cevher, V. Trun

cated variance reduction: A uniﬁed approach to bayesian optimization and level-set estimation. In Proceedings of the International Conference on Neural Information Processing Systems, pp. 1507 1515, 2016.

Bogunovic, I., Zhao, J., and Cevher, V. Robust max

imization of non-submodular objectives. In Storkey, A. and Perez-Cruz, F. (eds.), Proceedings of the In ternational Conference on Artiﬁcial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 890 899. PMLR, 09 11 Apr 2018. URL http://proceedings.mlr.press/ v84/bogunovic18a.html.

Boykov, Y. and Kolmogorov, V. An experimental compari

son of min-cut/max-ﬂow algorithms for energy minimiza tion in vision. IEEE Transactions on Pattern Analysis & Machine Intelligence, 26(9):1124 1137, 2004.

Bubeck, S. Theory of convex optimization for machine learning. ar Xiv preprint ar Xiv:1405.4980, 15, 2014.

Chakrabarty, D., Lee, Y. T., Sidford, A., and Wong, S. C.

w. Subquadratic submodular function minimization. In Proceedings of the ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp. 1220 1231, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4528-6. doi: 10.1145/3055399.3055419. URL http://doi.acm.

org/10.1145/3055399.3055419.

Chen, L., Feldman, M., and Karbasi, A. Weakly submod

ular maximization beyond cardinality constraints: Does randomization help greedy? Proceedings of the Interna tional Conference on Machine Learning, 2017.

Optimal approximation for unconstrained non-submodular minimization

Conforti, M. and Cornuéjols, G. Submodular set func

tions, matroids and the greedy algorithm: tight worst-case bounds and some generalizations of the rado-edmonds theorem. Discrete applied mathematics, 7(3):251 274, 1984.

Cunningham, W. H. Decomposition of submodular func

tions. Combinatorica, 3(1):53 68, 1983.

Das, A. and Kempe, D. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. ar Xiv preprint ar Xiv:1102.3975, 2011.

Desautels, T., Krause, A., and Burdick, J. W. Paralleliz

ing exploration-exploitation tradeoffs in gaussian pro cess bandit optimization. Journal of Machine Learning Research, 15:4053 4103, 2014. URL http://jmlr. org/papers/v15/desautels14a.html.

Dughmi, S. Submodular functions: Extensions, dis tributions, and algorithms. a survey. ar Xiv preprint ar Xiv:0912.0322, 2009.

Edmonds, J. Submodular functions, matroids, and certain polyhedra. In Combinatorial Optimization Eureka, You Shrink!, pp. 11 26. Springer, 2003.

El Halabi, M. Learning with Structured Sparsity: From Discrete to Convex and Back. Ph D thesis, Ecole Polytech nique Fédérale de Lausanne, 2018.

El Halabi, M. and Cevher, V. A totally unimodular view of structured sparsity. Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, pp. 223 231, 2015.

El Halabi, M., Bach, F., and Cevher, V. Combinatorial penalties: Structure preserved by convex relaxations. Pro ceedings of the International Conference on Artiﬁcial Intelligence and Statistics, 2018.

Elenberg, E. R., Khanna, R., Dimakis, A. G., Negahban, S., et al. Restricted strong convexity implies weak submodularity. The Annals of Statistics, 46(6B):3539 3568, 2018.

Feige, U., Mirrokni, V. S., and Vondrák, J. Maximizing non-monotone submodular functions. SIAM Journal on Computing, 40(4):1133 1153, 2011.

Fujishige, S. and Isotani, S. A submodular function min

imization algorithm based on the minimum-norm base. Paciﬁc Journal of Optimization, 7(1):3 17, 2011.

Gatmiry, K. and Gomez-Rodriguez, M. Non-submodular function maximization subject to a matroid constraint, with applications. Co RR, abs/1811.07863, 2019. URL http://arxiv.org/abs/1811.07863.

Gofﬁn, J. L. and Vial, J. P. On the computation of weighted analytic centers and dual ellipsoids with the projective algorithm. Mathematical Programming, 60(1):81 92, Jun 1993. ISSN 1436-4646. doi: 10.1007/BF01580602. URL

https://doi.org/10.1007/BF01580602.

González, J., Dai, Z., Hennig, P., and Lawrence, N. Batch bayesian optimization via local penalization. In Pro ceedings of the International Conference on Artiﬁcial Intelligence and Statistics, pp. 648 657, 2016.

Grant, M. and Boyd, S. CVX: Matlab software for dis

ciplined convex programming, version 2.1. http:// cvxr.com/cvx, March 2014.

Harshaw, C., Feldman, M., Ward, J., and Karbasi, A. Submodular maximization beyond non-negativity: Guar antees, fast algorithms, and applications. In Chaud huri, K. and Salakhutdinov, R. (eds.), Proceedings of the International Conference on Machine Learn ing, volume 97 of Proceedings of Machine Learn ing Research, pp. 2634 2643. PMLR, 09 15 Jun 2019. URL http://proceedings.mlr.press/ v97/harshaw19a.html.

Hassidim, A. and Singer, Y. Submodular optimization under noise. In Kale, S. and Shamir, O. (eds.), Pro ceedings of the Conference on Learning Theory, vol ume 65 of Proceedings of Machine Learning Research, pp. 1069 1122, Amsterdam, Netherlands, 07 10 Jul 2017. PMLR. URL http://proceedings.mlr. press/v65/hassidim17a.html.

Hassidim, A. and Singer, Y. Optimization for approximate submodularity. In Proceedings of the International Con ference on Neural Information Processing Systems, pp. 394 405. Curran Associates Inc., 2018.

Horel, T. and Singer, Y. Maximization of approximately submodular functions. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Proceedings of the International Conference on Neural Information Processing Systems, pp. 3045 3053. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6236 maximization-of-approximately submodular-functions.pdf.

Iyer, R. and Bilmes, J. Algorithms for approximate minimization of the difference between submodular functions, with applications. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, UAI 12, pp. 407 417, Arlington, Virginia, United States, 2012. AUAI Press. ISBN 978-0-9749039-8 9. URL http://dl.acm.org/citation.cfm? id=3020652.3020697.

Optimal approximation for unconstrained non-submodular minimization

Iyer, R., Jegelka, S., and Bilmes, J. Monotone closure of relaxed constraints in submodular optimization: Connec tions between minimization and maximization. In Pro ceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, UAI 14, pp. 360 369, Arlington, Virginia, United States, 2014. AUAI Press. ISBN 978-0-9749039 1-0. URL http://dl.acm.org/citation.cfm?

id=3020751.3020789.

Iyer, R. K., Jegelka, S., and Bilmes, J. A. Curvature and optimal algorithms for learning and minimizing submod ular functions. In Proceedings of the International Con ference on Neural Information Processing Systems, pp. 2742 2750, 2013.

Jaggi, M. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the International Conference on Machine Learning, pp. 427 435, 2013.

Kawahara, Y., Iyer, R., and Bilmes, J. On approximate non

submodular minimization via tree-structured supermodu larity. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, pp. 444 452, 2015.

Krause, A., Singh, A., and Guestrin, C. Near-optimal sen

sor placements in gaussian processes: Theory, efﬁcient algorithms and empirical studies. Journal of Machine Learning Research, 9(Feb):235 284, 2008.

Kuhnle, A., Smith, J. D., Crawford, V. G., and Thai,

M. T. Fast maximization of non-submodular, mono tonic functions on the integer lattice. ar Xiv preprint ar Xiv:1805.06990, 2018.

Kyrillidis, A., Baldassarre, L., El Halabi, M., Tran-Dinh, Q., and Cevher, V. Structured Sparsity: Discrete and Convex Approaches, pp. 341 387. Springer International Publishing, Cham, 2015. ISBN 978-3-319-16042-9. doi: 10.1007/978-3-319-16042-9_12. URL https://doi.

org/10.1007/978-3-319-16042-9_12.

Lee, Y. T., Sidford, A., and Wong, S. C.-w. A faster cutting plane method and its implications for combinatorial and convex optimization. In IEEE Annual Symposium on Foundations of Computer Science, pp. 1049 1065. IEEE, 2015.

Lehmann, B., Lehmann, D., and Nisan, N. Combinatorial auctions with decreasing marginal utilities. Games and Economic Behavior, 55(2):270 296, 2006.

Lin, H. and Bilmes, J. An application of the submodular principal partition to training data subset selection. In NIPS workshop on Discrete Optimization in Machine Learning, 2010.

Lovász, L. Submodular functions and convexity, pp. 235

257. Springer Berlin Heidelberg, Berlin, Heidelberg, 1983. ISBN 978-3-642-68874-4. doi: 10.1007/978

3-642-68874-4_10. URL https://doi.org/10. 1007/978-3-642-68874-4_10.

Meyer, Jr, C. D. Generalized inversion of modiﬁed matrices. SIAM Journal on Applied Mathematics, 24(3):315 323, 1973.

Narasimhan, M., Jojic, N., and Bilmes, J. A. Q-clustering. In Weiss, Y., Schölkopf, B., and Platt, J. C. (eds.), Pro ceedings of the International Conference on Neural In formation Processing Systems, pp. 979 986. MIT Press, 2006. URL http://papers.nips.cc/paper/ 2760-q-clustering.pdf.

Obozinski, G. and Bach, F. A uniﬁed perspective on convex structured sparsity: Hierarchical, symmetric, submodular norms and beyond. working paper or preprint, Decem ber 2016. URL https://hal-enpc.archives ouvertes.fr/hal-01412385.

Qian, C., Shi, J.-C., Yu, Y., Tang, K., and Zhou, Z.-H. Opti

mizing ratio of monotone set functions. In Proceedings of the International Joint Conference on Artiﬁcial Intelli gence, IJCAI 17, pp. 2606 2612. AAAI Press, 2017a. ISBN 978-0-9992411-0-3. URL http://dl.acm. org/citation.cfm?id=3172077.3172251.

Qian, C., Shi, J.-C., Yu, Y., Tang, K., and Zhou, Z.-H. Subset selection under noise. In Proceedings of the International Conference on Neural Information Processing Systems, pp. 3560 3570, 2017b.

Rapaport, F., Barillot, E., and Vert, J. Classiﬁcation of arraycgh data using fused svm. Bioinformatics, 24(13): i375 i382, 2008.

Sakaue, S. Weakly modular maximization: Applications, hardness, tractability, and efﬁcient algorithms. ar Xiv preprint ar Xiv:1805.11251, 2018.

Sakaue, S. Greedy and iht algorithms for non-convex optimization with monotone costs of non-zeros. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 206 215. PMLR, 16 18 Apr 2019. URL http://proceedings.mlr.press/ v89/sakaue19a.html.

Singla, A., Tschiatschek, S., and Krause, A. Noisy submodular maximization via adaptive sampling with applications to crowdsourced image collection sum marization. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, AAAI 16, pp. 2037 2043.

Optimal approximation for unconstrained non-submodular minimization

AAAI Press, 2016. URL http://dl.acm.org/ citation.cfm?id=3016100.3016183.

Sviridenko, M., Vondrák, J., and Ward, J. Optimal approx

imation for submodular and supermodular optimization with bounded curvature. Mathematics of Operations Re search, 42(4):1197 1218, 2017.

Svitkina, Z. and Fleischer, L. Submodular approximation: Sampling-based algorithms and lower bounds. SIAM Journal on Computing, 40(6):1715 1737, 2011.

Trevisan, L. Inapproximability of combinatorial optimiza

tion problems. ar Xiv preprint cs/0409043, 2004.

Vondrák, J. Submodularity in combinatorial optimization. Ph D thesis, Charles University, 2007.

Wang, Y.-J., Xu, D.-C., Jiang, Y.-J., and Zhang, D.-M. Min

imizing ratio of monotone non-submodular functions. Journal of the Operations Research Society of China, Mar 2019. ISSN 2194-6698. doi: 10.1007/s40305-019-00244 1. URL https://doi.org/10.1007/s40305

019-00244-1.