# difference_of_submodular_minimization_via_dc_programming__ac3b5a35.pdf

Difference of Submodular Minimization via DC Programming

Marwa El Halabi 1 George Orfanides 2 Tim Hoheisel 2

Minimizing the difference of two submodular (DS) functions is a problem that naturally occurs in various machine learning problems. Although it is well known that a DS problem can be equivalently formulated as the minimization of the difference of two convex (DC) functions, existing algorithms do not fully exploit this connection. A classical algorithm for DC problems is called the DC algorithm (DCA). We introduce variants of DCA and its complete form (CDCA) that we apply to the DC program corresponding to DS minimization. We extend existing convergence properties of DCA, and connect them to convergence properties on the DS problem. Our results on DCA match the theoretical guarantees satisﬁed by existing DS algorithms, while providing a more complete characterization of convergence properties. In the case of CDCA, we obtain a stronger local minimality guarantee. Our numerical results show that our proposed algorithms outperform existing baselines on two applications: speech corpus selection and feature selection.

1. Introduction

We study the difference of submodular (DS) functions minimization problem

min X V F(X) := G(X) H(X), (1)

where G and H are normalized submodular functions (see Section 2 for deﬁnitions). We denote the minimum of (1) by F . Submodular functions are set functions that satisfy a diminishing returns property, which naturally occurs in a variety of machine learning applications. Many of these applications involve DS minimization, such as feature selection, probabilistic inference (Iyer & Bilmes, 2012), learning

1Samsung - SAIT AI Lab, Montreal 2Department of Mathematics and Statistics, Mc Gill University. Correspondence to: Marwa El Halabi <marwa.elhalabi@gmail.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

discriminatively structured graphical models (Narasimhan & Bilmes, 2005), and learning decision rule sets (Yang et al., 2021). In fact, this problem is ubiquitous as any set function can be expressed as a DS function, though ﬁnding a DS decomposition has exponential complexity in general (Narasimhan & Bilmes, 2005; Iyer & Bilmes, 2012).

Unlike submodular functions which can be minimized in polynomial time, minimizing DS functions up to any constant factor multiplicative approximation requires exponential time, and obtaining any positive polynomial time computable multiplicative approximation is NP-Hard (Iyer & Bilmes, 2012, Theorems 5.1 and 5.2). Even ﬁnding a local minimum (see Deﬁnition 2.1) of DS functions is PLS complete (Iyer & Bilmes, 2012, Section 5.3).

DS minimization was ﬁrst studied in (Narasimhan & Bilmes, 2005), who proposed the submodular-supermodular (Sub Sup) procedure; an algorithm inspired by the convexconcave procedure (Yuille & Rangarajan, 2001), which monotonically reduces the objective function at every step and converges to a local minimum. Iyer & Bilmes (2012) extended the work of (Narasimhan & Bilmes, 2005) by proposing two other algorithms, the supermodular-submodular (Sup Sub) and the modular-modular (Mod Mod) procedures, which have lower per-iteration cost than the Sub Sup method, while satisfying the same theoretical guarantees.

The DS problem can be equivalently formulated as a difference of convex (DC) functions minimization problem (see Section 2). DC programs are well studied problems for which a classical popular algorithm is the DC algorithm (DCA) (Pham Dinh & Le Thi, 1997; Pham Dinh & Souad, 1988). DCA has been successfully applied to a wide range of non-convex optimization problems, and several algorithms can be viewed as special cases of it, such as the convexconcave procedure, the expectation-maximization (Dempster et al., 1977), and the iterative shrinkage-thresholding algorithm (Chambolle et al., 1998); see (Le Thi & Pham Dinh, 2018) for an extensive survey on DCA.

Existing DS algorithms, while inspired by DCA, do not fully exploit this connection to DC programming. In this paper, we apply DCA and its complete form (CDCA) to the DC program equivalent to the DS problem. We establish new connections between the two problems which allow us to leverage convergence properties of DCA to obtain

Difference of submodular minimization via DC programming

theoretical guarantees on the DS problem that match ones by existing methods, and stronger ones when using CDCA. In particular, our key contributions are:

We show that a special instance of DCA and CDCA, where iterates are integral, monotonically decreases the DS function value at every iteration, and converges with rate O(1/k) to a local minimum and strong local minimum (see Deﬁnition 2.1) of the DS problem, respectively. DCA reduces to Sub Sup in this case.

We introduce variants of DCA and CDCA, where iterates are rounded at each iteration, which allow us to add regularization. We extend the convergence properties of DCA and CDCA to these variants.

CDCA requires solving a concave minimization subproblem at each iteration. We show how to efﬁciently obtain an approximate stationary point of this subproblem using the Frank-Wolfe (FW) algorithm.

We study the effect of adding regularization both theoretically and empirically.

We demonstrate that our proposed methods outperform existing baselines empirically on two applications: speech corpus selection and feature selection.

1.1. Additional related work

An accelerated variant of DCA (ADCA) which incorporates Nesterov s acceleration into DCA was presented in (Nhat et al., 2018). We investigate the effect of acceleration in our experiments (Section 5). Kawahara & Washio (2011) proposed an exact branch-and-bound algorithm for DS minimization, which has exponential time-complexity. Maehara & Murota (2015) proposed a discrete analogue of the continuous DCA for minimizing the difference of discrete convex functions, of which DS minimization is a special case, where the proposed algorithm reduces to Sub Sup. Several works studied a special case of the DS problem where G is modular (Sviridenko et al., 2017; Feldman, 2019; Harshaw et al., 2019), or approximately modular (Perrault et al., 2021), providing approximation guarantees based on greedy algorithms. El Halabi & Jegelka (2020) provided approximation guarantees to the related problem of minimizing the difference between an approximately submodular function and an approximately supermodular function. In this work we focus on general DS minimization, we discuss some implications of our results to certain special cases in Appendix H.

2. Preliminaries

We begin by introducing our notation and relevant background on DS and DC minimization.

Notation: Given a ground set V = {1, , d} and a set function F : 2V R, we denote the marginal gain of adding an element i to a set X V by F(i|X) = F(X {i}) F(X). The indicator vector 1X Rd

is the vector whose i-th entry is 1 if i X and 0 otherwise. Let Sd denote the set of permutations on V . Given σ Sd, set Sσ k := {σ(1), , σ(k)}, with Sσ 0 = . The symmetric difference of two sets X, Y is denoted by X Y = (X \ Y ) (Y \ X). Denote by Γ0 the set of all proper lower semicontinuous convex functions on Rd. We write R for R {+ }. Given a set C Rd, δC denotes the indicator function of C taking value 0 on C and + outside it. Throughout, denotes the ℓ2-norm.

DS minimization A set function F is normalized if F( ) = 0 and non-decreasing if F(X) F(Y ) for all X Y . F is submodular if it has diminishing marginal gains: F(i | X) F(i | Y ) for all X Y , i V \ Y , supermodular if F is submodular, and modular if it is both submodular and supermodular. Given a vector x Rd, x deﬁnes a modular set function as x(A) = P

i A xi. Note that minimizing the difference between two submodular functions is equivalent to maximizing the difference between two submodular functions, and minimizing or maximizing the difference of two supermodular functions.

Given the inapproximability of Problem (1), we are interested in obtaining approximate local minimizers.

Deﬁnition 2.1. Given ϵ 0, a set X V is an ϵ-local minimum of F if F(X) F(X i) + ϵ for all i V \ X and F(X) F(X \ i) + ϵ for all i X. Moreover, X is an ϵ-strong local minimum of F if F(X) F(Y ) + ϵ for all Y X and all Y X.

In Appendix H, we show that if F is submodular then any ϵ-strong local minimum ˆX of F is also an ϵ-global minimum, i.e., F( ˆX) F + ϵ. It was also shown in (Feige et al., 2011, Theorem 3.4) that if F is supermodular then any ϵ-strong local minimum ˆX satisﬁes min{F( ˆX), F(V \ ˆX)} 1

3ϵ. We further show relaxed versions of these properties for approximately submodular and supermodular functions in Appendix H. Moreover, the two notions of approximate local minimality are similar if F is supermodular: any ϵ-local minimum of F is also an ϵd-strong local minimum of F (Feige et al., 2011, Lemma 3.3). However, in general, a local miniumum can have an arbitrarily worse objective value than any strong local minimum, as illustrated in Example G.2.

Minimizing a set function F is equivalent to minimizing a continuous extension of F called the Lov asz extension (Lov asz, 1983) on the hypercube [0, 1]d .

Deﬁnition 2.2 (Lov asz extension). Given a normalized set function F, its Lov asz extension f L : Rd R is deﬁned as follows: Given x Rd and σ Sd, with xσ(1)

Difference of submodular minimization via DC programming

xσ(d), f L(x) := Pd k=1 xσ(k)F(σ(k) | Sσ k 1).

We make use of the following well known properties of the Lov asz extension; see e.g. (Bach, 2013) and (Jegelka & Bilmes, 2011, Lemma 1) for item g.

Proposition 2.3. For a normalized set function F, we have:

a) For all X V, F(X) = f L(1X).

b) If F = G H, then f L = g L h L.

c) min X V F(X) = minx [0,1]d f L(x).

d) Rounding: Given x [0, 1]d, σ Sd such that xσ(1) xσ(d), let ˆk argmink=0,1,...,d F(Sσ k ), then F(Sσ ˆk ) f L(x). We denote this operation by Sσ ˆk = Round F (x).

e) f L is convex if and only if F is submodular.

f) Let F be submodular and deﬁne its base polyhedron

B(F) := s Rd | s(V ) = F(V ), s(A) F(A) A V .

Greedy algorithm: Given x Rd, σ Sd such that xσ(1) xσ(d), deﬁne yσ(k) = F(σ(k) | Sσ k 1), then y is a maximizer of maxs B(F ) x, s , f L is the support function of B(F), i.e., f L(x) = maxs B(F ) x, s , and y is a subgradient of f L at x.

g) If F is submodular, then f L is κ-Lipschitz, i.e., |f L(x) f L(y)| κ x y for all x, y Rd, with κ = 3 max X V |F(X)|. If F is also non-decreasing, then κ = F(V ).

These properties imply that Problem (1) is equivalent to

min x [0,1]d f L(x) = g L(x) h L(x), (2)

with g L, h L Γ0. In paticular, if X is a minimizer of (1), then 1X is a minimizer of (2), and if x is a minimizer of (2) then Round F (x ) is a minimizer of (1).

DC programming For a function f : Rd R, its domain is deﬁned as dom f = x Rd | f(x) < + , and its Fenchel conjugate as f (y) = supy Rd x, y f(x). For ρ 0, f is ρ-strongly convex if f ρ

2 2 is convex. We denote by ρ(f) the supremum over such values. We say that f is locally polyhedral convex if every point in its epigraph has a relative polyhedral neighbourhood (Durier, 1988). For a convex function f, ϵ 0 and x0 dom f, the ϵ-subdifferential of f at x0 is deﬁned by ϵf(x0) = y Rd f(x) f(x0) + y, x x0 ϵ, x Rd , while f(x0) stands for the exact subdifferential (ϵ = 0). We use the same notation to denote the ϵ-superdifferential of a concave function f at x0, deﬁned by ϵf(x0) =

y Rd f(x) f(x0) + y, x x0 + ϵ, x Rd . We also deﬁne dom ϵf = x Rd | ϵf(x) = .

The ϵ-subdifferential of a function f Γ0 and its conjugate f have the following relation (Urruty & Lemar echal, 1993, Part II, Proposition 1.2.1). Proposition 2.4. For any f Γ0, ϵ 0, we have

y ϵf(x) f (y)+f(x) y, x ϵ x ϵf (y).

A general DC program takes the form

min x Rd f(x) := g(x) h(x) (3)

where g, h Γ0. We assume throughout the paper that the minimum of (3) is ﬁnite and denote it by f . The DC dual of (3) is given by (Pham Dinh & Le Thi, 1997)

f = min y Rd h (y) g (y). (4)

The main idea of DCA is to approximate h at each iteration k by its afﬁne minorization h(xk)+ yk, x xk , with yk h(xk), and minimize the resulting convex function. DCA can also be viewed as a primal-dual subgradient method. We give in Algorithm 1 an approximate version of DCA with inexact iterates. Note that g (yk) = argminx Rd g(x) yk, x , and any ϵ-solution xk+1 to this problem will satisfy xk+1 ϵxg (yk), by Proposition 2.4.

Algorithm 1 Approximate DCA

1: ϵ, ϵx, ϵy 0, x0 dom g, k 0. 2: while f(xk) f(xk+1) > ϵ do 3: yk ϵyh(xk) 4: xk+1 ϵxg (yk) 5: k k + 1 6: end while

The following lemma, which follows from Proposition 2.4, provides a sufﬁcient condition for DCA to be well deﬁned, i.e, one can construct the sequences {xk} and {yk} from an arbitrary initial point x0 dom g. Lemma 2.5. DCA is well deﬁned if

dom g dom h and dom h dom g

Since Problem (3) is non-convex, we are interested in notions of approximate stationarity. Deﬁnition 2.6. For ϵ, ϵ1, ϵ2 0, a point x is an (ϵ1, ϵ2)- critical point of g h if ϵ1g(x) ϵ2h(x) = . Moreover, x is an ϵ-strong critical point if h(x) ϵg(x).

Note that the two notions of criticality are equivalent when h is differentiable and ϵ1 = ϵ, ϵ2 = 0. The following proposition provides necessary and sufﬁcient conditions for approximate local optimality based on approximate criticality.

Difference of submodular minimization via DC programming

Proposition 2.7. Let g, h Γ0 and ϵ 0. Then we have:

a) Let ˆx, x be two points satisfying ϵ1g(ˆx) ϵ2h(x) = , for some ϵ1, ϵ2 0 such that ϵ1 + ϵ2 = ϵ, then g(ˆx) h(ˆx) g(x) h(x) + ϵ. Moreover, if ˆx admits a neighbourhood U such that ϵ1g(ˆx) ϵ2h(x) = for all x U dom g, then ˆx is an ϵ-local minimum of g h. Conversely, if ˆx is an ϵ-local minimum of g h, then it is also an ϵ-strong critical point of g h.

b) If h is locally polyhedral convex, then ˆx is an ϵ-local minimum of g h if and only if it is an ϵ-strong critical point of g h.

Proof sketch. This extends the conditions for ϵ = 0 in (Le Thi & Pham Dinh, 1997, Theorem 4 and Corollary 2) and (Hiriart-Urruty, 1989, Proposition 3.1) to ϵ 0. The proof is given in Appendix D.1.

DCA converges in objective values, and in iterates if g or h is strongly convex, to a critical point (Pham Dinh & Le Thi, 1997, Theorem 3). We can always make the DC components strongly convex by adding ρ

2 2 to both g and h. A special instance of DCA, called complete DCA, converges to a strong critical point, but requires solving concave minimization subproblems (Pham Dinh & Souad, 1988, Theorem 3). CDCA picks valid DCA iterates yk, xk+1 that minimize the dual and primal DC objectives, respectively. We consider an approximate version of CDCA with the following iterates.

yk argmin{h (y) g (y) : y h(xk)}

= argmin{ y, xk g (y) : y h(xk)}, (5a)

xk+1 argmin{g(x) h(x) : x ϵxg (yk)}

= argmin{ x, yk h(x) : x ϵxg (yk)}. (5b)

3. DS Minimization via DCA

In this section, we apply DCA to the DC program (2) corresponding to DS minimization. We consider the DC decomposition f = g h, where

g = g L + δ[0,1]d + ρ

2 2 and h = h L + ρ

with ρ 0. Starting from x0 [0, 1]d, the approximate DCA iterates (with ϵy = 0) are then given by

yk ρxk + h L(xk), (7a)

xk+1 is an ϵx-solution of

min x [0,1]d g L(x) x, yk + ρ

Note that the minimum f = F of (2) is ﬁnite, since f is ﬁnite. DCA is clearly well deﬁned here; we discuss below

how to obtain the iterates efﬁciently. One can also verify that the condition in Lemma 2.5 holds: dom g = [0, 1]d dom h = Rd by Proposition 2.3-f, and dom h = B(H) if ρ = 0, Rd otherwise, hence in both cases dom h dom g = Rd, by Proposition 2.3-b,c.

Computational complexity A subgradient of h L can be computed as described in Proposition 2.3-f in O(d log d + d EOH) with EOH being the time needed to evaluate H on any set. An ϵx-solution of Problem (7b), for ϵx > 0, can be computed using the projected subgradient method (PGM) in O(dκ2/ϵ2 x) iterations when ρ = 0 and in O(2(κ+ ρ

d)2/ρϵx) when ρ > 0 (Bubeck, 2014, Theorems 3.1 and 3.5), where κ is the Lipschitz constant of g L(x) x, yk ; see Proposition 2.3-g. The time per iteration of PGM is O(d log d + d EOG).

When ρ = 0, Problem (7b) is equivalent to a submodular minimization problem, since minx [0,1]d g L(x) x, yk = min X V G(X) yk(X) by Proposition 2.3b,c. Then we can take xk+1 = 1Xk+1 where Xk+1 argmin X V G(X) yk(X). Several algorithms have been developed for minimizing a submodular function in polynomial time, exactly or within arbitrary accuracy ϵx > 0. Inexact algorithms are more efﬁcient, with the current best runtime O(d EOG/ϵ2 x) achieved by (Axelrod et al., 2019). In this case, DCA reduces to the Sub Sup procedure of (Narasimhan & Bilmes, 2005) and thus satisﬁes the same theoretical guarantees; see Appendix A.

In what follows, we extend these guarantees to the general case where xk is not integral and ρ 0, by leveraging convergence properties of DCA.

Theoretical guarantees Existing convergence results of DCA in (Pham Dinh & Le Thi, 1997; Le Thi & Pham Dinh, 1997; 2005) consider exact iterates and exact convergence, i.e., f(xk) = f(xk+1), which may require an exponential number of iterations, as shown in (Byrnes, 2015, Theorem 3.4) for Sub Sup. We extend these results to handle inexact iterates and approximate convergence.

Theorem 3.1. Given any f = g h, where g, h Γ0, let {xk} and {yk} be generated by approximate DCA (Algorithm 1). Then for all tx, ty (0, 1], k N, let ρ = ρ(g)(1 tx)+ρ(h)(1 ty) and ϵ = ϵx

ty , we have:

a) f(xk) f(xk+1) ρ

2 xk xk+1 2 ϵ.

b) For ϵ 0, if f(xk) f(xk+1) ϵ, then xk is an (ϵ , ϵy)-critical point of g h with yk ϵ g(xk) ϵyh(xk), xk+1 is an (ϵx, ϵ )-critical point of g h with yk ϵxg(xk+1) ϵ h(xk+1), where ϵ = ϵ + ϵx + ϵy, and ρ

2 xk xk+1 2 ϵ + ϵ.

Difference of submodular minimization via DC programming

c) min k {0,1,...,K 1} f(xk) f(xk+1) f(x0) f

d) If ρ(g) + ρ(h) > 0, then

min k {0,1,...,K 1} xk xk+1

Proof sketch. Items a and b with ϵ=ϵx =ϵy =0 are proved in (Pham Dinh & Le Thi, 1997, Theorem 3). We extend them to ϵ, ϵx, ϵy 0 by leveraging properties of approximate subgradients. Item c is obtained by telescoping sum.

Theorem 3.1 shows that approximate DCA decreases the objective f almost monotonically (up to ϵ), and converges in objective values with rate O(1/k), and in iterates with rate O(1/

k) if ρ > 0, to an approximate critical point of g h.

We present in Appendix E.1 a more detailed version of Theorem 3.1 and its full proof. In particular, we relate f(xk) f(xk+1) to a weaker measure of non-criticality, recovering the convergence rate provided in (Abbaszadehpeivasti et al., 2021, Corollary 4.1) on this measure. Approximate DCA with ϵ = 0, ϵx = ϵy 0 was considered in (Vo, 2015, Theorem 1.4) showing that any limit points ˆx, ˆy of {xk}, {yk} satisfy ˆy 3ϵxg(ˆx) ϵxh(ˆx) in this case. Our results are more general and tighter (at convergence y K 2ϵxg(x K) ϵxh(x K) in this case). For DS minimization, yk can be easily computed exactly (ϵy = 0). We consider ϵy > 0 to provide convergence results of FW on the concave subproblem required in CDCA (see Section 4).

The following corollary relates criticality on the DC problem (2) to local minimality on the DS problem (1).

Corollary 3.2. Given f = g h as deﬁned in (6), let {xk} and {yk} be generated by a variant of approximate DCA (7), where xk is integral, i.e., xk = 1Xk for some Xk V , and yk ρxk is computed as in Proposition 2.3-f. Then for all k N, ϵ 0, we have

a) If f(xk) f(xk+1) ϵ, then

F(Xk) F(Sσ ℓ) + ϵ for all ℓ V , (8)

2ρd(ϵ + ϵx) if ϵ + ϵx ρd

2 + ϵ + ϵx otherwise. (9)

and σ Sd is the permutation used to compute yk ρxk in Proposition 2.3-f.

b) Given d permutations σ1, , σd Sd, corresponding to decreasing orders of xk with different elements at σ(|Xk|) or σ(|Xk| + 1), and the corresponding

subgradients yk σ1, , yk σd h(xk) chosen as in Proposition 2.3-f, if we choose

xk+1 = argmin i V {f(xk+1 σi ) : xk+1 σi ϵxg (yk σi)},

then if f(xk) f(xk+1) ϵ, Eq. (8) holds with σ = σi for all i V . Hence, Xk is an ϵ -local minimum of F.

Proof sketch. We observe that yk ρxk h L(1Sσ ℓ) for all ℓ V . Item a then follows from Theorem 3.1-b, Proposition 2.3-a,f, Proposition 2.7-a, and the relation between the ϵ-subdifferentials of g and g ρ

2 2. Item b follows from Item a. See Appendix E.2.

Theorem 3.1 and Corollary 3.2 show that DCA with integral iterates xk decreases the objective F almost monotonically (up to ϵ), and converges to an ϵ -local minimum of F after at most (f(x0) f )/ϵ iterations, if we consider O(d) permutations for computing yk. By a similar argument, we can further guarantee that the returned solution cannot be improved, by more than ϵ , by adding or removing any c elements, if we consider O(dc) permutations for computing yk.

Taking ϵx = 0, ρ = 0 in Theorem 3.1 and Corollary 3.2, we recover all the theoretical properties of Sub Sup given in (Narasimhan & Bilmes, 2005; Iyer & Bilmes, 2012).

Effect of regularization Theorem 3.1 shows that using a non-zero regularization parameter ρ > 0 ensures convergence in iterates. Regularization also affects the complexity of solving Problem (7b); as discussed earlier ρ > 0 leads to a faster convergence rate (except for very small ρ). On the other hand, Corollary 3.2 shows that for ﬁxed ϵ and ϵx, a larger ρ may lead to a poorer solution. In practice, we observe that a larger ρ leads to slower convergence in objective values f(xk), but more accurate xk iterates, with ρ > 0 always yielding the best performance with respect to F (see Appendix C.1).

Note that when ρ > 0 we can t restrict xk to be integral, since the equivalence in Proposition 2.3-c does not hold in this case. It may also be advantageous to not restrict xk to be integral even when ρ = 0, as we observe in our numerical results (Appendix C.3). A natural question arises here: can we still obtain an approximate local minimum of F in this case? Given a fractional solution x K returned by DCA we can easily obtain a set solution with a smaller objective F(XK) = f L(1XK) f L(x K) by rounding; XK = Round F (x K) as described in Proposition 2.3-d. However, rounding a fractional solution x K returned by DCA will not necessarily yield an approximate local minimum of F, even if x K is a local minimum of f L, as we show in Example G.1. A simple workaround would be to explicitly check if the rounded solution is an ϵ -local minimum of F. If not, we can restart the algorithm from x K = 1 ˆ XK

Difference of submodular minimization via DC programming

where ˆXK = argmin|X XK|=1 F(X), similarly to what was proposed in (Byrnes, 2015, Algorithm 1) for Sub Sup. This will guarantee that DCA converges to an ϵ -local minimum of F after at most (f(x0) f )/ϵ iterations (see Proposition E.4). Such strategy is not feasible though if we want to guarantee convergence to an approximate strong local minimum of F, as we do in Section 4 with CDCA. We thus propose an alternative approach. We introduce a variant of DCA, which we call DCAR, where we round xk

at each iteration.

DCA with rounding Starting from x0 {0, 1}d, the approximate DCAR iterates are given by

yk, xk+1 as in (7a) and (7b) respectively, (10a)

xk+1 1Xk+1 where Xk+1 = Round F ( xk+1). (10b)

Since yk, xk+1 are standard approximate DCA iterates, then the properties in Theorem 3.1 apply to them, with ϵy = 0 and xk+1 replaced by xk+1. See Theorem E.5 for details. Since xk is integral in DCAR, Corollary 3.2 also holds. In particular, DCAR converges to an ϵ -local minimum of F after at most (f(x0) f )/ϵ iterations, if we consider O(d) permutations for computing yk, with ϵ deﬁned in (9).

4. DS Minimization via CDCA

As discussed in Section 2, CDCA is a special instance of DCA which is guaranteed to converge to a strong critical point. In this section, we apply CDCA to the DC program (2) corresponding to DS minimization, and show that the stronger guarantee on the DC program translates into a stronger guarantee on the DS problem. We use the same decomposition in (6).

Computational complexity CDCA requires solving a concave minimization problem for each iterate update. The constraint polytope h(xk) = ρxk + h L(xk) in Problem (5a) can have a number of vertices growing exponentially with the number of equal entries in xk. Thus, it is not possible to efﬁciently obtain a global solution of Problem (5a) in general. However, we can efﬁciently obtain an approximate critical point. Denote the objective

φk(w) = w, xk g (w). (11)

We use an approximate version of the FW algorithm, which starting from w0 h(xk), has the following iterates:

st ϵφk(wt) xk ϵg (wt), (12a)

vt argmin{ st, w : w h(xk)}, (12b)

wt+1 = (1 γt)wt + γtvt, (12c)

where ϵ 0 and we use the greedy step size γt = argminγ [0,1] φk((1 γ)wt + γvt) = 1. We

observe that with this step size, FW is a special case of DCA (with DC components g = δ h(xk) and h = φk). Hence, Theorem 3.1 applies to it (with ϵx = 0, ϵy = ϵ). In paticular, FW converges to a critical point with rate O(1/t). Convergence results of FW for nonconvex problems are often presented in terms of the FW gap deﬁned as gap(wt) := maxw h(xk) st, wt w (Lacoste-Julien, 2016). Our results imply the following bound on the FW gap (see Appendix F.1 for details).

Corollary 4.1. Given any f = g h, where g, h Γ0, and φk as deﬁned in (11), let {wt} be generated by approximate FW (12) with γt = 1. Then for all T N, we have

min t {0, ,T 1} gap(wt) φk(w0) minw h(xk) φk(w)

Corollary 4.1 extends the result of (Yurtsever & Sra, 2022, Lemma 2.1)1 to handle approximate supergradients of φk. A subgradient of h L and an approximate subgradient of g

can be computed as discussed in Section 3. The following proposition shows that the linear minimization problem (12b) can be exactly solved in O(d log d + d EOH) time.

Proposition 4.2. Given s, x Rd, let a1 > > am denote the unique values of x taken at sets A1 , Am, i.e., A1 Am = V and for all i {1, , m}, j Ai, xj = ai, and let σ Sd be a decreasing order of x, where we break ties according to s, i.e., xσ(1) xσ(d) and sσ(|Ci 1|+1) sσ(|Ci|), where Ci = A1 Ai for all i {1, , m}. Deﬁne wσ(k) = H(σ(k) | Sσ k 1) for all k V , then w is a maximizer of maxw h L(x) s, w .

Proof sketch. By Proposition 2.3-f, we have that w h L(x) and that any feasible solution is a maximizer of maxw B(H) w, s . The claim then follows by the optimality conditions of this problem given in (Bach, 2013, Proposition 4.2). The full proof is in Appendix F.2.

Note that Problem (5b) reduces to a unique solution xk+1 = g (yk) when ρ > 0, since g is differentiable in this case. When ρ = 0, the constraint g (yk) = argminx [0,1]d g L(x) yk, x is the convex hull of minimizers of g L(x) yk, x on {0, 1}d (Bach, 2013, Proposition 3.7), which can be exponentially many. One such trivial example is when the objective is zero so that the set of minimizers is {0, 1}d, in which case Problem (5b) is as challenging as the original DC problem. Fortunately, in what follows we show that solving Problem (5b) is not necessary to obtain an approximate strong local minimum of F; it is enough to pick any approximate subgradient of g (yk) as in DCA.

1The result therein is stated for φk continuously differentiable, but it does not actually require differentiability.

Difference of submodular minimization via DC programming

Theoretical guarantees Since CDCA is a special case of DCA, all the guarantees discussed in Section 3 apply. In addition, CDCA is known to converge to a strong critical point (Pham Dinh & Souad, 1988, Theorem 3). We extend this to the variant with inexact iterates and approximate convergence.

Theorem 4.3. Given any f = g h, where g, h Γ0, let {xk} and {yk} be generated by variant of approximate CDCA (5), where xk+1 is any point in ϵxg (yk) (not necessarily a solution of Problem (5b)). Then, for ϵ 0, if f(xk) f(xk+1) ϵ, xk is an (ϵ+ϵx)-strong critical point of g h. Moreover, if h is locally polyhedral, then xk is also an (ϵ + ϵx)-local minimum of f. This is the case for h given by (6) when ρ = 0.

The proof is given in Appendix F.3. It does not require that xk+1 is a solution of Problem (5b). However it does require that yk is a solution of Problem (5a). Whether a similar result holds when yk is only an approximate critical point is an interesting question for future work.

The next corollary relates strong criticality on the DC problem (2) to strong local minimality on the DS problem (1).

Corollary 4.4. Given f = g h as deﬁned in (6), ε 0, let ˆX V and ˆx = 1 ˆ X. If ˆx is an ε-strong critical point of g h, then ˆX is an ε -strong local minimum of F, where ε = 2ρdε if ε ρd

2 + ε otherwise. Conversely, if ˆX is an ε-strong local minimum of F, then ˆx is an ε-local minimum of f, and hence also an ε-strong critical point of g h.

Proof sketch. We observe that for any x = 1X corresponding to X ˆX or X ˆX, we have h L(ˆx) h L(x) = . The proof of the forward direction then follows from Proposition 2.7-a and the relation between the ε-subdifferentials of g and g ρ

2 2. For the converse direction, we argue that there exists a neighborhood Bδ(ˆx) of ˆx, such that any X = Round F (x) for x Bδ(ˆx), satisﬁes X ˆX or X ˆX. The claim then follows from Proposition 2.3-d,a and Proposition 2.7-a. See Appendix F.4 for details.

Theorem 4.3 and Corollary 4.4 imply that CDCA with integral iterates xk converges to an ϵ -strong local minimum of F after at most (f(x0) f )/ϵ iterations, with ϵ as in (9).

Effect of regularization The parameter ρ has the same effect on CDCA as discussed in Section 3 for DCA (Corollary 4.4 shows, like in Corollary 3.2, that for ﬁxed ϵ and ϵx, a larger ρ may lead to a poorer solution). Also, as in DCA, when ρ > 0 we can t restrict xk in CDCA to be integral. Moreover, rounding only once at convergence is not enough to obtain even an approximate local minimum of F, as shown in Example G.1. Checking if a set is an approximate strong local minimum of F is computationally infeasible, thus it cannot be explicitly enforced. Instead, we

propose a variant of CDCA, which we call CDCAR, where we round xk at each iteration.

CDCA with rounding Starting from x0 {0, 1}d, the approximate CDCAR iterates are given by

yk, xk+1 as in (5a) and (5b) respectively, (13a)

xk+1 1Xk+1 where Xk+1 = Round F ( xk+1). (13b)

Since CDCAR is a special case of DCAR, all the properties of DCAR discussed in Section 3 apply. In addition, since yk, xk+1, are standard approximate CDCA iterates, Theorem 4.3 applies to them, with xk+1 replaced by xk+1. Since xk is integral in CDCAR, Corollary 4.4 holds. In particular, DCAR converges to an ϵ -strong local minimum of F after at most (f(x0) f )/ϵ iterations, with ϵ deﬁned in (9). See Corollary F.2 for details.

The guarantees of DCA and CDCA are equivalent when F is submodular and similar when F is supermodular. As stated in Section 2, if F is supermodular then any ϵ -local minimum of F is also an ϵ d-strong local minimum. And when h is differentiable, which is the case in DS minimization only if H is modular and thus F is submodular, then approximate weak and strong criticality of f are equivalent. In this case, both DCA and CDCA return an ϵ -global minimum of F if xk is integral; see Appendix H. However, in general the objective value achieved by a set satisfying the guarantees in Corollary 3.2 can be arbitrarily worse than any strong local minimum as illustrated in Example G.2. This highlights the importance of the stronger guarantee achieved by CDCA.

5. Experiments

In this section, we evaluate the empirical performance of our proposed methods on two applications: speech corpus selection and feature selection. We compare our proposed methods DCA, DCAR, CDCA and CDCAR to the state-of-the-art methods for DS minimization, Sub Sup, Sup Sub and Mod Mod (Narasimhan & Bilmes, 2005; Iyer & Bilmes, 2012). We also include an accelerated variant of DCA (ADCA) and DCAR (ADCAR), with the acceleration proposed in (Nhat et al., 2018). We use the minimum norm point (MNP) algorithm (Fujishige & Isotani, 2011) for submodular minimization in Sub Sup and the optimal Greedy algorithm of (Buchbinder et al., 2012, Algorithm 2) for submodular maximization in Sup Sub. We also compare with the MNP, PGM, and Greedy algorithms applied directly to the DS problem (1).

We do not restrict ρ to zero or the iterates to be integral in DCA and CDCA (recall that DCA in this case reduces to Sub Sup). Instead, we vary ρ between 0 and 10, and round only once at convergence (though for evaluation purposes we do round at each iteration, but we do not update xk+1

with the rounded iterate). We also do not consider O(d)

Difference of submodular minimization via DC programming

0 5 10 15 20 25 30 iterations

0 5 10 15 20 25 30 iterations

0 5 10 15 20 25 30 iterations

0 5 10 15 20 25 30 iterations

Figure 1: Discrete and continuous objective values (log-scale) vs iterations on speech (top) and mushroom (bottom) datasets.

permutations for choosing yk in DCA, DCAR, Sub Sup and Mod Mod, as required in Corollary 3.2 and (Iyer & Bilmes, 2012) to guarantee convergence to an approximate local minimum of F, as this is too computationally expensive (unless done fully in parallel). Instead, we consider as in (Iyer & Bilmes, 2012) three permutations to break ties in xk: a random permutation, a permutation ordered according to the decreasing marginal gains of G, i.e., G(i | Xk \ i), or according to the decreasing marginal gains of F, i.e., F(i | Xk \i), which we try in parallel at each iteration, then pick the one yielding the best objective F. We also apply this heuristic in CDCA and CDCAR to choose an initial feasible point w0 ρxk + h L(xk) for FW (12); we pick the permutation yielding the smallest objective φk(w0).

We use f(xk) f(xk+1) 10 6 as a stopping criterion in our methods, and Xk+1 = Xk in Sub Sup, Sup Sub and Mod Mod as in (Iyer & Bilmes, 2012), and stop after a maximum number of iterations. To ensure convergence to a local minimum of F, we explicitly check for this as an additional stopping criterion in all methods except MNP, PGM and Greedy, and restart from the best neighboring set if not satisﬁed, as discussed in Section 3. For more details on the experimental set-up, see Appendix B. The code is available at https://github.com/Samsung SAILMontreal/ difference-submodular-min.git.

Speech corpus selection The goal of this problem is to ﬁnd a subset of a large speech data corpus to rapidly evaluate new and expensive speech recognition algorithms. One approach is to select a subset of utterances X from the corpus V that simultaneously minimizes the vocabulary size and maximizes the total value of data (Lin & Bilmes, 2011; Jegelka et al., 2011). Also, in some cases, some utterances importance decrease when they are selected together. This can be modeled by minimizing F(X) = λ p

|N(X)| Pr i=1 p

m(X Vi), where N(X) is the set of distinct words that appear in utterances X, m is a non-negative modular function, with the weight mj representing the importance of utterance j, and V1 Vr = V . We can write F as the difference of two non-decreasing submodular functions G(X) = λ p

|N(X)| and H(X) = P

m(X Vi). Moreover, this problem is a special case of DS minimization, where H is approximately modular. In particular, H is (1, β)-weakly DR-modular (see Deﬁnition H.1) with2

β min i [r] min j Vi

m(j) m(Vi).

The parameter β characterizes how close H is to being supermodular. This DS problem thus ﬁts under the setting considered in (El Halabi & Jegelka, 2020) (with α = 1), for

2The proof follows similarly to (Iyer et al., 2013, Lemma 3.3)

Difference of submodular minimization via DC programming

which PGM was shown to achieve the optimal approximation guarantee F( ˆX) G(X ) βH(X )+ϵ for some ϵ > 0, where X is a minimizer of F (see Corollary 1 and Theorem 2 therein). We show in Appendix H.1 that any variant of DCA and CDCA obtains the same approximation guarantee as PGM (see Proposition H.6 and discussion below it). We use the same dataset used by (Bach, 2013, Section 12.1), with d = |V | = 800 utterances and 1105 words. We choose λ = 1, the non-negative weights mi randomly, and partition V into r = 10 groups of consecutive indices.

Feature selection Given a set of features UV = {U1, U2, , Ud}, the goal is to ﬁnd a small subset of these features UX = {Ui : i X} that work well when used to classify a class C. We thus want to select the subset which retains the most information from the original set UV about C. This can be modeled by minimizing F(X) = λ|X| I(UX; C). The mutual information I(UX; C) can be written as the difference of the entropy H(UX) and conditional entropy H(UX | C), both of which are nondecreasing submodular. Hence F can be written as the difference of two non-decreasing submodular functions G(X) = λ|X| + H(UX | C) and H(X) = H(UX). We estimate the mutual information from the data. We use the Mushroom data set from (Dua & Graff, 2017), which has 8124 instances with 22 categorical attributes, which we convert to d = 118 binary features. We randomly select 70% of the data as training data for the feature selection, and set λ = 10 4.

Results: We plot in Fig. 1, the discrete objective values F(Xk) min(F) and continuous objective values f L(xk) min(f L), per iteration k, where min(F) and min(f L) are the smallest values achieved by all compared methods. We only plot the continuous objective of the methods which minimize the continuous DC problem (2), instead of directly minimizing the DS problem (1), i.e., our methods and PGM. For DCAR and CDCAR, we plot the continuous objective values before rounding, i.e., f L( xk), since the continuous objective after rounding is equal to the discrete one, i.e., f L(xk) = F(Xk). Results are averaged over 3 random runs, with standard deviations shown as error bars. For clarity, we only include our methods with the ρ value achieving the smallest discrete objective value. We show the results for all ρ values in Appendix C.1. For a fair implementationindependent comparison, we use the number of FW (12) iterations as the x-axis for CDCA and CDCAR, since one iteration of FW has a similar cost to an iteration of DCA variants. We only show the minimum objective achieved by Sup Sub, Mod Mod, MNP, PGM and Greedy, since their iteration time is signiﬁcantly smaller than the DCA and CDCA variants. We show the results with respect to time in Appendix C.2.

We observe that, as expected, PGM obtains the same discrete objective value as the best variants of our methods

on the speech dataset, where PGM and our methods achieve the same approximation guarantee, but worse on the adult dataset, where PGM has no guarantees. Though in terms of continuous objective value, PGM is doing worse than our methods on both datasets. Hence, a better f L value does not necessarily yield a better F value after rounding. In both experiments, our methods reach a better F value than all other baselines, except Sub Sup which gets the same value as DCAR on the speech dataset, and a similar value to our non-accelerated methods on the mushroom dataset.

The complete variants of our methods, CDCA and CDCAR, perform better in terms of F values, than their simple counterparts, DCA and DCAR, on the speech dataset. But, on the mushroom dataset, CDCAR perform similarly to DCAR, while CDCA is worse that DCA. Hence, using the complete variant is not always advantageous. In terms of f L values, CDCA and CDCAR perform worse than DCA and DCAR, respectively, on both datasets. Again this illustrates than a better f L value does not always yield a better F value.

Rounding at each iteration helps for CDCA on both datasets; CDCAR converges faster than CDCA in F, but not for DCA; DCAR reaches worse F value than DCA. Note that unlike f L(xk), the objective values f L( xk) of DCAR and CDCAR are not necessarily approximately non-increasing (Theorem E.5-b does not apply to them), which we indeed observe on the mushroom dataset.

Finally, we observe that adding regularization leads to better F values; the best ρ is non-zero for all our methods (see Appendix C.1 for a more detailed discussion on the effect of regularization). Acceleration helps in most cases but not all; DCAR and ADCAR perform the same on the speech dataset.

6. Conclusion

We introduce variants of DCA and CDCA for minimizing the DC program equivalent to DS minimization. We establish novel links between the two problems, which allow us to match the theoretical guarantees of existing algorithms using DCA, and to achieve stronger ones using CDCA. Empirically, our proposed methods perform similarly or better than all existing methods.

Acknowledgements

This research was enabled in part by support provided by Calcul Quebec (https://www.calculquebec.ca/) and the Digital Research Alliance of Canada (https://alliancecan.ca/). George Orfanides was partially supported by NSERC CREATE INTERMATH-AI. Tim Hoheisel was partially supported by the NSERC discovery grant RGPIN-2017-04035.

Difference of submodular minimization via DC programming

Abbaszadehpeivasti, H., de Klerk, E., and Zamani, M. On the rate of convergence of the difference-of-convex algorithm (dca). ar Xiv preprint ar Xiv:2109.13566, 2021. (Cited on 5, 20)

Axelrod, B., Liu, Y. P., and Sidford, A. Near-optimal approximate discrete and continuous submodular function minimization. ar Xiv preprint ar Xiv:1909.00171, 2019. (Cited on 4)

Bach, F. Learning with submodular functions: A convex optimization perspective. Foundations and Trends R in Machine Learning, 6(2-3):145 373, 2013. (Cited on 3, 6, 9, 13, 24, 25, 26)

Bian, A. A., Buhmann, J. M., Krause, A., and Tschiatschek, S. Guarantees for greedy maximization of nonsubmodular functions with applications. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 498 507. JMLR. org, 2017. (Cited on 26)

Bubeck, S. Theory of convex optimization for machine learning. ar Xiv preprint ar Xiv:1405.4980, 15, 2014. (Cited on 4)

Buchbinder, N., Feldman, M., Naor, J., and Schwartz, R. A tight linear time (1/2)-approximation for unconstrained submodular maximization. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pp. 649 658. IEEE, 2012. (Cited on 7, 29)

Byrnes, K. M. A tight analysis of the submodularsupermodular procedure. Discrete Applied Mathematics, 186:275 282, 2015. ISSN 0166-218X. doi: https://doi.org/10.1016/j.dam.2015.01.026. URL https://www.sciencedirect.com/ science/article/pii/S0166218X15000281. (Cited on 4, 6)

Chakrabarty, D., Jain, P., and Kothari, P. Provable submodular minimization via fujishige-wolfes algorithm. Adv. in Neu. Inf. Proc. Sys.(NIPS), 2014. (Cited on 17)

Chambolle, A., De Vore, R., Lee, N., and Lucier, B. Nonlinear wavelet image processing: variational problems, compression, and noise removal through wavelet shrinkage. Image Processing, IEEE Transactions on, 7(3):319 335, 1998. (Cited on 1)

Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1 22, 1977. (Cited on 1)

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. (Cited on 9)

Durier, R. On locally polyhedral convex functions. Trends in Mathematical Optimization, pp. 55 66, 1988. (Cited on 3)

El Halabi, M. and Jegelka, S. Optimal approximation for unconstrained non-submodular minimization. ICML, 2020. (Cited on 2, 8, 28)

Feige, U., Mirrokni, V. S., and Vondr ak, J. Maximizing non-monotone submodular functions. SIAM Journal on Computing, 40(4):1133 1153, 2011. (Cited on 2, 29)

Feldman, M. Guess free maximization of submodular and linear sums. In Friggstad, Z., Sack, J., and Salavatipour, M. R. (eds.), Algorithms and Data Structures - 16th International Symposium, WADS 2019, Edmonton, AB, Canada, August 5-7, 2019, Proceedings, volume 11646 of Lecture Notes in Computer Science, pp. 380 394. Springer, 2019. doi: 10.1007/ 978-3-030-24766-9\ 28. URL https://doi.org/ 10.1007/978-3-030-24766-9_28. (Cited on 2)

Fujishige, S. and Isotani, S. A submodular function minimization algorithm based on the minimum-norm base. Paciﬁc Journal of Optimization, 7(1):3 17, 2011. (Cited on 7)

Ghadimi, S. Conditional gradient type methods for composite nonlinear and stochastic optimization. Math. Program., 173(1-2):431 464, 2019. doi: 10.1007/ s10107-017-1225-5. URL https://doi.org/10. 1007/s10107-017-1225-5. (Cited on 20)

Harshaw, C., Feldman, M., Ward, J., and Karbasi, A. Submodular maximization beyond non-negativity: Guarantees, fast algorithms, and applications. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2634 2643, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr. press/v97/harshaw19a.html. (Cited on 2)

Hiriart-Urruty, J.-B. From convex optimization to nonconvex optimization. necessary and sufﬁcient conditions for global optimality. In Nonsmooth optimization and related topics, pp. 219 239. Springer, 1989. (Cited on 4, 18)

Iyer, R. and Bilmes, J. Algorithms for approximate minimization of the difference between submodular functions, with applications. In Proceedings of the Twenty Eighth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 12, pp. 407 417, Arlington, Virginia, United

Difference of submodular minimization via DC programming

States, 2012. AUAI Press. ISBN 978-0-9749039-89. URL http://dl.acm.org/citation.cfm? id=3020652.3020697. (Cited on 1, 5, 7, 8, 13)

Iyer, R. K., Jegelka, S., and Bilmes, J. A. Curvature and optimal algorithms for learning and minimizing submodular functions. In Advances in Neural Information Processing Systems, pp. 2742 2750, 2013. (Cited on 8)

Jegelka, S. and Bilmes, J. A. Online submodular minimization for combinatorial structures. In ICML, pp. 345 352. Citeseer, 2011. (Cited on 3)

Jegelka, S., Lin, H., and Bilmes, J. On fast approximate submodular minimization. In NIPS, pp. 460 468, 2011. (Cited on 8)

Kawahara, Y. and Washio, T. Prismatic algorithm for discrete dc programming problem. Advances in Neural Information Processing Systems, 24, 2011. (Cited on 2)

Lacoste-Julien, S. Convergence rate of frank-wolfe for non-convex objectives. ar Xiv preprint ar Xiv:1607.00345, 2016. (Cited on 6)

Le Thi, H. A. and Pham Dinh, T. Solving a class of linearly constrained indeﬁnite quadratic problems by dc algorithms. Journal of global optimization, 11(3):253 285, 1997. (Cited on 4, 18)

Le Thi, H. A. and Pham Dinh, T. The dc (difference of convex functions) programming and dca revisited with dc models of real world nonconvex optimization problems. Annals of operations research, 133(1):23 46, 2005. (Cited on 4)

Le Thi, H. A. and Pham Dinh, T. Dc programming and dca: thirty years of developments. Mathematical Programming, 169(1):5 68, 2018. (Cited on 1)

Lehmann, B., Lehmann, D., and Nisan, N. Combinatorial auctions with decreasing marginal utilities. Games and Economic Behavior, 55(2):270 296, 2006. (Cited on 26)

Lin, H. and Bilmes, J. Optimal selection of limited vocabulary speech corpora. In Twelfth Annual Conference of the International Speech Communication Association, 2011. (Cited on 8)

Lov asz, L. Submodular functions and convexity. In Mathematical Programming The State of the Art, pp. 235 257. Springer, 1983. (Cited on 2)

Maehara, T. and Murota, K. A framework of discrete dc programming by discrete convex analysis. Mathematical Programming, 152(1):435 466, 2015. (Cited on 2)

Narasimhan, M. and Bilmes, J. A. A submodularsupermodular procedure with applications to discriminative structure learning. In UAI 05, Proceedings of the 21st Conference in Uncertainty in Artiﬁcial Intelligence, Edinburgh, Scotland, July 26-29, 2005, pp. 404 412. AUAI Press, 2005. URL https://dslpitt.org/ uai/display Article Details.jsp?mmnu=1& smnu=2&article_id=1243&proceeding_id= 21. (Cited on 1, 4, 5, 7, 13)

Nhat, P. D., Le, H. M., and Le Thi, H. A. Accelerated difference of convex functions algorithm and its application to sparse binary logistic regression. In IJCAI, pp. 1369 1375, 2018. (Cited on 2, 7, 13)

Perrault, P., Healey, J., Wen, Z., and Valko, M. On the approximation relationship between optimizing ratio of submodular (rs) and difference of submodular (ds) functions. ar Xiv preprint ar Xiv: Arxiv-2101.01631, 2021. (Cited on 2)

Pham Dinh, T. and Le Thi, H. A. Convex analysis approach to dc programming: theory, algorithms and applications. Acta mathematica vietnamica, 22(1):289 355, 1997. (Cited on 1, 3, 4, 5)

Pham Dinh, T. and Souad, E. B. Duality in dc (difference of convex functions) optimization. subgradient methods. Trends in Mathematical Optimization, pp. 277 293, 1988. (Cited on 1, 4, 7, 24)

Pham Dinh, T., Huynh, V. N., Le Thi, H. A., and Ho, V. T. Alternating dc algorithm for partial dc programming problems. Journal of Global Optimization, 82(4):897 928, 2022. (Cited on 19)

Rockafellar, R. T. Convex analysis. Princeton university press, 1970. (Cited on 19)

Sviridenko, M., Vondr ak, J., and Ward, J. Optimal approximation for submodular and supermodular optimization with bounded curvature. Mathematics of Operations Research, 42(4):1197 1218, 2017. (Cited on 2)

Urruty, J.-B. H. and Lemar echal, C. Convex analysis and minimization algorithms. Springer-Verlag, 1993. (Cited on 3)

Vo, X. T. Learning with sparsity and uncertainty by difference of convex functions optimization. Ph D thesis, Universit e de Lorraine, 2015. (Cited on 5)

Yang, F., He, K., Yang, L., Du, H., Yang, J., Yang, B., and Sun, L. Learning interpretable decision rule sets: A submodular optimization approach. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum? id=p ZHGKM9m Ap. (Cited on 1)

Difference of submodular minimization via DC programming

Yuille, A. L. and Rangarajan, A. The concave-convex procedure (cccp). In Dietterich, T., Becker, S., and Ghahramani, Z. (eds.), Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001. URL https://proceedings. neurips.cc/paper/2001/file/ a012869311d64a44b5a0d567cd20de04-Paper. pdf. (Cited on 1)

Yurtsever, A. and Sra, S. Cccp is frank-wolfe in disguise. ar Xiv preprint ar Xiv:2206.12014, 2022. (Cited on 6)

Difference of submodular minimization via DC programming

Table 1: Stopping criteria

DCA, DCAR, CDCA, CDCAR Sub Sup Sup Sub, Mod Mod MNP, PGM PGM in DCA and MNP in Sub Sup FW in CDCA ADCA, ADCAR CDCA variants variants

f(xk) f(xk+1) 10 6 f(xk) f(xk+1) 10 6 Xk+1 = Xk Xk+1 = Xk gap 10 6 gap 10 6 gap 10 6

k 30 k + # FW iterations 30 k 30 k 3 104 k 3 104 k 103 k 103 k 10 Xk+1 local minimum of F Xk+1 local minimum of F Xk+1 local minimum of F Xk+1 local minimum of F

A. Subsup as a Special Case of DCA

We show that the Sub Sup procedure proposed in (Narasimhan & Bilmes, 2005) is a special case of DCA. Sub Sup starts from X0 V , and makes the following updates at each iteration:

yk σ(i) H(σ(i) | Sσ i 1) i V, for σ Sd such that Sσ |Xk| = Xk

Xk+1 argmin X V G(X) yk(X) (14)

Note that yk h L(1Xk) by Proposition 2.3-f and 1Xk+1 argminx [0,1]d g L(x) x, yk as discussed in Section 3, thus they are valid updates of DCA in Eq. (7) with ρ = ϵx = 0.

B. Experimental Setup Additional Details

In this section, we provide additional details on our experimental setup. As in (Iyer & Bilmes, 2012), we consider in Mod Mod and Sup Sub two modular upper bounds on G, which we try in parallel and pick the one which yields the best objective F. We set the parameter q in ADCA and ADCAR to 5 as done in (Nhat et al., 2018). We summarize the stopping criteria used in all methods and their subsolvers in Table 1. We pick the maximum number of iterations according to the complexity per iteration. We use the random seeds 42, 43, and 44. We use the implementation of MNP from the Matlab code provided in (Bach, 2013, Section 12.1) and implement the rest of the methods in Matlab.

C. Additional Empirical Results

In this section, we present some additional empirical results of the experiments presented in Section 5.

C.1. Effect of regularization

We report the discrete and continuous objective values per iteration of our proposed methods, for all ρ values, on the speech dataset in Fig. 2 and the mushroom dataset in Fig. 3. We observe that the variants without rounding at each iteration converge slower in f L for larger ρ values, though not always, e.g., DCA with ρ = 0.001 converges faster than with ρ = 0 on the speech dataset, and CDCA with ρ = 0.01 converges faster than with ρ = 0.1 on the mushroom dataset. The effect of ρ on the rounded variants is less clear; in most cases the methods with small ρ values are performing worse, but for CDCAR on the speech dataset the opposite is true. We again observe that better performance w.r.t f L does not necessarily translate to better performance w.r.t F. The effect of ρ on performance w.r.t F varies with the different methods and datasets. But in all cases, the best F values is obtained with ρ > 0.

Recall that we use PGM to compute an ϵx-subgradient of g to update xk in DCA variants (7b) and CDCA variants (5b), as well as in each iteration of FW (12) to update yk in CDCA variants (5a). As discussed in Section 3, PGM requires O(dκ2/ϵ2 x) iterations when ρ = 0 and O(2(κ + ρ

d)2/ρϵx) when ρ > 0, where κ is the Lipschitz constant of g L(x) x, yk . Figure 4 shows the gap reached by PGM at each iteration of DCA variants, and the worst gap reached by PGM over all the approximate subgradient computations done at each iteration of CDCA variants. As expected, a larger ρ leads to a more accurate solution (smaller gap), for a ﬁxed number of PGM iterations (we used 1000). Though, the accuracy at ρ = 0 is better than the very small non-zero values ρ = 0.01, 0.001, for which the complexity O(2(κ + ρ

d)2/ρϵx) becomes larger than O(dκ2/ϵ2 x).

Difference of submodular minimization via DC programming

0 10 20 iterations

0 5 10 iterations

0 10 20 iterations

0 5 10 iterations

0 5 10 15 20 iterations

0 5 10 15 20 iterations

0 10 20 10-1

0 10 20 10-1

0 5 10 15 20

0 5 10 15 20 10-1

Figure 2: Discrete and continuous objective values (log-scale) of our proposed methods for all ρ values vs iterations on speech dataset.

Difference of submodular minimization via DC programming

0 10 20 iterations

0 2 4 6 8 iterations

0 10 20 iterations

0 5 10 iterations

0 10 20 30 iterations

0 5 10 15 20 iterations

0 10 20 30 10-6

0 5 10 15 20 10-6

Figure 3: Discrete and continuous objective values (log-scale) of our proposed methods for all ρ values vs iterations on mushroom dataset.

Difference of submodular minimization via DC programming

5 10 15 20 25 iterations

2 4 6 8 10 iterations

5 10 15 20 25 iterations

2 4 6 8 10 iterations

2 4 6 8 10 iterations

1 1.5 2 iterations

5 10 15 20 25 iterations

2 4 6 8 iterations

5 10 15 20 25 iterations

2 4 6 8 10 iterations

2 4 6 8 10 12 iterations

1 2 3 4 5 6 iterations

Figure 4: PGM gap values (log-scale) of our proposed methods for all ρ values vs iterations on speech (top two rows) and mushroom (bottom two rows) datasets.

Difference of submodular minimization via DC programming

0 0.5 1 1.5 2 time (sec) 104

0 20 40 60 80 100 time (sec)

0 0.5 1 1.5 2 time (sec) 104

0 2 4 6 8 10 time (sec) 104

0 2 4 6 8 10 time (sec) 104

Figure 5: Discrete and continuous objective values (log-scale) vs time on speech (top) and mushroom (bottom) datasets. We include separate plots for non-DCA variants for visibility.

C.2. Running times

We report in Fig. 5 the discrete and continuous objective values with respect to time. We again only include our methods with the ρ value achieving the smallest discrete objective. As expected, DCA variants (including Sub Sup) have a signiﬁcantly higher computational cost compared to other baselines.

Recall that Sub Sup is a special case of DCA with ρ = 0 and xk chosen to be integral (see Appendix A and the computational complexity discussion in Section 3), so theoretically the cost of Sub Sup is the same as DCA with ρ = 0. In our experiments, we are using the MNP algorithm for the submodular minimization in Sub Sup min X V G(X) yk(X) = minx [0,1]d g L(x) x, yk , and PGM to solve Problem (7b) minx [0,1]d g L(x) x, yk + ρ

2 x 2 in DCA (MNP cannot be used for this problem when ρ > 0). MNP requires O(d diam (B(G yk))2/ϵ2 x) iterations to obtain an ϵx-solution to min X V G(X) yk(X) (Chakrabarty et al., 2014, Theorems 4 and 5). We can bound diam (B(G yk)) 2κ, where recall that κ is the Lipschitz constant of g L(x) x, yk given in Proposition 2.3-g. Hence MNP requires the same number of iterations O(dκ2/ϵ2 x) as PGD with ρ = 0, and the time per iteration of MNP is O(d2 + d log d + d EOG) (Chakrabarty et al., 2014, Proof of Theorem 1), which is larger than PGD O(d log d + d EOG) (see the computational complexity discussion in Section 3). Nevertheless, in our experiments, we observe that Sub Sup actually has a lower running time per iteration than DCA on the speech dataset; this is true even for DCA with ρ = 0 (see Fig. 6), but similar on the mushroom dataset.

C.3. Sub Sup vs DCA and DCAR with ρ = 0

In this section, we compare the performance of Sub Sup with DCA and DCAR with ρ = 0. We plot the discrete objective values of these three algorithms with respect to both iterations and time in Fig. 6. We observe that Sub Sup performs similarly to DCAR with ρ = 0 in terms of F values, while DCA with ρ = 0 obtains a bit better F values on the speech dataset. Note that the only difference between Sub Sup and DCA with ρ = 0 is that Sub Sup is choosing an integral solution in Problem (7b), using the MNP algorithm, while DCA chooses a possibly non-integral solution using the PGM algorithm. Hence, it seems that there is some advantage to not restricting the xk iterates of DCA to be integral in some cases. In terms of running time, Sub Sup has a lower iteration time than the other two algorithms on the speech dataset, and a similar one on the mushroom dataset (see discussion in Appendix C.2).

Difference of submodular minimization via DC programming

0 2 4 6 8 10 12 iterations

0 1000 2000 3000 4000 5000 6000 time (sec)

0 2 4 6 8 iterations

0 1 2 3 time (sec) 104

Figure 6: Discrete objective values (log-scale) of three DCA variants with ρ = 0, vs iterations (left) and time (right), on speech (top) and mushroom (bottom) datasets.

D. Proofs of Section 2

D.1. Proof of Proposition 2.7

Proposition 2.7. Let g, h Γ0 and ϵ 0. Then we have:

a) Let ˆx, x be two points satisfying ϵ1g(ˆx) ϵ2h(x) = , for some ϵ1, ϵ2 0 such that ϵ1 +ϵ2 = ϵ, then g(ˆx) h(ˆx) g(x) h(x) + ϵ. Moreover, if ˆx admits a neighbourhood U such that ϵ1g(ˆx) ϵ2h(x) = for all x U dom g, then ˆx is an ϵ-local minimum of g h. Conversely, if ˆx is an ϵ-local minimum of g h, then it is also an ϵ-strong critical point of g h.

b) If h is locally polyhedral convex, then ˆx is an ϵ-local minimum of g h if and only if it is an ϵ-strong critical point of g h.

Proof. a) The ﬁrst part is an extension of (Le Thi & Pham Dinh, 1997, Theorem 4). Given y ϵ1g(ˆx) ϵ2h(x), we have g(ˆx)+ y, x ˆx ϵ1 g(x) and h(x)+ y, ˆx x ϵ2 h(ˆx). Hence, g(ˆx) h(ˆx) g(x) h(x)+ϵ1 +ϵ2. If ˆx admits a neighbourhood where this is true, then ˆx is an ϵ-local minimum of g h. The converse direction extends a well known property; see e.g., (Hiriart-Urruty, 1989, Proposition 3.1). Given an ϵ-local minimum ˆx of g h, there exists a neighborhood U of ˆx such that g(ˆx) h(ˆx) g(x) h(x) + ϵ for all x U. Then for any ˆy h(ˆx), g(x) g(ˆx) h(x) h(ˆx) ϵ ˆy, x ˆx ϵ for all x U. This is enough to have ˆy ϵg(ˆx), since for any x Rd, we can apply the inequality to x = ˆx + τ(x ˆx) with τ > 0 small enough to have x U, then by convexity we get τg(x) + (1 τ)g(ˆx) g(ˆx) g(ˆx + τ(x ˆx)) g(ˆx) ˆy, ˆx + τ(x ˆx) ˆx ϵ, which implies g(x) g(ˆx) ˆy, x ˆx ϵ. Hence, h(ˆx) ϵg(ˆx).

b) This is an extension of (Le Thi & Pham Dinh, 1997, Corollary 2). If h is locally polyhedral convex then for every x dom h there exists a neighborhood U of x such that h(x ) h(x) for all x U (Le Thi & Pham Dinh, 1997, Theorem 5). Given ˆx satisfying h(ˆx) ϵg(ˆx), we have that h(x) ϵg(ˆx) for all x U for some neighborhood

Difference of submodular minimization via DC programming

U of x, which implies that ˆx is an ϵ-local minimum of g h by Item a. The converse direction also follows from Item a.

Remark D.1. Let ri (S) denote the relative interior of a convex set S. Note that any x ri (dom g) ri (dom h) (which is equal to ri (dom g) since we assumed minx g(x) h(x) is ﬁnite) is an ϵ-local minimum of g h for any ϵ > 0. Hence, ϵ-local minimality is meaningless on ri dom g ri dom h for any ϵ > 0. However, in this work, we are interested in integral ϵ-local minima of Problem (2), which are on the boundary of dom (g L + δ[0,1]d) = [0, 1]d, and hence not meaningless.

Proof. Since g, h are convex, they are continuous on ri (dom g) ri (dom h) (Rockafellar, 1970, Theorem 10.1). This implies that f is also continuous on ri (dom g) ri (dom h), hence for any x ri (dom g) ri (dom h), any ϵ > 0, there exists δ > 0, such that for all x ri(dom g) ri (dom h) satisfying x x < δ, we have |f(x ) f(x)| < ϵ, hence f(x) < f(x ) + ϵ.

E. Proofs of Section 3

E.1. Proof of Theorem 3.1

Before proving Theorem 3.1, we need the following lemma.

Lemma E.1 (Lemma 5 in (Pham Dinh et al., 2022)). Let Φ be a ρ-strongly convex function with ρ 0, then for any ϵ 0, t (0, 1], x dom Φ and y ϵΦ(x), we have

Φ(z) Φ(x) + y, z x + ρ(1 t)

We now present a more detailed version of Theorem 3.1 and its proof.

Theorem E.2. Given any f = g h, where g, h Γ0, let {xk} and {yk} be generated by approximate DCA (Algorithm 1), and deﬁne TΦ(xk+1) = Φ(xk) Φ(xk+1) yk, xk xk+1 for any Φ Γ0, Then for all tx, ty (0, 1], k N, let ρ = ρ(g)(1 tx) + ρ(h)(1 ty) and ϵ = ϵx

ty , we have:

a) Tg(xk+1) ρ(g)(1 tx)

2 xk xk+1 2 ϵx

tx and Th(xk+1) ρ(h)(1 ty)

2 xk xk+1 2 + ϵy

Moreover, for any ϵ 0, if Tg(xk+1) ϵ, then xk is an (ϵ + ϵx, ϵy)-critical point of g h, with yk ϵ+ϵxg(xk) ϵyh(xk), and ρ(g)(1 tx)

2 xk xk+1 2 ϵx

tx + ϵ. Conversely, if xk ϵ+ϵxg (yk), then Tg(xk+1) ϵx + ϵ.

Similarly, if Th(xk+1) ϵ, then xk+1 is an (ϵx, ϵ+ϵy)-critical point of g h, with yk ϵxg(xk+1) ϵ+ϵyh(xk+1), and ρ(h)(1 ty)

2 xk xk+1 2 ϵy

ty + ϵ. Conversely, if yk ϵ+ϵyh(xk+1), then Th(xk+1) ϵy ϵ.

b) f(xk) f(xk+1) = Tg(xk+1) Th(xk+1) ρ

2 xk xk+1 2 ϵ.

c) For any ϵ 0, f(xk) f(xk+1) ϵ if and only if Tg(xk+1) Th(xk+1) ϵ. In this case, xk is an (ϵ1 + ϵx, ϵy)- critical point of g h, with yk ϵ1+ϵxg(xk) ϵyh(xk), xk+1 is an (ϵx, ϵ2 + ϵy)-critical point of g h, with yk ϵxg(xk+1) ϵ2+ϵyh(xk+1), for some ϵ1+ϵ2 = ϵ, ϵ1 ϵx, ϵ2 ϵy, and ρ

2 xk xk+1 2 ϵ+ϵ. Conversely, if xk ϵx+ϵ1g (yk) and yk ϵy+ϵ2h(xk+1), then Tg(xk+1) Th(xk+1) ϵx + ϵy + ϵ and f(xk) f(xk+1) ϵx + ϵy + ϵ.

d) min k {0,1,...,K 1} Tg(xk+1) Th(xk+1) = min k {0,1,...,K 1} f(xk) f(xk+1) f(x0) f

e) If ρ(g) + ρ(h) > 0, then

min k {0,1,...,K 1} xk xk+1

Difference of submodular minimization via DC programming

Proof. a) Since xk+1 ϵxg (yk), then yk ϵxg(xk+1) by Proposition 2.4. By Lemma E.1 we have for all x Rd

g(x) g(xk+1) + yk, x xk+1 + ρ(g)(1 tx)

2 x xk+1 2 ϵx

hence Tg(xk+1) ρ(g)(1 tx)

2 xk xk+1 2 ϵx

tx . If Tg(xk+1) ϵ, taking tx = 1 in (15), we have for all x Rd

g(x) g(xk) yk, xk xk+1 ϵ + yk, x xk+1 ϵx g(xk) + yk, x xk ϵ ϵx,

so yk ϵ+ϵxg(xk) ϵyh(xk) and xk is an (ϵ + ϵx, ϵy)-critical point. Similarly, since yk ϵyh(xk), we have for all x Rd

h(x) h(xk) + yk, x xk + ρ(h)(1 ty)

2 x xk 2 ϵy

hence Th(xk+1) ρ(g)(1 ty)

2 xk xk+1 2 + ϵy

ty . If Th(xk+1) ϵ, taking ty = 1 in (16), we have for all x Rd

h(x) h(xk+1) + yk, xk xk+1 ϵ + yk, x xk ϵy h(xk+1) + yk, x xk+1 ϵ ϵy,

so yk ϵxg(xk+1) ϵ+ϵyh(xk+1) and xk+1 is an (ϵx, ϵ + ϵy)-critical point. The converse directions follow directly from the deﬁnitions of approximate subdifferentials and TΦ.

f(xk) f(xk+1) = g(xk) g(xk+1) + yk, xk xk+1 (h(xk) h(xk+1) + yk, xk xk+1 )

= Tg(xk+1) Th(xk+1)

2 xk xk+1 2 ϵ,

where the inequality follows from Item a.

c) This follows from Items a and b, by choosing ϵ1 Tg(xk+1) and ϵ2 Th(xk+1).

d) This follows from Item b by telescoping sum:

min k {0,1,...,K 1} Tg(xk+1) Th(xk+1) = min k {0,1,...,K 1} f(xk) f(xk+1)

k=0 Tg(xk+1) Th(xk+1)

k=0 f(xk) f(xk+1)

= f(x0) f(x K)

e) This follows from Items b and d.

min k {0,1,...,K 1} xk xk+1 2 min k {0,1,...,K 1} 2

ρ(f(xk) f(xk+1) + ϵ) 2

Note that f(xk) f(xk+1) acts as a measure of non-criticality, since f(xk) = f(xk+1) implies that xk and xk+1 are critical points, when ϵx = ϵy = 0. Theorem E.2 also motivates min{Tg(xk+1), Th(xk)} as a weaker measure of non-criticality, since min{Tg(xk+1), Th(xk)} = 0 implies that xk is a critical point, when ϵx = ϵy = 0. Items a and d imply the following bound min k {0,1,...,K 1} min{Tg(xk+1), Th(xk)} min{ f(x0) f

K + ϵy, f(x0) f

which recovers the convergence rate provided in (Abbaszadehpeivasti et al., 2021, Corollary 4.1) on Tg(xk+1), with ϵx = ϵy = 0. The criterion Tg(xk+1) ϵ has also been used as a stopping criterion of FW for nonconvex problems; see Appendix F.1 and (Ghadimi, 2019, Eq. (2.6)).

Difference of submodular minimization via DC programming

E.2. Proof of Corollary 3.2

Before proving Corollary 3.2, we need the following lemma.

Lemma E.3. Let Φ be a convex function with bounded domain of diameter D, i.e., x z D for all x, z dom Φ, and Φ = Φ + ρ

2 2 for some ρ 0. Then for any x dom Φ, if y ρx ϵΦ(x), then y ϵ Φ(x). Conversely, if

y ϵ Φ(x), then y ρx ϵ Φ(x), where ϵ = 2ρϵD if ϵ ρD2

2 , and ρD2

2 + ϵ otherwise.

Proof. If y ρx ϵΦ(x), we have

Φ(z) Φ(x) + y ρx, z x ϵ

Φ(z) Φ(x) + y, z x + ρ x 2 ρx, z + ρ

Φ(z) Φ(x) + y, z x + ρ

Φ(z) Φ(x) + y, z x ϵ

Hence, y ϵ Φ(x). Conversely, if y ϵ Φ(x), then by Lemma E.1, we have for all t (0, 1], z dom Φ

Φ(z) Φ(x) + y, z x + ρ(1 t)

Φ(x) + y, z x + ρ

Φ(z) Φ(x) + y ρx, z x ρt

Hence, y ρx ϵ Φ(x) with ϵ = mint (0,1) ρt

t = min{ 2ρϵD, ρD2

Corollary 3.2. Given f = g h as deﬁned in (6), let {xk} and {yk} be generated by a variant of approximate DCA (7), where xk is integral, i.e., xk = 1Xk for some Xk V , and yk ρxk is computed as in Proposition 2.3-f. Then for all k N, ϵ 0, we have

a) If f(xk) f(xk+1) ϵ, then F(Xk) F(Sσ ℓ) + ϵ for all ℓ V , (8)

2ρd(ϵ + ϵx) if ϵ + ϵx ρd

2 + ϵ + ϵx otherwise. (9)

and σ Sd is the permutation used to compute yk ρxk in Proposition 2.3-f.

b) Given d permutations σ1, , σd Sd, corresponding to decreasing orders of xk with different elements at σ(|Xk|) or σ(|Xk| + 1), and the corresponding subgradients yk σ1, , yk σd h(xk) chosen as in Proposition 2.3-f, if we choose

xk+1 = argmin i V {f(xk+1 σi ) : xk+1 σi ϵxg (yk σi)},

then if f(xk) f(xk+1) ϵ, Eq. (8) holds with σ = σi for all i V . Hence, Xk is an ϵ -local minimum of F.

Proof. a) If f(xk) f(xk+1) ϵ, we have by Theorem 3.1-b (with ϵy = 0) that yk ϵx+ϵg(xk), which by Lemma E.3 implies that yk ρxk ϵ (g L + δ[0,1]d)(xk), by taking D = maxx,z dom (g L+δ[0,1]d) x z =

d. We observe

that for any ℓ V , we have yk ρxk h L(1Sσ ℓ) by Proposition 2.3-f. Hence, ϵ (g L + δ[0,1]d)(xk) h L(x) = 0, and by Proposition 2.7-a f(xk) f(1Sσ ℓ) + ϵ . The statement then follows by Proposition 2.3-a.

Difference of submodular minimization via DC programming

b) Note that yk σi, xk+1 σi for any i V are valid iterates for approximate DCA, so Item a apply to them. If f(xk) f(xk+1) ϵ, then f(xk) f(xk+1 σi ) ϵ since f(xk+1) f(xk+1 σi ) for all i V . Hence, by Item a we have F(Xk) F(Sσi ℓ)+ϵ

for all i, ℓ V . We now observe that for any j Xk there exists σi for some i V , such that σi(|Xk|) = j, and Sσi |Xk| 1 = Xk \ j. Similarly for any j V \ Xk, there exists σi for some i V , such that σi(|Xk| + 1) = j, and Sσi |Xk|+1 = Xk j. Then Xk is an ϵ -local minimum of F.

E.3. Convergence properties of DCA variants

In this section, we present convergence properties of the DCA variants discussed in Section 3. We start by the DCA variant presented in Algorithm 2, where at convergence we explicitly check if rounding the current iterate yields an ϵ -local minimum of F, and if not we restart from the best neighboring set.

Algorithm 2 Approximate DCA with local minimality stopping criterion

1: ϵ, ϵ , ϵx 0, x0 dom h, k 0. 2: for k = 1, 2, , K do 3: yk h(xk) 4: xk+1 ϵxg (yk) 5: if f(xk) f(xk+1) ϵ then 6: Xk+1 = Round F (xk+1) 7: if Xk+1 is an ϵ -local minimum of F then 8: Stop 9: else 10: xk+1 = 1 ˆ Xk+1 where ˆXk+1 = argmin|X Xk+1|=1 F(X) 11: end if 12: end if 13: end for

Proposition E.4. Given f = g h as deﬁned in (6) and ϵ ϵ + ϵx, Algorithm 2 converges to an ϵ -local minimum of F after at most (f(x0) f )/ϵ iterations.

Proof. Note that between each restart (line 10), Algorithm 2 is simply running approximate DCA, so Theorem 3.1 applies. For any iteration k N, if the algorithm did not terminate, then either f(xk) f(xk+1) > ϵ or Xk+1 is not an ϵ -local minimum of F and thus F(Xk+1) > F( ˆXk+1) + ϵ . In the second case, we have

f(1 ˆ Xk+1) = F( ˆXk+1) < F(Xk+1) ϵ (by Proposition 2.3-a)

f(xk+1) ϵ (by Proposition 2.3-d)

f(xk) + ϵx ϵ (by Theorem 3.1-a with tx = 1, ϵy = 0)

f(xk) ϵ (since ϵ ϵ + ϵx)

Hence, the new xk+1 = 1 ˆ Xk+1 will satisfy f(xk) f(xk+1) > ϵ. Thus f < f(xk) < f(x0) kϵ and k < (f(x0) f )/ϵ.

Next we present convergence properties of approximate DCAR (10).

Theorem E.5. Given f = g h as deﬁned in (6), let {xk}, {Xk}, { xk} and {yk} be generated by approximate DCAR (10), and deﬁne TΦ(xk+1) = Φ(xk) Φ(xk+1) yk, xk xk+1 for any Φ Γ0, Then for all tx (0, 1], k N, we have:

a) Tg( xk+1) ρ(1 tx)

2 xk xk+1 2 ϵx

tx , and Th( xk+1) ρ

2 xk xk+1 2.

Moreover, for any ϵ 0, if Tg( xk+1) ϵ, then xk is an (ϵ + ϵx, 0)-critical point of g h, with yk ϵ+ϵxg(xk) h(xk), and ρ(1 tx)

2 xk xk+1 2 ϵx

tx + ϵ. Conversely, if xk ϵ+ϵxg (yk), then Tg( xk+1) ϵx + ϵ.

Difference of submodular minimization via DC programming

Similarly, if Th( xk+1) ϵ, then xk+1 is an (ϵx, ϵ)-critical point of g h, with yk ϵxg( xk+1) ϵh( xk+1), and ρ 2 xk xk+1 2 ϵ. Conversely, if yk ϵh( xk+1), then Th( xk+1) ϵ.

b) F(Xk) F(Xk+1) f(xk) f( xk+1) = Tg( xk+1) Th( xk+1) ρ(2 tx)

2 xk xk+1 2 ϵx

c) For any ϵ 0, F(Xk) F(Xk+1) ϵ then f(xk) f( xk+1) = Tg( xk+1) Th( xk+1) ϵ. In this case, xk is an (ϵx + ϵ1, 0)-critical point of g h, with yk ϵx+ϵ1g(xk) h(xk), xk+1 is an (ϵx, ϵ2)-critical point of g h with yk ϵxg( xk+1) ϵ2h( xk+1) for some ϵ1+ϵ2 = ϵ, ϵ1 ϵx, ϵ2 0, and ρ(2 tx)

2 xk xk+1 2 ϵ+ ϵx

tx . Conversely, if xk ϵx+ϵ1g (yk), and yk ϵ2h( xk+1), then Tg( xk+1) Th( xk+1) ϵx + ϵ and f(xk) f( xk+1) ϵx + ϵ.

d) mink {0,1,...,K 1} Tg( xk+1) Th( xk+1) = mink {0,1,...,K 1} f(xk) f( xk+1) mink {0,1,...,K 1} F(Xk)

F(Xk+1) F (X0) F

e) If ρ > 0, then

min k {0,1,...,K 1} xk xk+1

2 ρ(2 tx) F(X0) F

Proof. Note that the iterates xk+1, yk are generated by an approximate DCA step from xk, so Theorem E.2 apply to them.

a) The claim follows from Theorem E.2-a.

b) By Theorem E.2-b, we have

f(xk) f( xk+1) = Tg( xk+1) Th( xk+1) ρ(2 tx)

2 xk xk+1 2 ϵx

By Proposition 2.3-a, we also have F(Xk) F(Xk+1) = f(xk) f(xk+1). The claim then follows since f(xk+1) f( xk+1) by Proposition 2.3-d.

c) This follows from Item b and Theorem E.2-c.

d) This follows from Item b by telescoping sum.

e) This follows from Items b and d.

F. Proofs of Section 4

F.1. Proof of Corollary 4.1

Corollary 4.1. Given any f = g h, where g, h Γ0, and φk as deﬁned in (11), let {wt} be generated by approximate FW (12) with γt = 1. Then for all T N, we have

min t {0, ,T 1} gap(wt) φk(w0) minw h(xk) φk(w)

Proof. We observe that approximate FW with γt = 1 is a special case of approximate DCA (1), with DC components

g = δ h(xk) and h = φk,

and ϵx = 0, ϵy = ϵ. Indeed, we can write the approximate FW iterates w0 h(xk) = dom g , st ϵh (wt) and wt+1 = vt argminw g (w) st, w = (g ) ( st), which are valid iterates of approximate DCA (1).

We show also that g , h Γ0: We can assume w.l.o.g that h(xk) = , otherwise the bound holds trivially. Hence, g is proper. And since h Γ0, h(xk) is a closed and convex set, hence g is a closed and convex function. We also have that h is proper, since otherwise Problem (5a) would not have a ﬁnite minimum, which also implies that the minimum of the

Difference of submodular minimization via DC programming

DC dual (4) is not ﬁnite, contradicing our assumption that the minimum of the DC problem (3) is ﬁnite. Finally, since the fenchel conjugate g is closed and convex, then h is also closed and convex.

We can thus apply Theorem E.2. We get

min k {0,1,...,K 1} Tg (wk+1) Th (wk+1) φ(w0) minw h(xk) φk(w)

K (by Theorem E.2-d)

min k {0,1,...,K 1} Tg (wk+1) φ(w0) minw h(xk) φk(w)

K + ϵ (by Theorem E.2-a with ty = 1)

The claim now follows by noting that Tg (wt+1) = st, wt wt+1 = gap(wt).

F.2. Proof of Proposition 4.2

Proposition 4.2. Given s, x Rd, let a1 > > am denote the unique values of x taken at sets A1 , Am, i.e., A1 Am = V and for all i {1, , m}, j Ai, xj = ai, and let σ Sd be a decreasing order of x, where we break ties according to s, i.e., xσ(1) xσ(d) and sσ(|Ci 1|+1) sσ(|Ci|), where Ci = A1 Ai for all i {1, , m}. Deﬁne wσ(k) = H(σ(k) | Sσ k 1) for all k V , then w is a maximizer of maxw h L(x) s, w .

Proof. By Proposition 2.3-f, w h L(x), so it is a feasible solution. Given any w h L(x), w is a maximizer of maxw B(H) w, s , hence it must satisfy w (Ci) = H(Ci) for all i {1, , m} (Bach, 2013, Proposition 4.2). We have

k=1 sσ(|Ci 1|+k) H(σ(|Ci 1| + k) | Sσ |Ci 1|+k 1) w σ(|Ci 1|+k)

k=1 (sσ(|Ci 1|+k) sσ(|Ci 1|+k+1))(H(Sσ |Ci 1|+k) w (Sσ |Ci 1|+k))

sσ(|Ci 1|+1)(H(Sσ |Ci 1|) w (Sσ |Ci 1|)) + sσ(|Ci|)(H(Sσ |Ci|) w (Sσ |Ci|))

The last inequality holds since w B(H) and Sσ |Ci| = Ci for all i {1, , m}.

F.3. Proof of Theorem 4.3

To prove Theorem 4.3 we need the following lemma, which extends the result in (Pham Dinh & Souad, 1988, Theorem 2.3). Lemma F.1. For any ϵ 0, ˆx is an ϵ-strong critical point of g h if and only if there exists ˆy argmin{ y, ˆx g (y) : y h(ˆx)} such that ˆx ϵg (ˆy).

Proof. If ˆx is an ϵ-strong critical point of g h, i.e., h(ˆx) ϵg(ˆx), then for every y h(ˆx), we have y ϵg(ˆx). In particular, this holds for ˆy argmin{ y, ˆx g (y) : y h(ˆx)}, hence ˆx ϵg (ˆy) by Proposition 2.4. Conversely, given ˆy argmin{ y, ˆx g (y) : y h(ˆx)} such that ˆx ϵg (ˆy), we have

ˆy, ˆx g (ˆy) y, ˆx g (y), y h(ˆx). (17)

Since ˆx ϵg (ˆy), we have by Proposition 2.4, g (ˆy) + g(ˆx) ˆy, ˆx ϵ. Combining this with (17) yields

g(ˆx) ϵ y, ˆx g (y), y h(ˆx).

By deﬁnition of g , we obtain

g(ˆx) ϵ y, ˆx x + g(x), x Rd, y h(ˆx).

Hence y ϵg(ˆx) for all y h(ˆx).

Theorem 4.3. Given any f = g h, where g, h Γ0, let {xk} and {yk} be generated by variant of approximate CDCA (5), where xk+1 is any point in ϵxg (yk) (not necessarily a solution of Problem (5b)). Then, for ϵ 0, if f(xk) f(xk+1) ϵ, xk is an (ϵ + ϵx)-strong critical point of g h. Moreover, if h is locally polyhedral, then xk is also an (ϵ + ϵx)-local minimum of f. This is the case for h given by (6) when ρ = 0.

Difference of submodular minimization via DC programming

Proof. Since approximate CDCA is a special case of approximate DCA, with ϵy = 0, Theorem 3.1 applies. If f(xk) f(xk+1) ϵ, we have by Theorem 3.1-b that xk ϵx+ϵg (yk). Hence, by Lemma F.1 xk satisﬁes h(xk) ϵx+ϵg(xk). If h is locally polyhedral, this implies that xk is an (ϵ + ϵx)-local minimum of f by Proposition 2.7-b.

F.4. Proof of Corollary 4.4

Corollary 4.4. Given f = g h as deﬁned in (6), ε 0, let ˆX V and ˆx = 1 ˆ X. If ˆx is an ε-strong critical point of g h, then ˆX is an ε -strong local minimum of F, where ε = 2ρdε if ε ρd

2 + ε otherwise. Conversely, if ˆX is an ε-strong local minimum of F, then ˆx is an ε-local minimum of f, and hence also an ε-strong critical point of g h.

Proof. Assume that ˆx is an ϵ-strong critical point of g h. We ﬁrst observe that any vector x = 1X corresponding to X ˆX or X ˆX has a common decreasing order with ˆx, hence choosing ˆy as in Proposition 2.3-f according to this common order yields ˆy h L(ˆx) h L(x), and ˆy + ρˆx h(ˆx) ϵg(ˆx). By Lemma E.3, we thus have ˆy ϵ (g L + δ[0,1]d)(ˆx) and ϵ (g L + δ[0,1]d)(ˆx) h L(x) = . Proposition 2.7-a then implies that f(ˆx) f(x) + ϵ . Hence, ˆX is an ϵ -strong local minimum of F by Proposition 2.3-a.

Conversely, assume ˆX is an ϵ -strong local minimum of F. We argue that there exists a neighborhood Bδ(ˆx) = {x : x ˆx < δ} of ˆx, where δ = 1/4, such that any X = Round F (x) for x Bδ(ˆx), satisﬁes X ˆX or X ˆX. To see this note that for x Bδ(ˆx), any σ Sd such that xσ(1) xσ(d) would have Sσ | ˆ X| = ˆX, since ˆxi δ > ˆxj + δ for

all i ˆX, j ˆX. Since X = Sσ ˆk where ˆk argmink=0,1,...,d F(Sσ k ), it must satisfy X ˆX or X ˆX. As a result, we have by Proposition 2.3-d,a, f L(ˆx) = F( ˆX) F(X) + ϵ f L(x) + ϵ .

Hence, ˆx is an ϵ-local minimum of f, and thus also ϵ-strong critical point of g h by Proposition 2.7-a.

F.5. Convergence properties of CDCAR

Corollary F.2. Let {xk}, { xk+1} and {yk} be generated by an approximate version of CDCAR (13) where xk+1 ϵxg (yk) and for some ϵx 0. Then all of the properties in Theorem E.5 hold. In addition, if F(Xk) F(Xk+1) ϵ for some ϵ 0 then xk is an (ϵ + ϵx)-strong critical point of f, with h(xk) ϵx+ϵg(xk), and Xk is an ϵ -strong local minimum of F, where ϵ = p

2ρd(ϵ + ϵx) if ϵ + ϵx ρd

2 + ϵ + ϵx otherwise. If ρ = 0, xk is also an ϵ + ϵx-local minimum of f.

Proof. Since CDCAR is a special case of DCAR, then all properties of the latter apply to the former. In addition, if F(Xk) F(Xk+1) ϵ, we have by Theorem E.5-c that xk ϵx+ϵg (yk). Hence, by Lemma F.1 xk satisﬁes h(xk) ϵx+ϵg(xk). Hence, Xk is a ϵ -strong local minimum of F by Corollary 4.4. If ρ = 0, h is polyhedral, hence xk

is an ϵ + ϵx-local minimum of f by Proposition 2.7-b.

G. Remarks on Local Optimality Conditions

The following example shows that rounding a fractional solution x K returned by DCA or CDCA will not necessarily yield an ϵ-local minimum of F, for any ϵ 0, even if x K is a local minimum of f L. It also shows that the objective achieved by a local minimum of f L can be arbitrarily worse than the minimum objective.

Example G.1. For any ϵ 0, α > ϵ, let V = {1, 2, 3}, G(X) = α|X|, and H : 2V R be a set cover function deﬁned as H(X) = α| S

i X Ui|, where U1 = {1}, U2 = {1, 2}, U3 = {1, 2, 3}. Then G is modular, H is submodular, and their corresponding Lov asz extensions are g L(x) = α(x1+x2+x3) and h L(x) = α(max{x1, x2, x3}+max{x2, x3}+x3); see e.g., (Bach, 2013, Section 6.3). The minimum value min X V G(X) H(X) = 2α is achieved at X = {3}. Consider a solution ˆx = (1, 0.5, 0), ˆx is a local minimum of f L. To see this note that for any vector x such that x1 > x2 > x3 we have h L(x) = g L(x), hence f L(x) = 0 = f L(ˆx). Accordingly, for any x in the neighborhood {x : x ˆx < 0.25 of ˆx, we have f L(x) = 0 = f L(ˆx), thus ˆx is a local minimum of f L. On the other hand none of the sets , {1}, {1, 2}, {1, 2, 3} obtained by rounding ˆx via Proposition 2.3-d are ϵ-local minima of F, since they all have objective value F( ˆX) = 0 and adding or removing a single element yields an objective F(X) = α (we can choose X to be {2}, {13}, {2}, {23} respectively). Note that if xk = ˆx at any iteration k of DCA (e.g., if we initialize at x0 = ˆx) and ρ > 0, then xk+1 = xk and DCA will terminate. To see this note that h has a unique subgradient at xk which is yk = ρxk + 1, and xk = argminx [0,1]d g L(x)

Difference of submodular minimization via DC programming

2 x 2. This also applies to CDCA, since DCA and CDCA coincide in this case (since h(xk) has a unique element). Note also that the objective at this local minimum f L(ˆx) = 0 is arbitrarily worse than the minimum objective minx [0,1]3 f L(x) = 2α.

Note that in the above example, the variant of DCA in Algorithm 2 would yield the optimal solution X (e.g., if we pick as the rounded solution).

The following example shows that the objective achieved by a set satisfying the guarantees in Corollary 3.2 can be arbitrarily worse than any strong local minimum. This highlights the importance of the stronger guarantee of CDCA. Example G.2. Let V = {1, , d}, α > 0, and G, H : 2V R be set cover functions deﬁned as G(X) = α| S

i X U G i |, where U G 1 = {1}, U G 2 = U G 3 = {2}, U G 4 = = U G d = {3} and H(X) = α| S

i X U H i |, where U H 1 = U H 4 = = U H d = {1}, U H 2 = {2}, U H 3 = {3}. Then G and H are submodular; see e.g., (Bach, 2013, Section 6.3). Consider X = {1}, X is a local minimum since adding or removing any element results in the same objective F(X) = 0 or larger. We argue that X also satisﬁes the rest of the guarantees in Corollary 3.2, i.e., F(X) F(Sσi ℓ) for all ℓ V , where σ2, , σd Sd correspond to decreasing orders of 1X with σi(2) = i. Each σi admits (d 2)! valid choices. Note that the only possible values of F(Sσi ℓ) are 0, α and α, with α achieved only at Sσ2 3 = Sσ3 3 = {1, 2, 3} with the choices of σ2 starting with (1, 2, 3) and the choices of σ3 starting with (1, 3, 2). So, for any other choices of σ2 and σ3, X satisﬁes the guarantees in Corollary 3.2. If σi s are chosen uniformly at random, X would satisfy the guarantees in Corollary 3.2 with probability 1 2 d 2. On the other hand, any strong local minimum ˆX must contain {2, 3} since otherwise the set X = ˆX ({2, 3} \ ˆX) ˆX has a smaller objective F(X ) = F( ˆX) α leading to a contradiction. It follows then that any strong local minimum will satisfy F( ˆX) F({2, 3}) = α, which is also the optimal solution, and arbitratily better than the objective achieved by X.

H. Special Cases of DS Minimization

In this section, we discuss some implications of our results to some special cases of the DS problem (1). To that end, we deﬁne two types of approximate submodularity and supermodularity, and show how they are related.

First, we recall the notions of weak DR-submodularity/supermodularity, which were introduced in (Lehmann et al., 2006) and (Bian et al., 2017), respectively. Deﬁnition H.1. A set function F is α-weakly DR-submodular, with α > 0, if

F(i | A) αF(i | B), for all A B, i V \ B.

Similarly, F is β-weakly DR-supermodular, with β > 0, if

F(i | B) βF(i | A), for all A B, i V \ B.

We say that F is (α, β)-weakly DR-modular if it satisﬁes both properties.

In the above deﬁnition, if F is non-decreasing, then α, β (0, 1], if it is non-increasing, then α, β 1, and if it is neither (non-monotone) then α = β = 1. F is submodular (supermodular) iff α = 1 (β = 1) and modular iff both α = β = 1.

Next, we recall the following characterizations of submodularity and supermodularity: A set function F is submodular if F(A) + F(B) F(A B) + F(A B) for all A, B V , and supermodular if F(A) + F(B) F(A B) + F(A B). We introduce other notions of approximate submodularity and supermodularity based on these properties. Deﬁnition H.2. A set function F is α-submodular, with α > 0, if

F(A) + F(B) α(F(A B) + F(A B)), for all A, B V .

Similarly, F is β-supermodular, with β > 0, if

β(F(A) + F(B)) F(A B) + F(A B), for all A, B V .

We say that F is (α, β)-modular if it satisﬁes both properties.

In the above deﬁnition, if F is non-negative, then α, β (0, 1], if it is non-positive, then α, β 1, and if it is neither then α = β = 1. F is submodular (supermodular) iff α = 1 (β = 1) and modular iff both α = β = 1.

The two types of approximate submodularity and supermodularity are related as follows.

Difference of submodular minimization via DC programming

Proposition H.3. F is α-weakly DR submodular iff

F(A) + αF(B) F(A B) + αF(A B), A, B V. (18)

If F is also normalized, then F is α-submodular. Similarly, F is β-weakly DR supermodular iff

β F(B) F(A B) + 1

β F(A B), A, B V. (19)

If F is also normalized, then F is β-supermodular.

Proof. Given an α-weakly DR submodular function F, let A \ B = {i1, i2, , ir}. Then

F(A B {i1, , ik}) F(A B {i1, , ik 1}) α(F(B {i1, , ik}) F(B {i1, , ik 1)), k = 1, , r

F(A) F(A B) α(F(A B) F(B)) (by telescoping sum)

Rearranging the terms yields (18). If F is also normalized, then F is α-submodular. To see this, note that if α < 1, F is non-decreasing and hence F(X) F( ) = 0, and if α > 1, F is non-increasing and hence F(X) F( ) = 0. Thus for any α > 0, we have F(X) αF(X) for any X V . In particular, applying this to X = B and X = A B, we obtain

F(A) + F(B) F(A) + αF(B) F(A B) + αF(A B) α(F(A B) + F(A B)).

Conversely, if F satisﬁes (18), then for all A B V , let A = A {i}, B = B , then

F(A) + αF(B) F(A B) + αF(A B)

F(A {i}) + αF(B ) F(A ) + αF(B {i}) F(i | A ) αF(i | B ).

Hence F is α-weakly DR submodular. The remaining claims follow similarly.

H.1. Approximately submodular functions

We consider special cases of the DS problem (1) where F is approximately submodular. In Section 4, we showed that CDCA with integral iterates xk and CDCAR converge to an ϵ -strong local minimum of F when F(Xk) F(Xk+1) ϵ, where

2ρd(ϵ + ϵx) if ϵ + ϵx ρd

2 + ϵ + ϵx otherwise. (20)

The following two propositions relate the approximate strong local minima of an approximately submodular function to its global minima, for two different notions of approximate submodularity.

Proposition H.4. If F is an α-submodular function, then for any ε 0, any ε-strong local minimum ˆX of F satisﬁes F( ˆX) 1 2α 1(min X V F(X) + 2εα).

Proof. Let X be an optimal solution. Since ˆX is an ε-strong local minimum of F, we have F( ˆX) F( ˆX X ) + ε and F( ˆX) F( ˆX X ) + ε. Hence,

2F( ˆX) F( ˆX X ) + F( ˆX X ) + 2ε

α(F( ˆX) + F(X )) + 2ε

F( ˆX) 1 2α 1(F(X ) + 2εα).

Difference of submodular minimization via DC programming

Proposition H.4 applies to the solutions returned by CDCA with integral iterates xk and CDCAR on the DS problem (1), with ε = ϵ . Moreover, when F is submodular, we have α = 1, then any ε-strong local minimum is a 2ε-global minimum of F in this case. In particular, if H is modular, DCA and CDCA with integral iterates xk, DCAR, and CDCAR, all converge to a 2ϵ -global minimum of F. This holds for the DCA variants since by Theorem 3.1-b, DCA converges to an (ϵ + ϵx, 0)-critical point of g h, and when H is modular, h is differentiable, hence any (ϵ + ϵx, 0)-critical point of g h is also an (ϵ + ϵx)-strong critical point, and by Corollary 4.4 it is also an ϵ -strong local minimum of F if it is integral.

Proposition H.5. Given F = G H, if G is submodular and H is β-weakly DR-supermodular, then for any ε 0, any ε-strong local minimum ˆX of F satisﬁes F( ˆX) G(X ) βH(X ) + 2ε, where X is a minimizer of F.

Proof. Since ˆX is an ε-strong local minimum of F, we have F( ˆX) F( ˆX X ) + ε and F( ˆX) F( ˆX X ) + ε. By Proposition H.3, H satisﬁes 1

β H( ˆX) + H(X ) 1

β H( ˆX X ) + H( ˆX X ) 1

β (H( ˆX X ) + H( ˆX X )). Hence,

2F( ˆX) F( ˆX X ) + F( ˆX X ) + 2ε

2F( ˆX) (G( ˆX) + G(X )) (H( ˆX) + βH(X )) + 2ε

F( ˆX) G(X ) βH(X ) + 2ε.

Proposition H.5 again applies to the solutions returned by CDCA with integral iterates xk and CDCAR on the DS problem (1), with ε = ϵ . This guarantee matches the one provided in (El Halabi & Jegelka, 2020, Corollary 1) in this case (though the result therein does not require H to be submodular), which is shown to be optimal (El Halabi & Jegelka, 2020, Theorem 2).

The following proposition shows that a similar result to Proposition H.5 holds under a weaker assumption (recall from Corollary 4.4 that if ˆX is an ε-strong local minimum of F then 1 ˆ X is an (ε, 0)-critical point of g h).

Proposition H.6. Given F = G H where G is submodular and H is β-weakly DR-supermodular, f = g h as deﬁned in (6), ε 0, let ˆx be an (ε, 0)-critical point of g h, with ˆy εg(ˆx) h(ˆx), where ˆy ρˆx is computed as in Proposition 2.3-f. Then ˆX = Round F (ˆx) satisﬁes F( ˆX) G(X ) βH(X ) + ε , where X is a minimizer of F, and ε = 2ρdε if ε ρd

2 + ε otherwise.

Proof. Since ˆy εg(ˆx), we have by Lemma E.3 that ˆy ρˆx ε (g L + δ[0,1]d)(ˆx). Hence, for all x [0, 1]d

g L(x) g L(ˆx) + ˆy ρˆx, x ˆx ε . (21)

Since H is β-weakly DR-supermodular and ˆy ρˆx is computed as in Proposition 2.3-f, we have by (El Halabi & Jegelka, 2020, Lemma 1), for all x Rd,

βh L(x) h L(ˆx) ˆy ρˆx, x ˆx . (22)

Combining (21) and (22), we obtain

g L(x) βh L(x) g L(ˆx) h L(ˆx) ε .

In particular, taking x = 1X , we have by Proposition 2.3-a,d,

G(X ) βH(X ) = g L(x ) βh L(x ) f L(ˆx) ε F( ˆX) ε .

Proposition H.6 applies to the solution returned by any variant of DCA and CDCA (including ones with non-integral iterates xk) on the DS problem (1), with ε = ϵ + ϵx, ε = ϵ . In particular, if H is modular (β = 1), they all obtain an ϵ -global minimum of F.

Difference of submodular minimization via DC programming

H.2. Approximately supermodular functions

We consider special cases of the DS problem (1) where F is approximately supermodular. In Section 3, we showed that DCA with integral iterates xk and DCAR converge to an ϵ -local minimum of F when F(Xk) F(Xk+1) ϵ, with ϵ deﬁned in (20). The following proposition shows that approximate local minima of a supermodular function are also approximate strong local minima.

Proposition H.7 (Lemma 3.3 in (Feige et al., 2011)). If F is a supermodular function, then for any ε 0, any ε-local minimum of F is also an εd-strong local minimum of F.

Proof. The proof follows in a similar way to (Feige et al., 2011, Lemma 3.3), we include it for completeness. Given an ε-local minimum X of F, for any X X, let X \ X = {i1, , ik}, then

F(X) F(X ) =

ℓ=1 F(iℓ| X {i1, , iℓ 1})

ℓ=1 F(iℓ| X \ iℓ)

We can show in a similar way that F(X) F(X ) + dε for any X X.

The following proposition relates the approximate strong local minima of an approximately supermodular function to its global minima.

Proposition H.8. If F is a non-positive β-supermodular function, then for any ε 0, any ε-strong local minimum ˆX of F satisﬁes min{F( ˆX), F(V \ ˆX)} 1 3β2 min X V F(X) + 2

3ε. In addition, if F is also symmetric, then ˆX satisﬁes

F( ˆX) 1 2β min X V F(X) + ε.

Proof. This proposition generalizes (Feige et al., 2011, Theorem 3.4). The proof follows in a similar way. Let X be an optimal solution. Since ˆX is an ε-strong local minimum of F, we have F( ˆX) F( ˆX X )+ε and F( ˆX) F( ˆX X )+ε. Hence,

2F( ˆX) + F(V \ ˆX) F( ˆX X ) + F( ˆX X ) + F(V \ ˆX) + 2ε

β (F( ˆX X ) + F(X \ ˆX) + F(V )) + 2ε

1 β2 (F(X ) + F( ) + F(V )) + 2ε.

If F is also symmetric then

2F( ˆX) F( ˆX X ) + F( ˆX (V \ X )) + 2ε

= F( ˆX X ) + F((V \ ˆX) X ) + 2ε

β (F(X ) + F( )) + 2ε.

Proposition H.4 applies to the solutions returned by CDCA with integral iterates xk and CDCAR on the DS problem (1), with ε = ϵ . Moreover, when F is non-positive supermodular, we have β = 1, then the solutions returned by CDCA with integral iterates xk and CDCAR satisfy min{F( ˆX), F(V \ ˆX)} 1 3F + 2

3ϵ and F( ˆX) 1 2F + ϵ

if F is symmetric; and by Proposition H.7 the solutions returned by DCA with integral iterates xk and DCAR satisfy min{F( ˆX), F(V \ ˆX)} 1

3ϵ d and F( ˆX) 1

2F + ϵ d if F is symmetric. These guarantees match the ones for the deterministic local search provided in (Feige et al., 2011, Theorem 3.4) , which are optimal for symmetric functions (Feige et al., 2011, Theorem 4.5), but not for general non-positive supermodular functions, where a 1/2-approximation guarantee can be achieved (Buchbinder et al., 2012, Theorem 4.1).

The non-positivity assumption in Proposition H.8 is necessary as demonstrated by the following example.

Difference of submodular minimization via DC programming

Example H.9. Let V = {1, , 4}, α > 0, G(X) = 2α|X|, and H : 2V R be a set cover function deﬁned as H(X) = α| S

i X Ui|, where Ui = {1, , i}. Then G, H are submodular functions, and F is supermodular but not non-positive, since F(V ) = 4α > 0. Consider a solution ˆX = {2},F( ˆX) = α(d 4) = 0, F(V \ ˆX) = 2α and ˆX is a strong local minimum of F since adding or removing any number of elements yields the same objective or worse. On the other hand, the minimum is min X V F(X) = 2α, achieved at X = {4}, which is arbitrarily better than min{F( ˆX), F(V \ ˆX)}.