# discrete_and_continuous_difference_of_submodular_minimization__ad8ffced.pdf Discrete and Continuous Difference of Submodular Minimization George Orfanides 1 Tim Hoheisel 1 Marwa El Halabi 2 Submodular functions, defined on continuous or discrete domains, arise in numerous applications. We study the minimization of the difference of two submodular (DS) functions, over both domains, extending prior work restricted to set functions. We show that all functions on discrete domains and all smooth functions on continuous domains are DS. For discrete domains, we observe that DS minimization is equivalent to minimizing the difference of two convex (DC) functions, as in the set function case. We propose a novel variant of the DC Algorithm (DCA) and apply it to the resulting DC Program, obtaining comparable theoretical guarantees as in the set function case. The algorithm can be applied to continuous domains via discretization. Experiments demonstrate that our method outperforms baselines in integer compressive sensing and integer least squares. 1. Introduction Many problems in machine learning require solving a nonconvex problem, with potentially mixed discrete and continuous variables. In this paper, we investigate a broad class of such problems that possess a special structure, namely the minimization of the difference of two submodular (DS) functions, over both continuous and discrete domains of the form X = Qn i=1 Xi, where each Xi R is compact: min x X F(x) := G(x) H(x), (1) and where G and H are normalized submodular functions. Submodularity is an important property which naturally occurs in a variety of machine learning applications (Bilmes, 2022; Bach, 2013). Most of the submodular optimization 1Department of Mathematics and Statistics, Mc Gill University, Montreal 2Samsung AI Lab, Montreal. Correspondence to: George Orfanides , Marwa El Halabi . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). literature focuses on set functions, which can be equivalently viewed as functions on X = {0, 1}n. Throughout, we make this identification and refer to them as such. Submodular set functions can be minimized in polynomial time exactly or up to arbitrary accuracy (Axelrod et al., 2020), and maximized approximately in polynomial time (Buchbinder et al., 2015). Beyond set functions, submodularity extends to general lattices like X (Topkis, 1978). Bach (2019) extended results from submodular set function minimization to general domains X. In particular, he provided two polynomialtime algorithms for submodular minimization with arbitrary accuracy on discrete domains, which are also applicable to continuous domains via discretization. Axelrod et al. (2020) later developed a faster algorithm for this problem. Bian et al. (2017b); Niazadeh et al. (2020) extended results from submodular set function maximization to continuous domains, with Bian et al. (2017b) giving a deterministic 1/3-approximation, and Niazadeh et al. (2020) an optimal randomized 1/2-approximation, both in polynomial-time. In this paper, we extend this line of inquiry by generalizing DS set function minimization results to general domains X. However, unlike the special case where F is submodular, Problem (1) in general cannot be solved efficiently, neither exactly nor approximately. Indeed, even for set functions, any constant-factor multiplicative approximation requires exponential time, and any positive polynomial-time computable multiplicative approximation is NP-Hard (Iyer & Bilmes, 2012, Theorems 5.1, 5.2). Even obtaining a local minimum (Definition 4.3) is PLS complete (Iyer & Bilmes, 2012, Section 5.3). Prior work (Narasimhan & Bilmes, 2005; Iyer & Bilmes, 2012; El Halabi et al., 2023) developed descent algorithms for Problem (1) in the special case of set functions, which converge to a local minimum of F. Minimizing DS set functions is equivalent to minimizing the difference of two convex (DC) functions, obtained by replacing each submodular function by its Lov asz extension (Lov asz, 1983). DC programs capture most well-behaved non-convex problems on continuous domains. The DC algorithm (DCA) is a classical method to solve them (Le Thi & Pham Dinh, 2018). El Halabi et al. (2023) leverage this connection to minimize DS set functions by applying DCA variants to the equivalent DC program. They also show that the submodular-supermodular (Sub Sup) method of Narasimhan & Bilmes (2005) is a special case of one of their variants. Discrete and Continuous Difference of Submodular Minimization We adopt a similar approach here to solve Problem (1) for general discrete domains. We use the convex continuous extension introduced in Bach (2019) to reformulate the problem as a DC program, then apply a DCA variant, obtaining theoretical guarantees comparable to the set function case. As in Bach (2019), the algorithm is applicable to continuous domains via discretization. Our key contributions are: We show that any function on a discrete domain and any smooth function on a continuous domain can be expressed as a DS function, though finding a good DS decomposition can be challenging. We highlight applications with natural DS objectives, such as quadratic programming and sparse learning. For discrete domains, we introduce an efficient DCA variant for Problem (1), which is a descent method and converges to a local minimum at rate O(1/T). We demonstrate that our method empirically outperforms baselines on two applications: integer compressed sensing and integer least squares. Additional Related Work Maehara & Murota (2015) proposed a discrete analogue of DCA for minimizing the difference of two discrete convex (discrete DC) functions G, H : Zn Z {+ }. They also show that any function F : D Z on a finite D Zn is the restriction of a discrete DC function to D, making discrete DC functions essentially equivalent to integer valued DS functions on discrete domains. However, obtaining a DS decomposition is easier, since a discrete DC function is a special case of DS functions. In Section 3.2, we provide examples of applications which naturally have the form of Problem (1), but not of a discrete DC program, even if we ignore the integer valued restriction. For further details, see Appendix B.2. Several works studied optimizing DR-submodular functions, a subclass of submodular functions which are concave along non-negative directions. For set functions, DR-submodularity is equivalent to submodularity, but not for more general domains such as the integer lattice. Ene & Nguyen (2016) reduced DR-submodular functions on a bounded integer lattice X = Qn i=1{0, . . . , ki 1}, to submodular set functions on a ground set of size O(log(k)), enabling the application of set-function results to this setting. Approximation algorithms with provable guarantees for maximizing continuous DR-submodular functions on continuous domains have also been proposed (Bian et al., 2017a;b; Hassani et al., 2017; Mokhtari et al., 2017; Fazel & Sadeghi, 2023; Mualem & Feldman, 2023). Yu & K uc ukyavuz (2024) also gave a polynomial-time algorithm for DRsubmodular minimization with mixed-integer variables over box and monotonicity constraints, with no dependence on k. 2. Preliminaries We introduce here our notation and relevant background. Given n N, we let [n] = {1, . . . , n} and [0 : n] = {0} [n]. We use R = R {+ } and Z = Z {+ }. The standard basis vectors for Rn and Rn m are denoted by ei, i [n] and Eij, i [n], j [m], respectively. The vectors of all ones and all zeros in Rn are denoted by 1 and 0 respectively. The support of a vector x Rn is defined as supp(x) = {i | xi = 0}. For q > 0, the ℓq-(quasi)-norm is given by x q = (Pn i=1 xq i )1/q and the ℓ0-pseudo-norm by x 0 = |supp(x)|. We conflate x q q with x 0 for q = 0. Given x, y Rn, x y means that xi yi for all i [n]. We denote by , F and F the Frobenius inner product and norm respectively. We call X Rn m row nonincreasing if its rows are all non-increasing, i.e., Xi,j Xi,j+1 for all i [n], j [m 1]. We denote by {0, 1}n m , [0, 1]n m , and Rn m the sets of row nonincreasing matrices in {0, 1}n m, [0, 1]n m, and Rn m, respectively. Given a vector x Rm, a non-increasing permutation p = (p1, . . . , pm) of x satisfies xpi xpi+1 for all i [m]. Similarly, a non-increasing permutation (p, q) = (p1, q1), . . . , (pnm, qnm) of a matrix X Rn m satisfies Xpi,qi Xpi+1,qi+1 for all i [nm]. A non-increasing permutation (p, q) of X Rn m is called row-stable if it preserves the original order within rows when breaking ties, i.e., for any i, j [nm] with i < j, if pi = pj, then qi < qj. Throughout this paper, we consider a function F : X R defined on X = Qn i=1 Xi, the product of n compact subsets Xi R. The set Xi is typically a finite set such as Xi = {0, . . . , ki 1} (discrete domain) or a closed interval (continuous domain).We assume access to an evaluation oracle of F, which returns F(x) for any x X in time EOF . Whenever we state that F is differentiable on X, we mean that F is differentiable on an open set containing X. We say that F is L-Lipschitz continuous on X with respect to , if |F(x) F(y)| L x y for all x, y X. When not specified, is the ℓ2-norm. A function F is normalized if F(xmin) = 0 where xmin is the smallest element in X, non-decreasing (non-increasing) if F(x) F(y) (F(x) F(y)) for all x y, and monotone if it is non-decreasing or non-increasing. We assume without loss of generality that F is normalized. Submodularity A function F is said to be submodular if F(x) + F(y) F(min{x, y}) + F(max{x, y}), (2) for all x, y X, where the min and max are applied component-wise. We call F modular if Eq. (2) holds with equality, strictly submodular if the inequality is strict whenever x and y are not comparable, and (strictly) supermodular if F is (strictly) submodular. A function F is modular iff it is separable (Topkis, 1978, Theorem 3.3). Discrete and Continuous Difference of Submodular Minimization An equivalent diminishing-return characterization of submodularity is that F is submodular iff for all x X, i, j [n], i = j, ai, aj > 0 such that x + aiei, x + ajej X, F(x+aiei) F(x) F(x+aiei+ajej) F(x+ajej). (3) Strict submodularity then corresponds to a strict inequality in (3) (Topkis, 1978, Theorem 3.1-3.2). Other equivalent characterizations of submodularity hold for differentiable and twice-differentiable F (Topkis, 1978, Section 3). Proposition 2.1. Given F : X R, where each Xi is a closed interval, we have: a) If F is differentiable on X, then F is submodular iff for all x X, i, j [n], i = j, a > 0 such that x + aej X: xi (x + aej). (4) b) If F is twice-differentiable on X, then F is submodular iff for all x X, i = j: 2F xi xj (x) 0, i = j. (5) In both cases, F is also strictly submodular if the inequalities are strict. On continuous domains, submodular functions can be convex, concave, or neither (Bach, 2019, Section 2.2). For n = 1, any function is modular and strictly submodulars since any x, y R are comparable. (Strict) submodularity is preserved under addition, positive scalar multiplication, and restriction (Bach, 2019, Section 2.1). Submodular minimization Minimizing a submodular set function F is equivalent to minimizing a convex continuous extension of F, called the Lov asz extension (Lov asz, 1983), on the hypercube [0, 1]n. Bach (2019) extended this result to general submodular functions on X. To that end, he introduced a continuous extension of functions defined on X to the set of Radon probability measures on X. When X = Qn i=1[0 : ki 1], this extension has a simple, efficiently computable form (Bach, 2019, Section 4.1), given below. For simplicity, we assume all ki s are equal to k. Definition 2.2. Given a normalized function F : [0 : k 1]n R, its continuous extension f : Rn (k 1) R is defined as follows: Given X Rn (k 1) and (p, q) a non-increasing permutation of X, define yi [0 : k 1]n as y0 = 0 and yi = yi 1 + epi for i [(k 1)n]. Then, i=1 Xpi,qi F(yi) F(yi 1) (6) For X / Rn (k 1) , we set f (X) = + . Evaluating f (x) takes O(r log r + r EOF ) with r = nk. The value of f (X) is invariant to the permutation choice. We can define a bijection map Θ : {0, 1}n (k 1) [0 : k 1]n and its inverse as Θ(X) = X1, (7a) Θ 1(x) = X where Xi,j = ( 1, if j xi, 0, if j > xi . (7b) In the set function case, i.e., when k = 2, we have {0, 1}n (k 1) = {0, 1}n and [0, 1]n (k 1) = [0, 1]n, so Θ becomes the identity map, and f reduces to the Lov asz extension; see e.g. Bach (2013, Definition 3.1). We list below some key properties of the extension. Items a-e, Item g (general bound) are from Bach (2019), Item f follows directly from the definition of f , and we prove the bound for non-decreasing F in Item g in Appendix D. Proposition 2.3. For a normalized function F : [0 : k 1]n R and its continuous extension f , we have: a) For all X {0, 1}n (k 1) , f (X) = F(Θ(X)). b) If F is submodular, then minx [0:k 1]n F(x) = minx [0,1]n (k 1) f (X). c) Given X [0, 1]n (k 1) and {yi}(k 1)n i=0 as defined in Definition 2.2, let i argmin i [0:(k 1)n] F(yi) and x = yi , then F(x) f (X). We refer to this as rounding and denote it by x = Round F (X). d) f is convex iff F is submodular. e) Given X Rn (k 1) , a non-increasing permutation (p, q) of X, and {yi}(k 1)n i=1 defined as in Definition 2.2, define Y Rn (k 1) as Ypi,qi = F(yi) F(yi 1). If F is submodular, then Y is a subgradient of f at X. f) If F = G H, then f = g h . g) If F is submodular, then f is Lf -Lipschitz continuous with respect to the Frobenius norm, with Lf p (k 1)n maxx,i |F(x + ei) F(x)|. If F is also non-decreasing, then Lf F((k 1)1). Bach (2019, Section 5) provided two polynomial-time algorithms for minimizing submodular functions on discrete domains up to arbitrary accuracy ϵ 0; one using projected subgradient method and the other using Frank-Wolfe (FW) method. When X = [0 : k 1]n, both obtain an ϵ-minimum in O((nk Lf /ϵ)2 EOF ) time. Though FW is empirically faster; especially the pairwise FW variant of (Lacoste-Julien & Jaggi, 2015a), which we use in our experiments (Section 5). Axelrod et al. (2020, Theorem 9) gave a faster Discrete and Continuous Difference of Submodular Minimization randomized algorithm with runtime O(n(k Lf /ϵ)2 EOF ), based on projected stochastic subgradient method. DC Programming Given a Euclidean space E, let Γ0(E) be the set of proper lower semi-continuous convex functions from E into R. For a set C E, δC is the indicator function of C taking value 0 on C and + outside it. Given a function f : E R, its domain is defined as dom f = {x E | f(x) < + }. For f Γ0(E), ϵ 0, and x0 dom f, the ϵsubdifferential of f at x0 is defined by ϵf(x0) = y E f(x) f(x0) + y, x x0 ϵ, x E , while f(x0) stands for the exact subdifferential (ϵ = 0). An element of the ϵ-subdifferential is called an ϵ-subgradient, or simply a subgradient when ϵ = 0. A standard DC program takes the form min x E f(x) := g(x) h(x) (8) where g, h Γ0(E). We adopt the convention + (+ ) = + . The function f is called a DC function. DC-programs are generally non-convex and nondifferentiable. A point x is a global minimum of Problem (8) iff ϵh(x) ϵg(x) for all ϵ 0 (Hiriart-Urruty, 1989, Theorem 4.4). However, checking this condition is generally infeasible (Pham Dinh & Le Thi, 1997). Instead, we are interested in notions of approximate stationarity. In particular, we say that a point x is an ϵ-critical point of g h if ϵg(x) h(x) = . Criticality depends on the choice of the DC decomposition g h of f (Le Thi & Pham Dinh, 2018, Section 1.1) and implies local minimality over a restricted set (El Halabi et al., 2023, Proposition 2.7). Proposition 2.4. Let ϵ 0 and f = g h with g, h Γ0(E). If x, x E satisfy ϵg(x) h(x ) = , then f(x) f(x ) + ϵ DCA iteratively minimizes a convex majorant of (8), obtained by replacing h at iteration t by its affine minorization h(xt) + yt, x xt , with yt h(xt). Starting from x0 dom g, DCA iterates are given by yt h(xt) (9a) xt+1 argmin x E g(x) yt, x . (9b) We list now some properties of DCA (Pham Dinh & Le Thi, 1997, Theorem 3), (El Halabi et al., 2023, Theorem 3.1). Proposition 2.5. Given f = g h with g, h Γ0(E) and a finite minimum f , let ϵ, ϵx 0, and {xt}, {yt} be generated by DCA (9), where subproblem (9b) is solved up to accuracy ϵx. Then for all t, T N, we have: a) f(xt+1) f(xt) + ϵx. b) If f(xt) f(xt+1) ϵ, then xt is an ϵ + ϵx-critical point of g h with yt ϵ+ϵxg(xt) h(xt). c) mint [T 1] f(xt) f(xt+1) f(x0) f DCA is thus a descent method (up to ϵx) which converges (f(xt) f(xt+1) ϵ) to a critical point with rate O(1/T). We note that Item c is not specific to DCA, but instead follows from f(x T ) f . 3. Difference of Submodular functions We start by identifying which functions are expressible as DS functions, then highlight important applications where they naturally occur. 3.1. Representability The structure assumed in Problem (1) may seem arbitrary, but it is in fact very general. In particular, we prove that any function on a discrete domain and any smooth function on a continuous domain can be expressed as a DS function. Proposition 3.1. Given any normalized F : X R where each Xi R is finite, there exists a decomposition F = G H, where G, H : X R are normalized (strictly) submodular functions. Proof Sketch. We give a constructive proof, similar to the one in (Iyer & Bilmes, 2012, Lemma 3.1) for set functions. We choose a normalized strictly submodular function H : X R, then define G = F + |α| β H and H = |α| β H, where α 0 and β > 0 are lower bounds on the difference between the left and right hand sides of Ineq. (3) for F and H respectively. We verify that G and H are normalized and submodular, using Ineq. (3). Choosing α < 0 as a strict lower bound further guarantees strict submodularity. The full proof is given in Appendix E.1. Obtaining tight lower bounds α and β in the above proof requires exponential time in general, even for set functions (Iyer & Bilmes, 2012). We provide in Example 3.2 a valid choice for H, for which a tight β can be easily computed. For any F, we can use α = 4 maxx X |F(x)|, which is often easy to lower bound. Though loose bounds α and β result in slower optimization, as explained in Appendix A. Example 3.2. Let H : Rn R be the quadratic function H(x) = 1 2x Jx, where J is the matrix of all ones, and define H : X R as H(x) = H(x) H(xmin). Then, H is a normalized strictly submodular function. Proof Sketch. By Proposition 2.1-b, H is strictly submodular on any product of closed intervals. Since strict submodularity is preserved by restriction, H is also strictly submodular on X. The full proof is given in Appendix E.1. Discrete and Continuous Difference of Submodular Minimization In the above example, we have H(x + aiei) H(x) H(x + aiei + ajej) + H(x + ajej) = aiaj for any x X, ai, aj > 0. When X is discrete, we obtain a tight lower bound β = didj where didj are the distances between the closest two points in Xi, Xj respectively, for some j = i. The decomposition given in the proof of Proposition 3.1 cannot be applied to continuous domains. Indeed, if H is a continuous function, taking ai 0 or aj 0 in inequality (3) yields β = 0. However, a similar decomposition can be obtained in this case, if F satisfies some smoothness condition. Proposition 3.3. Given any normalized F : X R where each Xi is a closed interval, if F is differentiable and there exist LF 0 such that F xi (x + aej) LF a, for all x X, i = j, a > 0, xj + a Xj, then there exists a decomposition F = G H, where G, H : X R are normalized (strictly) submodular functions. Proof Sketch. The proof is similar to that of Proposition 3.1. We choose a normalized strictly submodular function H : X R, which is differentiable and satisfies H xi (x) H xi (x + aej) L Ha, for some L H > 0, then define G = F + LF L H H and H = LF L H H. We verify that G and H are normalized and submodular, using Proposition 2.1-a. Choosing LF > 0 as a strict lower bound further guarantees strict submodularity. The full proof is in Appendix E. One sufficient condition for F to satisfy the assumption in Proposition 3.3 is for its gradient to be LF -Lipschitz continuous with respect to the ℓ1-norm, i.e., F(x) F(y) LF x y 1 for all x, y X. It follows then that any twice continuously differentiable function on X is also a DS function, with LF = maxx X 2F(x) 1, (Beck, 2017, Theorem 5.12). In both cases, computing a tight Lipschitz constant LF has exponential complexity in general (Huang et al., 2023, Theorem 2.2). Though often one can easily derive a bound on LF . The function in Example 3.2 is again a valid choice for H, where L H = 1 is tight. Like DC functions, DS functions admit infinitely many DS decompositions. For the specific decompositions in Propositions 3.1 and 3.3, the best one is arguably the one with the tightest α, β and LF , L H, respectively. Finding the best DS decomposition in general is even more challenging, as it is unclear how to define best . This question is explored for set functions in Brandenburg et al. (2024), who study the complexity of decomposing a set function into a difference of submodular set functions such that their Lov asz extensions have as few pieces as possible. In Appendix B, we discuss the connection between DS functions and related non-convex function classes. In particular, we note that DS functions on continuous domains are not necessarily DC, and that the discrete DC functions consid- ered in Maehara & Murota (2015) are essentially equivalent to integer valued DS functions on discrete domains. 3.2. Applications As discussed in Section 3.1, Problem (1) covers most well behaved non-convex problems over discrete and continuous domains. Though finding a good DS decomposition can be expensive in general. We give here examples of applications which naturally have the form of Problem (1). Quadratic Programming Quadratic programs (QP) of the form minx X 1 2x Qx + c x, arise in numerous applications. This form includes box constrained QP (BCQP), where Xi s are all closed intervals, as well as integer and mixed-integer BCQP, where some or all Xi Zn. Such problems are NP-Hard if the objective is non-convex (De Angelis et al., 1997), or if some of the variables are discrete (Dinur et al., 1998). The objective in these QPs has a natural DS decomposition F = G H, with G(x) = x Q x + c x and H(x) = x ( Q+)x, where Q = min{Q, 0} and Q+ = max{Q, 0}. By Proposition 2.1-b, G and H are submodular, since c x is modular. When X is continuous, these QPs can also be written as DC programs, but this requires computing the minimum eigenvalue of Q. Sparse Learning Optimization problems of the form minx X ℓ(x) + λ x q q, where q [0, 1), λ 0, and ℓis a smooth function, arise in sparse learning, where the goal is to learn a sparse parameter vector from data. There, the loss ℓis often smooth and convex (e.g., square or logistic loss), and the non-convex regularizer x q q promotes sparsity. The domain is often unbounded so X can be set to x R for some R 0, or X Zn as in the integer compressed sensing problems we consider in Section 5.2. Using q < 1 makes the problem NP-Hard, even for continuous X (Chen et al., 2017), but can be preferable to the convex ℓ1-norm, as it leads to fewer biasing artifacts (Fan & Li, 2001). Note that the regularizer x q q is modular since it is separable. Hence, these problems are instances of Problem (1), where a DS decomposition of ℓcan be obtained as in Proposition 3.3. If ℓ(x) = Ax b 2 2 , we can use the same decomposition as in the QPs above, i.e., G(x) = x Q x+c x+λ x q q and H(x) = x ( Q+)x, with Q = A A and c = 2A b. These problems cannot be written as DC programs even when X is continuous, since x q q is not DC, as we prove in Proposition B.1. We show in Appendix B.2 that the natural DS decomposition in both applications is not a discrete DC decomposition as defined in Maehara & Murota (2015) and cannot easily be adapted into one for general discrete domains, even when ignoring the integer-valued restriction. Discrete and Continuous Difference of Submodular Minimization 4. Difference of Submodular Minimization In this section, we show that most well behaved instances of problem (1) can be reduced to DS minimization over a bounded integer lattice Qn i=1[0 : ki 1] for some ki N. Then, we address solving this special case. 4.1. Reduction to Integer Lattice Domain We first obverse that submodularity is preserved by any separable monotone reparametrization. A special case of the following proposition is stated in Bach (2019, Section 2.1) with X = X and m strictly increasing. Proposition 4.1. Given X = Qn i=1 Xi, X = Q i X i, with compact sets Xi, X i R, let F : X R and m : X X be a monotone function such that [m(x)]i = mi(xi). If F is submodular, then the function F : X R given by F (x) = F(m(x)) is submodular. Moreover, if m is strictly monotone, then F is submodular iff F is submodular. Proof Sketch. We observe that min{m(x), m(y)} + max{m(x), m(y)} = m(min{x, y}) + m(max{x, y}). The claim then follows by verifying that Eq. (2) holds. The full proof is given in Appendix F.1 Discrete case For discrete domains, the desired reduction follows from Proposition 4.1, by noting that in this case elements in X can be uniquely mapped to elements in Qn i=1[0 : ki 1] with ki = |Xi| via a strictly increasing map. Corollary 4.2. Minimizing any DS function F : X R, where each Xi R is a finite set, is equivalent to minimizing a DS function on Qn i=1[0 : ki 1] with ki = |Xi|. Proof. For each i [n], let Xi = {xi 0, . . . , xi ki 1} where elements are ordered in non-decreasing order, i.e., xi j < xi j+1 for all j [0 : ki 2]. Define the map mi : [0 : ki 1] Xi as mi(j) = xi j. Then mi is a strictly increasing bijection. By Proposition 4.1, then the function F : Qn i=1[0 : ki 1] R defined as F (x) = F(m(x)) with [m(x)]i = mi is a DS function. Continuous case We can convert continuous domains into discrete ones using discretization. We consider X = [0, 1]n wlog, since by Proposition 4.1 any DS function F defined on Qn i=1[ai, bi] can be reduced to a DS function F on [0, 1]n by translating and scaling. As done in Bach (2019, Section 5.1), given a function F : [0, 1]n R which is L-Lipschitz continuous with respect to the ℓ -norm and ϵ > 0, we define k = L/ϵ +1 and the function F : [0 : k 1]n R as F (x) = F(x/(k 1)). Then we have min x [0:k 1]n F (x) ϵ/2 min x [0,1]n F(x) min x [0:k 1]n F (x). Again by Proposition 4.1, if F is DS then F is also DS. Moreover, given any minimizer x of F , x /(k 1) is an ϵ 2-minimizer of F. It is worth noting that Lipschitz continuity is not necessary for bounding the discretization error. We show in Appendix F.2 how to handle the function F(x) = x q q, with q [0, 1), on a domain where it is not Lipschitz continuous. 4.2. Optimization over Integer Lattice We now address solving Problem (1) for X = Qn i=1[0 : ki 1]. For simplicity, we assume ki = k for all i [n], for some k N, i.e., X = [0 : k 1]n. The results can be easily extended to unequal ki s. Since Problem (1) is inapproximable, we will focus on obtaining approximate local minimizers. Given x X, we define the set of neighboring points of x in X as NX (x) := {x X | i [n], x = x ei }. Definition 4.3. Given ϵ 0 and x X, we call x an ϵ-local minimum of F if F(x) F(x ) + ϵ for all x NX (x). If X = {0, 1}n, we recover the definition of a local minimum of a set function. A natural approach to solve Problem (1) is to reduce it to DS set function minimization, using the same reduction as in submodular minimization (Bach, 2019, Section 4.4), and then apply the algorithms of Narasimhan & Bilmes (2005); Iyer & Bilmes (2012); El Halabi et al. (2023). However, this strategy is more expensive than solving the problem directly, even when F is submodular. We discuss this in Appendix C. Continuous relaxation We adopt a more direct approach to solve Problem (1), which generalizes the approach of El Halabi et al. (2023) for set functions. In particular, we relax the DS problem to an equivalent DC program, using the continuous extension (Definition 2.2) introduced in Bach (2019), then apply a variant of DCA to it. Recall that minimizing a submodular function F is equivalent to minimizing its continuous extension f (Proposition 2.3-b). We now observe that this equivalence continues to hold even when F is not submodular. Proposition 4.4. For any normalized function F : [0 : k 1]n R, we have min x [0:k 1]n F(x) = min X [0,1]n (k 1) f (X). (10) Moreover, if x is a minimizer of F then Θ 1(x ) is a minimizer of f , and if X is a minimizer of f then Round F (X ) is a minimizer of F. Recall that Θ is the bijection between {0, 1}n (k 1) and [0 : k 1]n defined in (7). The proof follows from the two properties of f in Proposition 2.3-a,c, and is given in Discrete and Continuous Difference of Submodular Minimization Appendix F.3. By Proposition 2.3-f, Problem (1) is then equivalent to min X [0,1]n (k 1) f (X) = g (x) h (x), (11) where g , h Γ0(Rn (k 1)) by Proposition 2.3-d,g, since G, H are submodular. Algorithm 1 DCA with local search 1: ϵ 0, T N, x0 X, X0 = Θ 1(x0) 2: for t = 1, . . . , T do 3: xt argminx NX (xt) F(x) 4: Choose a common non-increasing permutation (p, q) of Xt and Θ 1( xt) (preferably row-stable) 5: Choose Y t h(Xt) corresponding to (p, q) 6: Xt+1 argmin X [0,1]n (k 1) g (X) Y t, X F 7: if Xt+1 {0, 1}n (k 1) then 8: xt+1 = Θ( Xt+1), Xt+1 = Xt+1 9: else 10: xt+1 = Round F ( Xt+1), Xt+1 = Θ 1(xt+1) 11: end if 12: if F(xt) F(xt+1) ϵ then 13: Stop. 14: end if 15: end for Algorithm Problem (11) is a DC program on E = Rn (k 1), with f = f + δ[0,1]n (k 1) = g h, where g = g + δ[0,1]n (k 1) and h = h . Applying the standard DCA (9) to it gives a descent method (up to ϵx) which converges to a critical point XT of g h by Proposition 2.5. A feasible solution x T to Problem (1) can then be obtained by rounding, i.e., x T = Round F (XT ) [0 : k 1]n, which satisfies F(x T ) f (XT ), as shown in Proposition 2.3-c. However, even in the set functions case, x T is not necessarily an approximate local minimum of F, as shown by El Halabi et al. (2023, Example F.1). To address this, the authors proposed two variants of standard DCA that do obtain an approximate local minimum of F. One variant, which generalizes the Sub Sup method of Narasimhan & Bilmes (2005), is too expensive, as it requires trying O(n) subgradients per iteration. While the other variant checks at convergence if the rounded solution is an approximate local minimum, and if not restarts DCA from the best neighboring point. Both can be generalized to our setting. We provide an extension of the second, more efficient, variant and its theoretical guarantees in Appendix I. We introduce a new efficient variant of DCA, DCA with local search (DCA-LS), in Algorithm 1, which selects a single subgradient Y t using a local search step (lines 35), ensuring direct convergence to an approximate local minimum of F without restarts. The algorithm maintains a feasible solution xt to Problem (1) by rounding Xt+1 or applying the map Θ if Xt+1 is integral. Rounding can optionally be applied in the latter case as well. Finding a common permutation on line 4 is always possible. Indeed, the binary matrix Θ 1( xt) differs from Xt at only one element: (i, xt i + 1) if xt = xt + ei and (i, xt i 1) if xt = xt ei. Thus, we can choose any row-stable non-increasing permutation (p, q) of Xt such that (pd+1, qd+1) = (i, xt i + 1) if xt = xt + ei and (pd 1, qd 1) = (i, xt i 1) if xt = xt ei, where d = Xt 0. Computational complexity The cost of finding the best neighboring point xt is 2n EOF . Finding a valid permutation (p, q) on line 4 as discussed above costs O(nk). The corresponding subgradient Y t can then be computed as described in Proposition 2.3-e in O(nk EOH). The subproblem on line 6 is a convex problem which can be solved using projected subgradient method or FW methods. Furthermore, like in the set function case, this subproblem is equivalent to a submodular minimization problem. Indeed, we show in Proposition F.4 that the term Y t, X F corresponds to the continuous extension of the normalized modular function Ht(x) = Pn i=1 Pxi j=1 Y t ij. Hence, by Proposition 2.3-b,f, the subproblem is equivalent to minimizing the submodular function F t = G Ht. We can thus obtain an integral ϵx-solution Xt+1 {0, 1}n (k 1) , for any ϵx 0, in O(n(k Lf t /ϵ)2 EOF t) time using the algorithm of Axelrod et al. (2020). Rounding can be skipped in this case, and Xt+1 can be directly mapped to xt+1 via Θ in O(nk). So the total cost per iteration of DCA-LS is O(n(k Lf t /ϵ)2 EOF t + nk EOH). The choice of DS decomposition for F affects the runtime of DCA-LS. For instance, looser bounds α and β in the generic decomposition from Proposition 3.1 lead to a larger Lipschitz constant Lf t and thus a longer runtime (Appendix A). Theoretical guarantees Let F be the minimum of Problem (1). Note that the minimum f = F of the DC program (11) is finite, and that {Y t}, { Xt+1} are standard DCA iterates. Proposition 2.5-a,b then apply to them. The following theorem relates DCA properties on Problem (11) to ones on Problem (1), showing that DCA-LS is a descent method (up to ϵx) which converges to an (ϵ + ϵx)-local minimum of F in O(1/ϵ) iterations. Theorem 4.5. Let {xt} be generated by Algorithm 1, where the subproblem on line 6 is solved up to accuracy ϵx 0. For all t [T], ϵ 0, we have: a) F(xt+1) F(xt) + ϵx. Discrete and Continuous Difference of Submodular Minimization b) Let {yi}(k 1)n i=0 be the vectors corresponding to the permutation (p, q) from line 4, defined as in Definition 2.2. If (p, q) is row-stable and F(xt) F(xt+1) ϵ, then F(xt) F(yi) + ϵ + ϵx for all i [0 : (k 1)n]. c) Algorithm 1 converges to an (ϵ + ϵx)-local minimum of F after at most (F(x0) F )/ϵ iterations. Proof Sketch. Item a follows from Proposition 2.5-a and Proposition 2.3-a,c. Item b follows from Proposition 2.5-b and Proposition 2.4, by observing that if (p, q) is row-stable, then it is a common non-increasing permutation for Xt and {Θ 1(yi)}(k 1)n i=0 , and thus Y t is a common subgradient of h at all these points. The choice of (p, q) on line 4 also ensures the same holds for Θ 1( xt) even when (p, q) is not row-stable. Item c is then obtained by telescoping sums. The full proof is given in Appendix F.5. Restricting (p, q) to be row-stable is necessary to ensure that it is also a non-increasing permutation of {Θ 1(yi)}(k 1)n i=0 , which is needed for Item b to hold, but not for Items a and c, as discussed in the proof sketch. In the set function case (k = 2), this restriction is unnecessary, since any non-increasing permutation of X Rn (k 1) is trivially row-stable. DCA-LS can be modified to return a solution with a stronger local minimality guarantee, where F(x T ) F(x) + ϵ + ϵx for all x X such that x x T 1 c, for some c N, by setting xt argmin x xt 1 c F(X). This increases the cost of computing xt to O(nc EOF ). Theorem 4.5 generalizes the theoretical guarantees of El Halabi et al. (2023) to general discrete domains; recovering the same guarantees in the set function case. In Appendix I, we compare DCA-LS to an extension of the more efficient DCA variant from El Halabi et al. (2023), theoretically and empirically. We show that both variants have similar theoretical guarantees and computational complexity. In practice, however, DCA-LS performs better in some settings, sometimes by a large margin, but is often slower than DCA-Restart. Other variants (with regularization, acceleration, complete DCA) explored in El Halabi et al. (2023) are also applicable here. When regularization is used, rounding becomes necessary, as explained in Section 3 therein. Implications for non-integer domains When X is a general discrete or continuous domain, we can apply DCA-LS to the function F obtained via the reductions discussed in Section 4.1. Theorem 4.5 then hold for F . In the discrete case, recall that F (x) = F(m(x)), where m is the map defined in the proof of Corollary 4.2. The guarantee that xt is an (ϵ+ϵx)-local minimum of F implies that m(xt) is an (ϵ + ϵx)-local minimum of F, in the sense that modifying any coordinate i [n] to its previous or next value on the grid Xi does not reduce F by more than ϵ + ϵx. In the continuous case, assuming again X = [0, 1]n, recall that F (x) = F(x/(k 1)), where k = L/ϵ + 1 for some ϵ > 0, and L is the Lipschitz constant of F with respect to the ℓ -norm. Let xt = xt/(k 1). Then the local minimality guarantee on F implies that: F( xt) F( xt ei k 1) + ϵ + ϵx. This is only meaningful if ϵ > ϵ + ϵx, since the Lipschitz continuity of F already ensures F( xt) F( xt ei k 1) L/(k 1) ϵ . 5. Experiments We evaluate our proposed method, DCA-LS (Algorithm 1), on two applications: integer least squares and integer compressive sensing, where X Zn. We compare it with state-of-the-art baselines. We use Pairwise-FW (Lacoste-Julien & Jaggi, 2015b) to solve the submodular minimization subproblem at line 6. Results averaged over 100 runs are shown in Figure 1, with error bars for standard deviations. Experimental setups for each application are described below, with additional details given in Appendix G. The code is available at https://github.com/ Samsung SAILMontreal/cont-diffsubmin. 5.1. Integer Least Squares We consider the problem of recovering an integervalued vector x X from noisy linear measurements b = Ax + ξ, where A Rm n is a given measurement matrix with m n, and ξ Rm is a Gaussian noise vector ξ N(0, σ2Im). One approach for recovering x is to solve an integer least squares (ILS) problem minx X F(x) = Ax b 2 2. This is an integer BCQP with Q = A A and c = 2A b. Thus, we use the DS decomposition given in Section 3.2. ILS problems arise in many applications, including wireless communications (Damen et al., 2003), image processing (Blasinski et al., 2012), and neural network quantization (Frantar & Alistarh, 2022). We sample x uniformly from X = { 1, 0, 2, 3}n with n = 100, draw the entries of A i.i.d from N(0, 1), and vary m from n to 2n. The noise variance σ2 is set to achieve a target signal-to-noise ratio (SNRd B) of 20 d B. We include as baselines ADMM (Takapoui et al., 2020), Optimal Brain Quantizer (OBQ) (Frantar & Alistarh, 2022), and the relaxand-round (RAR) heuristic which solves the relaxed problem on [ 1, 3]n, then rounds the solution to the nearest point in X. DCA-LS and ADMM are initialized with the RAR solution, and OBQ with the relaxed solution. We obtain an optimal solution x using Gurobi (Gurobi Optimization, LLC, 2024) by rewriting the problem as a binary QP. We report in Figure 1 (top) three evaluation metrics; recovery probability Discrete and Continuous Difference of Submodular Minimization 1 1.2 1.4 1.6 1.8 2 m / n Recovery Probability ADMM RAR OBQ DCA-LS Optimal 1 1.2 1.4 1.6 1.8 2 m / n Bit Error Rate 1 1.2 1.4 1.6 1.8 2 m / n (F( x) ! F(x$))=F(x$) 0.2 0.4 0.6 0.8 1 m / n Recovery Probability DCA-LS DCA-LS-LASSO LASSO OMP 0.2 0.4 0.6 0.8 1 m / n Support Error 0.2 0.4 0.6 0.8 1 m / n Estimation Error Figure 1: Performance results, averaged over 100 runs, for the integer least squares with SNRd B = 20 d B, n = 100 (top) and for integer compressed sensing with SNRd B = 8 d B, n = 256 and s = 26 = 0.1n (bottom). (ratio of exactly recovered signals over number of runs), bit error rate BER(ˆx) = ˆx x 0/n, and relative objective gap with respect to x . We also compared with Babai s nearest plane algorithm (Babai, 1986, 2nd procedure), but excluded it here as it had worse objective gap and BER than all baselines, and similar recovery probability to ADMM. We observe that DCA-LS outperforms baselines on all metrics, and performs on par with the optimal solution starting from m/n = 1.2. OBQ matches the performance of DCALS from m/n = 1.5, while ADMM and RAR are both considerably worse across all metrics. In Appendix H.1, we include results using larger n = 400, as well as another setup where we fix m = n and vary the SNRd B. Gurobi cannot be used for n = 400 as it is too slow. DCA-LS outperforms other baselines on all metrics in these settings too. 5.2. Integer Compressed Sensing We consider an integer compressed sensing problem, where the goal is to recover an s-sparse integer vector x from noisy linear measurements with m n and s n. This problem arises in applications like wireless communications, collaborative filtering, and error correcting codes (Fukshansky et al., 2019). The signal can be recovered by solving minx X F(x) = Ax b 2 2 + λ x 0 with λ > 0. This is a special case of the sparse learning problem in Section 3.2, with q = 0. We use the same DS decomposition therein. We set X = { 1, 0, 1}n, n = 256, s = 26 = 0.1n , and draw A i.i.d from N(0, 1/m), with m varied from 26 to n. We choose a random support for x , then sample its non-zero entries i.i.d uniformly from { 1, 1}. The noise variance σ2 is set to achieve an SNRd B of 8 d B. We consider as baselines orthogonal matching pursuit (OMP) (Pati et al., 1993) and a variant of LASSO (Tibshirani, 1996) with the additional box constraint x [ 1, 1]n, solved using FISTA (Beck & Teboulle, 2009). Since both methods return solutions outside of X, we round the solutions to the nearest vector in X. As a non-convex method, the performance of DCA-LS is highly dependent on its initialization. We thus consider two different intialization, one with x0 = 0 (DCA-LS) and one with the solution from the box LASSO variant (DCA-LSLASSO). We evaluate methods on three metrics; recovery probability, support error |supp(ˆx) supp(x )|, and estimation error ˆx x 2/ x 2, where is the symmetric difference. Figure 1 (bottom) reports the best values, as λ is varied from 1 to 10 5 in DCA-LS and FISTA, and the sparsity of the OMP solution is varied from 1 to 1.5s . DCA-LS and DCA-LS-LASSO significantly outperform baselines in recovery probability. While DCA-LS outperforms OMP in all metrics, it performs worse than LASSO in estimation and support errors for m/n < 0.6, but overtakes it after. DCA-LS-LASSO outperforms baselines on all metrics, except for m/n (0.25, 0.4), where it slightly lags behind LASSO in estimation and support errors. These correspond to cases where x = x . Indeed, recall that DCA-LS is a descent method (up to ϵ) so DCA-LS-LASSO is guaranteed to obtain an objective value at least as good as LASSO. In Appendix H.3, we include results for s = 13 = 0.05n , and for another setup where we fix m/n = 0.5 and vary the SNRd B. We observe similar trends in these settings too. Discrete and Continuous Difference of Submodular Minimization Acknowledgements We thank Xiao-Wen Chang for helpful discussions. George Orfanides was partially supported by NSERC CREATE INTER-MATH-AI and a Fonds de recherche du Qu ebec Doctoral Research Scholarship B2X-326710. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Axelrod, B., Liu, Y. P., and Sidford, A. Near-optimal approximate discrete and continuous submodular function minimization. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 837 853. SIAM, 2020. Babai, L. On lov asz lattice reduction and the nearest lattice point problem. Combinatorica, 6:1 13, 1986. Bach, F. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning, 6(2-3):145 373, 2013. Bach, F. Submodular functions: from discrete to continuous domains. Mathematical Programming, 175:419 459, 2019. Beck, A. First-order methods in optimization. SIAM, 2017. Beck, A. and Guttmann-Beck, N. Fom a matlab toolbox of first-order methods for solving convex optimization problems. Optimization Methods and Software, 34(1): 172 193, 2019. Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, 2009. Becker, S. Cosamp and omp for sparse recovery. URL https://www.mathworks. com/matlabcentral/fileexchange/ 32402-cosamp-and-omp-for-sparse-recovery. Retrieved 2025-01-30. Bian, A., Levy, K., Krause, A., and Buhmann, J. M. Continuous dr-submodular maximization: Structure and algorithms. In Advances in Neural Information Processing Systems, pp. 486 496, 2017a. Bian, A. A., Mirzasoleiman, B., Buhmann, J., and Krause, A. Guaranteed non-convex optimization: Submodular maximization over continuous domains. In Artificial Intelligence and Statistics, pp. 111 120. PMLR, 2017b. Bilmes, J. Submodularity in machine learning and artificial intelligence. ar Xiv preprint ar Xiv: 2202.00132, 2022. URL https://arxiv.org/abs/2202. 00132v2. Blasinski, H., Bulan, O., and Sharma, G. Per-colorantchannel color barcodes for mobile applications: An interference cancellation framework. IEEE Transactions on Image Processing, 22(4):1498 1511, 2012. Brandenburg, M.-C., Grillo, M., and Hertrich, C. Decomposition polyhedra of piecewise linear functions. ar Xiv preprint ar Xiv:2410.04907, 2024. Buchbinder, N., Feldman, M., Seffi, J., and Schwartz, R. A tight linear time (1/2)-approximation for unconstrained submodular maximization. SIAM Journal on Computing, 44(5):1384 1402, 2015. Chen, Y., Ge, D., Wang, M., Wang, Z., Ye, Y., and Yin, H. Strong np-hardness for sparse optimization with concave penalty functions. In International Conference on Machine Learning, pp. 740 747. PMLR, 2017. Damen, M. O., El Gamal, H., and Caire, G. On maximumlikelihood detection and the search for the closest lattice point. IEEE Transactions on information theory, 49(10): 2389 2402, 2003. De Angelis, P. L., Pardalos, P. M., and Toraldo, G. Quadratic programming with box constraints. Developments in global optimization, pp. 73 93, 1997. de Oliveira, W. The abc of dc programming. Set-Valued and Variational Analysis, 28:679 706, 2020. Dinur, I., Kindler, G., and Safra, S. Approximating-cvp to within almost-polynomial factors is np-hard. In Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No. 98CB36280), pp. 99 109. IEEE, 1998. El Halabi, M., Orfanides, G., and Hoheisel, T. Difference of submodular minimization via DC programming. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 9172 9201. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/ v202/el-halabi23b.html. Ene, A. and Nguyen, H. L. A reduction for optimizing lattice submodular functions with diminishing returns. ar Xiv preprint ar Xiv:1606.08362, 2016. Fan, J. and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of Discrete and Continuous Difference of Submodular Minimization the American statistical Association, 96(456):1348 1360, 2001. Fazel, M. and Sadeghi, O. Fast first-order methods for monotone strongly dr-submodular maximization. In SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23), pp. 169 179. SIAM, 2023. Frantar, E. and Alistarh, D. Optimal brain compression: A framework for accurate post-training quantization and pruning. Advances in Neural Information Processing Systems, 35:4475 4488, 2022. Fukshansky, L., Needell, D., and Sudakov, B. An algebraic perspective on integer sparse recovery. Applied Mathematics and Computation, 340:31 42, 2019. Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024. URL https://www.gurobi.com. Hassani, H., Soltanolkotabi, M., and Karbasi, A. Gradient methods for submodular maximization. In Advances in Neural Information Processing Systems, pp. 5843 5853, 2017. Hiriart-Urruty, J.-B. From convex optimization to nonconvex optimization. necessary and sufficient conditions for global optimality. In Nonsmooth optimization and related topics, pp. 219 239. Springer, 1989. Huang, J. W., Roberts, S. J., and Calliess, J.-P. On the sample complexity of lipschitz constant estimation. Transactions on Machine Learning Research, 2023. Iyer, R. and Bilmes, J. Algorithms for approximate minimization of the difference between submodular functions, with applications. In Proceedings of the Twenty Eighth Conference on Uncertainty in Artificial Intelligence, UAI 12, pp. 407 417, Arlington, Virginia, United States, 2012. AUAI Press. ISBN 978-0-9749039-89. URL http://dl.acm.org/citation.cfm? id=3020652.3020697. Kim, J., Halabi, M. E., Park, W., Schaefer, C. J., Lee, D., Park, Y., Lee, J. W., and Song, H. O. Guidedquant: Large language model quantization via exploiting end loss guidance. In Proceedings of the 41th International Conference on Machine Learning, 2025. Lacoste-Julien, S. and Jaggi, M. On the global linear convergence of frank-wolfe optimization variants. In Advances in Neural Information Processing Systems, pp. 496 504, 2015a. Lacoste-Julien, S. and Jaggi, M. On the global linear convergence of frank-wolfe optimization variants. Advances in neural information processing systems, 28, 2015b. Le Thi, H. A. and Pham Dinh, T. Dc programming and dca: thirty years of developments. Mathematical Programming, 169(1):5 68, 2018. Lov asz, L. Submodular functions and convexity. In Mathematical Programming The State of the Art, pp. 235 257. Springer, 1983. Maehara, T. and Murota, K. A framework of discrete dc programming by discrete convex analysis. Mathematical Programming, 152(1):435 466, 2015. Mokhtari, A., Hassani, H., and Karbasi, A. Conditional gradient method for stochastic submodular maximization: Closing the gap. ar Xiv preprint ar Xiv:1711.01660, 2017. Mordukhovich, B. and Nam, N. M. An easy path to convex analysis and applications. Springer Nature, 2023. Mualem, L. and Feldman, M. Resolving the approximability of offline and online non-monotone dr-submodular maximization over general convex sets. In Ruiz, F., Dy, J., and van de Meent, J.-W. (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 2542 2564. PMLR, 25 27 Apr 2023. URL https://proceedings.mlr.press/ v206/mualem23a.html. Murota, K. Discrete convex analysis. Mathematical Programming, 83:313 371, 1998. Murota, K. and Shioura, A. Relationship of m-/l-convex functions with discrete convex functions by miller and favati tardella. Discrete Applied Mathematics, 115(1-3): 151 176, 2001. Narasimhan, M. and Bilmes, J. A. A submodularsupermodular procedure with applications to discriminative structure learning. In UAI 05, Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence, Edinburgh, Scotland, July 26-29, 2005, pp. 404 412. AUAI Press, 2005. URL https://dslpitt.org/ uai/display Article Details.jsp?mmnu=1& smnu=2&article_id=1243&proceeding_id= 21. Niazadeh, R., Roughgarden, T., and Wang, J. R. Optimal algorithms for continuous non-monotone submodular and dr-submodular maximization. The Journal of Machine Learning Research, 21(1):4937 4967, 2020. Pati, Y. C., Rezaiifar, R., and Krishnaprasad, P. S. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar conference on signals, systems and computers, pp. 40 44. IEEE, 1993. Discrete and Continuous Difference of Submodular Minimization Pham Dinh, T. and Le Thi, H. A. Convex analysis approach to dc programming: theory, algorithms and applications. Acta mathematica vietnamica, 22(1):289 355, 1997. Takapoui, R., Moehle, N., Boyd, S., and Bemporad, A. A simple effective heuristic for embedded mixed-integer quadratic programming. International journal of control, 93(1):2 12, 2020. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267 288, 1996. Topkis, D. M. Minimizing a submodular function on a lattice. Operations research, 26(2):305 321, 1978. Yu, Q. and K uc ukyavuz, S. On constrained mixed-integer dr-submodular minimization. Mathematics of Operations Research, 2024. Discrete and Continuous Difference of Submodular Minimization A. Effect of looser α and β bounds on performance In this section, we discuss how the bounds α and β in the generic decomposition given in Proposition 3.1 affect the runtime of Algorithm 1. As discussed in Section 4.2, the runtime of Algorithm 1, particularly for solving the submodular minimization subproblem at each iteration t, depends on the Lipschitz constant Lf t of f t . Recall that f t is the continuous extension of F t = G Ht, where Ht is the modular function whose continuous extension is Y t, X F . The decomposition in Proposition 3.1 defines G = F + α β H and H = α β H, for some normalized strictly submodular function H. Let Y t = β αY t, we can thus write f t (X) = f (X) + α β ( h (X) Y t, X F ). Looser bounds α and β result in a larger Lipschitz constant Lf t , and thus a longer runtime. To show that, we provide upper and lower bounds on Lf t which grow with |α| β . For any X, X Rn (k 1) , we have |f t (X) f t (X )| |f (X) f (X )| + |α| β | h (X) h (X )| + |α| β | Y t, X X F | (Lf + 2 |α| β L ht ) X X F Hence, Lf t Lf + 2 |α| β L ht . Note that Lf and L ht are independent of α and β. This upper bound indeed grows with |α| Since Y t h (Xt), we know from Proposition 7-b Bach (2019) that h (X) Y t, X for any X Rn (k 1) with equality if and only if Y t h (X). We can thus write h (Xt) = Y t, Xt F , or equivalently h (Xt) = Y t, Xt F . Taking X = Xt, we get |f t (X) f t (Xt)| = |f (X) f (Xt) + α β ( h (X) Y t, X F )| β | h (X) Y t, X F | |f (X) f (Xt)|. Hence, for any X Rn (k 1) , we have Lf t |f t (X) f t (Xt)| X Xt F β | h (X) Y t, X F | X Xt F |f (X) f (Xt)| Choosing any X = Xt such that Y t h (X), we get | h (X) Y t,X F | X Xt F > 0. Such X must exists, since otherwise h (X) = Y t, X F for all X Rn (k 1) , and H(X) = Pn i=1 Pxi j=1 Y t ij by Proposition F.4. Since H is strictly submodular, it can t be modular unless n = 1, otherwise Ineq (3) will not hold strictly. Then, this lower bound indeed also grows with |α| β . We can thus conclude that Lf t itself grows with |α| B. Connection to Related Non-convex Classes In this section, we discuss the relation of DS functions to other related classes of non-convex functions, namely DC and discrete DC functions. B.1. Relation to DC Functions We remark that the class of DS functions is not a subclass of DC functions. Since DC functions have continuous domains, it is evident that DS functions defined on discrete domains are not DC. For continuous domains, any univariate function which is not DC can serve as a counter-example, since as mentioned earlier when n = 1 any function is modular. The function F : [0, 1] R given by F(x) = 1 p |x 1/2| is one such example, given in de Oliveira (2020, Example 7). We provide below another counter-example for any n. Discrete and Continuous Difference of Submodular Minimization Proposition B.1. The function F : X R defined as F(x) = x q q with q [0, 1) (F(x) = x 0 if q = 0) and where each Xi is a closed interval, is a modular function but not a DC function, whenever 0 int X. Proof. We first note that F is a separable function, hence it is modular. Next, we recall that convex functions are locally Lipschitz on the interior of their domain (Mordukhovich & Nam, 2023, Corollary 2.27), i.e., for all x int X there exists a neighborhood U of x such that F is Lipschitz continuous on U int X. Since the difference of locally Lipschitz functions is also locally Lipschitz, DC functions are then also locally Lipschitz on the interior of their domain. We show that F is not locally Lipschitz at 0 int X, and thus it cannot be DC. We show that there exists {xk} X such that {xk} 0 and |F (xk)| xk 2 is unbounded. In particular, we take xk = 1 ke1. For any q [0, 1), we have xk 2 = (1/k)q 1/k = k1 q. Taking the limit as k , we see |F (xk) F (0)| xk 0 2 = |F (x)| The function in Proposition B.1 is used as a sparsity regularizer in sparse learning problems, which we showed in Section 3.2 are instances of Problem (1). B.2. Relation to Discrete DC Functions Next, we discuss the relation between DS functions on discrete domains and discrete DC functions considered in Maehara & Murota (2015). Therein, a function F : Zn Z is called discrete DC if it can be written as F = G H, where G, H : Zn Z are L -convex or M -convex functions. We recall the definitions of both types of functions, see e.g., Murota (1998, Chap. 1, Eq. (1.33) and (1.45)). Definition B.2 (L -convex). A function F : Zn R is called L -convex if it satisfies the discrete midpoint convexity; that is, for all x, y Zn, we have F(x) + F(y) F Definition B.3 (M -convex). A function F : Zn R is called M -convex if, for any x, y Zn and i [n] such that xi > yi, we have F(x) + F(y) min{F(x ei) + F(y + ei), min j [n],yi>xi F(x ei + ej) + F(y + ei ej)} (13) L -convex functions are a proper subclass of submodular functions on Zn (Maehara & Murota, 2015, Section 2.1), while M -convex functions are a proper subclass of supermodular functions on Zn (Murota & Shioura, 2001, Theorem 3.8, Example 3.10). It follows then that any discrete DC function is a DS function on Zn. This is of course not surprising in light of Proposition 3.1 (which still holds for functions defined on Zn). Corollary B.4 is the analogue of Proposition 3.1 for discrete DC functions, which was stated in Maehara & Murota (2015, Corollary 3.9-1) for functions defined on finite subsets of Zn. Corollary B.4. Given F : X Z, where each Xi R is a finite set, there exists an L - L discrete DC function F : Zn Z such that F(x) = F (m 1(x)) for all x X, where m : Qn i=1[0 : ki 1] X is a bijection with ki = |Xi|. Proof. Given F : X R, where each Xi R is a finite set, we define F : Qn i=1[0 : ki 1] R as F(x) = F(m(x)) where m : Qn i=1[0 : ki 1] X is the bijection defined in Corollary 4.2. Since F is defined on a finite subset of Zn, by Corollary 3.9-1 in Maehara & Murota (2015), there exists F : Zn Z such that F = G H , with L -convex functions G , H : Zn Z, and F (z) = F(z) for all z Qn i=1[0 : ki 1] and F (z) = 0 otherwise. We thus have for all x X, F(x) = F(m 1(x)) = F (m 1(x)). Since L -convex functions are submodular on Zn and submodularity is preserved by restriction, then the restrictions G |X , H |X : X Z of G , H to X are submodular, and the restriction F |X : X Z of F to X is a DS function with F = F |X = G |X H |X . Note however that F itself is not necessarily an L - L discrete DC function, since L -convexity is Discrete and Continuous Difference of Submodular Minimization not preserved by restriction. Nevertheless, minimizing F on X is equivalent to minimizing F on Zn if minx X F(x) 0, which can be assumed wlog. The class of discrete DC functions, in particular L - L functions, is thus essentially equivalent to the class of integer valued DS functions on discrete domains. Applications which are DS but not discrete DC: In Section 3.2, we outlined two classes of applications which have natural DS objectives. We now show that both types of applications do not have a natural discrete DC decomposition, for general discrete domains, even if we ignore the assumption in Maehara & Murota (2015) that F is integer valued. For QPs of the form minx X F(x) = 1 2x Qx + c x, F admits a natural discrete DC decomposition, when Xi s are uniform finite sets. However, this is not the case if Xi is a non-uniform finite set for some i, as in the integer least squares experiments we consider in Section 5.1. Such cases also arise for example in non-uniform quantization of neural networks (Kim et al., 2025). To see this, note that any L -convex function F : Z Z {+ } should satisfy for all x, y dom F, x+y 2 dom F. If we re minimizing a quadratic over a uniform grid, we can map from the uniform grid to Q i[0, ki 1] and the resulting function will still be a quadratic. We can then easily decompose the objective into L - L DC function, as we show in Proposition B.5. However, if the domain is a non-uniform grid, the resulting function is no longer a quadratic. Proposition B.5. Any quadratic objective F : Zn Z defined as F(x) = x Qx can be written as the difference of two L functions G(x) = x (Q + D)x and H(x) = x ( Q+ + D)x, where Q = min{Q, 0}, Q+ = max{Q, 0}, and D is a diagonal matrix with Dii = max{ P Proof. Let 2F be the L Hessian of F (Maehara & Murota, 2015, Section 3.2), then 2F = Q + Q . To see this note that for any x Zn, i [n], we have: F(x + ei) F(x) = e i Qx + x Qei + Qii. (14) Then for any j = i, we get: 2 ij F(x) = F(x + ej + ei) F(x + ej) (F(x + ei) F(x)) (15) = e i Q(x + ej) + (x + ej) Qei + Qii (e i Qx + x Qei + Qii) (16) = Qij + Qji (17) 2 ii F(x) = F(x + 1 + ei) F(x + 1) (F(x + ei) F(x)) X j =i 2 ij F(x) (18) = e i Q(x + 1) + (x + 1) Qei + Qii (e i Qx + x Qei + Qii) X j =i 2 ij F(x) (19) = e i Q1 + 1 Qei X j =i (Qij + Qji) (20) = 2Qii. (21) Hence, 2F = Q + Q . Similarly the L Hessian of G is 2G(x) = Q + (Q ) + 2D and of H os 2H(x) = Q+ (Q+) + 2D. Note that for any j = i we have 2 ij G(x) 0 and 2 ij H(x) 0. We also have j=1 2 ij G(x) = j=1 (Q ij + Q ji) + 2Dii 0. Similarly, n X j=1 2 ij H(x) = j=1 (Q+ ij + Q+ ji) + 2Dii 0. Hence, by the Hessian characterization of L -convexity (Maehara & Murota, 2015, Theorem 3.7), G and H are L - convex. Discrete and Continuous Difference of Submodular Minimization For sparse learning problems minx X ℓ(x) + λ x q q, where q [0, 1), λ 0, and ℓis a smooth function, we show below that x q q is not a discrete convex function. Proposition B.6. The function f(x) = x q q, q [0, 1) (when q = 0, we let f = 0), is not L -convex nor M -convex over Zn Proof. As a simple counterexample for the case q (0, 1), take n = 1 so that f : Z R+ is given as f(x) = |x|q. Taking x = 0 and y = 2, we have f(x) + f(y) f 2q + 0 2 (23) which, for any q [0, 1), is a contradiction. For n = 1, M -convexity definition simplifies to (the minimum over empty set is + ) f(x) + f(y) f(x 1) + f(x + 1), for all x > y. We again consider the counterexample x = 2, y = 0. When q = 0, we have f(x) + f(y) = 1 + 0 < 2 = f(x 1) + f(x + 1) which contradicts the M -convexity definition. Similarly, for q (0, 1), we have: f(x) + f(y) = 2q + 0 < 1 + 3q = f(x 1) + f(x + 1), where the inequality follows from the fact that |x + y|q < |x|q + |y|q, which implies that 2q = |3 1|q < 3q + 1q. Of course, in both cases, we can still use the generic decomposition given in Maehara & Murota (2015, Corollary 3.9-1), this would yield G and H with large Lipschitz constants, which would lead to slow optimization. C. Reduction to set function case In this section, we discuss the reduction of a submodular minimization problem minx X F(x) over the integer lattice X = [0 : k 1]n to the minimization of a submodular set function over {0, 1}n (k 1), given in Bach (2019, Section 4.4). As mentioned in Section 4.2, the same reduction can be used to reduce Problem (1) to DS set function minimization. We define the map π : {0, 1}n (k 1) {0, 1}n (k 1) as ( 1 if there exists j j, Xi,j = 1, 0 otherwise, (24) for all i [n], j [k 1], i.e., π(X) is the smallest row non-increasing matrix such that X π(X). Given Bi 0 for all i [n], define the set-function F : {0, 1}n (k 1) R as F(X) = F(Θ(π(X))) + i=1 Bi π(X)i Xi 1 (25) where π(X)i, Xi are the ith rows of π(X), X, respectively. Then minimizing F is equivalent to minimizing F; min X {0,1}n (k 1) F(X) = minx [0:k 1]n F(x). To ensure F is submodular, we choose Bi > 0 such that |F(y) F(x)| i=1 Bi|xi yi|, (26) for all x y X (Bach, 2019, Section 4.4). We thus reduced minimizing F to minimizing a submodular set function. Similarly, we can reduce Problem (1) with X = [0 : k 1]n to the following DS set function minimization: min X {0,1}n (k 1) G(X) H(X), Discrete and Continuous Difference of Submodular Minimization where G, H are submodular set functions defined as in (25). However, as discussed in Bach (2019, Section 4.4), this strategy adds extra parameters Bi which are often unkown, and leads to slower optimization due to the larger Lipschitz constant L f of the continuous extension f of F (which reduces to Lov asz extension in this case). In particular, while we can bound Lf p (k 1)n maxi Bi (Proposition 2.3-g), the bound for f is (2k 3) times larger, i.e., L f (2k 3) p (k 1)n maxi Bi. The time complexity of both submodular minimization algorithms in Bach (2019, Section 5) and the one in Axelrod et al. (2020) (the fastest known inexact submodular set function minimization algorithm) scales quadratically with the Lipschitz constant of the continuous extension: O(( nk Lf ϵ )2EOF ) and O(n(k Lf /ϵ)2 EOF ), respectively. Using the reduction F increases their complexity by at least a factor of O(k2). Moreover, the cost EO F of evaluating F can also be larger; if computed via Eq. (25) it incurs an additional O(nk) time, i.e., EO F = EOF + O(nk). This also applies in our setting, where F is a DS function, since DCA-LS requires solving a submodular minimization at each iteration, which will be similarly slower using the reduction. To validate this empirically, we compare the minimization of a submodular quadratic function F with that of its reduction F: min x [ 1,1]n F(x) = 1 2x Qx, (27) We set n = 50 and construct a symmetric matrix Q Rn n as follows: for i < j, draw Qi,j U[ 1/n,0] where U denotes the uniform distribution; and draw diagonal elements as Qi,i U[0,1]. The resulting Q has non-positive off-diagonal entries, making F submodular by Proposition 2.1-a. This construction also helps ensure the minimizer is randomly located within the lattice D. As discussed in Section 4.1, we can convert Problem (27) to a submodular minimization problem over X = [0 : k 1]n by discretization. In particular, we define F : X R as F (x) = F(m(x)) with m : X [ 1, 1]n, mi(x) = 2 k 1 xi 1. Then F is submodular by Proposition 4.1. To apply the reduction, we must find positive Bi s which satisfy (26) for the function F . One valid choice is to set Bi = B for all i [n], where B is the Lipschitz constant of F with respect to the ℓ1-norm. We start by bounding the Lipschitz constant of F over [ 1, 1]n with respect to the ℓ1-norm. For all x [ 1, 1]n, we have F(x) = Qx Q x Q , (28) Hence, F is L-Lipschitz continuous with respect to the ℓ1-norm with L = Q , which is equal to the maximum ℓ1-norm of the rows of Q. For any x, y X, we have |F (x) F (y)| = |F(m(x)) F(m(y))| (29) L m(x) m(y) 1 (30) = 2L k 1 x y 1. (31) Hence, F is Lipschitz continuous in the ℓ1-norm over [ 1, 1]n with constant B = 2L k 1. We can now define the reduction F of F as in (25). Then Problem (27) reduces to a submodular set-function minimization problem: min X {0,1}n (k 1) F(X). (32) A minimizer X of (32) yields a minimizer x X of F by setting x = Θ(X ). We numerically compare minimizing F and F over their respective domains to get a minimizer of Problem (27). We minimize F and F using pairwise FW, which we run for 200 iterations. Let ˆx, x X denote the resulting solutions, where x is obtained by applying Θ to the output of pairwise FW with F. We test different discretization levels k = 100, 200, . . . , 500 and initialize both methods at zero (i.e. 0 Rn for F and 0 Rn (k 1) for F). For each value of k, we run 50 random trials. We evaluate the two approaches on three metrics: the relative objective gap between ˆx and x, the duality gap given in Bach (2019, Section 5.2), and the running time. Average results are shown in Figure 2. We observe that x has a worse duality gap and objective value than ˆx and a longer running time, after the same number of iterations. This confirms that minimizing F is significantly slower than minimizing F , both in terms of convergence speed (due to the larger Lipschitz constant L f ) and per iteration runtime (due to the higher evaluation cost of F). Discrete and Continuous Difference of Submodular Minimization 100 200 300 400 500 k (F(~x) ! F( x))=F( x) 100 200 300 400 500 k Duality Gap 100 200 300 400 500 k Lattice Reduction Figure 2: Results averaged over 50 runs comparing the direct minimization of F on the lattice X (Lattice) and the minimization of its reduction F on {0, 1}n (k 1) (Reduction) on Problem (27). D. Proofs of Section 2 D.1. Proof of Proposition 2.3-g We prove here the bounds on the Lipschitz constant of the continuous extension of a submodular function given in Proposition 2.3-g. The general bound was stated in Bach (2019, Section 5.1), and follows directly from the definition of the subgradients (see Proposition 2.3-e). Lemma D.1. Given a normalized submodular function F : [0 : k 1]n R, its continuous extension f is Lf -Lipschitz continuous with respect to the Frobenius norm, with Lf p (k 1)n B where B = maxx,x+ei [0:k 1]n |F(x+ei) F(x)|. If F is also non-decreasing, then Lf F((k 1)1). Proof. For any X Rn (k 1) , let (p, q) be a non-increasing permutation of X and Y the corresponding subgradient of f at X, defined as in Proposition 2.3-e. The bound for the general case follows directly by bounding the Frobenius norm of Y . i=1 Y 2 pi,qi i=1 (F(yi) F(yi 1))2 Moreover, if F is non-decreasing, then Yi,j 0 for all i, j. By telescoping sums, we get: Y F Y 1,1 = i=1 F(yi) F(yi 1) = F(yn(k 1)) F(y0) = F((k 1)1) F(0). Since F is normalized, then we have Y F F((k 1)1). E. Proofs of Section 3 E.1. Proofs of Proposition 3.1 and Example 3.2 Proposition 3.1. Given any normalized F : X R where each Xi R is finite, there exists a decomposition F = G H, where G, H : X R are normalized (strictly) submodular functions. Discrete and Continuous Difference of Submodular Minimization Proof. We give a constructive proof, similar to the one provided in Iyer & Bilmes (2012, Lemma 3.1) for set functions. Let α 0 be any lower bound on how much F violates Eq. (3), i.e. for any x X, i = j [n], and ai, aj > 0 such that x + aiei, x + ajej X, we have F(x + aiei) F(x) F(x + aiei + ajej) + F(x + ajej) α. (33) Since F is finitely valued and defined on finitely many points, such a lower bound must exist. For example, we can use α = 4 maxx X |F(x)|. Note that α < 0 unless F is already submodular, in which case we can take G = F and H = 0. We now choose any normalized strictly submodular function H : X R. We provide an example in Example 3.2. H satisfies Eq. (3) with strict inequality. As each Xi is finite, there exists β > 0 such that H(x + aiei) H(x) H(x + aiei + ajej) + H(x + ajej) β. (34) whenever x X, i = j [n], and ai, aj > 0 are such that x + aiei, x + ajej X. We next construct G and H based on H. In particular, we define G = F + |α| β H and H = |α| β H. Since F and H are normalized, G and H are also normalized. Also, H is strictly submodular, since multiplying by a positive scalar preserves strict submodularity. We check that G is also submodular. For any x X, i = j [n], and ai, aj > 0 such that x + aiei, x + ajej X, we have G(x + aiei) G(x) G(x + aiei + ajej) + G(x + ajej) = F(x + aiei) F(x) F(x + aiei + ajej) + F(x + ajej) + |α| β H(x + aiei) H(x) H(x + aiei + ajej) + H(x + ajej) β β = 0. (35) Hence, G is submodular by Equation (3). This yields the desired decomposition F = G H, with G and H normalized submodular functions. Choosing α < 0 which strictly satisfies the inequality (33) (which is again always possible), we can further guarantee that G is strictly submodular, since the inequality (35) will become strict too. Example 3.2. Let H : Rn R be the quadratic function H(x) = 1 2x Jx, where J is the matrix of all ones, and define H : X R as H(x) = H(x) H(xmin). Then, H is a normalized strictly submodular function. Proof. Since each Xi is a compact subset of R, there exists an interval [ai, bi] such that Xi [ai, bi]. The function H is twice-differentiable on Qn i=1[ai, bi], and the Hessian of H has negative entries. Hence, it is strictly submodular on Qn i=1[ai, bi] by Proposition 2.1-b. Since strict submodularity is preserved by restricting H to X, H is also strictly submodular. It is also normalized by definition. E.2. Proof of Proposition 3.3 Proposition 3.3. Given any normalized F : X R where each Xi is a closed interval, if F is differentiable and there exist LF 0 such that F xi (x + aej) LF a, for all x X, i = j, a > 0, xj + a Xj, then there exists a decomposition F = G H, where G, H : X R are normalized (strictly) submodular functions. Proof. The proof is again constructive. Note that LF > 0 unless F is already submodular, in which case we can take G = F and H = 0. We now choose any normalized strictly submodular function H : X R, which is differentiable and there exists L H > 0 such that H xi (x) H xi (x + aej) L Ha, for all x X, i = j, a > 0 such that xj + a Xj. We provide an example in Example 3.2. We next construct G and H based on H. In particular, we define G = F + LF L H H and H = LF L H H. Since F and H are normalized, G and H are also normalized. Also, H is strictly submodular, since multiplying by a positive scalar preserves Discrete and Continuous Difference of Submodular Minimization strict submodularity. We check that G is also submodular. For any x X, j [n], a > 0 such that i = j and xj + a Xj, we have xi (x + aej) = F xi (x + aej) + LF xi (x + aej) L H L Ha = 0. (36) Hence, G is submodular by Proposition 2.1-a. This yields the desired decomposition F = G H, with G and H normalized submodular functions. Choosing LF > 0 which satisfies the strict inequality F xi (x + aej) > LF a, we can further guarantee that G is strictly submodular, since the inequality (36) will become strict too. F. Proofs of Section 4 F.1. Proof of Proposition 4.1 Proposition 4.1. Given X = Qn i=1 Xi, X = Q i X i, with compact sets Xi, X i R, let F : X R and m : X X be a monotone function such that [m(x)]i = mi(xi). If F is submodular, then the function F : X R given by F (x) = F(m(x)) is submodular. Moreover, if m is strictly monotone, then F is submodular iff F is submodular. Proof. For any monotone m, if F is submodular, then for any x, y X we have F (min{x, y}) + F (max{x, y}) = F m(min{x, y}) + F m(min{x, y}) = F min{m(x), m(y)} + F max{m(x), m(y)} F m(x) + F m(y) (F is submodular) = F (x) + F (y). Hence, F is submodular. If F is modular, the above inequality becomes an equality, implying that F is modular too. Moreover if m is strictly monotone, m is invertible, and we may write F(y) = F (m 1(y)). We show that the converse claim is also true in this case. Indeed, since the inverse of a strictly monotone function is again strictly monotone, we have by the first claim that if F is (sub)modular then F is also (sub)modular. F.2. Discretization of a non-Lipschitz function example In Section 4.1, we assumed that the function F : [0, 1]n R is Lipschitz continuous to bound its discretization error. However, Lipschitz continuity is not always necessary. We show below how to handle an example of a non-Lipschitz continuous function. In particular, we consider a function F : [ B, B]n R given by F(x) = ℓ(x)+ x q q, with q [0, 1), where ℓ : [0, 1]n R is a DS function L-Lipschitz continuous w.r.t the ℓ -norm. The modular function x q q is not Lipschitz continuous on the domain [ B, B]n. We first observe that the function |x|q satisfies a notion of Lipschitz continuity, for x = 0. Lemma F.1. The function F : R R defined as F(x) = |x|q with q (0, 1) satisfies ||y|q |x|q| 2 max{|y|, |x|}q 1|y x|, for every two scalars x, y = 0 of same sign. Proof. We assume without loss of generality that |y| |x| > 0. Let t > 0 be the smallest integer such that 2tq 1. Note that |y|q |x|q = |y|2q |x|2q |y|q + |x|q < |y|2q |x|2q Applying this relation recursively yields |y|q |x|q = |y|2tq |x|2tq |y|q Pt 1 i=0 2i 2tq|z|2tq 1|y x| |y|q(2t 1) 2|y|q 1|y x|, (37) Discrete and Continuous Difference of Submodular Minimization where the first inequality follows by the mean value theorem with z a scalar between x and y, and the 2nd inequality holds since by definition of t, we have 2t 1q < 1 2tq and thus |z|2tq 1 |y|2tq 1. We shift and scale the function F to obtain a function F : [0, 1]n R defined as F (x) = F(B(2x 1)), which remains DS by Proposition 4.1. Proposition F.2. Given q (0, 1), B > 0 and F : [0, 1]n R defined as F(x) = ℓ (x) + B(2x 1) q q, where ℓ : [0, 1]n R is a DS function L-Lipschitz continuous w.r.t the ℓ -norm, let k = 2 max{ 2L/ϵ , B(8n/ϵ)1/q } + 1 and define the function F : [0 : k 1]n R as F (x) = F(x/(k 1)). Then F is DS and we have min x [0:k 1]n F (x ) ϵ/2 min x [0,1]n F(x) min x [0:k 1]n F (x). (38) The same also holds if q = 0, with k = 2 L/ϵ + 1. Proof. Let ϕ : [0 : k 1]n [0, 1]n denote the map ϕ(x) = x/(k 1), so F (x) = F(ϕ(x)). Since F is DS and ϕ is monotone, then F is also DS by Proposition 4.1. Let π : [0, 1]n [0 : k 1]n be defined for all x [0, 1]n such that for each i [n], [π(x)]i = k 1 2 if xi = 1/2, and [π(x)]i = t where ϕ(t) is the nearest point to xi for t in [0 : k 1] \ k 1 2 , if xi = 1/2. Note that k 1 is an even number hence k 1 2 [0 : k 1]. For any x [0, 1]n, let x = π(x) be the vector obtained from the above map. Note that for any i, if xi = 1 2 then ϕ(x i) = 1 2, otherwise we have |xi ϕ(x i)| 1 k 1 and |ϕ(x i) 1 2| 1 k 1. Let Gi(xi) = |B(2xi 1)|q, we show that for all i where xi = 1/2, we have |Gi(xi) Gi(ϕ(x i))| 4B( 2B k 1)q 1|xi ϕ(x i)|. Note that the nearest points ϕ(t) to xi = 1/2, for t in [0 : k 1]\ k 1 2 , are ϕ(t) = 1/2 1/(k 1) which are equidistant from 1/2. Hence B(2xi 1) and B(2ϕ(x i) 1) will always have the same sign, and max{B(2xi 1), B(2ϕ(x i) 1)} 2B k 1. By Lemma F.1, we thus have |Gi(xi) Gi(ϕ(x i))| 2( 2B k 1)q 1|B(2xi 1) B(2ϕ(x i) 1)| 4B( 2B k 1)q 1|xi ϕ(x i)|. Putting everything together we get: |F(x) F (x )| = |F(x) F(ϕ(x ))| = |ℓ (x) ℓ (ϕ(x ))| + X xi =1/2 |Gi(xi) Gi(ϕ(x i))| L x ϕ(x ) + 4B( 2B xi =1/2 |xi ϕ(x i)| L k 1 + 4B( 2B L k 1 + 2n( 2B The same bound holds for q = 0 by noting that B(2xi 1) and B(2ϕ(x i) 1) have the same support, hence B(2xi 1) 0 = B(2ϕ(x i) 1) 0 (no discretization error). F.3. Proof of Proposition 4.4 Proposition 4.4. For any normalized function F : [0 : k 1]n R, we have min x [0:k 1]n F(x) = min X [0,1]n (k 1) f (X). (10) Discrete and Continuous Difference of Submodular Minimization Moreover, if x is a minimizer of F then Θ 1(x ) is a minimizer of f , and if X is a minimizer of f then Round F (X ) is a minimizer of F. Proof. Let x be a minimizer of F. By Proposition 2.3-a, we have F(x ) = f (Θ 1(x )). Since Θ 1(x ) [0, 1]n (k 1) , then min x [0:k 1]n F(x) min X [0,1]n (k 1) f (X). Let X is a minimizer of f . By Proposition 2.3-c, we have F(Round F (X )) f (X ). Since Round F (X ) [0 : k 1]n, we have min x [0:k 1]n F(x) min X [0,1]n (k 1) f (X). F.4. Continuous extension of a normalized modular function In this section, we derive the form of a normalized modular function and its continuous extension. Lemma F.3. A function F : [0 : k 1]n R is modular and normalized iff there exists W Rn (k 1) such that j=1 Wi,j for all x [0 : k 1]n. (39) Proof. We first recall that a function F is modular iff it is separable under addition (Topkis, 1978, Theorem 3.3). ( = ) If F is modular, then it is separable under addition, i.e. there exist Fi : [0 : k 1] R, i [n], such that F(x) = Pn i=1 Fi(xi) for any x [0 : k 1]n. As F is normalized, we also have that F(0) = Pn i=1 Fi(0) = 0. Now, define W Rn (k 1) such that Wi,j = Fi(j) Fi(j 1), for i [n], j [k 1]. We thus obtain i=1 Fi(xi) F(0) i=1 Fi(xi) Fi(0) j=1 Fi(j) Fi(j 1) (by telescoping sums) ( = ) If there exists a matrix W Rn (k 1) such that F(x) = Pn i=1 Pxi j=1 Wij, then F is separable under addition, and hence modular. Indeed, we can define the functions Fi : [0 : k 1] R, i [n], by Fi(x) = Pxi j=1 Wi,j. Then F(x) = Pn i=1 Fi(xi) for any x [0 : k 1]n. F is also normalized, since F(0) = 0. Proposition F.4. A function F : [0 : k 1]n R is modular and normalized, with F(x) = Pn i=1 Pxi j=1 Wij for all x [0 : k 1]n iff its continuous extension f is linear on its domain, with f (X) = W, X F for all X Rn (k 1) . Proof. Recall from Lemma F.3, that F is modular and normalized iff it can be written as F(x) = Pn i=1 Pxi j=1 Wij for some W Rn (k 1). ( = ) If F has a linear continuous extension such that f (X) = W, X F for all X Rn (k 1) for some W Rn (k 1). Then, for any x [0 : k 1]n, let X = Θ 1(x). By Proposition 2.3-a and the definition of the Θ 1 (7b), we have that F(x) = f (X) = W, X F = j=1 Wi,j Xi,j = j=1 Wij. (40) Discrete and Continuous Difference of Submodular Minimization ( = ) Suppose F is modular and normalized. By Definition 2.2, the continuous extension of F is equal to f on its domain. Since F and F are submodular, then both f and f are convex by Proposition 2.3-d, which implies that f is affine on its domain. Since F is normalized, we have f (0) = F(0) = 0 by Proposition 2.3-a, hence f is linear on its domain, i.e,. f (X) = W , X F for all X Rn (k 1) for some W Rn (k 1). By the above argument for the reverse direction, we must also have that W = W. F.5. Proof of Theorem 4.5 Before proving Theorem 4.5, we recall some properties of DCA that apply to iterates of Algorithm 1. Proposition F.5. Let ϵ, ϵx 0, and {Xt}, { Xt}, {Y t} be generated by Algorithm 1, where the subproblem at line 6 is solved up to accuracy ϵx. Then for all t N, we have: a) f( Xt+1) f(Xt) + ϵx. b) If f(Xt) f( Xt+1) ϵ, then Xt is an ϵ + ϵx-critical point of g h with Y t ϵ+ϵxg(Xt) h(Xt). Proof. Note that the iterates Xt+1, Y t are generated by a DCA step from Xt, with subproblem (9b) solved up to accuracy ϵx. The proposition follows then directly from Proposition 2.5-a,b. Theorem 4.5. Let {xt} be generated by Algorithm 1, where the subproblem on line 6 is solved up to accuracy ϵx 0. For all t [T], ϵ 0, we have: a) F(xt+1) F(xt) + ϵx. b) Let {yi}(k 1)n i=0 be the vectors corresponding to the permutation (p, q) from line 4, defined as in Definition 2.2. If (p, q) is row-stable and F(xt) F(xt+1) ϵ, then F(xt) F(yi) + ϵ + ϵx for all i [0 : (k 1)n]. c) Algorithm 1 converges to an (ϵ + ϵx)-local minimum of F after at most (F(x0) F )/ϵ iterations. Proof. We first observe that for all t [T], we have: F(xt) F(xt+1) = f(Xt) f(Xt+1) f(Xt) f( Xt+1). (41) Since by the extension property (Proposition 2.3-a), we have f(Xt+1) = F(xt+1) and f(Xt) = F(xt). And by rounding (Proposition 2.3-c), we have f(Xt+1) f( Xt+1). If Xt+1 is integral and rounding is skipped, this still holds trivially since Xt+1 = Xt+1. (a) This follows from Proposition F.5-a and Eq. (41) (b) If F(xt) F(xt+1) ϵ, then f(Xt) f( Xt+1) ϵ too, by Eq. (41). By Proposition F.5-b, we thus have Y t ϵ+ϵxg(Xt) h(Xt). We observe that by definition of yi, if (p, q) is a row-stable non-increasing permutation of Xt, then it is also a non-increasing permutation of Θ 1(yi), for all i [0 : (k 1)n]. Hence, Y t h(Θ 1(yi)) by Proposition 2.3-e, and Y t ϵ+ϵxg(Xt) h(Θ 1(yi)). Therefore, by Proposition 2.4 and Proposition 2.3-a, F(xt) = f(Xt) f(Θ 1(yi)) + ϵ + ϵx = F(yi) + ϵ + ϵx, (42) for any i [(k 1)n]. (c) We first argue that Algorithm 1 converges (F(xt) F(xt+1) ϵ) after at most (F(x0) F )/ϵ iterations. By telescoping sums, we have PT 1 t=0 F(xt) F(xt+1) = F(x0) F(x T ) F(x0) F . Hence, min t [T 1] F(xt) F(xt+1) F(x0) F Discrete and Continuous Difference of Submodular Minimization Taking T = (F(x0) F )/ϵ, we get that there exists some t [T 1] such that F(xt) F(xt+1) (F(x0) F )/T = ϵ. Since (p, q) is chosen on line 4 to be a common permutation of Xt and Θ 1( xt), then by the same argument as in Item b, we have at convergence F(xt) F( xt) + ϵ + ϵx. Since xt is the best neighboring point of xt, this implies that xt is an (ϵ + ϵx)-local minimum of F. G. Experimental Setup Additional Details We provide here additional details on the experimental setups used in Section 5. All methods were implemented in MATLAB. To solve the submodular minimization subproblem in DCA-LS (Algorithm 1-line 6), we use the pairwise FW algorithm from Bach (2019), available at http://www.di.ens.fr/ fbach/submodular_multi_online.zip. In both experiments, during each DCA iteration, we run pairwise FW for a maximum of 400 iterations, stopping earlier if the duality gap reaches 10 4. We warm start pairwise FW with the subproblem s solution from the previous DCA iteration. To achieve a desired target (SNRd B) of α 0, we choose the noise variance σ2 using the following formula, 10 b 2 2 ξ 2 2 , (44) where ξ N(0, Im), then set the noise vector to ξ = σξ . G.1. Additional Details for Section 5.1 The RAR solution is obtained by first solving the relaxed least squares problem x LS argmin x [ 1,3]n F(x) = Ax b 2 2 (45) then rounding x LS to the nearest point in X. We implement OBQ according to the pseudocode provided in Frantar & Alistarh (2022, Algorithm 3). But since in our setting we assume no access to x , we initialize OBQ with x LS. For the ADMM approach of Takapoui et al. (2020), we use the code from https://github.com/cvxgrp/miqp_admm? tab=readme-ov-file. In DCA-LS, we use a maximum of T = 50 outer iterations and set ϵ = 10 5. For n = 100, recall that we obtain an optimal solution using Gurobi 10.0.1 (Gurobi Optimization, LLC, 2024). But since the domain X = { 1, 0, 2, 3}n has unevenly spaced integers, Gurobi cannot solve it directly. Instead, we reformulate the problem into an equivalent unconstrained binary quadratic program (UBQP) min x {0,1}2n x Ux + d x (46) where U = M A AM, d = 2(1 A AM +b AM) and M Rn 2n is defined as M = In [3, 1], with denoting the Kronecker product. To see that the two problems are equivalent, the map Mx 1 defines a bijection between {0, 1}2n and X, and that the objective in (46) is equal to F(Mx 1), without the constant terms. G.2. Additional Details for Section 5.2 We solve the box constrained variant of Lasso min x [ 1,1]n Ax b 2 2 + λ x 1, (47) using the FISTA algorithm from Beck & Guttmann-Beck (2019) with default parameter settings, i.e., 1000 maximum iterations and xk+1 xk 10 5 stopping criterion (for other parameters see https://www.tau.ac.il/ becka/ solvers/fista). In DCA-LS, we use a maximum of T = 25 outer iterations, and set ϵ = 10 5. In DCA-LS, DCA-LSLASSO, and FISTA, we use λ = 10 i for i [0 : 5], and warm start with the solution of the previous λ value (in decreasing order). Our implementation of OMP is adapted from the one provided in Becker. We run OMP for a maximum of 1.5s iterations, stopping earlier if the residual reaches Ax b 2 ξ 2 Discrete and Continuous Difference of Submodular Minimization H. Additional Experiments We present here additional experimental results for the applications presented in Section 5. H.1. Additional Integer Least Squares Experiments 1 1.2 1.4 1.6 1.8 2 m / n Recovery Probability ADMM RAR OBQ DCA-LS 1 1.2 1.4 1.6 1.8 2 m / n Bit Error Rate 1 1.2 1.4 1.6 1.8 2 m / n (F( x) ! F(x\))=F(x\) 15 20 25 30 SNRd B Recovery Probability ADMM RAR OBQ DCA-LS Optimal 15 20 25 30 SNRd B Bit Error Rate 15 20 25 30 SNRd B (F( x) ! F(x$))=F(x$) 15 20 25 30 SNRd B Recovery Probability ADMM RAR OBQ DCA-LS 15 20 25 30 SNRd B Bit Error Rate 15 20 25 30 SNRd B (F( x) ! F(x\))=F(x\) Figure 3: Performance results, averaged over 100 runs, for integer least squares experiments: Varying m with n = 400 and SNRd B = 20 (top), varying SNRd B with m = n = 100 (middle), and varying SNRd B with m = n = 400 (bottom). We repeat the integer least squares experiment from Section 5.1 with n = 400. We also consider another setup where we fix m = n and vary the SNRd B. Performance results for both are reported in Figure 3. When using a signal of length n = 400, we compute the relative objective gap with respect to x , as it is very time consuming to obtain x using Gurobi. We again observe that DCA-LS outperforms baselines on all metrics. For n = 100, DCA-LS performs on par with the optimal solution x starting from SNRd B = 24 d B. OBQ matches the performance of DCA-LS from SNRd B = 26 d B onward. For n = 400 OBQ matches the performance of DCA-LS from SNRd B = 24 d B. Note that the relative objective gap in this case can be negative, since the true signal x is not necessarily the optimal solution for the integer least squares problem. Indeed, we see that this is the case when the noise is high; DCA-LS and OBQ obtain a better solution for SNRd B 18 d B and 15 d B, respectively. H.2. Times for Integer Least Squares Experiments We report the average running time of the compared methods on the integer least squares experiments from Section 5.1 and Appendix H.1 in Figure 4. The reported times for DCA-LS and ADMM do not include the time for the RAR initialization, and for OBQ do not include the time for the relaxed solution initialization (which has similar time as RAR). We observe that Discrete and Continuous Difference of Submodular Minimization DCA-LS is significantly faster that the optimal Gurobi solver, even for the small problem instance (n = 100 ). For the larger instance (n = 400), Gurobi is already too slow to be practical. While our algorithm is slower than heuristic baselines, it remains efficient, with a maximum runtime of 100 seconds for m = n = 400. 1 1.2 1.4 1.6 1.8 2 m / n ADMM RAR OBQ DCA-LS Optimal 1 1.2 1.4 1.6 1.8 2 m / n 15 20 25 30 SNRd B 15 20 25 30 SNRd B Figure 4: Running times (log-scale), averaged over 100 runs, for integer least squares experiments: (a) Varying m with n = 100 and SNRd B = 20, (b) varying m with n = 400 and SNRd B = 20, (c) varying SNRd B and m = n = 100, and (d) varying SNRd B and m = n = 400. Optimal solution is computed using Gurobi. H.3. Additional Integer Compressed Sensing experiments We repeat the integer compressed sensing experiment from Section 5.2, with another sparsity level s = 13 = 0.05n . We also consider another setup where we fix m/n = 0.5 and vary the SNRd B. Performance results for both are reported in Figure 5. When varying the number of measurements for s = 13, we again see that DCA-LS-LASSO outperforms all baselines and DCA-LS across all metrics. DCA-LS also outperforms the baselines in terms of recovery probability, but lags slightly behind in estimation error and support error when m/n 0.4. Similar trends emerge in the experiments where s = 13 and the amount of additive noise is varied. Again, we see that DCA-LS-LASSO and DCA-LS outperform in terms of recovery probability, and when SNRd B 3, outperform in estimation error and support error. Interestingly, when we decrease the sparsity to s = 26, we see that DCA-LS recovers signals earlier than the baselines, but does not reach a 100% recovery rate once SNRd B = 20. Similar to the experiment in Section 5.2, we see that DCA-LS-LASSO significantly outperforms all other methods, especially in terms of recovery probability. H.4. Times for Integer Compressed Sensing Experiments We report the average running time of the compared methods on the integer compressed sensing experiments from Section 5.2 and Appendix H.3 in Figure 6. For LASSO and DCA-LS, the reported times correspond to the sum of running times for the 6 λ values tried, λ = 1, 0.1, . . . , 10 5. The reported time for DCA-LS-LASSO include the time for the LASSO initialization. We also report the running time of LASSO and DCA-LS for each λ separately at m/n 0.2 and 0.5 (m = 50 and m = 123, respectively) with SNRd B = 8 in Figure 7. We observe that DCA-LS and DCA-LS-LASSO are slower than baselines, but they remain efficient, with a maximum total runtime of 37 seconds for DCA-LS and 71 seconds for DCA-LS-LASSO, for solving the problem for all 6 λ Discrete and Continuous Difference of Submodular Minimization 0.2 0.4 0.6 0.8 1 m / n Recovery Probability DCA-LS-LASSO DCA-LS LASSO OMP 0.2 0.4 0.6 0.8 1 m / n Estimation Error 0.2 0.4 0.6 0.8 1 m / n Support Error 0 5 10 15 20 SNRd B Recovery Probability 0 5 10 15 20 SNRd B Estimation Error 0 5 10 15 20 SNRd B Support Error 0 5 10 15 20 SNRd B Recovery Probability 0 5 10 15 20 SNRd B Estimation Error 0 5 10 15 20 SNRd B Support Error Figure 5: Performance results, averaged over 100 runs, for integer compressed sensing experiments with n = 256: Varying m with s = 13 = 0.05n and SNRd B = 8 d B (top), varying SNRd B with s = 13 = 0.05n and m/n = 0.5 (middle), and varying SNRd B with s = 26 = 0.1n and m/n = 0.5 (bottom). values. In terms of total runtime (Figure 6), DCA-LS-LASSO is faster than DCA-LS in some regimes: at high SNRd B with m/n = 0.5(for SNRd B 6 with s = 13, and SNRd B 9 with s = 26), and for m/n (0.35, 0.75) with SNRd B = 8, but slower in others. In terms of individual runtime per λ (Figure 7), DCA-LS-LASSO is faster than DCA-LS for λ 0.01. We also note that DCA-LS-LASSO s total runtime increases significantly when m/n 0.75. This is likely because for small λ values, the influence of the ℓ1-norm regularizer in LASSO is minimal, so DCA-LS-LASSO is effectively initialized with a least-squares solution, which can be worse than the x0 = 0 initialization used in DCA-LS. Inspecting the λ values that yielded the best performance for both DCA-LS-LASSO and DCA-LS, we found that they were always λ 0.01. So restricting λ to this range would not impact performance. In Figure 8, we plot the sum of running times for only λ = 1, 0.1, 0.01 with SNRd B = 8. With this restriction, DCA-LS-LASSO s total runtime no longer increases at m/n 0.75, and is actually lower than DCA-LS for all m/n. Discrete and Continuous Difference of Submodular Minimization 0.2 0.4 0.6 0.8 1 m / n OMP LASSO DCA-LS DCA-LS-LASSO 0.2 0.4 0.6 0.8 1 m / n 0 5 10 15 20 SNRd B 0 5 10 15 20 SNRd B Figure 6: Running times (log-scale), summed over all λ values and averaged over 100 runs, for integer compressed sensing experiments with n = 256: (a) Varying m with s = 13 and SNRd B = 8, (b) varying m with s = 26 and SNRd B = 8, (c) varying SNRd B with s = 13 and m/n = 0.5, and (d) varying SNRd B with s = 26 and m/n = 0.5. LASSO DCA-LS DCA-LS-LASSO Figure 7: Running times (log-scale) for each λ, averaged over 100 runs, for integer compressed sensing experiments with n = 256 and SNRd B = 8: (a) m/n 0.2 and s = 13, (b) m/n 0.5 and s = 13, (c) m/n 0.2 and s = 26, and (d) m/n 0.5 and s = 26. Discrete and Continuous Difference of Submodular Minimization 0.2 0.4 0.6 0.8 1 m / n OMP LASSO DCA-LS-LASSO DCA-LS 0.2 0.4 0.6 0.8 1 m / n Figure 8: Running times (log-scale), summed over λ = 1, 0.1, 0.01 and averaged over 100 runs, for integer compressed sensing experiments with n = 256: (a) Varying m with s = 13 and SNRd B = 8, and (b) varying m with s = 26 and SNRd B = 8. Discrete and Continuous Difference of Submodular Minimization I. DCA Variants Comparison As discussed in Section 4.2, El Halabi et al. (2023) proposed two variants of standard DCA (9) for the special case of Problem (1), where F is a set function (X = {0, 1}n). Both monotonically decrease the objective F (up to ϵx) and converge to a local minimum. In this section, we extend the more efficient of the two variants (Algorithm 2 therein) to general discrete domains and compare it with our proposed DCA variant (DCA-LS; Algorithm 1). We refer to this extension as DCA-Restart, and present it in Algorithm 2. Algorithm 2 DCA with restart 1: ϵ 0, ϵ ϵ, T N, X0 [0, 1]n (k 1) 2: for t = 1, . . . , T do 3: Choose Y t h(Xt) (preferably corresponding to a row-stable permutation) 4: Xt+1 argmin X [0,1]n (k 1) g (X) Y t, X F 5: if f(Xt) f(Xt+1) ϵ then 6: if Xt {0, 1}n (k 1) then 7: xt = Θ(Xt) 8: else 9: xt = Round F (Xt) 10: end if 11: xt argminx NX (xt) F(x) 12: if F(xt) F( xt) + ϵ then 13: Stop. 14: else 15: Xt+1 = Θ 1( xt) 16: end if 17: end if 18: end for Theoretical comparison Similar to the set function variant, DCA-Restart runs standard DCA (9) and, at convergence, checks if rounding the current iterate yields an approximate local minimum of F. If not, it restarts from the best neighboring point. We show in Theorem I.1 that DCA-Restart satisfies the same theoretical guarantees as DCA-LS (see Theorem 4.5). In particular, it also recovers the guarantees of El Halabi et al. (2023) in the special case of set functions. Theorem I.1. Let {Xt}, {xt} be generated by Algorithm 2, where the subproblem on line 4 is solved up to accuracy ϵx 0. For all t [T], ϵ 0, ϵ ϵ, we have: a) f(Xt+1) f(Xt) + ϵx. b) Let (p, q) be the permutation used to compute Y t in Proposition 2.3-e, and {yi}(k 1)n i=0 the vectors corresponding to (p, q), defined as in Definition 2.2. If (p, q) is row-stable and f(Xt) f(Xt+1) ϵ, then F(xt) F(yi) + ϵ + ϵx for all i [0 : (k 1)n]. c) Algorithm 2 converges to an ϵ -local minimum of F after at most (f(X0) f )/ϵ iterations. Proof. Note that between each restart (line 15), Algorithm 2 is simply running standard DCA (9), so Proposition 2.5 applies. a) This holds between each restart by Proposition 2.5-a. Whenever the algorithm restarts, we have F( xt) < F(xt) ϵ , i.e., xt is not an ϵ -local minimum. In this case, we have f(Xt+1) = F( xt) (by Proposition 2.3-a) f(Xt) ϵ (by Proposition 2.3-a,c) (48) Discrete and Continuous Difference of Submodular Minimization b) If f(Xt) f(Xt+1) ϵ, then by Proposition 2.5-b, we have Y t ϵ+ϵxg(Xt) h(Xt). If (p, q) is row-stable, then by definition of yi, (p, q) is a common non-increasing permutation for Xt and Θ 1(yi), for all i [0 : (k 1)n]. Hence, Y t h(Θ 1(yi)) by Proposition 2.3-e, and thus Y t ϵ+ϵxg(Xt) h(Θ 1(yi)). Therefore, F(xt) f(Xt) (by Proposition 2.3-a,c) f(Θ 1(yi)) + ϵ + ϵx (by Proposition 2.4) = F(yi) + ϵ + ϵx, (by Proposition 2.3-a) for any i [(k 1)n]. c) For any iteration t [T], if the algorithm did not terminate, then either f(Xt) f(Xt+1) > ϵ or F( xt) < F(xt) ϵ . In the second case, the algorithm restarts, then by Equation (48) we have f(Xt) f(Xt+1) ϵ, since ϵ ϵ. We therefore have F(x0) F(xt) tϵ, hence t < (f(X0) f )/ϵ. If the algorithm did terminate, then xt must be an ϵ -local minimum of F. DCA-Restart has a similar computational cost per iteration as DCA-LS. The only difference is that DCA-LS has an additional cost per iteration of O(n EOF ) for the local search step (line 3) and O(nk) to map Xt+1 to xt+1 (line 8). In DCA-Restart, these operations are only done at convergence (f(Xt) f(Xt+1) ϵ), which in the worst case can occur at every iteration. However, these differences have little impact on the overall per-iteration cost, which is dominated by the cost of solving the subproblem. Indeed, the total cost per iteration in DCA-Restart is O(n(k Lf t /ϵ)2 EOF t + nk EOH); the same as in DCA-LS. Moreover, the number of iterations for both variants is at most (f(X0) f )/ϵ. So, theoretically, both variants have very similar theoretical guarantees and runtime. Empirical comparison We empirically compare DCA-LS to DCA-Restart on all experiments included in the paper. We use the same parameters T and ϵ, and the same subproblem solver (pairwise-FW) in DCA-Restart as in DCA-LS; see Appendix G for how these are set in each experiment. We choose ϵ = 0, i.e., xt should be an exact local minimum if DCA-Restart stops before reaching the maximum number of iterations. We also choose a row-stable permutation (p, q) when computing Y t. We report their performance on integer least squares (ILS) in Figure 9 and corresponding running times in Figure 10. Similarly, Figure 13 and Figure 14 show their performance and running times on integer compressed sensing (ICS). We also plot the number of DCA outer and inner iterations for ILS in Figures 11 and 12 and for ICS in Figures 16 and 17. The reported numbers of DCA inner iterations are the total number of inner iterations, i.e., iterations of pairwise-FW, summed over all outer iterations t. For ICS, the reported numbers of iterations and running times correspond to the sum over the 6 values of λ tried, λ = 1, 0.1, . . . , 10 5. We also report the running time for each λ separately at m/n 0.2 and 0.5 (m = 50 and m = 123, respectively) with SNRd B = 8 in Figure 15. All results are again averaged over 100 runs, with error bars for standard deviations. We observe that DCA-LS matches or outperforms DCA-Restart on all experiments. The two variants perform similarly when initialized with a good solution (LASSO in ICS, RAR in ILS); otherwise, DCA-LS performs better, sometimes by a large margin (see 4th row in Figure 13). In terms of runtime, DCA-Restart is generally faster on both ILS and ICS. For both applications, DCA-Restart takes slightly more outer iterations to converge, but has a lower per-iteration runtime, as evidenced by its smaller number of inner iterations. Intuitively, we expected each step of DCA-LS to decrease the objective more than in DCA-Restart, because of its more careful permutation choice, and thus to converge faster. In practice, this choice seems to lead to better solutions in some cases, but not significantly faster convergence, and comes at the cost of increased subproblem complexity. The slower convergence of pairwise-FW in DCA-LS may be due to the subgradients Y t changing more between consecutive iterations, causing the subproblem s solution to vary more, and thus take longer to solve, given that we warm-start pairwise-FW with the previous iteration s solution. Overall, the choice between the two variants is problem dependent; for example, on whether a good initialization is available. Discrete and Continuous Difference of Submodular Minimization 1 1.2 1.4 1.6 1.8 2 m / n Recovery Probability DCA-LS DCA-Restart 1 1.2 1.4 1.6 1.8 2 m / n Bit Error Rate 1 1.2 1.4 1.6 1.8 2 m / n (F( x) ! F(x$))=F(x$) 1 1.2 1.4 1.6 1.8 2 m / n Recovery Probability DCA-LS DCA-Restart 1 1.2 1.4 1.6 1.8 2 m / n Bit Error Rate 1 1.2 1.4 1.6 1.8 2 m / n (F( x) ! F(x\))=F(x\) 15 20 25 30 SNRd B Recovery Probability DCA-LS DCA-Restart 15 20 25 30 SNRd B Bit Error Rate 15 20 25 30 SNRd B (F( x) ! F(x$))=F(x$) 15 20 25 30 SNRd B Recovery Probability DCA-LS DCA-Restart 15 20 25 30 SNRd B Bit Error Rate 15 20 25 30 SNRd B (F( x) ! F(x\))=F(x\) Figure 9: Performance results, averaged over 100 runs, for integer least squares experiments comparing two DCA variants: Varying m with n = 100 and SNRd B = 8 (first row), varying m with n = 400 and SNRd B = 8 (second row), varying SNRd B with m = n = 100 (third row), and varying SNRd B with m = n = 400 (fourth row). Discrete and Continuous Difference of Submodular Minimization 1 1.2 1.4 1.6 1.8 2 m / n DCA-LS DCA-Restart 1 1.2 1.4 1.6 1.8 2 m / n 15 20 25 30 SNRd B 15 20 25 30 SNRd B Figure 10: Running times, averaged over 100 runs, for integer least squares experiments comparing two DCA variants: (a) Varying m with n = 100 and SNRd B = 8, (b) varying m with n = 400 and SNRd B = 8, (c) varying SNRd B with m = n = 100, and (d) varying SNRd B with m = n = 400. 1 1.2 1.4 1.6 1.8 2 m / n DCA-LS DCA-Restart 1 1.2 1.4 1.6 1.8 2 m / n 15 20 25 30 SNRd B 15 20 25 30 SNRd B Figure 11: Number of DCA iterations, averaged over 100 runs, for integer least squares experiments: (a) Varying m with n = 100 and SNRd B = 8, (b) varying m with n = 400 and SNRd B = 8, (c) varying SNRd B with m = n = 100, and (d) varying SNRd B with m = n = 400. Discrete and Continuous Difference of Submodular Minimization 1 1.2 1.4 1.6 1.8 2 m / n Inner Iterations DCA-LS DCA-Restart 1 1.2 1.4 1.6 1.8 2 m / n Inner Iterations 15 20 25 30 SNRd B Inner Iterations 15 20 25 30 SNRd B Inner Iterations Figure 12: Number of DCA inner iterations, summed over all outer iterations t, averaged over 100 runs, for integer least squares experiments: (a) Varying m with n = 100 and SNRd B = 8, (b) varying m with n = 400 and SNRd B = 8, (c) varying SNRd B with m = n = 100, and (d) varying SNRd B with m = n = 400. Discrete and Continuous Difference of Submodular Minimization 0.2 0.4 0.6 0.8 1 m / n Recovery Probability DCA-LS-LASSO DCA-LS DCA-Restart-LASSO DCA-Restart 0.2 0.4 0.6 0.8 1 m / n Estimation Error 0.2 0.4 0.6 0.8 1 m / n Support Error 0.2 0.4 0.6 0.8 1 m / n Recovery Probability 0.2 0.4 0.6 0.8 1 m / n Estimation Error 0.2 0.4 0.6 0.8 1 m / n Support Error 0 5 10 15 20 SNRd B Recovery Probability 0 5 10 15 20 SNRd B Estimation Error 0 5 10 15 20 SNRd B Support Error 0 5 10 15 20 SNRd B Recovery Probability 0 5 10 15 20 SNRd B Estimation Error 0 5 10 15 20 SNRd B Support Error Figure 13: Performance results, averaged over 100 runs, for integer compressed sensing experiments comparing two DCA variants with n = 256: Varying m with s = 13 and SNRd B = 8 (first row), varying m with s = 26 and SNRd B = 8 (second row), varying SNRd B with m/n = 0.5 and s = 13 (third row), and varying SNRd B with m/n = 0.5 and s = 26 (fourth row). Discrete and Continuous Difference of Submodular Minimization 0.2 0.4 0.6 0.8 1 m / n DCA-Restart DCA-LS DCA-Restart-LASSO DCA-LS-LASSO 0.2 0.4 0.6 0.8 1 m / n 0 5 10 15 20 SNRd B 0 5 10 15 20 SNRd B Figure 14: Running times, summed over all λ values and averaged over 100 runs, for integer compressed sensing experiments comparing two DCA variants with n = 256: (a) Varying m with s = 13 and SNRd B = 8, (b) varying m with s = 26 and SNRd B = 8, (c) varying SNRd B with s = 13 and m/n = 0.5, and (d) varying SNRd B with s = 26 and m/n = 0.5. DCA-Restart DCA-LS DCA-Restart-LASSO DCA-LS-LASSO Figure 15: Running times for each λ, averaged over 100 runs, for integer compressed sensing experiments comparing two DCA variants with n = 256 and SNRd B = 8: (a) m/n 0.2 and s = 13, (b) m/n 0.5 and s = 13, (c) m/n 0.2 and s = 26, and (d) m/n 0.5 and s = 26. Discrete and Continuous Difference of Submodular Minimization 0.2 0.4 0.6 0.8 1 m / n DCA-Restart DCA-LS DCA-Restart-LASSO DCA-LS-LASSO 0.2 0.4 0.6 0.8 1 m / n 0 5 10 15 20 SNRd B 0 5 10 15 20 SNRd B Figure 16: Number of DCA iterations, summed over all λ values and averaged over 100 runs, for integer compressed sensing experiments with n = 256: (a) Varying m with s = 13 and SNRd B = 8, and (b) varying m with s = 26 and SNRd B = 8, (c) varying SNRd B with s = 13 and m/n = 0.5, and (d) varying SNRd B with s = 26 and m/n = 0.5 0.2 0.4 0.6 0.8 1 m / n Inner Iterations DCA-Restart DCA-LS DCA-Restart-LASSO DCA-LS-LASSO 0.2 0.4 0.6 0.8 1 m / n Inner Iterations 0 5 10 15 20 SNRd B Inner Iterations 0 5 10 15 20 SNRd B Inner Iterations Figure 17: Number of DCA inner iterations (log-scale), summed over all outer iterations t and all λ values, averaged over 100 runs, for integer compressed sensing experiments with n = 256: (a) Varying m with s = 13 and SNRd B = 8, and (b) varying m with s = 26 and SNRd B = 8, (c) varying SNRd B with s = 13 and m/n = 0.5, and (d) varying SNRd B with s = 26 and m/n = 0.5.