# continuous_submodular_maximization_beyond_drsubmodularity__a1265a2c.pdf Continuous Submodular Maximization: Beyond DR-Submodularity Moran Feldman Department of Computer Science University of Haifa Haifa 3498838, Israel moranfe@cs.haifa.ac.il Amin Karbasi School of Engineering and Applied Science Yale University New Haven, CT 06520 amin.karbasi@yale.edu In this paper, we propose the first continuous optimization algorithms that achieve a constant factor approximation guarantee for the problem of monotone continuous submodular maximization subject to a linear constraint. We first prove that a simple variant of the vanilla coordinate ascent, called COORDINATE-ASCENT+, achieves a ( e 1 2e 1 ε)-approximation guarantee while performing O(n/ε) iterations, where the computational complexity of each iteration is roughly O(n/ ε + n log n) (here, n denotes the dimension of the optimization problem). We then propose COORDINATE-ASCENT++, that achieves the tight (1 1/e ε)-approximation guarantee while performing the same number of iterations, but at a higher computational complexity of roughly O(n3/ε2.5 + n3 log n/ε2) per iteration. However, the computation of each round of COORDINATE-ASCENT++ can be easily parallelized so that the computational cost per machine scales as O(n/ ε + n log n). 1 Introduction Submodularity is a fundamental concept in combinatorial optimization, usually associated with discrete set functions [Fujishige, 1991]. As submodular functions formalize the intuitive notion of diminishing returns, and thus provide a useful structure, they appear in a wide range of modern machine learning applications including various forms of data summarization [Lin and Bilmes, 2012, Mirzasoleiman et al., 2013], influence maximization [Kempe et al., 2003], sparse and deep representations [Balkanski et al., 2016, Oh Song et al., 2017], fairness Celis et al. [2016], Kazemi et al. [2018], experimental design [Harshaw et al., 2019], neural network interpretability Elenberg et al. [2017], human-brain mapping [Salehi et al., 2017], adversarial robustness [Lei et al., 2018], crowd teaching [Singla et al., 2014], to name a few. Moreover, submodularity ensures the tractability of the underlying combinatorial optimization problems as minimization of submodular functions can be done exactly and (constrained) maximization of submodular functions can be done approximately. For more details regarding the theory and applications of submodular functions in machine learning and signal processing, we refer the interested reader to the recent surveys by Buchbinder and Feldman [2018] and Tohidi et al. [2020]. To capture an even larger set of applications, while providing rigorous guarantees, the discrete notion of submodularity has been generalized in various directions, including adaptive and interactive submodualarity for sequential decision making problems [Golovin and Krause, 2011, Guillory and Bilmes, 2010], weak submodularity for general set functions with a bounded submodularity distance [Das and Kempe, 2011] and sequence submodularity for time series analysis [Tschiatschek et al., 2017, Mitrovic et al., 2019], among other variants. Very recently, a surge of new applications in machine learning and statistics motivated researchers to study continuous submodular functions [Bach, 2015, Wolsey, 1982a], a large class of non-convex/non- 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. concave functions, which may be optimized efficiently. In particular, it has been shown that continuous submodular minimization can be done exactly Bach [2015]. In contrast, for continuous submodular maximization, it is usually assumed that the continuous function is not only submodular, but also has the extra condition of diminishing returns. Such functions are usually called continuous DRsubmodular [Bian et al., 2017b]. We should highlight that even though in the discrete domain, submodularity and diminishing returns are equivalent; in the continuous domain, the diminishing returns condition implies continuous submodularity, but not vice versa. In this paper, we propose the first algorithms that achieve constant factor approximation guarantees for the maximization of a monotone continuous submodular function subject to a linear constraint. More specifically, our contributions can be summarized as follows: We develop a variant of the coordinate ascent algorithm, called COORDINATE-ASCENT+, that achieves a ( e 1 2e 1 ε)-approximation guarantee while performing O(n/ϵ) iterations, where the computational complexity of each iteration is O(n p B/ε + n log n). Here, n and B denote the dimension of the optimization problem and the ℓ1 radius of the constraint set, respectively. We then develop COORDINATE-ASCENT++, that achieves the tight (1 1/e ε) approximation guarantee while performing O(n/ϵ) iterations, where the computational complexity of each iteration is O(n3 B/ε2.5 + n3 log n/ε2). Moreover, COORDINATE-ASCENT++ can be easily parallelized so that the computational complexity per machine in each round scales as O(n p B/ϵ + n log n). Notably, to establish these results, we do not assume that the continuous submodular function satisfies the diminishing returns condition. 1.1 Related Work Continuous submodular functions naturally arise in many machine learning applications such as Adwords for e-commerce and advertising [Mehta et al., 2007, Devanur and Jain, 2012], influence and revenue maximization [Bian et al., 2017b], robust budget allocation [Staib and Jegelka, 2017], multi-resolution data summarization [Bian et al., 2017b], learning assignments [Golovin et al., 2014], experimental design [Chen et al., 2018b], and MAP inference for determinantal point processes [Gillenwater et al., 2012, Hassani et al., 2019]. Continuous submodular functions have also been studied in statistics as negative log-densities of probability distributions. These distributions are referred to as multivariate totally positive of order 2 (MTP2) [Fallat et al., 2017] and classical examples are the multivariate logistic, Gamma and F distributions, as well as characteristic roots of random Wishart matrices [Karlin and Rinott, 1980]. The focus of the current work is to study continuous submodular maximization. Almost all the existing works in this area consider a proper subclass of continuous submodular functions, called continuous DR-submodular, which satisfy diminishing returns conditions. In particular, when first order information (i.e., exact or stochastic gradients) is available Hassani et al. [2017] showed that (stochastic) gradient ascent achieves 1/2-approximation guarantee for monotone continuous DRsubmodular functions subject to a general convex body constraint.1 Interestingly, one can achieve the tight approximation guarantee of 1 1/e by using conditional gradient methods [Bian et al., 2017b] or its efficient stochastic variants [Mokhtari et al., 2018a, Karbasi et al., 2019, Zhang et al., 2020]. A simple variant of the conditional gradient methods can also be applied to non-monotone DRsubmodular functions, which results in a 1/e-approximation guarantee [Bian et al., 2017a, Mokhtari et al., 2018c, Hassani et al., 2019]. The only work, we are aware of, that goes beyond the above line of work, and considers also non-DR continuous submodular functions is a recent work by Niazadeh et al. [2018], which developed a polynomial time algorithm with a tight 1/2-approximation guarantee for the problem of continuous submodular maximization subject to a box constraint. Discrete and continuous submodular maximization problems are inherently related to one another through the multilinear extension [Calinescu et al., 2011a]. Indeed, maximization of the multilinear extension (along with a subsequent rounding) has led to the best theoretical results in many settings, 1We note that the result of Hassani et al. [2017] can be viewed as an adaptation of an algorithm designed by Chekuri et al. [2011] for multilinear extensions of discrete monotone submodular functions. including submodular maximization subject to various complex constraints [Feldman et al., 2011, Chekuri et al., 2014, Buchbinder and Feldman, 2019], online and bandit submodular maximization [Zhang et al., 2019, Chen et al., 2018a], decentralized solution [Mokhtari et al., 2018b, Xie et al., 2019], and algorithms with low adaptivity complexity Chekuri and Quanrud [2019], Balkanski et al. [2019], Chen et al. [2019], Ene et al. [2019]. Remark: Soma and Yoshida [2018] considered the closely related problem of maximizing a non-DR submodular function subject to a cardinality constraint over the integer lattice. In principle, any algorithm for their setting can be applied to our setting using a reduction at the cost of a small decrease in the approximation guarantee and some increase in the time complexity.2 However, an error was recently found in the analysis of the algorithm of [Soma and Yoshida, 2018], and therefore, this algorithm cannot currently be used to get an algorithm for our setting via the above mentioned reduction. It is likely that the algorithm of [Soma and Yoshida, 2018] can be fixed by introducing an enumeration step similar to the one we describe in Section 5, but such a modification will significantly increase the time complexity of their algorithm. In Appendix A in the supplementary material we give the reduction from our settings to the setting of [Soma and Yoshida, 2018] and also describe the error that was found in the algorithm of [Soma and Yoshida, 2018]. 2 Preliminaries and Problem Formulation We first recall a few standard definitions regarding submodular functions. Even though submodularity is mostly considered in the discrete domain, the notion can be naturally extended to arbitrary lattices [Fujishige, 1991]. To this end, let us consider a subset of Rd + of the form X = Qd i=1 Xi where each Xi is a compact subset of R+. A function F : X R+ is submodular [Wolsey, 1982b] if for all (x, y) X X, we have F(x) + F(y) F(x y) + F(x y) , where x y .= max(x, y) (component-wise) and x y .= min(x, y) (component-wise). A submodular function is monotone if for any x, y X such that x y, we have F(x) F(y) (here, by x y we mean that every element of x is less than that of y). The above definition includes the discrete notion of submodularity over a set by restricting each Xi to {0, 1}. In this paper, we mainly consider continuous submodular functions, where each Xi is a closed interval in R+. When F is twice differentiable, a continuous function is submodular if and only if all cross-second-derivatives are non-positive [Bach, 2015], i.e., i = j, x X, 2F(x) Thus, continuous submodular functions can be convex (e.g., F(x) = P i,j φi,j(xi xj) for φi,j convex), concave (e.g., F(x) = g(Pn i=1 λixi) for g concave and λi s non-negative), and neither (e.g., quadratic program F(x) = x T Hx where all off-diagonal elements of H are non-positive). A proper subclass of continuous submodular functions are called DR-submodular [Bian et al., 2017b, Soma and Yoshida, 2015] if for all x, y X such that x y, standard basis vector ei Rn + and a non-negative number z R+ such that zei + x X and zei + y X, it holds that F(zei + x) F(x) F(zei + y) F(y).3 One can easily verify that for a differentiable DRsubmodular function the gradient is an antitone mapping, i.e., for all x, y X such that x y we have F(x) F(y) [Bian et al., 2017b]. An important example of a DR-submodular function is the multilinear extension [Calinescu et al., 2011b]. Maybe the simplest example of a continuous submodular function that is not DR-submodular is the quadratic function F(x) = x T Hx + h T x + b where only the off-diagonal entries of H are non-positive (and there is no restrictions on the diagonal entries). Moreover, as we mentioned in the introduction, continuous submodular functions naturally arise as the negative log-densities of probability distributions. For instance, a distribution p on X is called MTP2 if p(x)p(y) p(x y)p(x y) for all x, y X R. MTP2 implies positive association between random variables. As an example, a multivariat Gaussian distibution is MTP2 2The reduction between the two settings is a bit subtle, and we would like to thank an anonymous referee for suggesting its crux idea to us. 3It is well-known that (non-DR) continuous submodularity is equivalent to requiring this inequality to hold whenever the vectors x and y also obey xi = yi. if an only if its inverse covariance matrix has non-positive off-diagonal entries. Therefore, finding the the most likely configuration in this setting amounts to maximizing a continuous submodular function. The definition of DR-submodular functions implies that decreasing coordinates of a vector affects the contribution of these coordinates to the value of the vector in at most a proportional way. More formally, if x and y are two vectors such that x, x + y X, then for every real value c [0, 1] it holds that F(x + cy) F(x) + c[F(x + y) F(x)]. This property is very helpful in algorithms designed for DR-submodular functions since it allows one to lower bound the increase in the value of a given solution x when some coordinate i of x is increased even when the increase does not make xi as large as the value of coordinate i in some optimal solution. For (non-DR) continuous submodular functions this cannot be done, and thus, a coordinate whose value cannot be increased to match its value in some optimal solution (due to the constraint and incorrect decisions already made by the algorithm) is basically useless for the algorithm; rendering the design of algorithms for (non-DR) continuous submodular functions much more challenging than for DR-submodular functions. In this paper, we consider the following fundamental optimization problem max x X F(x) subject to x 1 B, (1) where F is a non-negative monotone continuous submodular function. Without loss of generality, we assume that each closed interval Xi is of the form [0, ui] since otherwise for Xi = [ai, ai + ui] we can always define a corresponding continuous submodular function G(x) = F(x + a), where a = [a1, . . . , an]. Similarly, we assume w.l.o.g., that ui B for every coordinate i. We also assume that F is L-smooth, meaning that F(x) F(y) 2 L x y 2 for some L 0 and for all x, y X. Finally, note that replacing a linear constraint of the form Pn i=1 wixi B (where wi > 0) with x 1 B does not change the nature of the problem. In this case, we can simply define a corresponding function G(x) = F (Pn i=1 xiei/wi) and solve Problem (1). This change of course changes L by a factor of W = min1 i n wi. Prior to our work, no constant approximation guarantee was known for Problem (1). 3 Plain Coordinate Ascent In this section we present our plain coordinate ascent algorithm and analyze its guarantee. Our algorithm uses as a black box an algorithm for a one dimensional optimization problem whose properties are summarized by the following proposition. We include the proof of this proposition in Appendix B in the supplementary material. Proposition 3.1. Given a point x [0, u], a coordinate i [n], bounds 0 < a b ui xi and a positive parameter ε (0, 1) there is a polynomial time algorithm that runs in O( p B/ε+log(ε/a)) time and returns a value y [a, b] maximizing the ratio F(x + yei)/y up to an additive error of εL. Using the algorithm whose existence is guaranteed by the last proposition, we can now formally state our coordinate ascent algorithm as Algorithm 1. This algorithm gets a quality control parameter ε (0, 1/4). We begin the analysis of Algorithm 1 with the following observation that bounds its time complexity. The proof of this observation and some other proofs appearing later in this paper have been moved to Appendix D in the supplementary material due to space constraints. In a nutshell, the observation holds since each iteration of the main loop of Algorithm 1 producees progress in at least one of three ways: making xj = uj for some coordinate j, making x 1 as large as B, or increasing x 1 by at least δ. Observation 3.2. The main loop of Algorithm 1 makes at most O(n/ε) iterations, each running in O(n p B/ε + n log n) time. Thus, the entire algorithm runs in O(n2 B/ε1.5 + n2 log n/ε) time. Fix now some feasible solution y [0, u]. Intuitively, we say that an iteration of the main loop of Algorithm 1 is good (with respect to y) if, at the beginning of the iteration, the algorithm still has the option to increase each coordinate of x to be equal to the corresponding coordinate of y, and this does not violate the constraint (if a coordinate of x is already larger than the corresponding coordinate of y that is fine as well). Formally, an iteration is good if the inequality yi xi d i was true in this iteration for every coordinate i C (before the vector x was updated at the end of the iteration). Let Algorithm 1: COORDINATE-ASCENT (ε) 1 Let x 0 and δ εB/n. 2 while x 1 B do 3 Let C [n] be the set of coordinates i [n] for which xi < ui (i.e., these coordinates can be increased in x to some positive extent without violating feasibility). 4 for every i C do 5 Let d i be the maximum amount by which xi can be increased without violating feasibility. Formally, d i = min{ui xi, B x 1}. 6 Use the algorithm suggested by Proposition 3.1 to find a value di [min{d i, δ}, d i] maximizing F (x+diei) F (x) di up to an additive error of εL. 7 Let j be the coordinate of C maximizing F (x+djej) F (x) dj , and update x x + djej. 8 return x. ℓdenote the number of good iterations of the main loop of Algorithm 1, and let us denote by x(h) the value of x after h iterations for every 0 h ℓ. Using this notation, we can now state and prove the following lemma, which provides a lower bound on the value of x after any number of (good) iterations of Algorithm 1. Lemma 3.3. For every vector y [0, u] and integer 0 h ℓ, F(x(h)) (1 e x(h) 1/( y 1+εB)) F(y) x(h) 1 εL. Proof. We prove the lemma by induction on h. For h = 0, x(h) 1 = 0, and the lemma follows from the non-negativity of F. Thus, it remains to prove the lemma for some h > 0 given that it holds for h 1. From this point on we restrict our attention to iteration number h of Algorithm 1, and thus, when we refer to variables such as C and di, these variables should be understood as taking the values they are assigned in this iteration. Given this assumption, for every i C, let us now define a value oi that is closest to yi x(h 1) i among all the values in the range to which di can belong. Formally, oi = min{max{yi x(h 1) i , min{δ, d i}}, d i} = max{yi x(h 1) i , min{δ, d i}} i C , where the equality holds since the fact that the iteration we consider is a good iteration implies yi x(h 1) i d i. Since oi is a valid choice for di, we get by the definition of di that F(x(h 1) + diei) F(x(h 1)) di F(x(h 1) + oiei) F(x(h 1)) Using the definition of j and the submodularity of F, the last inequality implies F(x(h 1) + djej) F(x(h 1)) i C oi F (x(h 1)+diei) F (x(h 1)) i C[F(x(h 1) + oiei) F(x(h 1))] P i C oi εL F(x(h 1) + P i C oiei) F(x(h 1))] P i C oi εL . To understand the rightmost side of the last inequality, we need the following two bounds. X i C max{yi, δ} X i C yi + nδ y 1 + εB , and x(h 1) + X i C oiei x(h 1) + X i C (yi x(h 1) i )ei y . Plugging these bounds into Inequality (2), and using the monotonicity of F, we get F(x(h 1) + djej) F(x(h 1)) dj F(y) F(x(h 1)) y 1 + εB εL . Since x(h) = x(h 1) + djej, the last inequality now yields the following lower bound on F(x(h)). F(x(h)) F(x(h 1)) + dj y 1 + εB [F(y) F(x(h 1))] εLdj 1 dj y 1 + εB F(x(h 1)) + dj y 1 + εB F(y) εLdj . Finally, plugging into the last inequality the lower bound on F(x(h 1)) given by the induction hypothesis, we get F(x(h)) 1 dj y 1 + εB n (1 e x(h 1) 1/( y 1+εB)) F(y) x(h 1) 1 εL o + dj y 1 + εB F(y) εLdj 1 e ( x(h 1) 1+dj)/( y 1+εB) F(y) ( x(h 1) 1 + dj) εL = 1 e x(h) 1/( y 1+εB) F(y) x(h) 1 εL . Our next objective is to get an approximation guarantee for Algorithm 1 based on the last lemma. Such a guarantee appears below as Corollary 3.5. However, to prove it we also need the following observation, which shows that lower bounding the value of F(x) at some point during the execution of Algorithm 1 implies the same bound also for the value of the final solution of the algorithm. Observation 3.4. The value of F(x) only increases during the execution of Algorithm 1. Proof. The observation follows from the monotonicity of F since dj is always non-negative. Let opt be some optimal solution vector. Corollary 3.5. Let x CA be the vector outputted by Algorithm 1, then F(x CA) (1 1/e B 1 maxi [n] ui ε) F(opt) εBL. The guarantee of Corollary 3.5 is close to an approximation ratio of 1 1/e when the upper bound ui is small compared to B for every i [n]. In the next two sections we describe enhanced versions of our coordinate ascent algorithm that give an approximation guarantee which is independent of this assumption. We note that, formally, the analyses of these enhanced versions are independent of Corollary 3.5. However, the machinery used to prove this corollary is reused in these analyses. 4 Fast Enhanced Coordinate Ascent In this section we describe one simple and fast way to enhance the plain coordinate ascent algorithm from Section 3, leading to the algorithm that we name COORDINATE-ASCENT+. Before describing COORDINATE-ASCENT+ itself, let us give a different formulation for the guarantee of Algorithm 1. Lemma 4.1. There is a coordinate j [n] such that the output x CA of Algorithm 1 has a value of at least (1 1/e 2ε) F(opt optjej) εBL. Proof. In this proof we use the notation from Section 3, and consider the last iteration ℓ during this execution in which there is no coordinate i [n] such that opti xi > d i (where xi represents here its value at the beginning of the iteration). If ℓ is the last iteration of Algorithm 1, then all the iterations of Algorithm 1 are good when we choose y = opt. Thus, for this choice of y we get x(ℓ ) 1 = B, and by Lemma 3.3 the value of the output x CA = x(ℓ ) of Algorithm 1 is at least (1 e B/( opt 1+εB)) F(opt) εBL (1 e 1 ε) F(opt) εBL , where the inequality holds since opt 1 B. This guarantee is stronger than the guarantee of the lemma (because of the monotonicity of F), and thus, completes the proof for the current case. Consider now the case in which iteration ℓ is not the last iteration of Algorithm 1. In this case we set j to be some coordinate in [n] for which the inequality optj x(ℓ ) j > d j holds. Choosing y = opt optjej, we get that Algorithm 1 has at least ℓ good iterations. There are now two cases to consider based on the relationship between y 1 and B. If y 1 B/2, then Lemma 3.3 and Observation 3.4 imply together that the value of x CA is at least F(x CA) F(x(ℓ )) (1 e (B optj)/(B+εB optj)) F(opt optjej) εBL (1 e2ε 1) F(opt optjej) εBL (1 e 1 2ε) F(opt optjej) εBL , where the second inequality holds since x(ℓ ) 1 B, but optj x(ℓ ) j > d j = B x(ℓ ) 1 (the last equality holds since the inequalities optj x(ℓ ) j > d j and uj optj exclude the possibility of d j = uj x(ℓ ) j ). The penultimate inequality holds since, by our assumption, B optj y 1 B/2. It remains to consider the caes in which y 1 B/2. In this case, for every coordinate i [n] we have yi xi yi B/2 B x 1 as long as x 1 B/2. Thus, all the iterations of Algorithm 1 are good until x 1 gets to a size lager than B/2; which implies x(ℓ ) 1 B/2. Hence, Lemma 3.3 and Observation 3.4 allow us to lower bound F(x CA) also by F(x CA) F(x(ℓ )) (1 e (B/2)/(B/2+εB)) F(opt optjej) εBL (1 e2ε 1) F(opt optjej) εBL (1 e 1 2ε) F(opt optjej) εBL , where the second inequality follows from the above discussion and the inequality x(ℓ ) 1 B which holds since x(ℓ ) is a feasible solution. We are now ready to present the enhanced algorithm COORDINATE-ASCENT+, which appears as Algorithm 2. The enhancement done in this algorithm, and its analysis, is related to an algorithm of Cohen and Katzir [2008] obtaining the same approximation guarantee for the special case of discrete monotone submodular functions. Algorithm 2: COORDINATE-ASCENT+ (ε) 1 Let x CA be the solution produced by Algorithm 1 when run with ε. 2 Let x CA+ be the best solution among the n + 1 solutions x CA and {ui ei}i [n]. 3 return x CA+. It is clear that the time complexity of Algorithm 2 is dominated by the time complexity of Algorithm 1. Thus, we only need to analyze the approximation ratio of Algorithm 2. This is done by the next theorem, whose proofs relies on the fact that one of the solutions checked by Algorithm 2 is ujej for the coordinate j whose existence is guaranteed by Lemma 4.1. Theorem 4.2. Algorithm 2 outputs a solution of value at least e 1 2e 1 2ε F(OPT) εBL (0.387 2ε) F(OPT) εBL. It has O(n/ε) iterations, each running in O(n p B/ε + n log n) time, which yields a time complexity of O(n2 B/ε1.5 + n2 log n/ε). 5 Optimal Approximation Ratio In this section we describe a more involved way to enhance the plain coordinate ascent algorithm from Section 3, which leads to the algorithm that we name COORDINATE-ASCENT++ and achieves the optimal approximation ratio of 1 - 1/e (up to some error term). This enhancement uses as a black box an algorithm for a one dimensional optimization problem whose properties are summarized by the following proposition. We include the proof of this proposition in Appendix C in the supplementary material. Proposition 5.1. Given a point x [0, u], a coordinate i [n], a target value F(x) v F(x uiei) and a positive parameter ε (0, 1), there is a polynomial time algorithm that runs in O(log(B/ε)) time and returns a value 0 y ui xi such that F(x + yei) v εL. There is no value 0 y < y such that F(x + y ei) v. We can now give a simplified version of COORDINATE-ASCENT++, which appears as Algorithm 3. For simplicity, we assume in the description and analysis of this algorithm that n 3. If this is not the case, one can simulate it by adding dummy coordinates that do not affect the value of the objective function. Algorithm 3 starts by guessing two coordinates h1 and h2 that contribute a lot of value to opt. Then it constructs a solution x with a small support using two executions of the algorithm whose existence is guaranteed by Proposition 5.1, one execution for each one of the coordinates h1 and h2. It then completes the solution x into a full solution by executing Algorithm 1 after contracting the coordinates h1 and h2, i.e., modifying the objective function so that it implicitly assumes that these coordinates take the values they take in x. Algorithm 3: COORDINATE-ASCENT++ (SIMPLIFIED) (ε) 1 Guess the coordinate h1 [n] maximizing F(opth1 eh1) and the coordinate h2 [n] \ {h1} other than h1 maximizing F(P i {h1,h2} opti ei). 3 for i = 1 to 2 do 4 Guess a value vi obeying max{F(x), F(x + opthiehi) ε F(opt)} vi F(x + opthiehi) . 5 Let yi be the value returned by the algorithm whose existence is guaranteed by Proposition 5.1 given x as the input vector, the coordinate hi and the target value vi. 6 Update x x + yiehi. 7 Execute Algorithm 1 on the instance obtained by removing the coordinates h1 and h2, replacing the objective function with F (x ) = F(x + x) F(x) and decreasing B by x 1. Let x CA be the output of Algorithm 1. 8 Return x + x CA (we denote this sum by x CA++ in the analysis). We begin the analysis of Algorithm 3 by bounding its time complexity. Observation 5.2. Assuming the guesses made by Algorithm 3 do not require any time, Algorithm 3 has O(n/ε) iterations, each running in O(n p B/ε + n log n) time, which yields a time complexity of O(n2 B/ε1.5 + n2 log n/ε). The next step in the analysis of Algorithm 3 is proving some properties of the vector x = P i {h1,h2} yi ei produced by the first part of the algorithm. In a nutshell, these properties holds since the definition of vj and the properties of Proposition 5.1 show together that the value chosen for xhj by the algorithm of Proposition 5.1 gives almost as much value as choosing opthj, but it never overestimates opthj. Lemma 5.3. F(x) F(P2 j=1 opthj ehj) 2ε F(opt) 2εL and x P2 j=1 opthj ehj. We now ready to prove the approximation guarantee of Algorithm 3. Intuitively, this proof is based on simply adding up the lower bound on F(x) given by Lemma 5.3 and the lower bound on F (x CA) given by Lemma 4.1. Some of the ideas used in the proof can be traced back to a recent result by Nutov [2020], who described an algorithm achieving (1 1/e)-approximation for the discrete version of the problem we consider (namely, maximizing a non-negative monontone discrete submodular function subject to a knapsack constraint) using O(n4) function evaluations. Lemma 5.4. Algorithm 3 outputs a vector x CA++ whose value is at least (1 1/e 4ε) F(opt) ε(B + 2)L. To get our final COORDINATE-ASCENT++ algorithm, we need to explain how to implement the guesses of Algorithm 3. The coordinates h1 and h2 can be guessed by simply iterating over all the possible pairs of two coordinates. Similarly, by the next observation, to get vi it suffices to try all the possible values in the set {F(x) + εj F(uhiehi) | j is a non-negative integer and F(x) + εj F(uhiehi) F(x + uhiehi)}. In the following, we refer to this set as J (x, hi). Observation 5.5. Consider the vector x at the point in which Algorithm 3 guesses the value vi. Then, there exists a value in the set J (x, hi) obeying the requirements from vi. Our final COORDINATE-ASCENT++ algorithm appears as Algorithm 4. By the above discussion, the number of iterations it makes exceeds the number of iterations given by Observation 5.2 only by a factor of i=1 |J (x, hi)| n2 1 + F(x + uhiehi) F(x) ε F(uhiehi) = O(ε 2n2) , where the equality holds since the submodulrity and non-negativity of f imply F(x + uhiehi) F(x) F(uhiehi). Algorithm 4: COORDINATE-ASCENT++ (ε) 1 for every pair of distinct coordinates h1, h2 [n] do 2 Let x(0) 0. 3 for every v1 J (x(0), h1) do 4 Let y1 be the value returned by the algorithm guaranteed by Proposition 5.1 given x(0) as the input vector, the coordinate h1 and the target value v1. 5 Set x(1) x(0) + y1eh1. 6 for every v2 J (x(1), h2) do 7 Let y2 be the value returned by the algorithm guaranteed by Proposition 5.1 given x(1) as the input vector, the coordinate h2 and the target value v2. 8 Update x x(1) + y2eh2. 9 Execute Algorithm 1 on the instance obtained by removing the coordinates h1 and h2, replacing the objective function with F (x ) = F(x + x) F(x) and decreasing B by x 1. Let x CA be the output of Algorithm 1. 10 Mark x + x CA as a candidate solution. 11 Return the solution maximizing F among all the solutions marked above as candidate solutions. The next theorem summarizes the result we have proved in this section. Theorem 5.6. For every ε (0, 1), Algorithm 4 is an algorithm for our problem which produces a solution of value at least (1 1/e 4ε) F(opt) ε(B+2)L. It has O(n3/ε3) iterations, each running in O(n p B/ε + n log n) time, which yields a time complexity of O(n4 B/ε2.5 + n4 log n/ε3). In all the loops of Algorithm 4, the iterations are independent, and thus, can be done in parallel instead of sequentially. Thus, the parallel time required for Algorithm 4 is equal to the time complexity of Algorithm 3, which by Observation 5.2 is only O(n2 B/ε1.5 + n2 log n/ε). 6 Conclusion In this paper, we provided the first constant factor approximation guarantees for the problem of maximizing a monotone continuous submodular function subject to a linear constraint. Crucially, our results did not rely on DR-submodularity. A natural open problem is to extend our algorithms to more involved kinds constraints. To understand the difficult of such an extension, let us recall that our algorithms resemble algorithms for maximizing a submodular set function subject to a knapsack constraint even though all the coordinates have a coefficient of 1 in the constraint, which corresponds to a simpler cardinality constraint. Intuitively, this is the case because our algorithms treat monotone non-DR submodular continuous function as if they were monotone submodular set functions over an infinite ground set consisting of pairs (i, v), where i is a coordinate and v is the value assigned to this coordinate. Unfortunately, this intuitive reduction from continuous to set functions converts cardinality constraints into knapsack constraints, forcing us to employ knapsack techniques even when handling cardinality-like constraints. Similarly, other kinds of constraints will also become more involved if the above intuitive reduction is applied to them, often leading to constraints that are difficult to handle. Broader Impact In this paper, we provided the first constant factor approximation guarantees for a large class of non-convex functions with combinatorial structure. Since it is a theoretical result in nature, a broader impact discussion is not applicable. Acknowledgments and Disclosure of Funding We would like to thank the anonymous referees for pointing out the work of Soma and Yoshida [2018] to us. The work of Moran Feldman was supported in part by ISF grants no. 1357/16 and 459/20. Amin Karbasi is partially supported by NSF (IIS-1845032), ONR (N00014-19-1-2406), AFOSR (FA955018-1-0160), and TATA Sons Private Limited. Francis R. Bach. Submodular Functions: from Discrete to Continous Domains. Co RR, abs/1511.00394, 2015. Eric Balkanski, Baharan Mirzasoleiman, and Yaron Singer. Learning Sparse Combinatorial Representations via Two-stage Submodular Maximization. In Proceedings of The 33rd International Conference on Machine Learning, pages 2207 2216, 2016. Eric Balkanski, Aviad Rubinstein, and Yaron Singer. An optimal approximation for submodular maximization under a matroid constraint in the adaptive complexity model. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019. An Bian, Kfir Yehuda Levy, Andreas Krause, and Joachim M. Buhmann. Non-monotone continuous dr-submodular maximization: Structure and algorithms. In Advances in Neural Information Processing Systems (Neur IPS), pages 486 496, 2017a. Andrew An Bian, Baharan Mirzasoleiman, Joachim M. Buhmann, and Andreas Krause. Guaranteed Non-convex Optimization: Submodular Maximization over Continuous Domains. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017b. Niv Buchbinder and Moran Feldman. Submodular functions maximization problems., 2018. Niv Buchbinder and Moran Feldman. Constrained submodular maximization via a non-symmetric technique. Math. Oper. Res., 44(3):988 1005, 2019. Gruia Calinescu, Chandra Chekuri, Martin Pal, and Jan Vondrak. Maximizing a submodular set function subject to a matroid constraint. SIAM Journal on Computing, 2011a. Gruia Calinescu, Chandra Chekuri, Martin Pál, and Jan Vondrák. Maximizing a monotone submodular function subject to a matroid constraint. SIAM Journal on Computing, 40(6):1740 1766, 2011b. L. Elisa Celis, Amit Deshpande, Tarun Kathuria, and Nisheeth K. Vishnoi. How to be Fair and Diverse? Co RR, abs/1610.07183, 2016. Chandra Chekuri and Kent Quanrud. Parallelizing greedy for submodular set function maximization in matroids and beyond. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019. Chandra Chekuri, Jan Vondrák, and Rico Zenklusen. Submodular function maximization via the multilinear relaxation and contention resolution schemes. In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC), 2011. Chandra Sekhar Chekuri, Jan Vondrák, and Rico Zenklusen. Submodular function maximization via the multilinear relaxation and contention resolution schemes. SIAM Journal on Computing, 2014. Lin Chen, Christopher Harshaw, Hamed Hassani, and Amin Karbasi. Projection-free online optimization with stochastic gradient: From convexity to submodularity. In International Conference on Machine Learning, 2018a. Lin Chen, Hamed Hassani, and Amin Karbasi. Online Continuous Submodular Maximization. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1896 1905, 2018b. Lin Chen, Moran Feldman, and Amin Karbasi. Unconstrained submodular maximization with constant adaptive complexity. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 102 113, 2019. Reuven Cohen and Liran Katzir. The generalized maximum coverage problem. Inf. Process. Lett., 108(1):15 22, 2008. doi: 10.1016/j.ipl.2008.03.017. URL https://doi.org/10.1016/j.ipl. 2008.03.017. Abhimanyu Das and David Kempe. Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection. In International Conference on Machine Learning, pages 1057 1064, 2011. Nikhil R Devanur and Kamal Jain. Online matching with concave returns. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 137 144. ACM, 2012. Ethan R. Elenberg, Alexandros G. Dimakis, Moran Feldman, and Amin Karbasi. Streaming Weak Submodularity: Interpreting Neural Networks on the Fly. In Advances in Neural Information Processing Systems, 2017. Alina Ene, Huy L Nguyen, and Adrian Vladu. Submodular maximization with matroid and packing constraints in parallel. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019. Shaun Fallat, Steffen Lauritzen, Kayvan Sadeghi, Caroline Uhler, Nanny Wermuth, and Piotr Zwiernik. Total positivity in Markov structures. The Annals of Statistics, 45(3):1152 1184, 2017. Moran Feldman, Joseph Naor, and Roy Schwartz. A unified continuous greedy algorithm for submodular maximization. In Foundations of Computer Science (FOCS), pages 570 579. IEEE, 2011. Satoru Fujishige. Submodular functions and optimization, volume 58. Annals of Discrete. Mathematics, North Holland, Amsterdam, 2nd edition, 1991. ISBN 0-444-88556-0. Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Near-optimal map inference for determinantal point processes. In Advances in Neural Information Processing Systems, pages 2735 2743, 2012. Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427 486, 2011. Daniel Golovin, Andreas Krause, and Matthew J. Streeter. Online Submodular Maximization under a Matroid Constraint with Application to Learning Assignments. Co RR, abs/1407.1082, 2014. Andrew Guillory and Jeff Bilmes. Interactive Submodular Set Cover. In International Conference on Machine Learning, pages 415 422. Omnipress, 2010. Chris Harshaw, Moran Feldman, Justin Ward, and Amin Karbasi. Submodular maximization beyond non-negativity: Guarantees, fast algorithms, and applications. In International Conference on Machine Learning, 2019. Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Zebang Shen. Stochastic conditional gradient++: (non-)convex minimization and continuous submodular maximization, 2019. S. Hamed Hassani, Mahdi Soltanolkotabi, and Amin Karbasi. Gradient Methods for Submodular Maximization. In Advances in Neural Information Processing Systems, pages 5843 5853, 2017. Amin Karbasi, Hamed Hassani, Aryan Mokhtari, and Zebang Shen. Stochastic continuous greedy ++: When upper and lower bounds match. In Advances in Neural Information Processing Systems (Neur IPS), pages 13066 13076, 2019. Samuel Karlin and Yosef Rinott. Classes of orderings of measures and related correlation inequalities. i. multivariate totally positive distributions. Journal of Multivariate Analysis, 10(4):467 498, 1980. Ehsan Kazemi, Morteza Zadimoghaddam, and Amin Karbasi. Scalable deletion-robust submodular maximization: Data summarization with privacy and fairness constraints. In International conference on machine learning, pages 2544 2553, 2018. David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137 146. ACM, 2003. Qi Lei, Lingfei Wu, Pin-Yu Chen, Alexandros G Dimakis, Inderjit S Dhillon, and Michael Witbrock. Discrete adversarial attacks and submodular optimization with applications to text classification. ar Xiv preprint ar Xiv:1812.00151, 2018. Hui Lin and Jeff A. Bilmes. Learning Mixtures of Submodular Shells with Application to Document Summarization. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pages 479 490, 2012. Aranyak Mehta, Amin Saberi, Umesh Vazirani, and Vijay Vazirani. Adwords and generalized online matching. Journal of the ACM (JACM), 2007. Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed Submodular Maximization: Identifying Representative Elements in Massive Data. In Advances in Neural Information Processing Systems, pages 2049 2057, 2013. Marko Mitrovic, Ehsan Kazemi, Moran Feldman, Andreas Krause, and Amin Karbasi. Adaptive sequence submodularity. In Advances in Neural Information Processing Systems, pages 5353 5364, 2019. Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Conditional gradient method for stochastic submodular maximization: Closing the gap. In International Conference on Artificial Intelligence and Statistics, pages 1886 1895, 2018a. Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Decentralized submodular maximization: Bridging discrete and continuous settings. In International Conference on Machine Learning, 2018b. Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Stochastic conditional gradient methods: From convex minimization to submodular maximization. ar Xiv preprint ar Xiv:1804.09554, 2018c. Rad Niazadeh, Tim Roughgarden, and Joshua R. Wang. Optimal algorithms for continuous nonmonotone submodular and DR-submodular maximization. In Advances in Neural Information Processing Systems 31, pages 9617 9627, 2018. Zeev Nutov, 2020. Personal communication. Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. Deep metric learning via facility location. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5382 5390, 2017. Mehraveh Salehi, Amin Karbasi, Dustin Scheinost, and R Todd Constable. A submodular approach to create individualized parcellations of the human brain. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 478 485. Springer, 2017. Adish Singla, Ilija Bogunovic, Gábor Bartók, Amin Karbasi, and Andreas Krause. Near-optimally teaching the crowd to classify. In International Conference on Machine Learning, 2014. Tasuku Soma and Yuichi Yoshida. A generalization of submodular cover via the diminishing return property on the integer lattice. In Advances in Neural Information Processing Systems, pages 847 855, 2015. Tasuku Soma and Yuichi Yoshida. Maximizing monotone submodular functions over the integer lattice. Math. Program., 172(1-2):539 563, 2018. doi: 10.1007/s10107-018-1324-y. URL https://doi.org/10.1007/s10107-018-1324-y. Matthew Staib and Stefanie Jegelka. Robust Budget Allocation via Continuous Submodular Functions. In Proceedings of the 34th International Conference on Machine Learning, pages 3230 3240, 2017. Ehsan Tohidi, Rouhollah Amiri, Mario Coutino, David Gesbert, Geert Leus, and Amin Karbasi. Submodularity in action: From machine learning to signal processing applications. IEEE Signal Processing Magazine, 37(5):120 133, 2020. Sebastian Tschiatschek, Adish Singla, and Andreas Krause. Selecting sequences of items via submodular maximization. In AAAI Conference on Artificial Intelligence, pages 2667 2673, 2017. Laurence A Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 1982a. Laurence A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 1982b. Jiahao Xie, Chao Zhang, Zebang Shen, Chao Mi, and Hui Qian. Decentralized gradient tracking for continuous dr-submodular maximization. In International Conference on Artificial Intelligence and Statistics, 2019. Mingrui Zhang, Lin Chen, Hamed Hassani, and Amin Karbasi. Online continuous submodular maximization: From full-information to bandit feedback. In Advances in Neural Information Processing Systems, 2019. Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. One sample stochastic frank-wolfe. In International Conference on Artificial Intelligence and Statistics, 2020.