# continuous_drsubmodular_maximization_structure_and_algorithms__ad390e10.pdf Continuous DR-submodular Maximization: Structure and Algorithms An Bian ETH Zurich ybian@inf.ethz.ch Kfir Y. Levy ETH Zurich yehuda.levy@inf.ethz.ch Andreas Krause ETH Zurich krausea@ethz.ch Joachim M. Buhmann ETH Zurich jbuhmann@inf.ethz.ch DR-submodular continuous functions are important objectives with wide real-world applications spanning MAP inference in determinantal point processes (DPPs), and mean-field inference for probabilistic submodular models, amongst others. DR-submodularity captures a subclass of non-convex functions that enables both exact minimization and approximate maximization in polynomial time. In this work we study the problem of maximizing non-monotone continuous DRsubmodular functions under general down-closed convex constraints. We start by investigating geometric properties that underlie such objectives, e.g., a strong relation between (approximately) stationary points and global optimum is proved. These properties are then used to devise two optimization algorithms with provable guarantees. Concretely, we first devise a two-phase algorithm with 1/4 approximation guarantee. This algorithm allows the use of existing methods for finding (approximately) stationary points as a subroutine, thus, harnessing recent progress in non-convex optimization. Then we present a non-monotone FRANK-WOLFE variant with 1/e approximation guarantee and sublinear convergence rate. Finally, we extend our approach to a broader class of generalized DR-submodular continuous functions, which captures a wider spectrum of applications. Our theoretical findings are validated on synthetic and real-world problem instances. 1 Introduction Submodularity is classically most well known for set function optimization, where it enables efficient minimization [23] and approximate maximization [31; 25] in polynomial time. Submodularity has recently been studied on the integer lattice [34; 33] and on continuous domains [3; 4; 36; 21], with significant theoretical results and practical applications. For set functions, it is well known that submodularity is equivalent to the diminishing returns (DR) property. However, this does not hold for integer-lattice functions or continuous functions, where the DR property defines a subclass of submodular functions, called DR-submodular functions. In continuous domains, applying convex optimization techniques enables efficient minimization of submodular continuous functions [3; 36] (despite the non-convex nature of such objectives). In [4] it is further shown that continuous submodularity enables constant-factor approximation schemes for constrained monotone DR-submodular maximization and box constrained non-monotone submodular maximization problems. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Many real-world non-convex problems, such as maximizing the softmax extension of DPPs, require maximizing a non-monotone DR-submodular function over a general down-closed convex constraint. Yet, current theory [3; 4; 36] does not apply to this general problem setting, which motivates us to develop guaranteed and efficient algorithms for such problems. Exploring the structure that underlies DR-submodularity is crucial to deriving guaranteed algorithms. Combined with a notion of non-stationarity for constrained optimization problems and a new notion of strong DR-submodularity , we find a rich structure in the problem of continuous DR-submodular maximization. This in turn gives rise to two approximation algorithms with provable guarantees. Specifically, we make the following contributions: - We bound the difference between objective values of stationary points and the global optimum. Our analysis shows that the bound is even tighter if the objective is strongly DR-submodular (see Definition 3). - Based on the geometric properties, we present two algorithms: (i) A two-phase FRANK- WOLFE-style algorithm with 1/4 approximation guarantee converges with a 1/ k rate; (ii) a non-monotone FRANK-WOLFE variant exhibits a 1/e approximation guarantee and converges sublinearly. Even though the worst-case guarantee of the first one is worse than the second, it yields several practical advantages, which we discuss in Section 4.2. - We investigate a generalized class of submodular functions on conic lattices. This allows us to model a larger class of non-trivial applications. These include logistic regression with a non-convex separable regularizer, non-negative PCA, etc. To optimize them, we provide a reduction that enables to invoke algorithms for continuous submodular optimization problems. - We experimentally demonstrate the applicability of our methods on both synthetic and real- world problem instances. 1.1 Problem Statement Notation. We use boldface letters, e.g., x to represent a vector, boldface capital letters, e.g., A to denote a matrix. xi is the ith entry of x, Aij is the (ij)th entry of A. We use ei to denote the standard ith basis vector. f( ) is used to denote a continuous function, and F( ) to represent a set function. [n] := {1, ..., n} for an integer n 1. k k means the Euclidean norm by default. Given two vectors x, y, x y means xi yi, 8i. x _ y and x y denote coordinate-wise maximum and coordinate-wise minimum, respectively. The general setup of constrained non-monotone DR-submodular (see Definition 1 below) maximization is, x2P f(x), (P) where f : X ! R is continuous DR-submodular, X = Qn i=1 Xi, each Xi is an interval [3; 4]. Wlog1, we assume that the lower bound u of X is 0, i.e., X = [0, u]. The set P [0, u] is assumed to be a down-closed convex set, where down-closedness means: x 2 P and 0 y x implies that y 2 P. The diameter of P is D := maxx,y2P kx yk, and it holds that D k uk. We use x to denote the global maximum of (P). One can assume f is non-negative over X, since otherwise one just needs to find a lower bound for the minimum function value of f over [0, u] (and box-constrained submodular minimization can be solved to arbitrary precision in polynomial time [3]). Over continuous domains, a DR-submodular function [4] is a submodular function with the diminishing returns (DR) property, Definition 1 (DR-submodular & DR property). A function f : X 7! R is DR-submodular (has the DR property) if 8a b 2 X, 8i 2 [n], 8k 2 R+ s.t. (kei + a) and (kei + b) are still in X, it holds, f(kei + a) f(a) f(kei + b) f(b). (1) If f is differentiable, one can show that Definition 1 is equivalent to rf being an antitone mapping from Rn to Rn. Furthermore, if f is twice-differentiable, the DR property is equivalent to all of the entries of its Hessian being non-positive, i.e., r2 ijf(x) 0, 8x 2 X, i, j 2 [n]. A function f : X 7! R is DR-supermodular iff f is DR-submodular. We also assume that f has Lipschitz gradients, 1Since otherwise one can work on a new function g(x) := f(x + u) that has 0 as the lower bound of its domain, and all properties of the function are still preserved. Definition 2. A function f has L-Lipschitz gradients if for all x, y 2 X it holds that, krf(x) rf(y)k Lkx yk. (2) A brief summary of related work appears in Section 6. 2 Motivating Real-world Examples Many continuous objectives in practice turn out to be DR-submodular. Here we list several of them. More can be found in Appendix B. Softmax extension. Determinantal point processes (DPPs) are probabilistic models of repulsion, that have been used to model diversity in machine learning [26]. The constrained MAP (maximum a posteriori) inference problem of a DPP is an NP-hard combinatorial problem in general. Currently, the methods with the best approximation guarantees are based on either maximizing the multilinear extension [6] or the softmax extension [20], both of which are DR-submodular functions (details in Appendix F.1). The multilinear extension is given as an expectation over the original set function values, thus evaluating the objective of this extension requires expensive sampling. In constast, the softmax extension has a closed form expression, which is much more appealing from a computational perspective. Let L be the positive semidefinite kernel matrix of a DPP, its softmax extension is: f(x) = log det (diag(x)(L I) + I) , x 2 [0, 1]n, (3) where I is the identity matrix, diag(x) is the diagonal matrix with diagonal elements set as x. The problem of MAP inference in DPPs corresponds to the problem maxx2P f(x), where P is a down-closed convex constraint, e.g., a matroid polytope or a matching polytope. Mean-field inference for log-submodular models. Log-submodular models [9] are a class of probabilistic models over subsets of a ground set V = [n], where the log-densities are submodular set functions F(S): p(S) = 1 Z exp(F(S)). The partition function Z = P S V exp(F(S)) is typically hard to evaluate. One can use mean-field inference to approximate p(S) by some factorized distribution qx(S) := Q j /2S(1 xj), x 2 [0, 1]n, by minimizing the distance measured w.r.t. the Kullback-Leibler divergence between qx and p, i.e., P S V qx(S) log qx(S) p(S) . It is, (1 xj)F(S) + i=1[xi log xi + (1 xi) log(1 xi)] + log Z. KL(x) is DR-supermodular w.r.t. x (details in Appendix F.1). Minimizing the Kullback-Leibler divergence KL(x) amounts to maximizing a DR-submodular function. 2.1 Motivating Example Captured by Generalized Submodularity on Conic Lattices Submodular continuous functions can already model many scenarios. Yet, there are several interesting cases which are in general not (DR-)Submodular, but can still be captured by a generalized notion. This generalization enables to develop polynomial algorithms with guarantees by using ideas from continuous submodular optimization. We present one representative objective here (more in Appendix B). In Appendix A we show the technical details on how they are covered by a class of submodular continuous functions over conic lattices. Consider the logistic regression model with a non-convex separable regularizer. This flexibility may result in better statistical performance (e.g., in recovering discontinuities, [2]) compared to classical models with convex regularizers. Let z1, ..., zm in Rn be m training points with corresponding binary labels y 2 { 1}m. Assume that the following mild assumption is satisfied: For any fixed dimension i, all the data points have the same sign, i.e., sign(zj i ) is the same for all j 2 [m] (which can be achieved by easily scaling if not). The task is to solve the following non-convex optimization problem, min x2Rn f(x) := m 1 Xm j=1 fj(x) + λr(x), (4) where fj(x) = log(1 + exp( yjx>zj)) is the logistic loss; λ > 0 is the regularization parameter, and r(x) is some non-convex separable regularizer. Such separable regularizers are popular in statistics, and two notable choices are r(x) = Pn i , and r(x) = Pn i=1 min{γx2 i , 1} (see [2]). Let us define a vector 2 { 1}n as i = sign(zj i ), i 2 [n] and l(x) := 1 j=1 fj(x). One can show that l(x) is not DR-submodular or DR-supermodular. Yet, in Appendix A we will show that l(x) is K -DR-supermodular, where the latter generalizes DR-supermodularity. Usually, one can assume the optimal solution x lies in some box [u, u]. Then the problem is an instance of constrained non-monotone K -DR-submodular maximization. 3 Underlying Properties of Constrained DR-submodular Maximization In this section we present several properties arising in DR-submodular function maximization. First we show properties related to concavity of the objective along certain directions, then we establish the relation between locally stationary points and the global optimum (thus called local-global relation ). These properties will be used to derive guarantees for the algorithms in Section 4. All omitted proofs are in Appendix D. 3.1 Properties Along Non-negative/Non-positive Directions A DR-submodular function f is concave along any non-negative/non-positive direction [4]. Notice that DR-submodularity is a stronger condition than concavity along directions v 2 Rn +: for instance, a concave function is concave along any direction, but it may not be a DR-submodular function. For a DR-submodular function with L-Lipschitz gradients, one can get the following quadratic lower bound using standard techniques by combing the concavity and Lipschitz gradients in (2). Quadratic lower bound. If f is DR-submodular with a L-Lipschitz gradient, then for all x 2 X and v 2 Rn +, it holds, f(x + v) f(x) + hrf(x), vi L 2 kvk2. (5) It will be used in Section 4.2 for analyzing the non-monotone FRANK-WOLFE variant (Algorithm 2). Strong DR-submodularity. DR-submodular objectives may be strongly concave along directions v 2 Rn +, e.g., for DR-submodular quadratic functions. We will show that such additional structure may be exploited to obtain stronger guarantees for the local-global relation. Definition 3 (Strongly DR-submodular). A function f is µ-strongly DR-submodular (µ 0) if for all x 2 X and v 2 Rn +, it holds that, f(x + v) f(x) + hrf(x), vi µ 2 kvk2. (6) 3.2 Relation Between Approximately Stationary Points and Global Optimum First of all, we present the following Lemma, which will motivate us to consider a non-stationarity measure for general constrained optimization problems. Lemma 1. If f is µ-strongly DR-submodular, then for any two points x, y in X, it holds: (y x)>rf(x) f(x _ y) + f(x y) 2f(x) + µ 2 kx yk2. (7) Lemma 1 implies that if x is stationary (i.e., rf(x) = 0), then 2f(x) f(x _ y) + f(x y) + µ 2 kx yk2, which gives an implicit relation between x and y. While in practice finding an exact stationary point is not easy, usually non-convex solvers will arrive at an approximately stationary point, thus requiring a proper measure of non-stationarity for the constrained optimization problem. Non-stationarity measure. Looking at the LHS of (7), it naturally suggests to use maxy2P(y x)>rf(x) as the non-stationarity measure, which happens to coincide with the measure proposed by recent work of [27], and it can be calculated for free for Frank-Wolfe-style algorithms (e.g., Algorithm 3). In order to adapt it to the local-global relation, we give a slightly more general definition here: For any constraint set Q X, the non-stationarity of a point x 2 Q is, g Q(x) := max v2Qhv x, rf(x)i (non-stationarity). (8) It always holds that g Q(x) 0, and x is a stationary point in Q iff g Q(x) = 0, so (8) is a natural generalization of the non-stationarity measure krf(x)k for unconstrained optimization. As the next statement shows, g Q(x) plays an important role in characterizing the local-global relation. Proposition 1 (Local-Global Relation). Let x be a point in P with non-stationarity g P(x), and Q := {y 2 P | y u x}. Let z be a point in Q with non-stationarity g Q(z). It holds that, max{f(x), f(z)} 1 4 [f(x ) g P(x) g Q(z)] + µ kx x k2 + kz z k2& where z := x _ x x. Proof sketch of Proposition 1: The proof uses Lemma 1, the non-stationarity in (8) and a key observation in the following Claim. The detailed proof is in Appendix D.2. Claim 1. It holds that f(x _ x ) + f(x x ) + f(z _ z ) + f(z z ) f(x ). Note that [7; 20] propose a similar relation for the special cases of multilinear/softmax extensions by mainly proving the same conclusion as in Claim 1. Their relation does not incorporate the properties of non-stationarity or strong DR-submodularity. They both use the proof idea of constructing a complicated auxiliary set function tailored to specific DR-submodular functions. We present a different proof method by directly utilizing the DR property on carefully constructed auxiliary points (e.g., (x + z) _ x in the proof of Claim 1). 4 Algorithms for Constrained DR-submodular Maximization Based on the properties, we present two algorithms for solving (P). The first is based on the localglobal relation, and the second is a FRANK-WOLFE variant adapted for the non-monotone setting. All the omitted proofs are deferred to Appendix E. 4.1 An Algorithm Based on the Local-Global Relation Algorithm 1: TWO-PHASE FRANK-WOLFE for non-monotone DR-submodular maximization Input: maxx2P f(x), stopping tolerance 1, 2, #iterations K1, K2 1 x NON-CONVEX FRANK-WOLFE(f, P, K1, 1, x(0)) ; // x(0) 2 P 2 Q P \ {y 2 Rn + | y u x}; 3 z NON-CONVEX FRANK-WOLFE(f, Q, K2, 2, z(0)) ; // z(0) 2 Q Output: arg max{f(x), f(z)} ; We summarize the TWO-PHASE algorithm in Algorithm 1. It is generalized from the two-phase method in [7; 20]. It invokes some non-convex solver (we use the NON-CONVEX FRANK-WOLFE by [27]; pseudocode is included in Algorithm 3 of Appendix C) to find approximately stationary points in P and Q, respectively, then returns the solution with the larger function value. Though we use NON-CONVEX FRANK-WOLFE as the subroutine here, it is worth noting that any algorithm that is guaranteed to find an approximately stationary point can be plugged into Algorithm 1 as the subroutine. We give an improved approximation bound by considering more properties of DR-submodular functions. Borrowing the results from [27] for the NON-CONVEX FRANK-WOLFE subroutine, we get the following, Theorem 1. The output of Algorithm 1 satisfies, max{f(x), f(z)} µ kx x k2 + kz z k2& max{2h1, Cf(P)} p K1 + 1 , 1 max{2h2, Cf(Q)} p K2 + 1 , 2 where h1 := maxx2P f(x) f(x(0)), h2 := maxz2Q f(z) f(z(0)) are the initial suboptimalities, Cf(P) := supx,v2P,γ2[0,1],y=x+γ(v x) 2 γ2 (f(y) f(x) (y x)>rf(x)) is the curvature of f w.r.t. P, and z = x _ x x. Theorem 1 indicates that Algorithm 1 has a 1/4 approximation guarantee and 1/ k rate. However, it has good empirical performance as demonstrated by the experiments in Section 5. Informally, this can be partially explained by the term µ kx x k2 + kz z k2& in (10): if x is away from x , this term will augment the bound; if x is close to x , by the smoothness of f, it should be close to optimal. 4.2 The Non-monotone FRANK-WOLFE Variant Algorithm 2: Non-monotone FRANK-WOLFE variant for DR-submodular maximization Input: maxx2P f(x), prespecified step size γ 2 (0, 1] 1 x(0) 0, t(0) 0, k 0; // k : iteration index, t(k) : cumulative step size 2 while t(k) < 1 do 3 v(k) arg maxv2P,v u x(k)hv, rf(x(k))i; // shrunken LMO 4 use uniform step size γk = γ; set γk min{γk, 1 t(k)}; 5 x(k+1) x(k) + γkv(k), t(k+1) t(k) + γk, k k + 1; Output: x(K) ; // assuming there are K iterations in total Algorithm 2 summarizes the non-monotone FRANK-WOLFE variant, which is inspired by the unified continuous greedy algorithm in [13] for maximizing the multilinear extension of a submodular set function. It initializes the solution x(0) to be 0, and maintains t(k) as the cumulative step size. At iteration k, it maximizes the linearization of f over a shrunken constraint set: {v|v 2 P, v u x(k)}, which is different from the classical LMO of Frank-Wolfe-style algorithms (hence we refer to it as the shrunken LMO ). Then it employs an update step in the direction v(k) chosen by the LMO with a uniform step size γk = γ. The cumulative step size t(k) is used to ensure that the overall step sizes sum to one, thus the output solution x(K) is a convex combination of the LMO outputs, hence also lies in P. The shrunken LMO (Step 3) is the key difference compared to the monotone FRANK-WOLFE variant in [4]. The extra constraint v u x(k) is added to prevent too aggressive growth of the solution, since in the non-monotone setting such aggressive growth may hurt the overall performance. The next theorem states the guarantees of Algorithm 2. Theorem 2. Consider Algorithm 2 with uniform step size γ. For k = 1, ..., K it holds that, f(x(k)) t(k)e t(k)f(x ) LD2 2 kγ2 O(γ2)f(x ). (11) By observing that t(K) = 1 and applying Theorem 2, we get the following Corollary: Corollary 1. The output of Algorithm 2 satisfies f(x(K)) e 1f(x ) LD2 Corollary 1 shows that Algorithm 2 enjoys a sublinear convergence rate towards some point x(K) inside P, with a 1/e approximation guarantee. Proof sketch of Theorem 2: The proof is by induction. To prepare the building blocks, we first of all show that the growth of x(k) is indeed bounded, Lemma 2. Assume x(0) = 0. For k = 0, ..., K 1, it holds x(k) i ui[1 (1 γ)t(k)/γ], 8i 2 [n]. Then the following Lemma provides a lower bound, which gets the global optimum involved, Lemma 3 (Generalized from Lemma 7 in [8]). Given 2 (0, u], let λ0 = mini2[n] ui i . Then for all x 2 [0, ], it holds f(x _ x ) (1 1 Then the key ingredient for induction is the relation between f(x(k+1)) and f(x(k)) indicated by: Claim 2. For k = 0, ..., K 1 it holds f(x(k+1)) (1 γ)f(x(k))+γ(1 γ)t(k)/γf(x ) LD2 which is derived by a combination of the quadratic lower bound in (5), Lemma 2 and Lemma 3. Remarks on the two algorithms. Notice that though the TWO-PHASE algorithm has a worse guarantee than the non-monotone FRANK-WOLFE variant, it is still of interest: i) It allows flexibility in using a wide range of existing solvers for finding an (approximately) stationary point. ii) The guarantees that we present rely on a worst-case analysis. The empirical performance of the TWOPHASE algorithm is often comparable or better than that of the FRANK-WOLFE variant. This suggests to explore more properties in concrete problems that may favor the TWO-PHASE algorithm, which we leave for future work. 5 Experimental Results We test the performance of the analyzed algorithms, while considering the following baselines: 1) QUADPROGIP [39], which is a global solver for non-convex quadratic programming; 2) Projected gradient ascent (PROJGRAD) with diminishing step sizes ( 1 k+1, k starts from 0). We run all the algorithms for 100 iterations. For the subroutine (Algorithm 3) of TWO-PHASE FRANK-WOLFE, we set 1 = 2 = 10 6, K1 = K2 = 100. All the synthetic results are the average of 20 repeated experiments. All experiments were implemented using MATLAB. Source code can be found at: https://github.com/bianan/non-monotone-dr-submodular. 5.1 DR-submodular Quadratic Programming As a state-of-the-art global solver, QUADPROGIP2 [39] can find the global optimum (possibly in exponential time), which were used to calculate the approximation ratios. Our problem instances are synthetic DR-submodular quadratic objectives with down-closed polytope constraints, i.e., f(x) = 1 2x>Hx + h>x + c and P = {x 2 Rn + | Ax b, x u, A 2 Rm n ++ , b 2 Rm +}. Both objective and constraints were randomly generated, in the following two manners: 1) Uniform distribution. H 2 Rn n is a symmetric matrix with uniformly distributed entries in [ 1, 0]; A 2 Rm n has uniformly distributed entries in [ , + 1], where = 0.01 is a small positive constant in order to make entries of A strictly positive. 8 10 12 14 16 Dimensionality Approx. ratio (a) m = b0.5nc 8 10 12 14 16 Dimensionality Approx. ratio 8 10 12 14 16 Dimensionality Approx. ratio (c) m = b1.5nc Figure 1: Results on DR-submodular quadratic instances with uniform distribution. 2) Exponential distribution. The entries of H and A were sampled from exponential distributions Exp(λ) (For a random variable y 0, its probability density function is λe λy, and for y < 0, its density is 0). Specifically, each entry of H was sampled from Exp(1), then the matrix H was made to be symmetric. Each entry of A was sampled from Exp(0.25) + , where = 0.01 is a small positive constant. In both the above two cases, we set b = 1m, and u to be the tightest upper bound of P by uj = mini2[m] bi Aij , 8j 2 [n]. In order to make f non-monotone, we set h = 0.2 H> u. To make sure that f is non-negative, we first of all solve the problem minx2P 1 2x>Hx + h>x using QUADPROGIP, let the solution to be ˆx, then set c = f(ˆx) + 0.1 |f(ˆx)|. The approximation ratios w.r.t. dimensionalities (n) are plotted in Figures 1 and 2, for the two manners of data generation. We set the number of constraints to be m = b0.5nc, m = n and m = b1.5nc in Figures 1a to 1c (and Figures 2a to 2c), respectively. 2We used the open source code provided by [39], and the IBM CPLEX optimization studio https://www. ibm.com/jm-en/marketplace/ibm-ilog-cplex as the subroutine. 8 10 12 14 16 Dimensionality Approx. ratio (a) m = b0.5nc 8 10 12 14 16 Approx. ratio 8 10 12 14 16 Dimensionality Approx. ratio (c) m = b1.5nc Figure 2: Results on quadratic instances with exponential distribution. One can see that TWO-PHASE FRANK-WOLFE usually performs the best, PROJGRAD follows, and non-monotone FRANK-WOLFE variant is the last. The good performance of TWO-PHASE FRANKWOLFE can be partially explained by the strong DR-submodularity of quadratic functions according to Theorem 1. Performance of the two analyzed algorithms is consistent with the theoretical bounds: the approximation ratios of FRANK-WOLFE variant are always much higher than 1/e. 5.2 Maximizing Softmax Extensions With some derivation, one can see the derivative of the softmax extension in (3) is: rif(x) = tr((diag(x)(L I) + I) 1(L I)i), 8i 2 [n], where (L I)i denotes the matrix obtained by zeroing all entries except the ith row of (L I). Let C := (diag(x)(L I) + I) 1, D := (L I), one can see that rif(x) = D> i C i, which gives an efficient way to calculate the gradient rf(x). 8 10 12 14 16 Dimensionality Function value (a) m = b0.5nc 8 10 12 14 16 Dimensionality Function value 8 10 12 14 16 Dimensionality Function value (c) m = b1.5nc Figure 3: Results on softmax instances with polytope constraints generated from uniform distribution. Results on synthetic data. We generate the softmax objectives (see (3)) in the following way: first generate the n eigenvalues d 2 Rn +, each randomly distributed in [0, 1.5], and set D = diag(d). After generating a random unitary matrix U, we set L = UDU>. One can verify that L is positive semidefinite and has eigenvalues as the entries of d. We generate the down-closed polytope constraints in the same form and same way as that for DRsubmodular quadratic functions, except for setting b = 2 1m. Function values returned by different solvers w.r.t. n are shown in Figure 3, for which the random polytope constraints were generated with uniform distribution (results for which the random polytope constraints were generated with exponential distribution are deferred to Appendix G). The number of constraints was set to be m = b0.5nc, m = n and m = b1.5nc in Figures 3a to 3c, respectively. One can observe that TWO-PHASE FRANK-WOLFE still has the best performance, the non-monotone FRANK-WOLFE variant follows, and PROJGRAD has the worst performance. Real-world results on matched summarization. The task of matched summarization is to select a set of document pairs out of a corpus of documents, such that the two documents within a pair are similar, and the overall set of pairs is as diverse as possible. The motivation for this task is very practical: it could be, for example, to compare the opinions of various politicians on a range of representative topics. In our experiments, we used a similar setting to the one in [20]. We experimented on the 2012 US Republican debates data, which consists of 8 candidates: Bachman, Gingrich, Huntsman, Paul, Perry, Romney and Santorum. Each task involves one pair of candidates, so in total there are 28 = 7 8/2 tasks. Figure 4a plots the averaged function values returned by the three solvers over 28 tasks, w.r.t. different values of a hyperparameter reflecting the matching quality (details see [20]). 0.2 0.4 0.6 0.8 1 Match quality controller Function value (a) Average on 28 tasks 0 20 40 60 80 100 Iteration Function value (b) Objectives w.r.t. iterations Figure 4: Results on 2012 US Republican debates data. Figure 4b traces the objectives w.r.t. iterations for a specific candidate pair (Bachman, Romney). For TWO-PHASE FRANK-WOLFE, the objectives of the selected phase were plotted. One can see that TWOPHASE FRANK-WOLFE also achieves the best performance, while the performance of nonmonotone FRANK-WOLFE variant and PROJGRAD is comparable. 6 Related Work Submodular optimization and, more broadly, non-convex optimization are extensively studied in the literature, which renders it very difficult comprehensively surveying all previous work. Here we only briefly summarize some of the most related papers. Submodular optimization over integer-lattice and continuous domains. Many results from submodular set function optimization have been generalized to the integer-lattice case [34; 33; 12; 24]. Of particular interest is the reduction [12] from an integer-lattice DR-submodular maximization problem to a submodular set function maximization problem. Submodular optimization over continuous domains has attracted considerable attention recently [3; 4; 36]. Two classes of functions that are covered by continuous submodularity are the Lovasz extensions [28] and multilinear extensions [6] of submodular set functions. Particularly, multilinear extensions of submodular set functions are also continuous DR-submodular [3], but with the special property that they are coordinate-wise linear. Combined with the rounding technique of contention resolution [7], maximizing multilinear extensions [38; 19; 13; 8; 11] has become the state-of-the-art method for submodular set function maximization. Some of the techniques in maximizing multilinear extensions [13; 7; 8] have inspired this work. However, we are the first to explore the rich properties and devise algorithms for the general constrained DR-submodular maximization problem over continuous domains. Non-convex optimization. Non-convex optimization receives a surge of attention in the past years. One active research topic is to reach a stationary point for unconstrained optimization [35; 32; 1] or constrained optimization [18; 27]. However, without proper assumptions, a stationary point may not lead to any global approximation guarantee. The local-global relation (in Proposition 1) provides a strong relation between (approximately) stationary points and global optimum, thus making it flexible to incorporate progress in this area. 7 Conclusion We have studied the problem of constrained non-monotone DR-submodular continuous maximization. We explored the structural properties of such problems, and established a local-global relation. Based on these properties, we presented a TWO-PHASE algorithm with a 1/4 approximation guarantee, and a non-monotone FRANK-WOLFE variant with a 1/e approximation guarantee. We further generalized submodular continuous function over conic lattices, which enabled us to model a larger class of applications. Lastly, our theoretical findings were verified by synthetic and real-world experiments. Acknowledgement. This research was partially supported by ERC St G 307036, by the Max Planck ETH Center for Learning Systems, and by the ETH Zürich Postdoctoral Fellowship program. [1] Allen-Zhu, Zeyuan and Hazan, Elad. Variance reduction for faster non-convex optimization. In International Conference on Machine Learning (ICML), pp. 699 707, 2016. [2] Antoniadis, Anestis, Gijbels, Irène, and Nikolova, Mila. Penalized likelihood regression for generalized linear models with non-quadratic penalties. Annals of the Institute of Statistical Mathematics, 63(3):585 615, 2011. [3] Bach, Francis. Submodular functions: from discrete to continous domains. ar Xiv preprint ar Xiv:1511.00394, 2015. [4] Bian, Andrew An, Mirzasoleiman, Baharan, Buhmann, Joachim M., and Krause, Andreas. Guaranteed non-convex optimization: Submodular maximization over continuous domains. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 111 120, 2017. [5] Boyd, Stephen and Vandenberghe, Lieven. Convex optimization. Cambridge university press, [6] Calinescu, Gruia, Chekuri, Chandra, Pál, Martin, and Vondrák, Jan. Maximizing a submod- ular set function subject to a matroid constraint. In Integer programming and combinatorial optimization, pp. 182 196. Springer, 2007. [7] Chekuri, Chandra, Vondrák, Jan, and Zenklusen, Rico. Submodular function maximization via the multilinear relaxation and contention resolution schemes. SIAM Journal on Computing, 43 (6):1831 1879, 2014. [8] Chekuri, Chandra, Jayram, TS, and Vondrák, Jan. On multiplicative weight updates for concave and submodular function maximization. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pp. 201 210. ACM, 2015. [9] Djolonga, Josip and Krause, Andreas. From map to marginals: Variational inference in bayesian submodular models. In Neural Information Processing Systems (NIPS), pp. 244 252, 2014. [10] Eghbali, Reza and Fazel, Maryam. Designing smoothing functions for improved worst-case competitive ratio in online optimization. In Advances in Neural Information Processing Systems (NIPS), pp. 3279 3287. 2016. [11] Ene, Alina and Nguyen, Huy L. Constrained submodular maximization: Beyond 1/e. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pp. 248 257, [12] Ene, Alina and Nguyen, Huy L. A reduction for optimizing lattice submodular functions with diminishing returns. ar Xiv preprint ar Xiv:1606.08362, 2016. [13] Feldman, Moran, Naor, Joseph, and Schwartz, Roy. A unified continuous greedy algorithm for submodular maximization. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on, pp. 570 579. IEEE, 2011. [14] Friedland, S and Gaubert, S. Submodular spectral functions of principal submatrices of a hermitian matrix, extensions and applications. Linear Algebra and its Applications, 438(10): 3872 3884, 2013. [15] Fuchssteiner, Benno and Lusky, Wolfgang. Convex cones, volume 56. Elsevier, 2011. [16] Fujishige, Satoru. Submodular functions and optimization, volume 58. Elsevier, 2005. [17] Garg, Vijay K. Introduction to lattice theory with computer science applications. John Wiley & Sons, 2015. [18] Ghadimi, Saeed, Lan, Guanghui, and Zhang, Hongchao. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155 (1-2):267 305, 2016. [19] Gharan, Shayan Oveis and Vondrák, Jan. Submodular maximization by simulated annealing. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pp. 1098 1116. Society for Industrial and Applied Mathematics, 2011. [20] Gillenwater, Jennifer, Kulesza, Alex, and Taskar, Ben. Near-optimal map inference for deter- minantal point processes. In Advances in Neural Information Processing Systems (NIPS), pp. 2735 2743, 2012. [21] Hassani, Hamed, Soltanolkotabi, Mahdi, and Karbasi, Amin. Gradient methods for submodular maximization. In Advances in Neural Information Processing Systems (NIPS), pp. 5837 5847, 2017. [22] Ito, Shinji and Fujimaki, Ryohei. Large-scale price optimization via network flow. In Advances in Neural Information Processing Systems (NIPS), pp. 3855 3863, 2016. [23] Iwata, Satoru, Fleischer, Lisa, and Fujishige, Satoru. A combinatorial strongly polynomial algorithm for minimizing submodular functions. Journal of the ACM, 48(4):761 777, 2001. [24] Khodabakhsh, Ali and Nikolova, Evdokia. Maximizing non-monotone dr-submodular functions with cardinality constraints. ar Xiv preprint ar Xiv:1611.09474, 2016. [25] Krause, Andreas and Golovin, Daniel. Submodular function maximization. Tractability: Practical Approaches to Hard Problems, 3:19, 2012. [26] Kulesza, Alex, Taskar, Ben, et al. Determinantal point processes for machine learning. Founda- tions and Trends R in Machine Learning, 5(2 3):123 286, 2012. [27] Lacoste-Julien, Simon. Convergence rate of frank-wolfe for non-convex objectives. ar Xiv preprint ar Xiv:1607.00345, 2016. [28] Lovász, László. Submodular functions and convexity. In Mathematical Programming The State of the Art, pp. 235 257. Springer, 1983. [29] Montanari, Andrea and Richard, Emile. Non-negative principal component analysis: Message passing algorithms and sharp asymptotics. IEEE Transactions on Information Theory, 62(3): 1458 1484, 2016. [30] Motzkin, Theodore S and Straus, Ernst G. Maxima for graphs and a new proof of a theorem of turán. Canad. J. Math, 17(4):533 540, 1965. [31] Nemhauser, George L, Wolsey, Laurence A, and Fisher, Marshall L. An analysis of approxima- tions for maximizing submodular set functions i. Mathematical Programming, 14(1):265 294, 1978. [32] Reddi, Sashank J., Sra, Suvrit, Poczos, Barnabas, and Smola, Alexander J. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems (NIPS), pp. 1145 1153. 2016. [33] Soma, Tasuku and Yoshida, Yuichi. A generalization of submodular cover via the diminishing return property on the integer lattice. In Advances in Neural Information Processing Systems (NIPS), pp. 847 855, 2015. [34] Soma, Tasuku, Kakimura, Naonori, Inaba, Kazuhiro, and Kawarabayashi, Ken-ichi. Optimal budget allocation: Theoretical guarantee and efficient algorithm. In International Conference on Machine Learning (ICML), pp. 351 359, 2014. [35] Sra, Suvrit. Scalable nonconvex inexact proximal splitting. In Advances in Neural Information Processing Systems (NIPS), pp. 530 538, 2012. [36] Staib, Matthew and Jegelka, Stefanie. Robust budget allocation via continuous submodular functions. In International Conference on Machine Learning (ICML), pp. 3230 3240, 2017. [37] Topkis, Donald M. Minimizing a submodular function on a lattice. Operations research, 26(2): 305 321, 1978. [38] Vondrák, Jan. Optimal approximation for the submodular welfare problem in the value oracle model. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pp. 67 74, 2008. [39] Xia, Wei, Vera, Juan, and Zuluaga, Luis F. Globally solving non-convex quadratic programs via linear integer programming techniques. ar Xiv preprint ar Xiv:1511.02423, 2015. [40] Zass, Ron and Shashua, Amnon. Nonnegative sparse pca. Advances in Neural Information Processing Systems (NIPS), pp. 1561 1568, 2007.