# differentiable_learning_of_submodular_models__d609990d.pdf Differentiable Learning of Submodular Models Josip Djolonga Department of Computer Science ETH Zurich josipd@inf.ethz.ch Andreas Krause Department of Computer Science ETH Zurich krausea@ethz.ch Can we incorporate discrete optimization algorithms within modern machine learning models? For example, is it possible to incorporate in deep architectures a layer whose output is the minimal cut of a parametrized graph? Given that these models are trained end-to-end by leveraging gradient information, the introduction of such layers seems very challenging due to their non-continuous output. In this paper we focus on the problem of submodular minimization, for which we show that such layers are indeed possible. The key idea is that we can continuously relax the output without sacrificing guarantees. We provide an easily computable approximation to the Jacobian complemented with a complete theoretical analysis. Finally, these contributions let us experimentally learn probabilistic log-supermodular models via a bi-level variational inference formulation. 1 Introduction Discrete optimization problems are ubiquitous in machine learning. While the majority of them are provably hard, a commonly exploitable trait that renders some of them tractable is that of submodularity [1, 2]. In addition to capturing many useful phenomena, submodular functions can be minimized in polynomial time and also enjoy a powerful connection to convex optimization [3]. Both of these properties have been used to great effect in both computer vision and machine learning, to e.g. compute the MAP configuration in undirected graphical models with long-reaching interactions [4] and higher-order factors [5], clustering [6], to perform variational inference in log-supermodular models [7, 8], or to design norms useful for structured sparsity problems [9, 10]. Despite all the benefits of submodular functions, the question of how to learn them in a practical manner remains open. Moreover, if we want to open the toolbox of submodular optimization to modern practitioners, an intriguing question is how to to use them in conjunction with deep learning networks. For instance, we need to develop mechanisms that would enable them to be trained together in a fully end-to-end fashion. As a concrete example from the computer vision domain, consider the problem of image segmentation. Namely, we are given as input an RGB representation x Rn 3 of an image captured by say a dashboard camera, and the goal is to identify the set of pixels A {1, 2, . . . , n} that are occupied by pedestrians. While we could train a network θ: x v Rn to output per-pixel scores, it would be helpful, especially in domains with limited data, to bias the predictions by encoding some prior beliefs about the expected output. For example, we might prefer segmentations that are spatially consistent. One common approach to encourage such configurations is to first define a graph over the image G = (V, E) by connecting neighbouring pixels, specify weights w over the edges, and then solve the following graph-cut problem A (w, v) = arg min A V F(A) = arg min A V {i,j} E wi,j JA {i, j} = 1K | {z } 1 iff the predictions disagree i A vi |{z} pixel score While this can be easily seen as a module computing the best configuration as a function of the edge weights and per-pixel scores, incorporating it as a layer in a deep network seems at a first glance to 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. be a futile task. Even though the output is easily computable, it will be discontinuous and have no Jacobian, which is necessary for backpropagation. However, as the above problem is an instance of submodular minimization, we can leverage its relationship to convexity and relax it to y (w, v) = arg min y Rn f(y) + 1 2 y 2 = arg min y Rn {i,j} E wi,j|yi yj| + v T y + 1 In addition to having a continuous output, this relaxation has a very strong connection with the discrete problem as the discrete optimizer can be obtained by thresholding y as A = {i V | y i > 0}. Moreover, as explained in 2, for every submodular function F there exists an easily computable convex function f so that this relationship holds. For general submodular functions, the negation of the solution to the relaxed problem (2) is known as the min-norm point [11]. In this paper we consider the problem of designing such modules that solve discrete optimization problems by leveraging this continuous formulation. To this end, our key technical contribution is to analyze the sensitivity of the min-norm point as a function of the parametrization of the function f. For the specific case above we will show how to compute y / w and y / v. Continuing with the segmentation example, we might want to train a conditional model P(A | x) that can model the uncertainty in the predictions to be used in downstream decision making. A rich class of models are log-supermodular models, i.e., those of the form P(A | x) = exp( Fθ(x)(A))/Z(θ) for some parametric submodular function Fθ. While they can capture very useful interactions, they are very hard to train in the maximum likelihood setting due to the presence of the intractable normalizer Z(θ). However, Djolonga and Krause [8] have shown that for any such distribution we can find the closest fully factorized distribution Q( | x) minimizing a specific information theoretic divergence D . In other words, we can exactly compute Q( | x) = arg min Q Q D (P( | x) Q), where Q is the family of fully factorized distributions. Most importantly, the optimal Q can also be computed from the min-norm point. Thus, a reasonable objective would be to learn a model θ(x) so that the best approximate distribution Q( | x) gives high likelihood to the training data point. This is a complicated bi-level optimization problem (as Q implicitly depends on θ) with an inner variational inference procedure, which we can again train end-to-end using our results. In other words, we can optimize the following algorithm end-to-end with respect to θ. xi θ(x) θi P = exp( Fθi(A))/Z(θi) Q = arg min Q Q D (P Q) Q(Ai | xi). (3) Related work. Sensitivity analysis of the set of optimal solutions has a long history in optimization theory [12]. The problem of argmin-differentiation of the specific case resulting from graph cuts (i.e. eq. (2)) has been considered in the computer vision literature, either by smoothing the objective [13], or by unrolling iterative methods [14]. The idea to train probabilistic models by evaluating them using the marginals produced by an approximate inference algorithm has been studied by Domke [15] for tree-reweighted belief propagation and mean field, and for continuous models by Tappen [16]. These methods either use the implicit function theorem, or unroll iterative optimization algorithms. The benefits of using an inconsistent estimator, which is what we do by optimizing eq. (3), at the benefit of using computationally tractable inference methods has been discussed by Wainwright [17]. Amos and Kolter [18] discuss how to efficiently argmin-differentiate quadratic programs by perturbing the KKT conditions, an idea that goes back to Boot [19]. We make an explicit connection to their work in Theorem 4. In Section 4 we harness the connection between the min-norm problem and isotonic regression, which has been exploited to obtain better duality certificates [2], and by Kumar and Bach [20] to design an active-set algorithm for the min-norm problem. Chakravarti [21] analyzes the sensitivity of the optimal isotonic regression point with respect to perturbations of the input, but does not discuss the directional derivatives of the problem. Recently, Dolhansky and Bilmes [22] have used deep networks to parametrize submodular functions. Discrete optimization is also used in structured prediction [23, 24] for the computation of the loss function, which is closely related to our work if we use discrete optimization only at the last layer. However, in this case we have the advantage in that we allow for arbitrary loss functions to be applied to the solution of the relaxation. Contributions. We develop a very efficient approximate method ( 4) for the computation of the Jacobian of the min-norm problem inspired by our analysis of isotonic regression in 3, where we derive results that might be of independent interest. Even more importantly, from a practical perspective, this Jacobian has a very nice structure and we can multiply with it in linear time. This means that we can efficiently perform back-propagation if we use these layers in a modern deep architectures. In 5 we show how to compute directional derivatives exactly in polynomial time, and give conditions under which our approximation is correct. This is also an interesting theoretical result as it quantifies the stability of the min-norm point with respect to the model parameters. Lastly, we use our results to learn log-supermodular models in 6. 2 Background on Submodular Minimization Let us introduce the necessary background on submodular functions. They are defined over subsets of some ground set, which in the remaining of this paper we will w.l.o.g. assume to be V = {1, 2, . . . , n}. Then, a function F : 2V R is said to be submodular iff for all A, B V it holds that F(A B) + F(A B) F(A) + F(B). (4) We will furthermore w.l.o.g. assume that F is normalized so that F( ) = 0. A very simple family of submodular functions are modular functions. These, seen as discrete analogues of linear functions, satisfy the above with equality and are given as F(A) = P i A mi for some real numbers mi. As common practice in combinatorial optimization, we will treat any vector m Rn as a modular function 2V R defined as m(A) = P i A mi. In addition to the graph cuts from the introduction (eq. (1)), another widely used class of functions are concave-of-cardinality functions, i.e. those of the form F(|A|) = h(|A|) for some concave h: R R [5]. From eq. (4) we see that if we want to define a submodular function over a collection D 2V it has to be closed under union and intersection. Such collections are known as lattices, and two examples that we will use are the simple lattice 2V and the trivial lattice { , V }. In the theory of submodular minimization, a critical object defined by a pair consisting of a submodular function F and a lattice D { , V } is the base polytope B(F | D) = {x Rn | x(A) F(A) for all A D} {x Rn | x(V ) = F(V )}. (5) We will also use the shorthand B(F) = B(F | 2V ). Using the result of Edmonds [25], we know how to maximize a linear function over B(F) in O(n log n) time with n function evaluations of F. Specifically, to compute maxy B(F ) z T y, we first choose a permutation σ: V V that sorts z, i.e. so that zσ(1) zσ(2) zσ(n). Then, a maximizer f(σ) B(F) can be computed as [f(σ)]σ(i) = F({σ(i)} | {σ(1), . . . , σ(i 1)}), (6) where the marginal gain of A given B is defined as F(A | B) = F(A B) F(B). Hence, we know how to compute the support function f(z) = supy B(F ) z T y, which is known as the Lovász extension [3]. First, this function is indeed an extension as f(1A) = F(A) for all A V , where 1A {0, 1}n is the indicator vector for the set A. Second, it is convex as it is a supremum of linear functions. Finally, and most importantly, it lets us minimize submodular functions in polynomial time with convex optimization because min A 2V F(A) = minz [0,1] f(z) and we can efficiently round the optimal continuous point to a discrete optimizer. Another problem, with a smooth objective, which is also explicitly tied to the problem of minimizing F is that of computing the min-norm point, which can be defined in two different ways as y = arg min y B(F ) 1 2 y 2, or equivalently as y = arg min y f(y) + 1 where the equivalence comes from strong Fenchel duality [2]. The connection with submodular minimization comes from the following lemma, which we have already hinted at in the introduction. Lemma 1 ([1, Lem. 7.4]). Define A = {i | y i < 0} and A0 = {i | y i 0}. Then A (A0) is the unique smallest (largest) minimizer of F. Moreover, if instead of hard-thresholding we send the min-norm point through a sigmoid, the result has the following variational inference interpretation, which lets us optimize the pipeline in eq. (3). Lemma 2 ([8, Thm. 3]). Define the infinite Rényi divergence between any distributions P and Q over 2V as D (P || Q) = sup A V log P(A)/Q(A) . For P(A) exp( F(A)), the distribution Q minimizing D over all fully factorized distributions Q is given as i A σ( y i ) Y i/ A σ(y i ), where σ(u) = 1/(1 + exp( u)) is the sigmoid function. 3 Argmin-Differentiation of Isotonic Regression We will first analyze a simpler problem, i.e., that of isotonic regression, defined as y(x) = arg min y O 1 2 y x 2, (8) where O = {y Rn | yi yi+1 for i = 1, 2, . . . , n 1}. The connection to our problem will be made clear in Section 4, and it essentially follows from the fact that the Lovász extension is linear on O. In this section, we will be interested in computing the Jacobian y/ x, i.e., in understanding how the solution y changes with respect to the input x. The function is well-defined because of the strict convexity of the objective and the non-empty convex feasible set. Moreover, it can be easily computed in O(n) time using the pool adjacent violators algorithm (PAVA) [26]. This is a well-studied problem in statistics, see e.g. [27]. To understand the behaviour of y(x), we will start by stating the optimality conditions of problem (8). To simplify the notation, for any A V we will define Meanx(A) = 1 |A| P i A xi. The optimality conditions can be stated via ordered partitions Π = (B1, B2, . . . , Bm) of V , meaning that the sets Bi are disjoint, k j=1Bj = V , and Π is ordered so that 1 + maxi Bj i = mini Bj+1 i. Specifically, for any such partition we define yΠ = (y1, y2, . . . , ym), where yj = Meanx(Bj)1|Bj| and 1k = {1}k is the vector of all ones. In other words, yΠ is defined by taking block-wise averages of x with respect to Π. By analyzing the KKT conditions of problem (8), we obtain the following well-known condition. Lemma 3 ([26]). An ordered partition Π = (B1, B2, . . . , Bm) is optimal iff the following hold 1. (Primal feasibility) For any two blocks Bj and Bj+1 we have Meanx(Bj) Meanx(Bj+1). (9) 2. (Dual feasibility) For every block B Π and each i B define Pre B(i) = {j B | j i}. Then, the condition reads Meanx(Pre B(i)) Meanx(B) 0. (10) Points where eq. (9) is satisfied with equality are of special interest, because of the following result. Lemma 4. If for some Bj and Bj+1 the first condition is satisfied with equality, we can merge the two sets so that the new coarser partition Π will also be optimal. Thus, in the remaining of this section we will assume that the sets Bj are chosen maximally. We will now introduce a notion that will be crucial in the subsequent analysis. Definition 1. For any block B, we say that i B is a breakpoint if Meanx(Pre B(i)) = Meanx(B) and it is not the right end-point of B (i.e., i < maxi B i ). From an optimization perspective, any breakpoint is equivalent to non-strict complementariness of the corresponding Lagrange multiplier. From a combinatorial perspective, they correspond to positions where we can refine Π into a finer partition Π that gives rise to the same point, i.e., yΠ = yΠ (if we merge blocks using Lemma 4, the point where we merge them will become a breakpoint). We can now discuss the differentiability of y(x). Because projecting onto convex sets is a proximal operator and thus non-expansive, we have the following as an immediate consequence of Rademacher s theorem. Lemma 5. The function y(x) is 1-Lipschitz continuous and differentiable almost everywhere. We will denote by xk and + xk the left and right partial derivatives with respect to xk. For any index k we will denote by u(k) (l(k)) the breakpoint with the smallest (largest) coordinate larger (smaller) than k. Define it as + ( ) if no such point exists. Moreover, denote by Π(z) the collection of indices where z takes on distinct values, i.e., Π(z) = n i=1{{i V | zi = zi }}. Theorem 1. Let k be any coordinate and let B Π(y(x)) be the block containing coordinate i. Also define B+ = {i B | i u(k)} and B = {i B | i l(k)}. Hence, for any i B + xk(yi) = Ji B \ B K/|B \ B |, and xk(yi) = Ji B \ B+K/|B \ B+|. Note that all of these derivatives will agree iff there are no breakpoints, which means that the existence of breakpoints is an isolated phenomenon due to Lemma 5. In this case the Jacobian exists and has a very simple block-diagonal form. Namely, it is equal to y x = Λ(y(x)) blkdiag(C|B1|, C|B2|, . . . , C|Bm|), (11) where Ck = 1k k/k is the averaging matrix with elements 1/k. We will use Λ(z) for the matrix taking block-wise averages with respect to the blocks Π(z). As promised in the introduction, Jacobian multiplication Λ(y(x))u is linear as we only have to perform block-wise averages. 4 Min-Norm Differentiation In this section we will assume that we have a function Fθ parametrized by some θ Rd that we seek to learn. For example, we could have a mixture model j=1 θj Gj(A), (12) for some fixed submodular functions Gj : 2V R. In this case, to ensure that the resulting function is submodular we also want to enforce θj 0 unless Gj is modular. We would like to note that the discussion in this section goes beyond such models. Remember that the min-norm point is defined as yθ = arg min y fθ(y) + 1 2 y 2, (13) where fθ is the Lovász extension of Fθ. Hence, we want to compute y/ θ. To make the connection with isotonic regression, remember how we evaluate the Lovász extension at y. First, we pick a permutation σ that sorts y, and then evaluate it as fθ(y) = fθ(σ)T y, where fθ(σ) is defined in eq. (6). Hence, the Lovász extension is linear on the set of all vectors that are sorted by σ. Formally, for any permutation σ the Lovász extension is equal to fθ(σ)T y on the order cone O(σ) = {y | yσ(n) yσ(n 1) . . . yσ(1)}. Given a permutation σ, if we constrain eq. (13) to O(σ) we can replace fθ(y) by the linear function fθ(σ)T , so that the problem reduces to yθ(σ) = arg min y O(σ) 1 2 y + fθ(σ) 2, (14) which is an instance of isotonic regression if we relabel the elements of V using σ. Roughly, the idea is to instead differentiate eq. (14) with fθ(σ) computed at the optimal point yθ. However, because we can choose an arbitrary order among the elements with equal values, there may be multiple permutations that sort yθ, and this extra choice we have seems very problematic. Nevertheless, let us continue with this strategy and analyze the resulting approximations to the Jacobian. We propose the following approximation to the Jacobian θ b Jσ Λ(yθ) | {z } θ = Λ(yθ) [ θ1fθ(σ) | θ2fθ(σ) | | θdfθ(σ)] , where Λ(yθ) is used as an approximation of a Jacobian which might not exist. Fortunately, due to the special structure of the linearizations, we have the following result that the gradient obtained using the above strategy does not depend on the specific permutation σ that was chosen. Theorem 2. If θk F(A) exists for all A V the approximate Jacobians b Jσ are equal and do not depend on the choice of σ. Specifically, the j-th block of any element i B Π(yθ) is equal to 1 |B| θj Fθ(B | {i | [yθ]i < [yθ]i}). (15) Proof sketch, details in supplement. Remember that Λ(yθ) averages fθ(σ) within each B Π(yθ). Moreover, as σ sorts yθ, the elements in B must be placed consecutively. The coordinates of fθ(σ) are marginal gains (6) and they will telescope inside the mean, which yields the claimed quantity. Graph cuts. As a special, but important case, let us analyze how the approximate Jacobian looks like for a cut function (eq. (1)), in which case eq. (13) reduces to eq. (2). Let Π(y(w, v)) = (B1, B2, . . . , Bm). For any element i V we will denote by η(i) {1, 2, . . . , m} the index of the block where it belongs to. Then, the approximate Jacobian b J at θ = (w, v) has entries b vj(yi) = Jη(i) = η(j)K/|Bη(i)|, and b wi,j(yk) = sign(yi yj) 1 |Bη(k)| if η(k) = η(i), or sign(yj yi) 1 |Bη(k)| if η(k) = η(j), and 0 otherwise, where the sign function is defined to be zero if the argument is zero. Intuitively, increasing the modular term vi by δ will increase all the coordinates B in y that are in the same segment by δ/|B|. On the other hand, increasing the weight of an edge wi,j will have no effect if i and j are already on y in the same segment, and otherwise it will pull the segments containing i and j together by increasing the smaller one and decreasing the larger one. In the supplementary we provide a pytorch module that executes the back propagation pass in O(|V | + |E|) time in about 10 lines of code, and we also derive the approximate Jacobians for concave-of-cardinality and facility location functions. We will now theoretically analyze the conditions under which our approximation is correct, and then give a characterization of the exact directional derivative together with a polynomial algorithm that computes it. The first notion that will have implications for our analysis is that of (in)separability. Definition 2. The function F : 2V R is said to be separable if there exists some B V such that B / { , V } and F(V ) = F(B) + F(V \ B). The term separable is indeed appropriate as it implies that F(A) = F(A B) + F((V \ B) A) for all A V [2, Prop. 4.3], i.e., the function splits as a sum of two functions on disjoint domains. Hence, we can split the problem into two (on B and V \ B) and analyze them independently. We would like to point out that separability is checkable in cubic time using the algorithm of Queyranne [28]. To simplify the notation, we will assume that we want to compute the derivative at point θ Rd which results in the min-norm point y = yθ Rn. We will furthermore assume that y takes on unique values γ1 < γ2 < < γk on sets B1, B2, . . . , Bk respectively, and we will define the chain = A0 A1 A2 Ak = V by Aj = j j =1Bj . A central role in the analysis will be played by the set of constraints in B(Fθ) (see (5)) that are active at yθ, which makes sense given that we expect small perturbations in θ to result in small changes in yθ as well. Definition 3. For any submodular function F : 2V R and any point z B(F) we shall denote by DF (z) the lattice of tight sets of z on B(F), i.e. DF (z) = {A V | z(A) = F(A)}. The fact that the above set is indeed a lattice is well-known [1]. Moreover, note that DF (z) { , V }. We will also define D = DFθ (y ), i.e., the lattice of tight sets at the min-norm point. 5.1 When will the approximate approach work? We will analyze sufficient conditions so that irrespective of the choice of σ, the isotonic regression problem eq. (14) has no breakpoints, and the left and right derivatives agree. To this end, for any j {1, 2, . . . , k} we define the submodular function Fj : 2Bj as Fj(H) = Fθ (H | Aj 1), where we have dropped the dependence on θ as it will remain fixed throughout this section. Theorem 3. The approximate problem (14) is argmin-continuously differentiable irrespective of the chosen permutation σ sorting yθ if and only if any of the following equivalent conditions hold. (a) arg min H Bj Fj(H) Fj(Bj)|H|/|Bj| = { , Bj}. (b) y Bj relint(B(Fj)), i.e. DFj(y Bj) = { , Bj}, which is only possible if Fj is inseparable. In other words, we can equivalently say that the optimum has to lie on the interior of the face. Moreover, if θ yθ is continuous1, this result implies that the min-norm point is locally defined as averaging within the same blocks using (15), so that the approximate Jacobian is exact. We would like to point out that one can obtain the same derivatives as the ones suggested in 4, if we perturb the KKT conditions, as done by Amos and Kolter [18]. If we use that approach, in addition to the computational challenges, there is the problem of non-uniqueness of the Lagrange multiplier, and moreover, some valid multipliers might be zero for some of the active constraints. This would render the resulting linear system rank deficient, and it is not clear how to proceed. Remember that when we analyzed the isotonic regression problem in 3 we had non-differentiability due to the exactly same reason zero multipliers for active constraints, which in that case correspond to the breakpoints. Theorem 4. For a specific Lagrange multiplier there exists a solution to the perturbed KKT conditions derived by [18] that gives rise to the approximate Jacobians from Section 4. Moreover, this multiplier is unique if the conditions of Theorem 3 are satisfied. 5.2 Exact computation Unfortunately, computing the gradients exactly seems very complicated for arbitrary parametrizations Fθ, and we will focus our attention to mixture models of the form given in eq. (12). The directions v where we will compute the directional derivatives will have in general non-negative components vj, unless Fj is modular. By leveraging the theory of Shapiro [29], and exploiting the structure of both the min-norm point and the polyhedron B(Fv | D ) we obtain at the following result. Theorem 5. Assume that Fθ is inseparable and let v be any direction so that Fv is submodular. The directional derivative y/ θj at θ in direction v is given by the solution of the following problem. minimize d 1 2 d 2, subject to d B(Fv | D ), and d(Bj) = Fv(Aj) for j {1, 2, . . . , k}. First, note that this is again a min-norm problem, but now defined over a reduced lattice D with k additional equality constraints. Fortunately, due to these additional equalities, we can split the above problem into k separate min-norm problems. Namely, for each j {1, 2, . . . , k} collect the lattice of tight sets that intersect Bj as D j = {H Bj | H D }, and define the function Gj : 2Bj R as Gj(A) = Fv(A | Aj 1), where note that the parameter vector θ is taken as the direction v in which we want to compute the derivative. Then, the block of the optimal solution of problem (16) corresponding to Bj is equal to d Bj = arg min yj B(Gj|D j) 1 2 yj 2, (17) which is again a min-norm point problem where the base polytope is defined using the lattice D j. We can then immediately draw a connection with the results from the previous subsection. Corollary 1. If all latices are trivial, the solution of (17) agrees with the approximate Jacobian (15). How to solve problem (16)? Fortunately, the divide-and-conquer algorithm of Groenevelt [30] can be used to find the min-norm point over arbitrary lattices. To do this, we have to compute for each i Bj the unique smallest set H i in arg min Hj i Fj(Hj) y (Hj), which can be done using submodular minimization after applying the reduction of Schrijver [31]. To highlight the difference with the approximation from section 4, let us consider a very simple case. Lemma 6. Assume that Gj is equal to Gj(A) = Ji AK for some i Bj. Then, the directional derivative is equal to 1|D|/|D| where D = {i | i H i }. Hence, while the approximate directional derivative would average over all elements in Bj, the true one averages only over a subset D Bj and is possibly sparser. Lemma 6 gives us the exact directional derivatives for the graph cuts, as each component Gj will be either a cut function on 1For example if the correspondence θ B(Fθ) is hemicontinuous due to Berge s theorem. two vertices, or a function of the form in Lemma 6. In the first case the directional derivative is zero because 0 B(Gj) B(Gj | D j). In the second case, we can can either solve exactly using Lemma 6 or use a more sophisticated approximation, generalizing the result from [32] given that Fj is separable over 2Bj iff the graph is disconnected, we can first separate the graph into connected components, and then take averages within each connected component instead of over Bj. 5.3 Structured attention and constraints Recently, there has been an interest in the design of structured attention mechanisms, which, as we will now show, can be derived and furthermore generalized using the results in this paper. The first mechanism is the sparsemax of Martins and Astudillo [33]. It takes as input a vector and projects it to the probability simplex, which is the base polytope corresponding to G(A) = min{|A|, 1}. Concurrently with this work, Niculae and Blondel [32] have analyzed the following problem y = min y B(G) f(y) + 1 2 y z 2, (18) for the special case when B(G) is the simplex and f is the Lovász extension of one of two specific submodular functions. We will consider the general case when G can be any concave-of-cardinality function and F is an arbitrary submodular function. Note that, if either f(y) or the constraint were not present in problem (18), we could have simply leveraged the theory we have developed so far to differentiate it. Fortunately, as done by Niculae and Blondel [32], we can utilize the result of Yu [34] to significantly simplify (18). Namely, because projection onto B(G) preserves the order of the coordinates [35, Lemma 1], we can write the optimal solution y to (18) as y = min x B(G) 1 2 y y , where y = arg min y f(y) + 1 We can hence split problem (18) into two subtasks first, compute y and then project it onto B(G). As each operation can reduces to a minimum-norm problem, we can differentiate each of them separately, and thus solve (18) by stacking two submodular layers one after the other. 6 Experiments Mean Std. Dev. Mean Std. Dev. Accuracy 0.8103 0.1391 0.9121 0.1034 NLL 0.3919 0.1911 0.2681 0.2696 # Fg. Objs. 96.9 65.8 25.3 30.6 Figure 1: Test set results. We see that incorporating a graph cut solver improves both the accuracy and negative log-likelihood (NLL), while having consistent segmentations with fewer foreground objects. We consider the image segmentation tasks from the introduction, where we are given an RGB image x Rn 3 and are supposed to predict those pixels y {0, 1}n containing the foreground object. We used the Weizmann horse segmentation dataset [36], which we split into training, validation and test splits of sizes 180, 50 and 98 respectively. The implementation was done in pytorch2, and to compute the min-norm point we used the algorithm from [37]. To make the problem more challenging, at training time we randomly selected and revealed only 0.1% of the training set labels. We first trained a convolutional neural network with two hidden layers that directly predicts the per-pixel labels, which we refer to as CNN. Then, we added a second model, which we call CNN+GC, that has the same architecture as the first one, but with an additional graph cut layer, whose weights are parametrized by a convolutional neural network with one hidden layer. Details about the architectures can be found in the supplementary. We train the models by maximizing the log-likelihood of the revealed pixels, which corresponds to the variational bi-level strategy (eq. (3)) due to Lemma 2. We trained using SGD, Adagrad [38] and Adam [39], and chose the model with the best validation score. As evident from the results presented in Section 6, adding the discrete layer improves not only the accuracy (after thresholding the marginals at 0.5) and log-likelihood, but it gives more coherent results as it makes predictions with fewer connected components (i.e., foreground objects). Moreover, if we have a look at the predictions themselves in Figure 2, we can observe that the optimization layer not only removes spurious predictions, but there is is also a qualitative difference in the marginals as they are spatially more consistent. 2The code will be made available at https://www.github.com/josipd/nips-17-experiments. Figure 2: Comparison of results from both models on four instances from the test set (up: CNN, down: CNN+GC). We can see that adding the graph-cut layers helps not only quantitatively, but also qualitatively, as the predictions are more spatially regular and vary smoothly inside the segments. 7 Conclusion We have analyzed the sensitivity of the min-norm point for parametric submodular functions and provided both a very easy-to-implement practical approximate algorithm for general objectives, and strong theoretical result characterizing the true directional derivatives for mixtures. These results allow the use of submodular minimization inside modern deep architectures, and they are also immediately applicable to bi-level variational learning of log-supermodular models of arbitrarily high order. Moreover, we believe that the theoretical results open the new problem of developing algorithms that can compute not only the min-norm point, but also solve for the associated derivatives. Acknowledgements. The research was partially supported by ERC St G 307036 and a Google European Ph D Fellowship. [1] S. Fujishige. Submodular functions and optimization. Annals of Discrete Mathematics vol. 58. 2005. [2] F. Bach. Learning with submodular functions: a convex optimization perspective . Foundations and Trends R in Machine Learning 6.2-3 (2013). [3] L. Lovász. Submodular functions and convexity . Mathematical Programming The State of the Art. Springer, 1983, pp. 235 257. [4] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision . IEEE Transactions on Pattern Analysis and Machine Intelligence 26.9 (2004), pp. 1124 1137. [5] P. Kohli, L. Ladicky, and P. H. Torr. Robust higher order potentials for enforcing label consistency . Computer Vision and Pattern Recognition (CVPR). 2008. [6] M. Narasimhan, N. Jojic, and J. A. Bilmes. Q-clustering . Advances in Neural Information Processing Systems (NIPS). 2006, pp. 979 986. [7] J. Djolonga and A. Krause. From MAP to Marginals: Variational Inference in Bayesian Submodular Models . Advances in Neural Information Processing Systems (NIPS). 2014. [8] J. Djolonga and A. Krause. Scalable Variational Inference in Log-supermodular Models . International Conference on Machine Learning (ICML). 2015. [9] F. R. Bach. Shaping level sets with submodular functions . Advances in Neural Information Processing Systems (NIPS). 2011. [10] F. R. Bach. Structured sparsity-inducing norms through submodular functions . Advances in Neural Information Processing Systems (NIPS). 2010. [11] S. Fujishige, T. Hayashi, and S. Isotani. The minimum-norm-point algorithm applied to submodular function minimization and linear programming. Kyoto University. Research Institute for Mathematical Sciences [RIMS], 2006. [12] R. T. Rockafellar and R. J.-B. Wets. Variational analysis. Vol. 317. Springer Science & Business Media, 2009. [13] K. Kunisch and T. Pock. A bilevel optimization approach for parameter learning in variational models . SIAM Journal on Imaging Sciences 6.2 (2013), pp. 938 983. [14] P. Ochs, R. Ranftl, T. Brox, and T. Pock. Bilevel optimization with nonsmooth lower level problems . International Conference on Scale Space and Variational Methods in Computer Vision. Springer. 2015, pp. 654 665. [15] J. Domke. Learning graphical model parameters with approximate marginal inference . IEEE Transactions on Pattern Analysis and Machine Intelligence 35.10 (2013), pp. 2454 2467. [16] M. F. Tappen. Utilizing variational optimization to learn markov random fields . Computer Vision and Pattern Recognition (CVPR). 2007. [17] M. J. Wainwright. Estimating the wrong graphical model: Benefits in the computation-limited setting . Journal of Machine Learning Research (JMLR) 7 (2006). [18] B. Amos and J. Z. Kolter. Opt Net: Differentiable Optimization as a Layer in Neural Networks . International Conference on Machine Learning (ICML). 2017. [19] J. C. Boot. On sensitivity analysis in convex quadratic programming problems . Operations Research 11.5 (1963), pp. 771 786. [20] K. Kumar and F. Bach. Active-set Methods for Submodular Minimization Problems . ar Xiv preprint ar Xiv:1506.02852 (2015). [21] N. Chakravarti. Sensitivity analysis in isotonic regression . Discrete Applied Mathematics 45.3 (1993), pp. 183 196. [22] B. Dolhansky and J. Bilmes. Deep Submodular Functions: Definitions and Learning . Neural Information Processing Society (NIPS). Barcelona, Spain, Dec. 2016. [23] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables . Journal of Machine Learning Research (JMLR) 6.Sep (2005), pp. 1453 1484. [24] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks . Advances in Neural Information Processing Systems (NIPS). 2004, pp. 25 32. [25] J. Edmonds. Submodular functions, matroids, and certain polyhedra . Combinatorial structures and their applications (1970), pp. 69 87. [26] M. J. Best and N. Chakravarti. Active set algorithms for isotonic regression; a unifying framework . Mathematical Programming 47.1-3 (1990), pp. 425 439. [27] T. Robertson and T. Robertson. Order restricted statistical inference. Tech. rep. 1988. [28] M. Queyranne. Minimizing symmetric submodular functions . Mathematical Programming 82.1-2 (1998), pp. 3 12. [29] A. Shapiro. Sensitivity Analysis of Nonlinear Programs and Differentiability Properties of Metric Projections . SIAM Journal on Control and Optimization 26.3 (1988), pp. 628 645. [30] H. Groenevelt. Two algorithms for maximizing a separable concave function over a polymatroid feasible region . European Journal of Operational Research 54.2 (1991). [31] A. Schrijver. A combinatorial algorithm minimizing submodular functions in strongly polynomial time . Journal of Combinatorial Theory, Series B 80.2 (2000), pp. 346 355. [32] V. Niculae and M. Blondel. A Regularized Framework for Sparse and Structured Neural Attention . ar Xiv preprint ar Xiv:1705.07704 (2017). [33] A. Martins and R. Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification . International Conference on Machine Learning (ICML). 2016. [34] Y.-L. Yu. On decomposing the proximal map . Advances in Neural Information Processing Systems. 2013, pp. 91 99. [35] D. Suehiro, K. Hatano, S. Kijima, E. Takimoto, and K. Nagano. Online prediction under submodular constraints . International Conference on Algorithmic Learning Theory. 2012. [36] E. Borenstein and S. Ullman. Combined top-down/bottom-up segmentation . IEEE Transactions on Pattern Analysis and Machine Intelligence 30.12 (2008), pp. 2109 2125. [37] A. Barbero and S. Sra. Modular proximal optimization for multidimensional total-variation regularization . ar Xiv preprint ar Xiv:1411.0589 (2014). [38] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization . Journal of Machine Learning Research (JMLR) 12.Jul (2011), pp. 2121 2159. [39] D. Kingma and J. Ba. Adam: A method for stochastic optimization . ar Xiv preprint ar Xiv:1412.6980 (2014).