# deep_submodular_functions_definitions_and_learning__e3b877d0.pdf Deep Submodular Functions: Definitions & Learning Brian Dolhansky Jeff Bilmes Dept. of Computer Science and Engineering University of Washington Seattle, WA 98105 Dept. of Electrical Engineering University of Washington Seattle, WA 98105 We propose and study a new class of submodular functions called deep submodular functions (DSFs). We define DSFs and situate them within the broader context of classes of submodular functions in relationship both to various matroid ranks and sums of concave composed with modular functions (SCMs). Notably, we find that DSFs constitute a strictly broader class than SCMs, thus motivating their use, but that they do not comprise all submodular functions. Interestingly, some DSFs can be seen as special cases of certain deep neural networks (DNNs), hence the name. Finally, we provide a method to learn DSFs in a max-margin framework, and offer preliminary results applying this both to synthetic and real-world data instances. 1 Introduction Submodular functions are attractive models of many physical processes primarily because they possess an inherent naturalness to a wide variety of problems (e.g., they are good models of diversity, information, and cooperative costs) while at the same time they enjoy properties sufficient for efficient optimization. For example, submodular functions can be minimized without constraints in polynomial time [12] even though they lie within a 2n-dimensional cone in R2n. Moreover, while submodular function maximization is NP-hard, submodular maximization is one of the easiest of the NP-hard problems since constant factor approximation algorithms are often available e.g., in the cardinality constrained case, the classic 1 1/e result of Nemhauser [21] via the greedy algorithm. Other problems also have guarantees, such as submodular maximization subject to knapsack or multiple matroid constraints [8, 7, 18, 15, 16]. One of the critical problems associated with utilizing submodular functions in machine learning contexts is selecting which submodular function to use, and given that submodular functions lie in such a vast space with 2n degrees of freedom, it is a non-trivial task to find one that works well, if not optimally. One approach is to attempt to learn the submodular function based on either queries of some form or based on data. This has led to results, mostly in the theory community, showing how learning submodularity can be harder or easier depending on how we judge what is being learnt. For example, it was shown that learning submodularity in the PMAC setting is fairly hard [2] although in some cases things are a bit easier [11]. In both of these cases, learning is over all points in the hypercube. Learning can be made easier if we restrict ourselves to learn within only a subfamily of submodular functions. For example, in [24, 19], it is shown that one can learn mixtures of submodular functions using a max-margin learning framework here the components of the mixture are fixed and it is only the mixture parameters that are learnt, leading often to a convex optimization problem. In some cases, computing gradients of the convex problem can be done using submodular maximization [19], while in other cases, even a gradient requires minimizing a difference of two submodular functions [27]. Learning over restricted families rather than over the entire cone is desirable for the same reasons that any form of regularization in machine learning is useful. By restricting the family over which learning occurs, it decreases the complexity of the learning problem, thereby increasing the chance that one finds a good model within that family. This can be seen as a classic bias-variance tradeoff, where 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. V = V (0) V (1) V (2) meta features final feature w(1) w(2) w(3) V = V (0) V (1) meta features final feature Figure 1: Left: A layered DSF with K = 3 layers. Right: a 3-block DSF allowing layer skipping. increasing bias can reduce variance. Up to now, learning over restricted families has apparently (to the authors knowledge) been limited to learning mixtures over fixed components. This can be quite limited if the components are restricted, and if not might require a very large number of components. Therefore, there is a need for a richer and more flexible family of submodular functions over which learning is still possible. In this paper (and in [5]), we introduce a new family of submodular functions that we term deep submodular functions, or DSFs. DSFs strictly generalize, as we show below, many of the kinds of submodular functions that are useful in machine learning contexts. These include the so-called decomposable submodular functions, namely those that can be represented as a sum of concave composed with modular functions [25]. We describe the family of DSFs and place them in the context of the general submodular family. In particular, we show that DSFs strictly generalize standard decomposable functions, thus theoretically motivating the use of deeper networks as a family over which to learn. Moreover, DSFs can represent a variety of complex submodular functions such as laminar matroid rank functions. These matroid rank functions include the truncated matroid rank function [13] that is often used to show theoretical worst-case performance for many constrained submodular minimization problems. We also show, somewhat surprisingly, that like decomposable functions, DSFs are unable to represent all possible cycle matroid rank functions. This is interesting in and of itself since there are laminar matroids that are not cycle matroids. On the other hand, we show that the more general DSFs share a variety of useful properties with decomposable functions. Namely, that they: (1) can leverage the vast amount of practical work on feature engineering that occurs in the machine learning community and its applications; (2) can operate on multi-modal data if the data can be featurized in the same space; (3) allow for training and testing on distinct sets since we can learn a function from the feature representation level on up, similar to the work in [19]; and (4) are useful for streaming applications since functions can be evaluated without requiring knowledge of the entire ground set. These advantages are made apparent in Section 2. Interestingly, DSFs also share certain properties with deep neural networks (DNNs), which have become widely popular in the machine learning community. For example, DNNs with weights that are strictly non-negative correspond to a DSF. This suggests, as we show in Section 5, that it is possible to develop a learning framework over DSFs leveraging DNN learning frameworks. Unlike standard deep neural networks, which typically are trained either in classification or regression frameworks, however, learning submodularity often takes the form of trying to adjust the parameters so that a set of summary data sets are offered a high value. We therefore extend the max-margin learning framework of [24, 19] to apply to DSFs. Our approach can be seen as a max-margin learning approach for DNNs but restricted to DSFs. We show that DSFs can be learnt effectively in a variety of contexts (Section 6). In the below, we discuss basic definitions and an initial implementation of learning DSFS while in [5] we provide complete definitions, properties, relationships to concavity, proofs, and a set of applications. 2 Background Submodular functions are discrete set functions that have the property of diminishing returns. Assume a given finite size n set of objects V (the large ground set of data items), where each v V is a distinct data sample (e.g., a sentence, training pair, image, video, or even a highly structured object such as a tree or a graph). A valuation set function f : 2V R that returns a real value for any subset X V is said to be submodular if for all X Y and v / Y the following inequality holds: f(X {v}) f(X) f(Y {v}) f(Y ). This means that the incremental value (or gain) of adding another sample v to a subset decreases when the context in which v is considered grows from X to Y . We can define the gain of v in the context of X as f(v|X) f(X {v}) f(X). Thus, f is submodular if f(v|X) f(v|Y ). If the gain of v is identical for all different contexts i.e., f(v|X) = f(v|Y ), X, Y V , then the function is said to be modular. A function might also have the property of being normalized (f( ) = 0) and monotone non-decreasing (f(X) f(Y ) whenever X Y ). If the negation of f, f, is submodular, then f is called supermodular. A useful class of submodular functions in machine learning are decomposable functions [25], and one example of useful instances of these for applications are called feature-based functions. Given a set of non-negative monotone non-decreasing normalized (φ(0) = 0) concave functions φi : R+ R+ and a corresponding set of non-negative modular functions mi : V R+, the function f : 2V R+ defined as f(X) = P i φi(mi(X)) = P i φi(P x X mi(x)) is known to be submodular. Such functions have been called decomposable in the past, but in this work we will refer to them as the family of sums of concave over modular functions (SCMs). SCMs have been shown to be quite flexible [25], being able to represent a diverse set of functions such as graph cuts, set cover functions, and multiclass queuing system functions and yield efficient algorithms for minimization [25, 22]. Such functions are useful also for applications involving maximization. Suppose that each element v V is associated with a set of features U in the sense of, say, how TFIDF is used in natural language processing (NLP). Feature based submodular functions are those defined via the set of features. Feature based functions take the the form f(X) = P u U wuφu(mu(X)), where φu is a non-decreasing non-negative univariate normalized concave function, mu(X) is a feature-specific non-negative modular function, and wu is a non-negative feature weight. The result is the class of feature-based submodular functions (instances of SCMs). Such functions have been successfully used for data summarization [29]. Another advantage of such functions is that they do not require the construction of a pairwise graph and therefore do not have quadratic cost as would, say a facility location function (e.g., f(X) = P v V maxx X wxv), or any function based on pair-wise distances, all of which have cost O(n2) to evaluate. Feature functions have an evaluation cost of O(n|U|), linear in the ground set V size and therefore are more scalable to large data set sizes. Finally, unlike the facility location and other graph-based functions, feature-based functions do not require the use of the entire ground set for each evaluation and hence are appropriate for streaming algorithms [1, 9] where future ground elements are unavailable. Defining ψ : Rn R as ψ(x) = P i φi( mi, x ), we get a monotone non-decreasing concave function, which we refer to as univariate sum of concaves (USCs). 3 Deep Submodular Functions While feature-based submodular functions are indisputably useful, their weakness lies in that features themselves may not interact, although one feature u might be partially redundant with another feature u . For example, when describing a sentence via its component n-grams features, higherorder n-grams always include lower-order n-grams, so n-gram features are partially redundant. We may address this problem by utilizing an additional layer of nested concave functions as in f(X) = P s S ωsφs(P u U ws,uφu(mu(X))), where S is a set of meta-features, ωs is a metafeature weight, φs is a non-decreasing concave function associated with meta-feature s, and ws,u is now a meta-feature specific feature weight. With this construct, φs assigns a discounted value to the set of features in U, which can be used to represent feature redundancy. Interactions between the meta-features might be needed as well, and this can be done via meta-meta-features, and so on, resulting in a hierarchy of increasingly higher-level features. We hence propose a new class of submodular functions that we call deep submodular functions (DSFs). They may make use of a series of disjoint sets (see Figure 1-(a)): V = V (0), which is the function s ground set, and additional sets V (1), V (2), . . . , V (K). U = V (1) can be seen as a set of features , V (2) as a set of meta-features, V (3) as a set of meta-meta features, etc. up to V (K). The size of V (i) is di = |V (i)|. Two successive sets (or layers ) i 1 and i are connected by a matrix w(i) Rdi di 1 + , for i {1, . . . , K}. Given vi V (i), define w(i) vi to be the row of w(i) corresponding to element vi, and w(i) vi (vi 1) is the element of matrix w(i) at row vi and column vi 1. We may think of w(i) vi : V (i 1) R+ as a modular function defined on set V (i 1). Thus, this matrix contains di such modular functions. Further, let φvk : R+ R+ be a non-negative non-decreasing concave function. Then, a K-layer DSF f : 2V R+ can be expressed as follows, for any A V : f(A) = φv K v K 1 V (K 1) w(K) v K (v K 1)φv K 1 v2 V (2) w(3) v3 (v2)φv2 X v1 V (1) w(2) v2 (v1)φv1 X a A w(1) v1 (a) ! (1) Submodularity follows since a composition of a monotone non-decreasing function f and a monotone non-decreasing concave function φ (g( ) = φ(f( ))) is submodular (Theorem 1 in [20]) a DSF is submodular via recursive application and since submodularity is closed under conic combinations. A more general way to define a DSF (useful for the theorems below) uses recursion directly. We are given a directed acyclic graph (DAG) G = (V, E) where for any given node v V, we say pa(v) V are the parents of (vertices pointing towards) v. A given size n subset of nodes V V corresponds to the ground set and for any v V , pa(v) = . A particular root noder V \ V has the distinction thatr / pa(q) for any q V. Given a non-ground node v V \ V , we define the concave function ψv : RV R+ ψv(x) = φv X u pa(v)\V wuvψu(x) + mv, x (2) where φv : R+ R+ is a non-decreasing univariate concave function, wuv R+, mv : Rpa(v) V R+ is a non-negative linear function that evaluates as mv, x = P u pa(v) V mv(u)x(u)) (i.e., mv, x is a sparse dot-product over elements pa(v) V V ). The base case, where pa(v) V therefore has ψv(x) = φv( mv, x ), so ψv(1A) is a SCM function with only one term in the sum (1A is the characteristic vector of set A). A general DSF is defined as follows: for all A V , f(A) = ψr(1A)+ m (A), where m : V R is an arbitrary modular function (i.e., it may include positive and negative elements). From the perspective of defining a submodular function, there is no loss of generality by adding the final modular function m to a monotone non-decreasing function this is because any submodular function can be expressed as a sum of a monotone non-decreasing submodular function and a modular function [10]. This form of DSF is more general than the layered approach mentioned above which, in the current form, would partition V = {V (0), V (1), . . . , V (K)} into layers, and where for any v V (i), pa(v) V (i 1). Figure 1-(a) corresponds to a layered graph G = (V, E) where r = v3 1 and V = {v0 1, v0 2, . . . , v0 6}. Figure 1-(b) uses the same partitioning but where units are allowed to skip by more than one layer at a time. More generally, we can order the vertices in V with order σ so that {σ1, σ2, . . . , σn} = V where n = |V |, σm = r = v K where m = |V| and where σi pa(σj) iff i < j. This allows an arbitrary pattern of skipping while maintaining submodularity. The layered definition in Equation (1) is reminiscent of feed-forward deep neural networks (DNNs) owing to its multi-layered architecture. Interestingly, if one restricts the weights of a DNN at every layer to be non-negative, then for many standard hidden-unit activation functions the DNN constitutes a submodular function when given Boolean input vectors. The result follows for any activation function that is monotone non-decreasing concave for non-negative reals, such as the sigmoid, the hyperbolic tangent, and the rectified linear functions. This suggests that DSFs can be trained in a fashion similar to DNNs, as is further developed in Section 5. The recursive definition of DSFs, in Equation (2) is more general and, moreover, useful for the analysis in Section 4. DSFs should be useful for many applications in machine learning. First, they retain the advantages of feature-based functions (i.e., they require neither O(n2) nor access to the entire ground set for evaluation). Hence, DSFs can be both fast, and useful for streaming applications. Second, they allow for a nested hierarchy of features, similar to advantages a deep model has over a shallow model. For example, a one-layer DSF must construct a valuation over a set of objects from a large number of low-level features which can lead to fewer opportunities for feature sharing while a deeper network fosters distributed representations, analogous to DNNs [3, 4]. Below, we show that DSFs constitute a strictly larger family of submodular functions than SCMs. 4 Analysis of the DSF family DSFs represent a family that, at the very least, contain the family of SCMs. We argued intuitively that DSFs might extend SCMs as they allow components themselves to interact, and the interactions may propagate up a many-layered hierarchy. In this section, we formally place DSF within the context of more general submodular functions. We show that DSFs strictly generalize SCMs while preserving many of their attractive attributes. We summarize the results of this section in Figure 2, and that includes familial relationships amongst other classes of submodular functions (e.g., various matroid rank functions), useful for our main theorems. All Submodular Functions Laminar Matroid Rank Partition Matroid Rank Cycle Matroid Rank Figure 2: Containment properties of the set of functions studied in this paper. Thanks to concave composition closure rules [6], the root function ψr(x) : Rn R in Eqn. (2) is a monotone non-decreasing multivariate concave function that, by the concave-submodular composition rule [20], yields a submodular function ψr(1A). It is widely known that any univariate concave function composed with non-negative modular functions yields a submodular function. However, given an arbitrary multivariate concave function this is not the case. Consider, for example, any concave function ψ over R2 that offers the follow evaluations: ψ(0, 0) = ψ(1, 1) = 1, ψ(0, 1) = ψ(1, 0) = 0. Then f(A) = ψ(1A) is not submodular. Given a multivariate concave function ψ : Rn R, the superdifferential ψ(x) at x is defined as: ψ(x) = {h Rn : ψ(y) ψ(x) h, y h, x , y Rn} and a particular supergradient hx is a member hx ψ(x). If ψ(x) is differentiable at x then ψ(x) = { ψ(x)}. A concave function is said to have an antitone superdifferential if for all x y we have that hx hy for all hx ψ(x) and hy ψ(y) (here, x y x(v) y(v) v). Apparently, the following result has not been previously reported. Theorem 4.1. Let ψ : Rn R be a concave function. Then if ψ has an antitone superdifferential, then the set function f : 2V R defined as f(A) = ψ(1A) for all A V is submodular. A DSF s associated concave function has an antitone superdifferential. Concavity is not necessary in general (e.g., multilinear extensions of submodular functions are not concave but have properties analogous to an antitone superdifferential, see [5]). Lemma 4.3. Composition of monotone non-decreasing scalar concave and antitone superdifferential concave functions, and conic combinations thereof, preserves superdifferential antitonicity. Corollary 4.3.1. The concave function ψr associated with a DSF has an antitone superdifferential. A matroid M [12] is a set system M = (V, I) where I = {I1, I2, . . .} is a set of subsets Ii V that are called independent. A matroid has the property that I, that I is subclusive (i.e., given I I and I I then I I) and that all maximally independent sets have the same size (i.e., given A, B I with |A| < |B|, there exists a b B \ A such that A + b I). The rank of a matroid, a set function r : 2V Z+ defined as r(A) = max I I |I A|, is a powerful class of submodular functions. All matroids are uniquely defined by their rank function. All monotone non-decreasing non-negative rational submodular functions can be represented by grouping and then evaluating grouped ground elements in a matroid [12]. A particularly useful matroid is the partition matroid, where a partition (V1, V2, . . . , Vℓ) of V is formed, along with a set of capacities k1, k2, . . . , kℓ Z+. It s rank function is defined as: r(X) = Pℓ i=1 min(|X Vi|, ki) and, therefore, is an SCM, owing to the fact that φ(x) = min( x, 1Vi , ki) is USC. A cycle matroid is a different type of matroid based on a graph G = (V, E) where the rank function r(A) for A E is defined as the size of the maximum spanning forest (i.e., a spanning tree for each connected component) in the edge-induced subgraph GA = (V, A). From the perspective of matroids, we can consider classes of submodular functions via their rank. If a given type of matroid cannot represent another kind, their ranks lie in distinct families. To study where DSFs are situated in the space of all submodular functions, it is useful first to study results regarding matroid rank functions. Lemma 4.4. There are partition matroids that are not cycle matroids. In a laminar matroid, a generalization of a partition matroid, we start with a set V and a family F = {F1, F2, . . . , } of subsets Fi V that is laminar, namely that for all i = j either Fi Fj = or Fi Fj or Fj Fi (i.e., sets in F are either non-intersecting or comparable). In a laminar matroid, we also have for every F F an associated capacity k F Z+. A set I is independent if |I F| k F for all F F. A laminar family of sets can be organized in a tree, where there is one root R F in the tree that, w.l.o.g., can be V itself. Then the immediate parents pa(F) F of a set F F in the tree are the set of maximal subsets of F in F, i.e., pa(F) = {F F : F F and F F s.t. F F F}. We then define the following for all F F: r F (A) = min( X F pa(F ) r F (A F ) + |A \ [ F pa(F ) F|, k F ). (3) A laminar matroid rank has a recursive definition r(A) = r R(A) = r V (A). Hence, if the family F forms a partition of V , we have a partition matroid. More interestingly, when compared to Eqn. (2), we see that a laminar matroid rank function is an instance of a DSF with a tree-structured DAG rather than the non-tree DAGs in Figure 1. Thus, within the family of DSFs lie the truncated matroid rank functions used to show information theoretic hardness for many constrained submodular optimization problems [13]. Moreover, laminar matroids strictly generalize partition matroids. Lemma 4.5. Laminar matroids strictly generalize partition matroids Since a laminar matroid generalizes a partition matroid, this portends well for DSFs generalizing SCMs. Before considering that, we already are up against some limits of laminar matroids, i.e.: Lemma 4.6 (peeling proof). Laminar matroid cannot represent all cycle matroids. We call this proof a peeling proof since it recursively peels off each layer (in the sense of a DSF) of a laminar matroid rank until it boils down to a partition matroid rank function, where the base case is clear. The proof is elucidating, moreover, since it motivates the proof of Theorem 4.14 showing that DSFs extend SCMs. We also have the immediate corollary. Corollary 4.6.1. Partition matroids cannot represent all cycle matroids. We see that SCMs generalize partition matroid rank functions and DSFs generalize laminar matroid rank functions. We might expect, from the above results, that DSFs might generalize SCMs this is not immediately obvious since SCMs are significantly more flexible than partition matroid rank functions because: (1) the concave functions need not be simple truncations at integers, (2) each term can have its own non-negative modular function, (3) there is no requirement to partition the ground elements over terms in an SCM, and (4) we may with relative impunity extend the family of SCMs to ones where we add an additional arbitrary modular function (what we will call SCMMs below). We see, however, that SCMMs are also unable to represent the cycle matroid rank function over K4, very much like the partition matroid rank function. Hence the above flexibility does not help in this case. We then show that DSFs strictly generalize SCMMs, which means that DSFs indeed provide a richer family of submodular functions to utilize, ones that as discussed above, retain many of the advantages of SCMMs. We end the section by showing that DSFs, even with an additional arbitrary modular function, are still unable to represent matroid rank over K4, implying that although DSFs extend SCMMs, they cannot express all monotone non-decreasing submodular functions. We define a family of sums of concave over modular functions with an additional modular term (SCMMs), taking the form: f(A) = P i φi(mi(A)) + m (A) where each φi and mi as in an SCM, but where m : 2V R is an arbitrary modular function, so if m ( ) = 0 the SCMM is an SCM. Before showing that DSFs extend SCMMs, we include a result showing that SCMMs are strictly smaller than the set of all submodular functions. We include an unpublished result [28] showing that SCMMs can not represent the cycle matroid rank function, as described above, over the graph K4. Theorem 4.11 (Vondrak[28]). SCMMs Space of Submodular Functions We next show that DSFs strictly generalize SCMMs, thus providing justification for using DSFs over SCMMs and, moreover, generalizing Lemma 4.5. The DSF we choose is, again, a laminar matroid, so SCMMs are unable to represent laminar matroid rank functions. Since DSFs generalize laminar matroid rank functions, the result follows. Theorem 4.14. The DSF family is strictly larger than that of SCMs. The proof shows that it is not possible to represent a function of the form f(X) = min(P i min(|X Bi|, ki), k) using an SCMM. Theorem 4.15 also has an immediate consequence for concave functions. Corollary 4.14.1. There exists a non-USC concave function with an antitone superdifferential. The corollary follows since, as mentioned above, DSF functions have at their core a multivarate concave function with an antitone superdifferential, and thanks to Theorem 4.14 it is not always possible to represent this as a sum of concave over linear functions. It is currently an open problem if DSFs with ℓlayers extend the family of DSFs with ℓ < ℓlayers, for ℓ 2. Our final result shows that, while DSFs are richer than SCMMs, they still do not encompass all polymatroid functions. We show this by proving that the cycle matroid rank function on K4 is not achievable with DSFs. Theorem 4.15. DSFs Polymatroids Proofs of these theorems and more may be found in [5]. 5 Learning DSFs As mentioned above, learning submodular functions is generally difficult [11, 13]. Learning mixtures of fixed submodular component functions [24, 19], however, can give good empirical results on several tasks, including image [27] and document [19] summarization. In these examples, rather than attempting to learn a function at all 2n points, a max-margin approach is used only to approximate a submodular function on its large values. Typically when training a summarizer, one is given a ground set of items, and a set of representative sets of excerpts (usually human generated) each of which summarizes the ground set. Within this setting, access to an oracle function h(A) that, if available, could be used in a regression-style learning approach might not be available. Even if available, such learning is often overkill. Thus, instead of trying to learn h everywhere, we only seek to learn the parameters w of a function fw that lives within some family, parameterized by w, so that if B argmax A V :|A| k fw(A), then h(B) αh(A ) for some α [0, 1] where A argmax A V :|A| k h(A). In practice, this corresponds to selecting the best summary for a given document based on the learnt function, in the hope that it mirrors what a human believes to be best. Fortunately, the max-margin training approach directly addresses the above and is immediately applicable to learning DSFs. Also, given the ongoing research on learning DNNs, which have achieved state-of-the-art results on a plethora of machine learning tasks [17], and given the similarity between DSFs and DNNs, we may leverage the DNN learning techniques (such as dropout, Ada Grad, learning rate scheduling, etc.) to our benefit. 5.1 Using max-margin learning for DSFs Given an unknown but desired function h : 2V R+ and a set of representative sets S = {S1, S2, . . .}, with Si V and where for each S S, h(S) is highly scored, a max-margin learning approach may be used to train a DSF f so that if A argmax A V f(A), h(A) is also highly scored by h. Under the large-margin approach [24, 19, 27], we learn the parameters w of fw such that for all S S, fw(S) is high, while for A 2V , fw(A) is lower by some given loss. This may be performed by maximizing the loss-dependent margin so that for all S S and A 2V , fw(S) fw(A) + ℓS(A). For a given loss function ℓS(A), optimization reduces to finding parameters so that fw(S) max A 2V [fw(A)+ℓS(A)] is satisfied for S S. The task of finding the maximizing set is known as loss-augmented inference (LAI) [26], which for general ℓ(A) is NP-hard. With regularization, and defining the hinge operator (x)+ = max(0, x), the optimization becomes: max A 2V [f(A) + ℓS(A)] f(S) + + λ 2 ||w||2 2. (4) Given a LAI procedure, the subgradient of weight wi is wi f(A) wi f(S) + λwi, and in the case of a DSF, each subgradient can be computed efficiently with backpropagation, similar to the approach of [23], but to retain polymatroidality of f, projected gradient descent is used ensure w 0 For arbitrary set functions, f(A) and ℓS(A), LAI is generally intractable. Even if f(A) is submodular, the choice of loss can affect the computational feasibility of LAI. For submodular ℓ(A), the greedy algorithm can find an approximately maximizing set [21]. For supermodular ℓ(A), the task of solving max A 2V \S[f(A) + ℓ(A)] involves maximizing the difference of two submodular functions and the submodular-supermodular procedure [14] can be used. Once a DSF is learnt, we may wish to find max A V :|A| k f(A) and this can be done, e.g., using the greedy algorithm when m 0. The task of summarization, however, might involve learning based on one set of ground sets and testing via a different (set of) ground set(s) [19, 27]. To do this, any particular element v V may be represented by a vector of non-negative feature weights (m1(v), m2(v), . . . ) (e.g., mi(v) counts the number of times a unigram i appears in sentence v), and the feature i weight for any set A V can is represented as the i-specific modular evaluation mi(A) = P a A mi(a). We can treat the set of modular functions {mi : V R+}i as a matrix to be used as the first layer in DSF (e.g., w(1) in Figure 1 (left)) that is fixed during the training of subsequent layers. This preserves submodularity, and allows all later layers (i.e., w(2), w(3), . . . ) to be learnt generically over any set of objects that can be represented in the same feature space this also allows training over one set of ground sets, and testing on a totally separate set of ground sets. Max-margin learning, in fact, remains ignorant that this is happening since it sees the data only post feature representation. In fact, learning can be cross modal e.g., images and sentences, represented in the same feature space, can be learnt simultaneously. This is analogous to the shells of [19]. In 0 50 100 150 200 Iteration Cycle matroid rank Greedy set value (a) Cycle matroid 0 50 100 150 200 Iteration Laminar rank Greedy set value (b) Laminar matroid 0 20 40 60 80 100 Epoch Normalized VROUGE value Average greedy VROUGE (Train) DSF SCMM Human avg. Random avg. 0 200 400 600 800 Epoch Average greedy VROUGE (Test) (c) Image summarization Figure 3: (a),(b) show matroid learning via a DSF is possible in a max-margin setting; (c) shows that learning and generalization of a DSF can happen, via featurization, on real image data. that case, however, mixtures were learnt over fixed components, some of which required a O(n2) calculation for element-pair similarity scores. Via featurization in the first layer of a DSF, however, we may learn a DSF over a training set, preserving submodularity, avoid any O(n2) cost, and test on any new data represented in the same feature space. 6 Empirical Experiments on Learning DSFs We offer preliminary feasibility results showing it is possible to train a DSF on synthetic datasets and, via featurization, on a real image summarization dataset. The first synthetic experiment trains a DSF to learn a cycle matroid rank function on K4. Although Theorem 4.15 shows a DSF cannot represent such a rank function everywhere, we show that the max-margin framework can learn a DSF that, when maximized via max A V :|A| 3 f(A) does not return a 3-cycle (as is desirable). We used a simple two-layer DSF, where the first hidden layer consisted of four hidden units with square root activation functions, and a normalized sigmoid ˆσ(x) = 2 (σ(x) 0.5) at the output. Figure 3a shows that after sufficient learning iterations, greedy applied to the DSF returns independent sized-three sets. Further analysis shows that the function is not learnt everywhere, as predicted by Theorem 4.15. We next tested a scaled laminar matroid rank r(A) = (1/8) min P10 i=1 min(|A Bi|, 1), 8 where the Bi s are each size 10 and form a partition of V , with |V | = 100. Thus maximal independent sets argmax I I |I| have r(I) = 1 with |I| = 8. A DSF is trained with a hidden layer of 10 units of activation g(x) = max(x, 1), and a normalized sigmoid ˆσ at the output. We randomly generated 200 matroid bases, and trained the network. The greedy solution to max A V :|A| 8 f(A) on the learnt DSF produces sets that are maximally independent (Figure 3b). For our real-world instance of learning DSFs, we use the dataset of [27], which consists of 14 distinct image sets, 100 images each. The task is to select the best 10-image summary in terms of a visual ROUGE-like function that is defined over a bag of visual features. For each of the 14 ground sets, we trained on the other 13 sets and evaluated the performance of the trained DSF on the test set. We use a simple DSF of the form f(A) = ˆσ P mu(A) , where mu(A) is modular for feature u, and ˆσ is a sigmoid. We used (diagonalized) Adagrad, a decaying learning rate, weight decay, and dropout (which was critical for test-set performance). We compared to an SCMM of comparable complexity and number of parameters (i.e., the same form and features but a linear output), and performance of the SCMM is much worse (Figure 3c) perhaps because of a DSF s depth. Notably, we only require |U| = 628 visual-word features (as covered in Section 5 of [27]), while the approach in [27] required 594 components of O(n2) graph values, or roughly 5.94 million precomputed values. The loss function is ℓ(A) = 1 R(A), where R(A) is a ROUGE-like function defined over visual-words. During training, we achieve numbers comparable to [27]. We do not yet match the generalization results in [27], but we do not use strong O(n2) graph components, and we expect better results perhaps with a deeper network and/or better base features. Acknowledgments: Thanks to Reza Eghbali and Kai Wei for useful discussions. This material is based upon work supported by the National Science Foundation under Grant No. IIS-1162606, the National Institutes of Health under award R01GM103544, and by a Google, a Microsoft, a Facebook, and an Intel research award. This work was supported in part by Terra Swarm, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA. [1] A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, and A. Krause. Streaming submodular maximization: Massive data summarization on the fly. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 671 680. ACM, 2014. [2] M. Balcan and N. Harvey. Learning submodular functions. Technical report, ar Xiv:1008.2159, 2010. [3] Y. Bengio. Learning Deep Architectures for AI. Foundations and Trends R in Machine Learning, 2(1):1 127, 2009. [4] Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798 1828, 2013. [5] Jeffrey Bilmes and Wenruo Bai. Deep Submodular Functions. Arxiv, abs/1701.08939, Jan 2017. http: //arxiv.org/abs/1701.08939. [6] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ Pr, 2004. [7] Niv Buchbinder, Moran Feldman, Joseph SeffiNaor, and Roy Schwartz. Submodular maximization with cardinality constraints. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1433 1452. Society for Industrial and Applied Mathematics, 2014. [8] Gruia Calinescu, Chandra Chekuri, Martin Pál, and Jan Vondrák. Maximizing a monotone submodular function subject to a matroid constraint. SIAM Journal on Computing, 40(6):1740 1766, 2011. [9] Chandra Chekuri, Shalmoli Gupta, and Kent Quanrud. Streaming algorithms for submodular function maximization. In International Colloquium on Automata, Languages, and Programming, pages 318 330. Springer, 2015. [10] W. H. Cunningham. Testing membership in matroid polyhedra. J Combinatorial Theory B, 36:161 188, 1984. [11] V. Feldman and J. Vondrák. Optimal bounds on approximation of submodular and XOS functions by juntas. Co RR, abs/1307.3301, 2013. [12] S. Fujishige. Submodular Functions and Optimization. Number 58 in Annals of Discrete Mathematics. Elsevier Science, 2nd edition, 2005. [13] M.X. Goemans, N.J.A. Harvey, S. Iwata, and V. Mirrokni. Approximating submodular functions everywhere. In SODA, pages 535 544, 2009. [14] R. Iyer and J. Bilmes. Algorithms for approximate minimization of the difference between submodular functions, with applications. Uncertainty in Artificial Intelligence (UAI), 2012. [15] Rishabh Iyer and Jeff Bilmes. Submodular optimization with submodular cover and submodular knapsack constraints. In Neural Information Processing Society (NIPS), Lake Tahoe, CA, December 2013. [16] Rishabh Iyer, Stefanie Jegelka, and Jeff A. Bilmes. Fast semidifferential-based submodular function optimization. In International Conference on Machine Learning (ICML), Atlanta, Georgia, 2013. [17] Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436 444, may 2015. [18] Jon Lee, Vahab S Mirrokni, Viswanath Nagarajan, and Maxim Sviridenko. Non-monotone submodular maximization under matroid and knapsack constraints. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 323 332. ACM, 2009. [19] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document summarization. In Uncertainty in Artificial Intelligence (UAI), Catalina Island, USA, July 2012. AUAI. [20] Hui Lin and Jeff Bilmes. A Class of Submodular Functions for Document Summarization, 2011. [21] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular functions-I. Math. Program., 14:265 294, 1978. [22] R. Nishihara, S Jegelka, and M. I. Jordan. On the convergence rate of decomposable submodular function minimization. In Advances in Neural Information Processing Systems, pages 640 648, 2014. [23] W. Pei. Max-Margin Tensor Neural Network for Chinese Word Segmentation. Transactions of the Association of Computational Linguistics, pages 293 303, 2014. [24] R. Sipos, P. Shivaswamy, and T. Joachims. Large-margin learning of submodular summarization models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 224 233. Association for Computational Linguistics, 2012. [25] P. Stobbe and A. Krause. Efficient minimization of decomposable submodular functions. In NIPS, 2010. [26] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediction models: A large margin approach. Proceedings of the 22nd international conference on Machine learning, (December):896 903, 2005. [27] S. Tschiatschek, R. Iyer, H. Wei, and J. Bilmes. Learning mixtures of submodular functions for image collection summarization. In Neural Information Processing Society (NIPS), Montreal, Canada, December 2014. [28] J. Vondrak. Personal Communication, 2011. [29] K. Wei, Y. Liu, K. Kirchhoff, and J. Bilmes. Unsupervised submodular subset selection for speech data. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, Florence, Italy, 2014.