# differentiable_submodular_maximization__5637e798.pdf Differentiable Submodular Maximization Sebastian Tschiatschek1, Aytunc Sahin2 and Andreas Krause2 1 Microsoft Research Cambridge 2 ETH Zurich sebastian.tschiatschek@microsoft.com, aytunc.sahin@inf.ethz.ch, krausea@ethz.ch We consider learning of submodular functions from data. These functions are important in machine learning and have a wide range of applications, e.g. data summarization, feature selection and active learning. Despite their combinatorial nature, submodular functions can be maximized approximately with strong theoretical guarantees in polynomial time. Typically, learning the submodular function and optimization of that function are treated separately, i.e. the function is first learned using a proxy objective and subsequently maximized. In contrast, we show how to perform learning and optimization jointly. By interpreting the output of greedy maximization algorithms as distributions over sequences of items and smoothening these distributions, we obtain a differentiable objective. In this way, we can differentiate through the maximization algorithms and optimize the model to work well with the optimization algorithm. We theoretically characterize the error made by our approach, yielding insights into the tradeoff of smoothness and accuracy. We demonstrate the effectiveness of our approach for jointly learning and optimizing on synthetic maximum cut data, and on real world applications such as product recommendation and image collection summarization. 1 Introduction Many important applications require the prediction of setvalued outputs, e.g. product recommendation in which the output corresponds to a set of items that should be recommended to a user, object detection in images in which the output is a set of bounding boxes [Felzenszwalb et al., 2010], or extractive summarization in which the output is a set of input objects (sentences in text summarization [Lin and Bilmes, 2010], images in image collection summarization [Tschiatschek et al., 2014], short scenes in video summarization [Zhang et al., 2016]). This problem is particularly challenging as the size of the output space (i.e. the number of possible subsets of some ground set, e.g. the set of all possible bounding boxes) increases exponentially with the size of the ground set. The problem of learning to predict sets is most commonly approached by either specifying suitable set functions by hand or by first learning the set functions for the task at hand from data and then performing inference using the learned set function. To enjoy strong theoretical guarantees for inference via maximization, we are interested in using set functions that are submodular. The submodular set functions are fitted to the given training data (we consider the supervised case in which the training data contains the sets we wish to predict) using, for example, large margin methods [Lin and Bilmes, 2012; Tschiatschek et al., 2014], maximum likelihood estimation [Gillenwater et al., 2014; Tschiatschek et al., 2016] or by minimizing some kind of setvalued loss function [Dolhansky and Bilmes, 2016]. Finally, for inference, some variant of greedy algorithm is used for maximizing the learned set function. This approach suffers from the problem that training and testing are performed differently which can lead to degraded performance. Furthermore, this approach circumvents endto-end training of the considered functions. However, end-toend training is desirable as it often leads to significant performance improvements as demonstrated on various domains recently [Ren et al., 2015; Bahdanau et al., 2014]. We resolve this problem by proposing two differentiable algorithms for maximizing non-negative submodular set functions with cardinality constraints and without constraints. These algorithms can easily be integrated into existing deep learning architectures. The algorithms define a likelihood over sets and are used for both training and testing. This enables end-to-end training of functions for predicting sets. Our algorithms are derived from the standard greedy algorithm [Nemhauser et al., 1978] and the double greedy algorithm [Buchbinder et al., 2015] which was originally proposed for maximizing positive non-monotone submodular functions and provides a 1 2 approximation guarantee. Interestingly, our algorithm for unconstrained maximization also provides approximation guarantees of the form f(XG) 1 2f(OPT) ϵ(t), where ϵ(t) is a function of the parameter t parameterizing our algorithm, OPT is an optimal solution and XG is the output of our algorithm. Our main contributions are: 1. We develop differentiable algorithms for end-to-end training of deep networks with set-valued outputs. 2. We theoretically characterize the additional errors im- Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) posed by smoothing used in the derivation of our algorithm for unconstrained maximization. 3. We demonstrate the excellent end-to-end learning performance of our approach on several challenging applications, including product recommendation and image collection summarization. 2 Notation & Problem Setting Notation. Let V = {e1, e2, . . .} be a ground set of items. We want to perform subset selection from V. In some parts of the paper we consider V to be fixed while in others V depends on the context. Note that we assume an arbitrary but fixed enumeration of the items in V. Furthermore, let f : 2V R be a set function assigning a real-valued scalar to every set S V. The function f is submodular iff f(A {e}) f(A) f(B {e}) f(B) for all e V, A B V \{e}. The function f is monotone if f(A) f(B) for all A B and non-monotone otherwise. We denote the gain of adding an item e V to a set of items S by + f (e | S) = f(S {e}) f(S) and the gain of removing the item e S V from a set S as f (e | S) = f(S \ {e}) f(S). We will omit the subscript f whenever the function f is clear from the context. For notational convenience we write S + e instead of S {e} and S e instead of S \ {e}. Problem setting. We consider the problem of learning an unknown submodular function f(S | V) from training data. The training data is given in the form of a collection of maximizers X (or approximate maximizers) of f(S | V) for different ground sets V, i.e. X = {(V1, X1), . . . , (VM, XM)}. We assume that the ground set is not only an abstract set of elements but that for every element in the ground set we have additional information, e.g. an associated vector of features. For instance, in the case of image collection summarization, the ground set Vi corresponds to the images (each element of Vi comes in form of an actual image in form of pixel values) in an image collection and the set Xi is a human generated summary for that collection. Note that the training data can contain multiple samples for the same ground set, e.g. in image collection summarization there can be multiple human generated summaries for some particular image collection. The (approximate) maximizers in X either maximize f(S | V) unconstrainedly, i.e. (V, X) X it holds that f(X | V) max S V f(S) or cardinality constrainedly, i.e. f(X) max S V,|S| k f(S | V) for some cardinality constraint k. Given an algorithm APPROXMAX that can approximately maximize a submodular function with or without cardinality constraints, our goal is to learn a set function hθ(S | V) parameterized by θ such that if it is maximized with APPROXMAX, the returned solutions approximately maximize f(S | V). For the image collection summarization application, maximizing hθ(S | V) with APPROXMAX should generalize to new unseen ground sets V, i.e. yield subsets of images that summarize an unseen image collection well, respectively. For other applications, e.g. the product recommendation application, maximizing the function hθ with APPROXMAX should yield good solutions conditioned on the selection of a subset of items (see the experimental section for details). Our approach learning to greedily maximize. We develop two probabilistic approximate maximization algorithms APPROXMAX that take a function hθ( |V) parameterized by θ as input and output approximate maximizers of hθ( |V). These algorithms are based on the standard greedy algorithm and the double greedy algorithm and presented in detail in the next sections. Because of the probabilistic nature of the algorithms, they induce a distribution over sets S V as PAPPROXMAX(S; hθ( |V)) = P(APPROXMAX(hθ( |V)) = S). We can thus equivalently view these probabilistic algorithms APPROXMAX as distributions over sets. To learn the parameters of the function hθ, we can thus maximize the likelihood X under the induced distribution, i.e. we aim to find θ = arg max θ (V,X) X log PAPPROXMAX(X; hθ( |V)). (1) The parameters θ according to the above equation intimately connect the function hθ , the algorithm APPROXMAX, and the function f, i.e. APPROXMAX computes approximate maximizers of f when applied to hθ . By ensuring that our developed approximate maximization algorithms have differentiable likelihoods, we can maximize our objective in (1) using gradient based optimization techniques. This enables the easy integration of our algorithms into deep learning architectures which are commonly optimized using stochastic gradient descent techniques. 3 Differentiable Unconstrained Maximization In this section we present our probabilistic algorithm PD2GREEDY for differentiable unconstrained maximization of non-negative submodular set functions. The algorithm is derived from the double greedy algorithm [Buchbinder et al., 2015] and presented in Algorithm 1. The algorithm works by iterating through the items in V in a fixed order. In every iteration it computes the gain ai for adding the ith item to the set Xi 1 and the gain bi for removing the ith item from the set Yi 1. It then compares these two gains and makes a probabilistic decision based on that comparison. Algorithm 1 PD2GREEDY: Probabilistic diff. double-greedy Require: Function hθ : V R 0 X0 Y0 V for i = 1, . . . , |V| do ai = hθ(Xi 1 + ei) hθ(Xi 1) ˆ= +(ei | Xi 1) bi = hθ(Yi 1 ei) hθ(Yi 1) ˆ= (ei | Yi 1) if g(ai, bi) U then U is uniform dist. on [0, 1] Xi = Xi 1 {ei} else Yi = Yi 1 \ {ei} end if end for return Approximate maximizer X|V| The deterministic and randomized version of the double greedy algorithm as presented in [Buchbinder et al., 2015] can be obtained from Algorithm 1 by specific choices of the Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) function g(ai, bi). In particular, setting g(a, b) = g1(a, b) := 1a>b, where 1 is the indicator function, results in the deterministic double greedy algorithm with a 1 3 approximation guarantee for maximizing positive non-monotone submodular functions. Setting g(a, b) = g2(a, b) := [a]+ [a]++[b]+ , where [x]+ = max{0, x}, results in the randomized double greedy algorithm with a 1 2 approximation guarantee for maximization in expectation. Note that the algorithm, independent of the particular choice of g(ai, bi), induces a distribution over subsets of V. To specify this distribution, let x be a binary vector representing the set X in form of a one-hot encoding for the assumed fixed ordering e1, e2, . . . of the elements in the ground set. Then, the distribution is PPD2GREEDY(X) = i=1 g(ai, bi)xi(1 g(ai, bi))1 xi, (2) where ai = +(ei | {ej | j < i, xj = 1}) and bi = (ei | {ej V | j i or xj = 1}). Obviously, this distribution depends on the order of the items which we assumed to be fixed. In practice, this order influences the maximization performance as substantiated by an example shortly. While Equation (2) defines a distribution over sets it is hard to optimize for g = g1 and g = g2 because of the nonsmoothness of these functions. Thus we propose the following two natural smooth alternatives for the function g: g(a, b) = g3(a, b; t) := µ(a, b; t) is an approximation to the deterministic double greedy algorithm which becomes exact as t 0 where µ(a, b; t) = 1 1+exp( (a b)/t) g(a, b) = g4(a, b; t) := a a +b , where a = t log(1 + exp( a t )) and b = t log(1 + exp( b t)) is an approximation to the randomized double greedy algorithm which becomes exact as t 0. Interestingly, choosing one of these smooth alternatives for the function g(a, b) still results in an algorithm with approximation guarantees to the true optimum as substantiated in the following theorems. Theorem 1. Let ϵ > 0 and g(a, b) = g4(a, b; t). For t < 2ϵ |V| log(2), the output X|V| of Algorithm 1 satisfies f(X|V|) 1 2f(OPT) ϵ in expectation, where f(OPT) is an optimal solution. Similarly, for the approximation to the deterministic algorithm we have the following theorem. Theorem 2. Let ϵ > 0 and g(a, b) = g3(a, b; t). For t < 3ϵ |V|W (1/e), the output X|V| of Algorithm 1 satisfies f(X|V|) 1 3f(OPT) ϵ in expectation, where f(OPT) is an optimal solution. The full proofs are omitted due to space constraints and available in the extended version of this paper [Tschiatschek et al., 2018]. The idea of the proofs of Theorems 1 and 2 is as follows: Instead of using g1 and g2 in Algorithm 1, we use g3 and g4, respectively. Note that as t 0, the approximation of g1 by g3 becomes exact (the same holds for the approximation of g2 by g4). For nonzero t, the considered approximations change the probabilities of selecting each item ei during the execution of Algorithm 1 and introduce an (additive) error compared to executing Algorithm 1 with g1 and g2, respectively. We can guarantee that this additive error is below any desired error ϵ/|V| by choosing t small enough. As the algorithm iterates over all items, the total error is then bounded by ϵ. Theorems 1 and 2 characterize a tradeoff between accuracy of the algorithm and smoothness of the approximation. High accuracy (small ϵ) requires low smoothness (small t) and vice versa. Speeding up computations. Naively computing the likelihood in Equation (2) requires 4|V| function evaluations. By keeping track of the function values f(Xi) and f(Yi) this can immediately be reduced to 2|V| function evaluations. However, for large |V| this may still be prohibitive. Fortunately, many submodular functions allow for fast computation of all ai and bi values needed for evaluating (2). For example, for the modular function f(S) = P e S se, where se R, the gains for computing PD2GREEDY(X) are ai = sei and bi = sei. Another example is the facility location function f(S) = maxe S we, where we R 0. In that case, ai can be iteratively computed as ai = max{ci 1, wei} where ci = max{ci 1, wei} if ei X and ci = ci 1 otherwise (setting c0 = for initialization). Non non-negative functions. Some of the functions we are learning in the experiments in Section 5 are not guaranteed to be non-negative over their whole domain. This seems problematic because the above guarantees only hold for nonnegative functions. Note that any set function ˆf(S) can be transformed into a non-negative set function f(S) by defining f(S) = ˆf(S) min S V ˆf(S ). This transformation does not affect the gains ai and bi computed in Algorithm 1, however it changes the values of the maxima. Ordering of the items. The ordering of the items in the ground set influences the induced distributions over sets. Furthermore it can have a significant impact on the achieved maximization performance, although the theoretical guarantees are independent of this ordering. To see this, consider the following toy example and PD2GREEDY with g = g1: Let V = {e1, e2}, we1 = 2, we2 = 1 and f(S) = 2 + maxi S wi |S|2. This function is non-negative submodular with f( ) = 2, f({e1}) = 3, f({e2}) = 2, f(V) = 0. If the items in PD2GREEDY were considered in the the order (e1, e2), the algorithm would return {e2} with a function value of 2 while it would return {e1} with a function value of 3 if the items were considered in the reversed order (e2, e1). Note that similar observations hold for all choices of g. Connection to probabilistic submodular models. Algorithm PD2GREEDY defines a distribution over sets and there is an interesting connection to probabilistic submodular models [Djolonga and Krause, 2014] Specifically, for the case that f(S) is a modular function, g = g3 and t = 2, the induced distribution of PD2GREEDY corresponds to that of a log-modular distribution, i.e. items ei are contained in the output of PD2GREEDY with probability σ(f({ei}). Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 4 Diff. Cardinality Constrained Maximization In this section we present our algorithm for differentiable cardinality constrained maximization of submodular functions. We assume that the cardinality constraint is k, i.e. (V,X) X : f(X) max S V,|S|=k f(S). Our algorithm PGREEDY is presented in Algorithm 2. The algorithm iteratively builds up a solution of k elements by adding one element in every iteration, starting from the empty set. The item added to the interim solution in iteration i is selected randomly from all not yet selected items V \ Xi 1, where the probability pe|Xi 1 of selecting item e depends on the gain of adding e to Xi 1. The selection probabilities are controlled by the temperature t. Assuming no ties, for t 0, PGREEDY is equivalent to the standard greedy algorithm, deterministically selecting the element with highest marginal gain in every iteration. The algorithm naturally induces a differentiable distribution over sequences of items σ = (e1, . . . , ek) of length k at temperature t such that t + hθ(σi | Xi 1)) P e V\Xi 1 exp( 1 t + hθ(e | Xi 1)), (3) where Xi 1 = {σ1, . . . , σi 1}. From this distribution, we derive the probability of a set S by summing over all sequences σ of items consistent with S, i.e. σ Σ(S) P(σ), (4) where Σ(S) is the set of permutations of the elements in S. Algorithm 2 PGREEDY: Probabilistic differentiable greedy Require: Function hθ : V R 0, cardinality constraint k X0 for i = 1, . . . , k do C V \ Xi 1 e C : pe|Xi 1 exp 1 t + hθ (e|Xi 1) t + hθ (e |Xi 1) e sample e from C with probability pe|Xi 1 Xi = Xi 1 {e } end for return Approximate maximizer Xk As already briefly mentioned, the probability of a set S depend on the temperature t. For t = 0 there is at most one permutation σ of the items in S for which P(σ) is non-zero (assuming that there are no ties). In contrast, for t , P(S | σ) is constant for all permutations. Similarly to the case of PD2GREEDY, the choice of the temperature t allows to trade-off accuracy (the standard greedy algorithm has a (1 1/e) approximation guarantee for cardinality constraint maximization of non-negative monotone functions) and smoothness. The higher the temperature, the smoother is the induced distribution but the further are the made decisions from the greedy choice. Learning with PGREEDY. Given training data X, we can optimize the parameters of the function hθ by maximizing the likelihood of X under the distribution (4). However, for large k computing the summation over σ Σ(S) is infeasible. We propose two approximations for this case: If t is small, P(S) can be accurately approximated by P(σ (S)) where σ (S) = arg maxσ Σ(S) P(σ). However, to find σ we would still have to search over all possible permutations of S. Hence, we propose to approximate σ (S) by the permutation σG(S) induced by the greedy order of the elements in S. Assuming there is no clear preference on the order of the items in S, i.e. P(σi) P(σj) for all σi, σj Σ(S), we can approximate P(S) as P(S) k!P(σR), where σR is a random permutation of the items in S. Testing. If k is known at test time, e.g. we aim to compute a summary of fixed size in the image collection summarization application, we can simply execute the standard greedy algorithm using the learned function hθ. This returns an approximate MAP solution for P(S). However, we can also generate approximate maximizers of hθ by invoking Algorithm 2. This can be beneficial in cases where we want to propose several candidate summaries to a user to choose from. 5 Experiments 5.1 Maximum Cut Let G(V, E, w) be a weighted undirected graph, where V denotes the set of vertices, E is the set of edges and w denotes the set of non-negative edge weights wij, (i, j) E. The maximum cut problem is to find a subset of nodes S such that the cut value f(S) = P i S P j V \S wij is maximized. The cut function f(S) is non-negative, non-monotone and submodular and has, for instance, been used in semi-supervised learning [Wang et al., 2013]. We generate synthetic data for our maximum cut experiment as follows. We first generate n vectors x1, . . . , xn Rd by sampling their components uniformly from [0, 1] and arrange them in a n d matrix X. Using a projection matrix P of size k d, we project each xi to the lower dimensional space Rk. The weights of the graph w are defined using an RBF kernel K on these projected vectors. For the resulting graph instance G(V, E, w) we compute the maximum cut S V using IP (Integer Programming) formulation in Gurobi [Gurobi Optimization, 2016]. We sample X1, X2, . . . , Xm as described above and, using the same projection matrix P and RBF kernel K, we generate weights wi for the m graphs and find the maximum cut S i by IP for each of these graphs. From this data we compose our training set as X = {(X1, S 1), . . . , (Xm, S m)}. The test set is composed using the same strategy. In this experiment, our aim is to learn a projection matrix ˆP such that PD2GREEDY, if executed on the graph induced by the points ˆPXi, returns S i . Note that our goal is not necessarily to recover the original projection matrix P we are only interested in a projection matrix that allows us to find high value cuts through PD2GREEDY. For each element of the test set, we compute the corresponding graph using ˆP and find the maximum cut using PD2GREEDY and call the set as Sl DG. We compute the corresponding graph using P and a Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Parameters Sl DG So DG Sr DG n=20, t=2 3 0.88 0.008 0.86 0.010 0.64 0.012 n=20, t=2 2 0.84 0.007 0.80 0.009 0.62 0.012 n=20, t=2 1 0.79 0.006 0.74 0.007 0.60 0.008 n=20, t=1 0.74 0.006 0.69 0.007 0.60 0.008 Table 1: Performance of different cuts for varying temperature t in PD2GREEDY. The cuts Sl DG computed from PD2GREEDY using the learned projection matrix outperforms cuts computed via random projection matrices and the original projection matrix. This indicates that the learned projection matrix is optimized to accommodate our probabilistic greedy algorithm. random projection matrix, which acts as a baseline, Pr and find the maxcut using PD2GREEDY; we refer to these cuts as So DG and Sr DG respectively. When the cut value is calculated using IP, we refer to it as So IP . We compare the ratio of the each cut value calculated by PD2GREEDY to the cut value calculated by IP. Results are shown in Table 1. We observe that cuts computed with PD2GREEDY using the learned projection matrix outperform cuts computed via random projection matrices and the original projection matrix. The number of nodes in the graph is n, the temperature of the PD2GREEDY is t and 0.3 is used as a bandwidth parameter for the RBF kernel. The original points xi are in R10 and they are projected into R5 using the first 5 coordinates. Our training and test set consists of 800 and 200 elements, respectively. During training and testing, we used g4(a, b; t) to define the likelihood of sets as in Equation (2). For optimization, we used Adam [Kingma and Ba, 2015] with batch size of 16 and initial learning rate 0.02. In every case, we benefit from learning the projection matrix. 5.2 On the Relation Between Learning and Temperature In this section we experimentally investigate the relation between temperature and the difficulty of parameter learning, providing rather informal but intuitive explanations for the results of our paper. Our findings are based on the experimental setting of Section 5.1. Figure 1 shows training loglikelihoods for different temperatures over training epochs on synthetic data. For low temperatures t (t = 2 5 and t = 2 4), we observe that the log-likelihood quickly converges to a (relatively) low log-likelihood level. For these low temperatures, the link function g4 is relatively close to g2, the standard link function of the randomized double greedy algorithm. However, as g4 becomes less smooth with decreasing t, the corresponding gradients become non-informative and learning is more difficult. For high temperatures t (t = 22 and t = 23), we also observe that the log-likelihoods converge to a (relatively) low log-likelihood level. For these high temperatures, g4 is very smooth but far from g2. With increasing temperature, the probability distribution over sets becomes more uniform and non-informative. Because of the smoothness of g4, the optimization becomes easier but it converges to a distribution that does not put much probability mass on the approximate maximizers that we are interested in. For medium temperatures t (t = 2 3 and t = 2 2) we observe the highest log-likelihoods. This is also reflected in the test set performance in our other synthetic experiments, cf. Table 1. For these medium temperatures, g4 is smooth enough to provide informative gradients and close enough to the original link function g2 to preserve the correct probability distribution over sets. To summarize, there is a trade-off between temperature and difficulty of parameter learning. Although Theorems 1 and 2 suggest that using a temperatures t as small as possible should lead to the smallest additive error compared to the corresponding variants of the double greedy algorithm, using low temperatures t can make training difficult. Using high temperatures t also results in bad log-likelihoods as the distribution over sets becomes uniform. The plot on the right hand side of Figure 1 shows the final log-likelihoods (at epoch 10) with different temperatures. From this plot, we can see that low and high temperatures end up in a lower log-likelihood than medium temperatures. Thus, it is advantageous to use temperatures t that result in smooth link functions while simultaneously ensuring an informative probability distribution over sets. 5.3 Product Recommendation We consider the Amazon baby registry data [Gillenwater et al., 2014]. The data consists of baby registry data collected from Amazon and split into several datasets according to product categories, e.g. safety, strollers, etc. The datasets contain 5,500 to 13,300 registries of users. The data is processed using 10-fold cross-validation. As for PD2GREEDY the order of items matters, we permute the items according to their empirical frequencies for each dataset. For every dataset and fold, we fit a facility location diversity (FLID) type model [Tschiatschek et al., 2016], i.e. max i S wi,d X | {z } div(S) where ui R encodes the utility of product i and wi,d R 0 encodes the dth latent property of item i. The intuitive idea behind the term div(S) is to encode repulsive dependencies between items that all have the dth latent property. We train FLID models by optimizing the likelihood of the training data in Equation (2) for t = 1 and the sigmoid approximation (FLIDD) and by optimizing the likelihood in Equation (4) (FLIDG) for t = 0.1. For optimization we used Adam with a batch size of 1 for FLIDD and with a batch size of 100 for FLIDG. We exhaustively computed the sum in Equation (4) for sets with at most 5 items and otherwise approximated this sum by randomly selecting 5! of the summands (and scaling the result accordingly). We decayed the initial learning rate of 0.01 with a decay factor of 0.9 per epoch. We trained all models for 20 epochs. The number of latent dimensions D is chosen as 10 for ground sets with at most 40 items and 20 for larger ground sets. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 1 2 3 4 5 6 7 8 9 10 Epoch Log likelihood 2 5 2 4 2 3 2 2 2 1 20 21 22 23 Temperature t Log likelihood Figure 1: Relation between learning and temperature. (left) Training log-likelihoods over number of training epochs. Highest log-likelihoods are achieved for medium temperatures t while low and high temperatures result in decreased log-likelihoods. (right) Log-likelihoods after convergence over temperature t. The highest log-likelihood is achieved for temperature t = 2 3. Performance evaluation. To assess the performance of our trained models we compute the following performance measures and compare them to the performance of modular models (fully factorized models independently predicting the inclusion of any element in a set without considering correlations) and determinantal point process (DPPs) fitted by expectation maximization [Gillenwater et al., 2014]. Let D = {R1, . . . , RN} be a set of registries and let D = {R D | |R| > 1} be the set of registries in D with at least 2 items. Relative improvement in likelihood RLL. This measure quantifies the increase in likelihood for FLID and DPPs over a modular model. The RLL is computed as RLL = | P R D log P(R) LLmod| where P(R) is the probability of R for FLID/DPP and LLmod = P R D log Pmodular(R) log is the likelihood of the fully factorized model. Fill-in accuracy ACC. We test how accurately the considered models can predict items removed from a baby registry. We compute the fill-in accuracy as ACC(D; ˆf) = 1 | D| R={r1,...,rk} D i=1 1 ˆ f(R ri)=ri where ˆf is the function used for prediction. For FLID and modular functions, ˆf(S) = arg maxe V\S P(S + e). In the case of DPPs, ˆf returns the element with largest marginal conditioned on S. Mean reciprocal rank MRR. Instead of only accounting correct predictions, this measure considers the order in which the models would predict items. Formally, MRR is defined as MRR(D; ˆf) = 1 | D| R={r1,...,rk} D 1 r(ri; R ri), where r(ri; R ri) = k if the item ri would be predicted as the kth item. The results are shown in Table 2. We observe that FLID in most cases outperforms the modular model. The RLL of DPPs is in general higher than that of FLIDD. In many cases FLIDG significantly outperforms DPPs and FLIDD in terms of fill-in accuracy and MRR. 5.4 Image Collection Summarization We consider the problem of image collection summarization, i.e. the problem of selecting a subset of images S of an image collection V such that the selected images summarize the content of the collection V well (e.g. S covers all important scenes in V and the pictures in S are diverse). Data. We use the image collection summarization data from [Tschiatschek et al., 2014]. This data consists of 14 image collections V1, . . . , V14, each consisting of 100 images, i.e. Vi = {I(i) 1 , . . . , I(i) 100}. The images capture diverse themes, e.g. traveling, shopping, etc. For each image collection i, hundreds of human generated summaries were collected. We heuristically prune the original human summaries using the technique proposed in [Tschiatschek et al., 2014]. The remaining human summaries H(i) = {H(i) 1 , . . . , H(i) Ni} are considered as approximate maximizers of an unknown function f(S | Vi) that we want to learn. Evaluation, model, training. We follow the evaluation from [Tschiatschek et al., 2014; Dolhansky and Bilmes, 2016]. That is, we train our model on 13 of the 14 image collections and test the model on the held out image collection. We report the average VROUGE score [Tschiatschek et al., 2014] on the hold out image collection. The VROUGE score quantifies the quality of a summary for an image collection, capturing notions of coverage and diversity. We train a model using the visual features computed in [Tschiatschek et al., 2014], similar as in [Dolhansky and Bilmes, 2016]. Our model processes the 628 dimensional feature vector of each image i by first compressing them into 100 dimensional vectors hi by a fully connected layer with Re LU activations. From this intermediate representation we compute a quality score qi for every image i by another fully connected layer. From the intermediate representation we also compute means, maxes and variances across Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) RLL [%] ACC [%] MRR [ ] Dataset DPP FLIDD modular DPP FLIDD FLIDG modular DPP FLIDD FLIDG safety 11.60 12.47 16.06 16.20 16.67 16.72 29.55 30.02 30.34 30.36 carseats 8.77 8.78 13.76 14.72 15.03 16.18 28.75 30.14 30.56 31.33 strollers 8.54 9.58 14.42 16.79 19.69 21.75 27.36 29.90 32.95 34.49 furniture 10.53 10.59 15.82 16.08 15.93 17.21 30.11 30.73 30.35 31.21 health 2.89 1.83 9.94 10.32 10.30 12.07 21.42 22.31 21.87 23.26 bath 2.66 1.65 7.30 8.68 8.24 9.90 16.06 17.32 17.13 18.88 media 2.35 1.19 9.31 10.07 10.04 14.31 20.29 21.70 21.86 26.14 toys 2.22 1.04 9.85 11.65 11.59 14.78 21.93 23.40 23.64 26.23 bedding 1.55 0.24 17.17 17.43 16.87 17.53 27.59 28.04 27.59 28.01 apparel 0.93 0.00 13.47 13.30 13.31 13.28 20.88 21.19 21.16 22.00 diaper 0.90 0.25 10.13 10.13 10.24 14.55 18.03 18.56 19.10 23.78 gear 1.68 0.56 6.23 6.34 6.10 6.46 14.80 15.12 14.66 14.71 feeding 0.17 0.00 6.38 6.77 6.49 9.24 14.53 15.22 15.02 18.00 Table 2: Performance of different models on the Amazon baby registry dataset. Performance of log-modular models fitted with maximum likelihood (modular), DPPs trained with the EM algorithm (DPP), FLID trained with PD2GREEDY (FLIDD), and FLID trained with PGREEDY (FLIDG). FLIDD, FLIDGand DPPs clearly outperform the modular model in most cases. FLIDGoutperforms FLIDDand DPPs in many cases in terms of fill-in accuracy and MRR. RLL is omitted for FLIDGas it would require approximation due to the intractability of (4) for large sets. A RLL of 0.00 indicates a negative likelihood improvement. the images for every feature. These quantities are transformed into a threshold vector b through another fully connected layer with Re LU activation (b is 100 dimensional as the intermediate representation). The qualities, the threshold vector and intermediate representations parameterize a set function f(S) = P i S qi + P100 j=1 min(bi, P i S hi,j). We use dropout to avoid overfitting. We optimize the expected VROUGE score induced by our proposed algorithm PGREEDY, using a temperature of t = 1 and randomly sampled permutations of the items in the training samples (all human generated summaries consist of 10 images and hence computing the exact expected VROUGE score using (4) is infeasible). For optimization, we used Adam [Kingma and Ba, 2015] with a learning rate of 0.001 and weight decay selected via cross-validation. Results. We achieve an average normalized VROUGE score of 0.81. This is comparable to the performance achieved in [Dolhansky and Bilmes, 2016]. The authors in [Tschiatschek et al., 2014] achieved an average VROUGE score of 1.13 for their best model by learning a mixture of more than 500 hand specified component functions. In our experiments, we observed that our model easily overfits the training data. So more elaborate regularization methods could potentially improve the performance of our model further. 6 Related Work While traditionally deep neural nets are used to learn parameters for a fixed size input vector, there is a growing literature on learning parameters of set functions with deep neural nets. For instance, [Rezatofighi et al., 2017] considers learning the parameters of the likelihood of a set using deep neural nets and [Zaheer et al., 2017] designs specific layers to create permutation invariant and equivariant functions for set prediction. [Balcan and Harvey, 2011] considers learning submodu- lar functions in a PAC-style framework and provides lower bounds on their learnability. [Lin and Bilmes, 2012; Tschiatschek et al., 2014] learn mixtures of submodular functions by optimizing the weights using a large margin structured prediction framework. Since only the weights, not the components, are learned, the learning process is highly dependent on the representativeness of the functions used. [Dolhansky and Bilmes, 2016] develop a new class of submodular functions, similar to deep neural nets, and learn them in a maximum margin setting. Recently, there has been some applications which combine deep learning and reinforcement learning to develop heuristics for combinatorial optimization problems. For instance, [Khalil et al., 2017] uses graph embeddings and reinforcement learning to learn heuristics for graph combinatorial optimization problems such as maximum cut. 7 Conclusions We considered learning of submodular functions which are used for subset selection through greedy maximization. To this end, we proposed variants of two types of greedy algorithms that allow for approximate maximization of submodular set functions such that their output can be interpreted as a distribution over sets. This distribution is differentiable, can be used within deep learning frameworks and enables gradient based learning. Furthermore, we theoretically characterized the tradeoff of smoothness and accuracy of some of the considered algorithms. We demonstrated the effectiveness of our approach on different applications, including max-cuts and product recommendation observing good empirical performance. Acknowledgements This research was partially supported by SNSF NRP 75 grant 407540 167212, SNSF CRSII2 147633 and ERC St G 307036. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) References [Bahdanau et al., 2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. Co RR, abs/1409.0473, 2014. [Balcan and Harvey, 2011] Maria-Florina Balcan and Nicholas J. A. Harvey. Learning submodular functions. In Symposium on Theory of Computing (STOC), pages 793 802, 2011. [Buchbinder et al., 2015] Niv Buchbinder, Moran Feldman, Joseph Seffi, and Roy Schwartz. A tight linear time (1/2)-approximation for unconstrained submodular maximization. SIAM Journal on Computing, 44(5):1384 1402, 2015. [Djolonga and Krause, 2014] Josip Djolonga and Andreas Krause. From map to marginals: Variational inference in bayesian submodular models. In Advances in Neural Information Processing Systems (NIPS), pages 244 252. 2014. [Dolhansky and Bilmes, 2016] Brian W Dolhansky and Jeff A Bilmes. Deep submodular functions: Definitions and learning. In Advances in Neural Information Processing Systems (NIPS), pages 3404 3412, 2016. [Felzenszwalb et al., 2010] Pedro F Felzenszwalb, Ross B Girshick, David Mc Allester, and Deva Ramanan. Object detection with discriminatively trained part-based models. TPAMI, 32(9):1627 1645, 2010. [Gillenwater et al., 2014] Jennifer A Gillenwater, Alex Kulesza, Emily Fox, and Ben Taskar. Expectationmaximization for learning determinantal point processes. In Advances in Neural Information Processing Systems (NIPS), pages 3149 3157. 2014. [Gurobi Optimization, 2016] Inc. Gurobi Optimization. Gurobi optimizer reference manual, 2016. [Khalil et al., 2017] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6351 6361, 2017. [Kingma and Ba, 2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference for Learning Representations (ICLR), 2015. [Lin and Bilmes, 2010] Hui Lin and Jeff Bilmes. Multidocument summarization via budgeted maximization of submodular functions. In Conference of the North American Chapter of the Association for Computational Linguistics, pages 912 920. Association for Computational Linguistics, 2010. [Lin and Bilmes, 2012] Hui Lin and Jeff Bilmes. Learning mixtures of submodular shells with application to document summarization. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 479 490. AUAI Press, 2012. [Nemhauser et al., 1978] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions i. Mathematical Programming, 14(1):265 294, 1978. [Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), pages 91 99. Curran Associates, Inc., 2015. [Rezatofighi et al., 2017] S Hamid Rezatofighi, Anton Milan, Ehsan Abbasnejad, Anthony Dick, Ian Reid, et al. Deepsetnet: Predicting sets with deep neural networks. In International Conference on Computer Vision (ICCV), pages 5257 5266. IEEE, 2017. [Tschiatschek et al., 2014] Sebastian Tschiatschek, Rishabh K Iyer, Haochen Wei, and Jeff A Bilmes. Learning mixtures of submodular functions for image collection summarization. In Advances in neural information processing systems (NIPS), pages 1413 1421, 2014. [Tschiatschek et al., 2016] Sebastian Tschiatschek, Josip Djolonga, and Andreas Krause. Learning probabilistic submodular diversity models via noise contrastive estimation. In Artificial Intelligence and Statistics (AISTATS), pages 770 779, 2016. [Tschiatschek et al., 2018] Sebastian Tschiatschek, Aytunc Sahin, and Andreas Krause. Differentiable submodular maximization. ar Xiv preprint ar Xiv:1803.01785, 2018. [Wang et al., 2013] Jun Wang, Tony Jebara, and Shih-Fu Chang. Semi-supervised learning using greedy max-cut. Journal of Machine Learning Research, 14(1):771 800, 2013. [Zaheer et al., 2017] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems, pages 3394 3404, 2017. [Zhang et al., 2016] Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video summarization with long shortterm memory. In European Conference on Computer Vision, pages 766 782. Springer, 2016. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)