# constrained_interacting_submodular_groupings__e7af633b.pdf

Constrained Interacting Submodular Groupings

Andrew Cotter 1 Mahdi Milani Fard 1 Seungil You 2 Maya Gupta 1 Jeff Bilmes 3

Abstract We introduce the problem of grouping a ﬁnite set V into m blocks where each block is a subset of V and where: (i) the blocks are individually highly valued by a submodular function f (both robustly and in the average case) while satisfying block-speciﬁc matroid constraints; and (ii) block scores interact where blocks are jointly scored highly via f, thus making the blocks mutually non-redundant. Submodular functions are good models of information and diversity; thus, the above can be seen as grouping V into matroid constrained blocks that are both intraand inter-diverse. Potential applications include forming ensembles of classiﬁcation/regression models, partitioning data for parallel processing, and summarization. In the non-robust case, we reduce the problem to non-monotone submodular maximization subject to multiple matroid constraints. In the mixed robust/average case, we offer a bicriterion guarantee for a polynomial time deterministic algorithm and a probabilistic guarantee for randomized algorithm, as long as the involved submodular functions (including the inter-block interaction terms) are monotone. We close with a case study in which we use these algorithms to ﬁnd high quality diverse ensembles of classiﬁers, showing good results.

1. Introduction

In recent years, submodular functions (Fujishige, 2005) have been used to address an increasingly wide variety of problems in machine learning and artiﬁcial intelligence. This includes energy functions in probabilistic models (Kohli et al., 2007; Gotovos et al., 2015; Djolonga et al., 2016), inﬂuence in social network (Kempe et al., 2003; Mossel & Roch, 2007), crowd teaching (Singla et al., 2014), non-parametric

1Google AI 2Kakao Mobility 3University of Washington, Seattle, work done while at Google AI. Correspondence to: Andrew Cotter <acotter@google.com>, Jeff Bilmes <bilmes@uw.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

Bayesian estimation (Reed & Ghahramani, 2013), document and speech summarization (Lin et al., 2009; Lin & Bilmes, 2011; Li et al., 2012), image summarization (Tschiatschek et al., 2014; Singla et al., 2016), and clustering (Narasimhan et al., 2005).

In this paper, we introduce and apply a new submodular optimization problem related to partitioning, covering, and packing (Schrijver, 2003). Given a set function f : 2V R+, it may be normalized (i.e., f( ) = 0), monotone non-decreasing (i.e., f(A) f(B) whenever A B), and/or submodular (i.e., A, B V , f(A) + f(B) f(A B) + f(A B)). A function g is said to be supermodular if g is submodular. A function m is modular if it is both submodular and supermodular. An m-partition of V is a set of m subsets, called blocks, that are non-intersecting (Aπ i Aπ j = for all i = j) and covering ( i Aπ i = V ). An m-covering is a set of blocks that is required only to be covering. In an m-covering, we might also have a multiplicity constraint which is expressed as a positive integer valued vector k = (kv : v V ) where kv Z+. To be a (k, m)- covering, we must have an m-covering with no multiplicity violations, i.e., |{i [m] : v Aπ i }| kv, v V . A packing is a set of blocks that is required only to be nonintersecting. When we wish to refer collectively either to a partition, a covering, or a packing, we use the term grouping.

Given a ﬁnite set V of size n = |V |, a non-negative integer m [1, n], and m monotone non-decreasing submodular functions fi : 2V R+ for i [m], the problem we study ﬁnds a feasible m-partition, or m-covering, or m-packing π of V into m blocks Aπ 1, Aπ 2, . . . , Aπ m that are good in a way to be described below.

Feasibility of our groupings is expressed using matroids, which are powerful combinatorial objects that can express many useful constraints over sets. A matroid (Oxley, 2006) M = (V, I) consists of a ﬁnite countable set V and a non-empty set of independent subsets I = {I1, I2, . . .}, where Ii V , that is down-closed (A B I A I) and where all maximally independent sets have the same size (i.e., A, B I with |A| < |B|, v B \ A having A + v I).

For a feasible grouping to be good, it must have several properties. First, the blocks in the grouping should both be individually highly diverse and also all be highly diverse

Constrained Interacting Submodular Groupings

on average, where diversity is measured by the functions {fi}i [m]. For example, suppose the elements of the ground set include animal names: goldﬁsh and carp have a similar meaning, as do crow and raven . But ﬁsh are very different from birds, so each of the sets {goldﬁsh, crow} and {carp, raven} are diverse. Second, and more importantly, different blocks should not be redundant w.r.t. each other. This avoids the case where two different blocks convey the same information but in different ways, something that might happen even if the two blocks are disjoint. For example, the sets A1 = {goldﬁsh, crow} and A2 = {carp, raven} are similar and hence redundant, despite being disjoint, and thus would be undesirable blocks when chosen together. This distinction between disjointedness and nonredundancy is particularly relevant in the context of submodular scoring functions where, as measured by a submodular function f, redundancy between A1 and A2 would mean that f(A1 A2) f(A1) and/or f(A1 A2) f(A2). We show below that this idea has a number of natural applications, one of which we evaluate in our case study section.

The paper is organized as follows. In the remainder of this section, we formally deﬁne our contributions (Section 1.1) and outline their utility in practice (Section 1.2). This section also deﬁnes objectives that, when constrainedly optimized, achieve our stated grouping goals. Section 2 places this paper in the context of previous work, and demonstrates that our contributions are novel. Section 3 formally outlines our approach and Section 4 details how we achieve cross-block diversity. Section 5 provides algorithms for constrainedly optimizing our objectives. Notably, we show a bi-criterion multiplicative approximation ratio guarantee for a fast polynomial time deterministic algorithm (Theorem 1), and also provide a randomized version of the algorithm giving a guarantee with high probability (using Lemma 4). The guarantees themselves are rather complex, and they are best appreciated in context, so we refer the reader to Theorem 1 and Lemma 4 for their statement. Lastly, Section 6 explores a case study application where we show our approach can be used to produce ensembles of machine learning models. We demonstrate that our approach improves on previous state-of-the-art results and moreover the groupings achieve the aforementioned desired properties.

1.1. Contributions and Objectives

Our starting point is a recently introduced objective (Wei et al., 2015) that takes a convex combination of a robust and an average objective and ﬁnds a grouping π that scores highly w.r.t.:

Fra(π) = λ1 min i [m] fi(Aπ i ) + λ2

i [m] fi(Aπ i ), (1)

where the λis are non-negative coefﬁcients and the fis may be distinct submodular functions. Our ﬁrst contribution is

that, unlike Wei et al. (2015), which only handles partitions, we also handle coverings and packings. Our second contribution is the inclusion of more general block-speciﬁc constraints, expressed as intersections of matroids on an expanded ground set. For example, we may wish for blocks to not exceed a certain size, or for each block to correspond to a sub-tree of some graph (Section 3).

Our third contribution, and the most signiﬁcant, is the introduction of cross-block interaction terms, enabling us to avoid groupings containing pairs of blocks that jointly score poorly. Our ﬁnal objective is:

F(π) =Fra(π) + λ3 min i,j [m],i<j Fi,j(Aπ i , Aπ j ) (2)

+ λ4 1 m 2 X

i,j [m],i<j Fi,j(Aπ i , Aπ j ).

While there are four λis in this objective, typically only two one for scoring individual blocks, and the other for pairwise interactions will be nonzero. We interpret the extra cross-block terms Fi,j as rewarding inter-block diversity. For example, we could cause our objective function to prefer blocks with large pairwise symmetric differences by taking Fi,j(Aπ i , Aπ j ) = Aπ i Aπ j . Alternatively, in the partitioning or packing setting, we could deﬁne Fi,j(Aπ i , Aπ j ) = f(Aπ i Aπ j ), in which case if there are two blocks Aπ i , Aπ j with either f(Aπ i Aπ j ) f(Aπ i ) or f(Aπ i Aπ j ) f(Aπ j ), then, under an interpretation of f as a diversity measure, the two blocks would be redundant, a situation we would prefer to avoid. We study several possible cross-block interactions, based on unions, intersections and symmetric differences, and {sub,super}modular functions thereof, and show that cross-block diversity preserves submodularity in an expanded ground set under various set-to-set mappings (Section 4).

Finally, we offer an approach that reduces the above problem to either non-monotone (without robust terms, i.e. λ1 = 0 and λ3 = 0) or iterative monotone (with one robust term, i.e. only one of λ1 or λ3 is nonzero) submodular maximization subject to multiple matroid constraints (Section 5).

1.2. Applications

There are several applications in machine learning and data science that ﬁt naturally into this setting, two of which we outline here.

Constructing ensembles of machine learning models: Let V index a set of features, with subsets of V corresponding to subsets of features on which a model will be trained. The classical feature selection problem would be to choose a single set of features that result in a good model. We re interested instead in the problem of ﬁnding an ensemble of models, each trained on a different subset of features, that together achieve good performance (Canini et al., 2016).

Constrained Interacting Submodular Groupings

This can be done by grouping V into Aπ 1, Aπ 2, . . . , Aπ m, from which we form an ensemble of m models, the results of which are aggregated together e.g. by averaging, voting, or taking the minimum. Given a submodular function f : V R+ measuring the quality of a single feature subset, one natural goal would be to choose each Aπ i to be individually high-quality, according to f. However, since the ensemble outputs are being aggregated, it would be purposeless to have redundant models many results (e.g. Kittler et al., 1998) suggest that when aggregating models, it is best for them to be as diverse as possible (so that the errors they make are independent, thereby improving accuracy and reducing variance). This motivates us to seek blocks that are as different from each other as possible. Individual model quality combined with diversity is exactly what maximizing Eq. (2) encourages. Our case study (Section 6) was performed in this setting and supports the beneﬁts of aggregating diverse models.

Multiple mutually diverse summaries: Data summarization involves ﬁnding a small but representative subset of a large set. There are some cases where it is useful to have multiple mutually diverse summaries, each of which is representative of the whole. For example, in parallel machine learning, where training data might need to be partitioned onto multiple machines distributed across a network, it can be useful to ensure that each subset is representative (so that local computations are accurate) but also diverse (since if two subsets are redundant, than so will the work that each processor performs). As another example, consider the problem of document summarization. It can be useful to produce multiple representative but distinct summaries of a collection of documents, as this ensures all concepts are covered but different perspectives are preserved.

2. Previous Work

Special cases of our problem have previously been studied. For example, maximizing Eq. (1) with 1 = λ1 = 1 λ2 over the space of all otherwise unconstrained partitions corresponds to the submodular fair allocation (SFA) problem. It is possible to achieve a O(1/( m log3 m)) approximation (Asadpour & Saberi, 2010) via iteratively rounding an LP solution when the fi s are all modular, although the problem is NP-hard to 1/2 + ϵ approximate for any ϵ > 0 (Golovin, 2005). For submodular fi s, (Golovin, 2005) also gives a matching-based algorithm with a factor 1/(n m + 1) approximation. A binary search algorithm (Khot & Ponnuswami, 2007) has a better factor of 1/(2m 1) that is independent of n. A less practical approach uses an ellipsoid approximation (Goemans et al., 2009) of each submodular function and reduces SFA to its modular version yielding an approximation factor of 1/( nm1/4 log n log3/2 m). (Wei et al., 2015) shows that

V V (1) V (2) V (3) V (m)

number of ground elements = n

Ground Set Expanded Ground Set

number of blocks = m

Figure 1. Illustration of the mapping from the original ground set V to the expanded ground set V . A subset S V is shown as red dots. The resulting blocks are shown at the bottom e.g. v3 Aπ 1 , {v1, vi} Aπ 2 , vn Aπ 3 , etc.

a greedy algorithm has a 1/m approximation when all fi s are the same. Fair allocation problems are also studied with sometimes non-submodular objectives (Ghodsi et al., 2017). Maximizing Eq. (1) with 0 = λ1 = 1 λ2 over the space of all partitions corresponds to the submodular welfare problem (SW), which can be reduced to submodular maximization on an expanded ground set under a partition matroid constraint (Vondr ak, 2008) using a greedy algorithm, an approach having a 1/2 guarantee (Fisher et al., 1978). The multi-linear extension of a submodular function can be used in a continuous greedy approach that solves SW with a (1 1/e) tight approximation factor (Vondr ak, 2008). When λ1 > 0, λ2 > 0, (Wei et al., 2015) offers two approaches. The ﬁrst takes the best of the two solutions computed under λ1 = 1 = 1 λ2 and λ1 = 0 = 1 λ2 to provide a max( βα (λ1)β+α, λ2β) guarantee, where α is the approximation factor for the SFA problem and β the factor for the SW problem. A second binary-search approach, the inspiration for Algorithm 1, ﬁnds a partition whose block objective value is at least max( δ 1 α+δ, λ2α)(OPT ϵ) for an α δ fraction of the blocks, where α is the approximation factor of a SW solver and 0 < δ < α.

The novelty of our optimization problem is that: (i) we may form not only partitions but also (k, m)-coverings and packings; (ii) we utilize a set of m matroids to deﬁne the feasibility of the individual blocks in a grouping; and (iii) we explicitly incorporate cross-block interaction terms.

Constrained Interacting Submodular Groupings

3. Approach

As with the strategy for the submodular welfare problem (Vondr ak, 2008), our approach to maximizing Eq. (2) starts by deﬁning an expanded ground set V (Figure 1), consisting of m disjoint unions of the original ground set V , i.e., the product set, deﬁned as:

i=1 V (i) = ]

v V R(v) = {(v, i) : v V, i [m]}

where |V | = nm and where is the disjoint union operator. V can be viewed as indexing into a n m matrix with V (i) (isomorphic to V ) being the ith column, and R(v)

(isomorphic to [m]) being the vth row.

We also deﬁne a mapping from subsets S V to the original ground set, and another mapping that selects the original ground set elements corresponding to those in the ith column as follows:

abs (S) {v V : i [m] with (v, i) S}

col (S, i) abs S V (i)

Given S V , a grouping π is obtained by setting Aπ i = col (S, i) for all i [m].

Using these mappings, we will ultimately (Eq. (5)) deﬁne a new objective F : 2V R on the expanded ground set that produces the valuation of a set S V indirectly, via submodular functions deﬁned on the original ground set, using col (S, i) to map subsets of V to subsets of V .

3.1. Partitionings, Packings, and Coverings

During optimization, we will take the feasible set F 2V to be the intersection of the independent sets of zero or more matroids over V . Such matroid independence constraints can be used to ensure that any feasible solution maps back to a partitioning, packing, or covering over V . When we wish for a partition, we can maximize F subject to a partition matroid on V whose independent sets are deﬁned based on the rows . That is, the independent sets are deﬁned as follows:

Ik = n S V : v V, S R(v) kv o , (3)

where k = (kv1, kv2, . . . , kvn) and v, kv = 1. In words, at most one copy of each element of the original ground set may be present. To express (k, m)-covering constraints on V , we allow v, kv 1. A covering and partition is obtained when maximizing a monotone F , since any candidate solution that is not yet a covering or partition can be made so by adding elements until all constraints are met with equality.

Likewise, a packing constraint can be expressed using a ℓ-uniform matroid with independent sets:

Iℓ= S V : |S| ℓ . (4)

If we set F = Iℓ Ik, where k = 1 is the vector of all 1s, this expresses a packing constraint, and is the intersection of two matroids deﬁned on V . In fact, the intersection of a matroid and a ℓ-uniform matroid is still a single matroid, called its ℓ-truncation (Schrijver, 2003), and hence Iℓ Ik constitutes only a single matroid.

3.2. Block Constraints

Having discussed how matroid constraints on the rows of V can be used to express the partitioning, packing and covering problems, we now turn our attention to how matroid constraints on the columns can be used to represent more general constraints on the blocks. Imagine that each block is required to satisfy its own matroid independence constraint: we are given m matroids {(V, Ii)}i [m] with independent sets Ii for i [m], where each matroid is deﬁned over the original ground set V . Using the expanded ground set and taking S V , we have that Aπ i Ii if and only if col (S, i) Ii.

Given a size-m set of matroids {Mi}i [m] where Mi = (V, Ii), the matroid union theorem (Schrijver, 2003, Thm. 42.1a) states that a new matroid can be deﬁned on V with independent sets Ib = {I1 I2 . . . Im : Ii Ii, i [m]}. Despite there being a matroid for each block, the disjoint union of these matroids is a single matroid on V .

One of the simplest examples of such constraints, and the one that we use in our case study (Section 6), simply places an upper bound on the number of elements within each block. The resulting matroid is a column-based analogue of Eq. (3).

4. Cross-block Interaction

In order to deﬁne the expanded objective F in terms of submodular functions on the original ground set V , we will deﬁne set functions via mappings from subsets of an expanded ground set to subsets of the original ground set. The next result shows that in some cases, composition and setto-set mappings preserve submodularity or supermodularity.

Lemma 1. Let V , V be two ground sets and deﬁne a set-toset mapping function G : 2V 2V . Also, let f : 2V R+ be monotone non-decreasing and submodular, and let g : 2V R+ be monotone non-decreasing and supermodular. Then:

1. If G is monotone non-decreasing (i.e. G(S) G(T) whenever S T), then f G and g G are both

Constrained Interacting Submodular Groupings

monotone non-decreasing.1

2. If S, T V , G(S T) = G(S) G(T) and G(S T) G(S) G(T), then f G : 2V R+ is submodular.

3. If S, T V , G(S T) G(S) G(T) and G(S T) = G(S) G(T), then g G : 2V R+ is supermodular.

Proof. In Appendix C. The objective deﬁned in Eq. (2) involves cross block interaction terms via Fi,j(Aπ i , Aπ j ) for all i, j [m]. The ground set expansion deﬁned in Section 3, combined with the above lemma, surprisingly allows many such interaction terms to be handled in a way that preserves submodularity. To this end, we deﬁne three additional set-to-set (i.e. 2V 2V ) mappings corresponding to union, intersection, or symmetric difference of the ith and jth mapped subsets:

G i,j (S) col (S, i) col (S, j)

where = , or . The following lemma, which largely follows from Lemma 1, shows that these can be used in a way that preserves sub/supermodularity:

Lemma 2. Let f : 2V R be monotone non-decreasing submodular, m : 2V R be non-negative modular, and g : 2V R be monotone non-decreasing supermodular. Then f G i,j : 2V R is monotone non-decreasing submodular, g G i,j : 2V R is monotone non-decreasing supermodular, and m G i,j : 2V R is non-negative submodular.

Proof. In Appendix C.

If one wishes to reduce the number of common elements in pairs of blocks, a useful cross-block interaction term is |V | G i,j (S) . Because the cardinality function is non-decreasing and modular, G i,j (S) is non-decreasing supermodular by Lemma 2. A submodular function (including the constant |V |) minus a supermodular function is submodular, and thus, the above interaction function is submodular and non-increasing.

In the partitioning and packing settings, if we have a nondecreasing submodular function f that measures the diversity of a set, one could use f G i,j (S) , which is both submodular and non-decreasing, as the cross-block interaction term, to encourage pairwise diversity. If one wants blocks to have large pairwise differences, then a natural cross-block interaction term is |G i,j (S)|. This function is again submodular, but is non-monotone (it is neither non-increasing nor non-decreasing).

1Note that denotes function composition.

4.1. Submodular Approximation of f G i,j

Unfortunately, f G i,j is not necessarily submodular, even for a submodular f, and has no obvious difference representation (Iyer & Bilmes, 2012). However, we can derive (non-monotone) submodular bounds based on the curvature. A normalized monotone submodular function f : 2V R has curvature c if f(v|S) (1 c)f(v) for all S V and V v / S, where f(v|S) f(S {v}) f(S) is the gain. Curvature is easily computed in O(n) time since c = 1 minv V f(v|V \ v)/f(v), where c [0, 1]. Modular functions have c = 0, fully curved functions have c = 1, and submodular function classes can be restricted to those having a particular c, since many useful submodular functions have non-extreme curvature, e.g. sums of nonasymptoting concave functions composed with non-negative modular functions (Stobbe & Krause, 2010).

Given a set X V , a normalized monotone submodular function f with curvature c, and any ordering σ = (σ1, σ2, . . . , σn) of V such that X = σ1, σ2, . . . , σ|X| , a modular subgradient m X f of f can be obtained (Fujishige, 2005) where m X f (X) = f(X), Y, m X f (Y ) f(Y ), and where m X f (σi) = f(σi|σ1, σ2, . . . , σi 1). Since f is monotone, the subgradient is non-negative. Also, for any Y , submodularity ensures that m X f (v) (1 c)f(v) and hence f(Y ) m X f (Y ) (1 c)f(Y ) for any X, Y , where the second inequality follows since f is normalized submodular. This enables us to obtain a non-monotone submodular lower bound via f G i,j(S) m X f G i,j(S) (1 c)f G i,j(S) for any X V that approximates the original problem:

Lemma 3. Given any algorithm that produces a solution ˆS having the property that F( ˆS) + m X f G i,j( ˆS)

α max S F F(S) + m X f G i,j(S) for α > 0, then ˆS

also has the property that F( ˆS) + f G i,j( ˆS) α(1

c) max S F F(S) + f G i,j(S) = α(1 c)OPT.

Proof. In Appendix C.

5. Algorithms and Optimization

Eq. (2) can now be written in terms of the expanded ground set as F : 2V R:

F (S) = (5)

λ1 min i fi (col (S, i)) + λ2

i [m] fi (col (S, i))

+ λ3 min i,j [m],i<j F i,j(S) + λ4 m 2 X

i,j [m],i<j F i,j(S)

Our approach in maximizing Eq. (2) subject to grouping constraints is to optimize Eq. (5) subject to the intersection

Constrained Interacting Submodular Groupings

Algorithm 1 Adaptation of the General Greed SAT algorithm of Wei et al. (2015) to handle matroid constraints, crossblock interactions, and coverings/packings as well as partitions. The functions f 1 , . . . , f m : 2V R are monotone non-decreasing, non-negative and submodular, ηf is a uniform upper bound on the f i s, each I1, . . . , Ik 2V is the set of independent sets of a matroid on V , and CALLBACK is a helper function that returns an α-approximation to the submodular maximization problem argmax S Tk i=1 Ii F c (S) in polynomial time.

Bisection ϵ, V , m, f 1 , . . . , f m, ηf , k, I1, . . . , Ik, CALLBACK, α : 1 Deﬁne F c (S) = Pm i=1 min c, f i (S) /m 2 Initialize cmin = 0, cmax = ηf , S = 3 While cmax cmin > ϵ: 4 Let c = (cmax cmin) /2 and Sc = CALLBACK (V , F c , k, I1, . . . , Ik) 5 If F c (Sc) < αc, then let cmax = c, else let cmin = c and S = Sc 6 Return S

of multiple matroid constraints deﬁned on the expanded ground set. The matroid constraints ensure that the solution can be transformed back to the original ground set, while preserving the approximation ratio. The algorithm used depends both on which terms are present (i.e. have nonzero associated λs), and whether the submodular functions are monotone.

When the overall objective is non-monotone submodular (so λ1 = 0 and λ3 = 0), the problem of maximizing Eq. (5) becomes one of non-monotone submodular maximization subject to matroid constraints. One can use algorithms such as Lee et al. (2010); Ward (2012), or the more recent, faster, and scalable approach given in Feldman et al. (2017). When using f G i,j(S) as a non-submodular block pair reward for non-robust coverings, a modular approximation adjusts any guarantees by 1 c (Lemma 3). The non-robust packing or partitioning problem reduces to monotone submodular maximization subject to two matroid constraints, for which there are a variety of good solutions. For example, the efﬁcient greedy algorithm (Nemhauser & Wolsey, 1978) solves this problem with a 1/3 guarantee while more recent but also more complicated approaches, such as Ward (2012), can solve this with an approximation ratio of (k + 3)/2 + ϵ for ℓmatroids (here k = 2).

When all of the involved submodular functions are monotone non-decreasing, a single robust term can be handled (one of λ1 or λ3 may be nonzero). Here, we are inspired by an approach originally used for robust submodular optimization (Krause et al., 2008) where the goal is to ﬁnd the max of the min over a set of submodular functions subject to a cardinality constraint. In Wei et al. (2015), this was extended to apply to a mixed robust/average objective over partitions. It turns out that essentially the same idea this is Algorithm 1 applies to the more general case of coverings and packings, as well as when there are matroids constraining each block individually, and also when the blocks scores interact.

In brief, this algorithm proceeds by iteratively optimizing inner submodular optimization problems using a provided CALLBACK function, which is assumed to run in polynomial time, and return an α-approximation. The ﬁnal algorithm achieves a nearly constant-factor approximation for a constant fraction of the blocks. As the fraction of the blocks shrinks, the guarantee for those blocks grows, and the guarantee holds simultaneously for a range over the fractions.

Theorem 1. A call to Algorithm 1 will perform log2 (ηf/ϵ) calls to CALLBACK, each of which will perform polynomially many evaluations of F c , each of which evaluates all m f i s once.

The resulting set S will satisfy the constraints, and for any γ (0, α) there will exist at least m (α γ) / (1 γ) indices i [m] for which f i (S) > γ (OPT ϵ), where OPT = max S Tk i=1 Ii F (S) is the optimum.

Proof. The proof technique follows that of Wei et al. (2015, Theorem 11), but we include it in Appendix C for completeness, and to show that it applies to our more general case (i.e., covers, packings, block-speciﬁc matroid constraints, and cross-block interactions).

This algorithm depends on a CALLBACK that deterministically returns an α-approximation to an inner submodular optimization problem. However, many submodular maximization algorithms are randomized, and have approximation guarantees that hold only in expectation. With a bit of extra work, such an algorithm can be used as well. First, we must convert the in-expectation guarantee into a high-probability guarantee using the following lemma that requires only an approximation bound:

Lemma 4. Let A be a randomized algorithm for submodular maximization that has an α-approximation guarantee in expectation, i.e. for which E [f (S)] αf (S ), where f is the submodular function we wish to maximize, S is the result of algorithm A, and S is the maximizer of f. For

Constrained Interacting Submodular Groupings

parameters β, δ (0, 1), suppose that we run algorithm A

k times, where k = l ln 1

δ / ln 1 αβ

1 α m , yielding results

S1, S2, . . . , Sk. Take S = argmax Si:i [k] f (Si) to be the best of these results. Then S will have an approximation ratio of αβ, i.e. f (S) αβf (S ), with probability 1 δ.

Proof. In Appendix C.

As was shown in Theorem 1, CALLBACK will be called at most log2 (ηf/ϵ) times, so it follows from the union bound that if we use the procedure of Lemma 4, then, with probability 1 δ log2 (ηf/ϵ) , every call to CALLBACK will return an αβ-approximation, and the result of Theorem 1 will hold (with αβ substituted for α).

The ability to handle a mixed robust/average objective, however, comes at a cost. Because Theorem 1 only applies when all of the involved submodular functions are monotone non-decreasing, we cannot use the intersection and symmetric-difference-based cross-block interaction terms only the union-based terms are possible. Non-monotone interaction terms can only be used with a non-robust objective. Finding an algorithm that can handle both robustness and non-monotone interactions is therefore an interesting open problem that we leave to future work.

6. Case Study

We validate our proposed approach with a case study in the setting of Canini et al. (2016), in which the task is to construct an ensemble-of-lattices machine learning model. Each lattice model (Gupta et al., 2016) in the ensemble is deﬁned on a subset of the features intuitively, two features interact non-linearly if they are included in the same lattice, and interact only linearly if they are not so our primary goal is to choose subsets of features that interact well with each other, with our secondary goal being to reduce redundancy in pairs of subsets. Notice that this is not a feature selection problem typically, every feature will be included in at least one lattice the task is to determine which features should interact non-linearly. For more details, please see Appendix A.

Our goals are to demonstrate that (i) we can successfully ﬁnd good approximate maximizers of the proposed objective function, and (ii) the inclusion of pairwise diversity terms results in improved diversity.

We compare to two baselines, the Crystals and Random Tiny Lattice (RTL) algorithms of Canini et al. (2016). The ﬁrst of these the current state-of-the-art is essentially a heuristic for choosing diverse ensembles, while the second simply chooses each lattice s features uniformly at random.

The dataset contains 463 154 samples with 29 informative

features plus a binary label indicating whether a particular visual element should be displayed on a web page. The dataset was randomly partitioned into training, validation and testing subsets containing 80%, 10% and 10% of the data, respectively (the validation set was only used for hyperparameter optimization of the baseline Crystals algorithm).

6.1. Choice of f

Our ground set V consists of the n = 29 features. We began by ﬁnding a submodular function f : 2V R+ for which f (S) represents roughly how well a single lattice model on the features in S would perform. To this end, we chose f to have the form f (S) = β + P

|A S|, where A consists of all 1and 2-element subsets of V . Observe that p

|A S| is submodular, non-negative, and monotone non-decreasing, so if β and αA are non-negative, then f will likewise be submodular, non-negative, and monotone.

Based on 9 191 random subsets of sizes between two and ten, we learned the β and αA parameters to minimize the squared error between f (S) and the training accuracy of a lattice model trained on the features contained in S. The result is the f that we use throughout.

6.2. Covering

Our goal here is essentially identical to Canini et al. (2016) we seek to choose m = 8 lattices, each containing up to 8 features (via a matroid constraint), by ﬁnding the S V

maximizing:

i [m] f (col (S, i)) + λ4

i,j [m] i =j

G i,j(S) (6)

The 8 lattices together should have good performance (the ﬁrst term), and the feature subsets should be relatively pairwise distinct, i.e. have large symmetric differences (this is the second term). We henceforth refer to the ﬁrst (intrablock diversity) term, representing the individual quality of the lattices, as the quality term, and the second (not including the λ4-scaling), representing the inter-block diversity, as the diversity term.

We optimized Eq. (6) for various choices of λ4 using the randomized algorithm of Feldman et al. (2017) combined with the procedure of Lemma 4, with β = 0.5 and δ = 0.1. Each optimization took between 2 and 30 seconds on a Xeon E5-2690. The results are shown in Figure 2. The left-hand plot shows that, as the trade-off parameter λ4 increases, the relative magnitude of the quality term decreases, and of the diversity term increases, as expected. The right-hand plot shows that, when the λ4 parameter is sufﬁciently large, the diversity term is balanced with the quality term (or is larger), and the ensembles found by our approach outperform those of both the state-of-the-

Constrained Interacting Submodular Groupings

10-5 10-4 10-3 10-2 10-1 100

Quality Diversity

10-5 10-4 10-3 10-2 10-1 100 0.774

Testing Accuracy

Our approach Crystals RTLs

Figure 2. (Left) The magnitudes of the quality and diversity terms of Eq. (6) as functions of the trade-off parameter λ4, averaged over ten independent runs. (Right) The testing accuracies of our proposed approach, as well as the Crystals and Random Tiny Lattices (RTL) algorithms of Canini et al. (2016), again as functions of λ4. The plotted results of our algorithm are averaged over ten independent runs, with the larger point being that for which the validation accuracy was maximized. For the RTL results, we trained 1000 random models. The lower and upper bounds of the plot are the worst and best testing accuracies (respectively) over all models, and the dotted horizontal lines are the testing accuracies of the RTL model at the 10th, 20th, . . . , 90th percentiles.

art Crystals algorithm (which uses heuristics to encourage diversity) and the 90th percentile of RTLs, albeit by a small amount. More importantly, the leftmost points in the righthand plot of Figure 2, in which the diversity portion of the objective is essentially zero, have signiﬁcantly worse testing accuracies than those with larger λ4s. This indicates that the use of pairwise diversity terms may be broadly beneﬁcial to submodular grouping problems.

6.3. Partitioning and Packing

Appendix B contains additional case studies exploring the efﬁcacy of the mixed robust/average objective for partitioning and packing, where the task is to maximize: X

i [m] f (col (S, i)) + λ3 min i,j [m] i =j f G i,j(S) (7)

Unlike in Eq. (6), the diversity term is a minimum over monotone non-decreasing submodular functions, instead of a sum over non-monotone submodular functions. The results demonstrate that Algorithm 1 is effective at optimizing this objective, but also reveal that, for this problem and data set, the above diversity term is not helpful and can be harmful if the quality term is overpowered. The lesson is that cross-block diversity is not a magic bullet it must be chosen appropriately for the problem.

6.4. Discussion

We have introduced a new class of submodular optimization problems involving grouping ground elements together into multiple sets and the ﬁrst, as far as we know, to involve

block-block interaction terms as well as general (matroid intersection) block constraints.

Another potential application of our method is sensitivity analysis of machine learning systems (i.e., does an ML model vary greatly when trained on different representative but mutually diverse subsets of the training data?) and also a form of robustness analysis (i.e., how does an ML system perform when tested on different representative but mutually diverse subsets of test data?). These are important questions, considering for example the recent interest in adversarial examples in ML.

Also, since convex combinations of submodular components preserve submodularity, it might also in the future be interesting to consider some form of Pareto frontier of solution sets for different convex mixtures.

Acknowledgments: This work was done in part during a time when author Bilmes was visiting Google AI research and also visiting the Simons Institute for the Theory of Computing. This material is based upon work supported by the National Science Foundation under Grant No. IIS1162606, the National Institutes of Health under award R01GM103544, and by a Google, a Microsoft, a Facebook, and an Intel research award. This work was supported in part by Terra Swarm, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA

Constrained Interacting Submodular Groupings

Asadpour, A. and Saberi, A. An approximation algorithm for max-min fair allocation of indivisible goods. In SICOMP, 2010.

Canini, K., Cotter, A., Gupta, M., Milani Fard, M., and Pfeifer, J. Fast and ﬂexible monotonic functions with ensembles of lattices. In NIPS, pp. 2919 2927, 2016.

Djolonga, J., Tschiatschek, S., and Krause, A. Variational inference in mixed probabilistic submodular models. In Neural Information Processing Systems (NIPS), December 2016.

Feldman, M., Harshaw, C., and Karbasi, A. Greed is good: Near-optimal submodular maximization via greedy optimization. ar Xiv preprint ar Xiv:1704.01652, 2017.

Fisher, M., Nemhauser, G., and Wolsey, L. An analysis of approximations for maximizing submodular set functions II. In Polyhedral combinatorics, 1978.

Fujishige, S. Submodular functions and optimization, volume 58. Elsevier, 2005.

Ghodsi, M., Haji Aghayi, M., Seddighin, M., Seddighin, S., and Yami, H. Fair allocation of indivisible goods: Improvement and generalization. ar Xiv preprint ar Xiv:1704.00222, 2017.

Goemans, M., Harvey, N., Iwata, S., and Mirrokni, V. Approximating submodular functions everywhere. In SODA, 2009.

Golovin, D. Max-min fair allocation of indivisible goods. Technical Report CMU-CS-05-144, 2005.

Gotovos, A., Hassani, S. H., and Krause, A. Sampling from probabilistic submodular models. In Neural Information Processing Systems (NIPS), December 2015.

Gupta, M. R., Cotter, A., Pfeifer, J., Voevodski, K., Canini, K., Mangylov, A., Moczydlowski, W., and van Esbroeck, A. Monotonic calibrated interpolated look-up tables. JMLR, 17(109):1 47, 2016.

Iyer, R. K. and Bilmes, J. A. Algorithms for approximate minimization of the difference between submodular functions, with applications. Co RR, abs/1207.0560, 2012. URL http://arxiv.org/abs/1207.0560.

Kempe, D., Kleinberg, J., and Tardos, E. Maximizing the spread of inﬂuence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 137 146. ACM, 2003.

Khot, S. and Ponnuswami, A. Approximation algorithms for the max-min allocation problem. In APPROX, 2007.

Kittler, J., Hatef, M., Duin, R. P. W., and Matas, J. On combining classiﬁers. IEEE Trans. Pattern Anal. Mach. Intell., 20(3):226 239, March 1998.

Kohli, P., Kumar, M. P., and Torr, P. H. P3 & beyond: Solving energies with higher order cliques. In Computer Vision and Pattern Recognition, 2007. CVPR 07. IEEE Conference on, pp. 1 8. IEEE, 2007.

Krause, A., Mc Mahan, B., Guestrin, C., and Gupta, A. Robust submodular observation selection. In JMLR, 2008.

Lee, J., Sviridenko, M., and Vondr ak, J. Submodular maximization over multiple matroids via generalized exchange properties. Math. Oper. Res., 35(4):795 806, 2010.

Li, J., Li, L., and Li, T. Multi-document summarization via submodularity. Applied Intelligence, 37(3):420 430, 2012.

Lin, H. and Bilmes, J. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 510 520. Association for Computational Linguistics, 2011.

Lin, H., Bilmes, J., and Xie, S. Graph-based submodular selection for extractive summarization. In Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on, pp. 381 386. IEEE, 2009.

Mossel, E. and Roch, S. On the submodularity of inﬂuence in social networks. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pp. 128 134. ACM, 2007.

Narasimhan, M., Jojic, N., and Bilmes, J. Q-clustering. In NIPS, volume 5, pp. 5, 2005.

Nemhauser, G. and Wolsey, L. Best algorithms for approximating the maximum of a submodular set function. Mathematics of Operations Research, 3(3):177 188, 1978.

Oxley, J. Matroid theory. Oxford University Press, USA, 2006.

Reed, C. and Ghahramani, Z. Scaling the Indian buffet process via submodular maximization. ICML, 2013.

Schrijver, A. Combinatorial optimization: polyhedra and efﬁciency, volume 24. Springer Verlag, 2003.

Singla, A., Bogunovic, I., Bartok, G., Karbasi, A., and Krause, A. Near-optimally teaching the crowd to classify. In ICML, pp. 154 162, 2014.

Constrained Interacting Submodular Groupings

Singla, A., Tschiatschek, S., and Krause, A. Noisy submodular maximization via adaptive sampling with applications to crowdsourced image collection summarization. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016.

Stobbe, P. and Krause, A. Efﬁcient minimization of decomposable submodular functions. In NIPS, 2010.

Tschiatschek, S., Iyer, R. K., Wei, H., and Bilmes, J. A. Learning mixtures of submodular functions for image collection summarization. In Advances in neural information processing systems, pp. 1413 1421, 2014.

Vondr ak, J. Optimal approximation for the submodular welfare problem in the value oracle model. In STOC, 2008.

Ward, J. A (k+ 3)/2-approximation algorithm for monotone submodular k-set packing and general k-exchange systems. In LIPIcs-Leibniz International Proceedings in Informatics, volume 14. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2012.

Wei, K., Iyer, R. K., Wang, S., Bai, W., and Bilmes, J. A. Mixed robust/average submodular partitioning: Fast algorithms, guarantees, and applications. In NIPS, 2015.