# learning_submodular_losses_with_the_lovasz_hinge__f3f5993e.pdf

Learning Submodular Losses with the Lov asz Hinge

Jiaqian Yu JIAQIAN.YU@CENTRALESUPELEC.FR Matthew B. Blaschko MATTHEW.BLASCHKO@INRIA.FR Centrale Sup elec & Inria, Grande Voie des Vignes, 92295 Chˆatenay-Malabry, France

Learning with non-modular losses is an important problem when sets of predictions are made simultaneously. The main tools for constructing convex surrogate loss functions for set prediction are margin rescaling and slack rescaling. In this work, we show that these strategies lead to tight convex surrogates iff the underlying loss function is increasing in the number of incorrect predictions. However, gradient or cutting-plane computation for these functions is NP-hard for nonsupermodular loss functions. We propose instead a novel convex surrogate loss function for submodular losses, the Lov asz hinge, which leads to O(p log p) complexity with O(p) oracle accesses to the loss function to compute a gradient or cutting-plane. As a result, we have developed the ﬁrst tractable convex surrogates in the literature for submodular losses. We demonstrate the utility of this novel convex surrogate through a real world image labeling task.

1. Introduction

Statistical learning has largely addressed problems in which a loss function decomposes over individual training samples. However, there are many circumstances in which non-modular losses can lead to better end results. This is the case when multiple outputs of a prediction system are used as the basis of a decision making process that leads to a single real-world outcome. These dependencies in the effect of the predictions (rather than a statistical dependency between predictions themselves) are therefore properly incorporated into a learning system through a non-modular loss function. In this paper, we aim to provide a theoretical and algorithmic foundation for a novel class of learning algorithms that make feasible learning with submodular losses, an important subclass of non-modular losses that is

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

currently infeasible with existing algorithms.

Convex surrogate loss functions are central to the practical application of empirical risk minimization. Straightforward principles have been developed for the design of convex surrogates for binary classiﬁcation and regression (Bartlett et al., 2006), and in the structured output setting margin and slack rescaling are two principles for deﬁning convex surrogates for more general output spaces (Tsochantaridis et al., 2005). Despite the apparent ﬂexibility of margin and slack rescaling in their ability to bound arbitrary loss functions, there are fundamental limitations to our ability to apply these methods in practice: (i) they provide only loose upper bounds to certain loss functions, (ii) computing a gradient or cutting plane is NP-hard for submodular loss functions, and (iii) consistency results are lacking in general (Mc Allester, 2007; Tewari & Bartlett, 2007). In practice, modular losses, such as Hamming loss, are often applied to maintain tractability, although non-modular losses, such as the intersection over union loss have been applied in the structured prediction setting (Blaschko & Lampert, 2008; Nowozin, 2014).

Non-modular losses have been (implicitly) considered in the context of multilabel classiﬁcation problems. (Cheng et al., 2010) uses the Hamming loss and subset 0-1 loss which are modular, and a rank loss which is supermodular; (Petterson & Caetano, 2011) introduces submodular pairwise potentials, not submodular loss functions, while using a non-submodular loss based on F-score. (Li & Lin, 2014) uses (weighted) Hamming loss which is modular, but also proposes a new tree-based algorithm for training; (Doppa et al., 2014) uses modular losses e.g. Hamming loss and F1 loss which is non-submodular. However, nonsupermodular loss functions are substantially more rare in the literature.

In this work, we introduce an alternate principle to construct convex surrogate loss functions for submodular losses based on the Lov asz extension of a set function. The Lov asz extension of a submodular function is its convex closure, and has been used in other machine learning contexts e.g. (Bach, 2010; Iyer & Bilmes, 2013). We analyze the settings in which margin and slack rescaling are tight

Learning Submodular Losses with the Lov asz Hinge

convex surrogates by ﬁnding necessary and sufﬁcient conditions for the surrogate function to be an extension of a set function. Although margin and slack rescaling generate extensions of some submodular set functions, their optimization is NP-hard. We therefore propose a novel convex surrogate for submodular functions based on the Lov asz extension, which we call the Lov asz hinge. In contrast to margin and slack rescaling, the Lov asz hinge provides a tight convex surrogate to all submodular loss functions, and computation of a gradient or cutting plane can be achieved in O(p log p) time with a linear number of oracle accesses to the loss function. We demonstrate empirically fast convergence of a cutting plane optimization strategy applied to the Lov asz hinge, and show that optimization of a submodular loss results in lower average loss on the test set.

In Section 2 we introduce the notion of a submodular loss function in the context of empirical risk minimization. The Structured Output SVM is one of the most popular objectives for empirical risk minimization of interdependent outputs, and we demonstrate its properties on non-modular loss functions in Section 3. In Section 4 we introduce the Lov asz hinge, and we empirically demonstrate its performance on an image labeling task on the Microsoft COCO dataset in Section 5.

2. Submodular Loss Functions

In empirical risk minimization (ERM), we approximate the risk, R of a prediction function f : X 7 Y by an empirical sum over losses incurred on a ﬁnite sample, using e.g. an i.i.d. sampling assumption (Vapnik, 1995):

i=1 (yi, f(xi)) (1)

Central to the practical application of the ERM principle, one must approximate, or upper bound the discrete loss function with a convex surrogate. We will identify the creation of a convex surrogate for a speciﬁc loss function with an operator that maps a function with a discrete domain to one with a continuous domain. In particular, we will study the case that the discrete domain is a set of p binary predictions. In this case we denote

Y = { 1, +1}p (2)

f(x) = sign(g(x)) (3)

: { 1, +1}p { 1, +1}p 7 R+ (4)

B : { 1, +1}p Rp 7 R (5)

where B is an operator that constructs the surrogate loss function from , and g : X 7 Rp is a parametrized prediction function to be optimized by ERM.

A key property is the relationship between B and . In particular, we are interested in when a given surrogate

strategy B yields an extension of . We make this notion formal by identifying { 1, +1}p with a given pdimensional unit hypercube of Rp (cf. Deﬁnition 3). We say that B (y, ) is an extension of (y, ) iff the functions are equal over the vertices of this unit hypercube. We focus on function extensions as they ensure a tight relationship between the discrete loss and the convex surrogate.

2.1. Set Functions and Submodularity

For many optimization problems, a function deﬁned on the power set of a given base set V , a set function is often taken into consideration to be minimized (or maximized). Submodular functions play an important role among these set functions, similar to convex functions on vector spaces.

Submodular functions may be deﬁned through several equivalent properties. We use the following deﬁnition (Fujishige, 2005):

Deﬁnition 1. A set function l : P(V ) 7 R is submodular if and only if for all subsets A, B V , l(A) + l(B) l(A B) + l(A B).

A function is supermodular iff its negative is submodular, and a function is modular iff it is both submodular and supermodular. A modular function can be written as a dot product between a binary vector in {0, 1}p (where p = |V |) encoding a subset of V and a coefﬁcient vector in Rp which uniquely identiﬁes the modular function. By example, Hamming loss is a modular function with a coefﬁcient vector of all ones, and a subset deﬁned by the entries that differ between two vectors.

Necessary to the sequel is the notion of monotone set functions

Deﬁnition 2. A set function l : P(V ) 7 R is increasing if and only if for all subsets A V and elements x V \ A, l(A) l(A {x}).

In this paper, we consider loss functions for multiple outputs that are set functions where inclusion in a set is deﬁned by a corresponding prediction being incorrect:

(y, y) = l({i|yi = yi}) (6)

for some set function l. Such functions are typically increasing, though it is possible to conceive of a sensible loss function that may be non-increasing.

With these notions, we now turn to an analysis of margin and slack rescaling, and show necessary and sufﬁcient conditions for these operators to yield an extension to the underlying discrete loss function.

Learning Submodular Losses with the Lov asz Hinge

3. Existing Convex Surrogates

A general problem is to learn a mapping f from inputs x X to discrete outputs (labels) y Y. The Structured Output SVM (SOSVM) is a popular framework for doing so in the regularized risk minimization framework. The approach that SOSVM pursues is to learn a function h : X Y 7 R over input/output pairs from which a prediction can be derived by maximizing f(x) = arg maxy h(x, y) over the response from a given input x. The SOSVM framework assumes h to be represented by an inner product between an element of a reproducing kernel Hilbert space (RKHS) and some combined feature representation of inputs and outputs φ(x, y),

h(x, y; w) = w, φ(x, y) (7)

although the notions of margin and slack rescaling may be applied to other function spaces, including random forests and deep networks.

A bounded loss function : Y Y R quantiﬁes the loss associated with a prediction y while the true value is y, and is used to re-scale the constraints. The marginrescaling constraints and slack-rescaling constraints are:

min w,ξ 1 2 w 2 + C

i=1 ξi, i, y Y : (8)

w, φ(xi, yi) w, φ(xi, y) (yi, y) ξi (9)

or w, φ(xi, yi) w, φ(xi, y) 1 ξi (yi, y) (10)

respectively.

In the sequel, we will consider the case that each xi X is an ordered set of p elements in a RKHS, H, and that yi Y is a binary vector in { 1, +1}p. We consider feature maps such that

w, φ(xi, yi) =

j=1 wj, xj i yj i . (11)

Given this family of joint feature maps, we may identify the jth dimension of g (cf. Equation (3)) with wj, xj i . Therefore arg maxy h(x, y; w) = sign(g(x)) and h(x, y) = g(x), y .

3.1. Analysis of Margin and Slack Rescaling

We now turn to the problem of determining necessary and sufﬁcient conditions for margin and slack rescaling to yield an extension of the underlying set loss function. We denote the operators for margin and slack rescaling that map the loss function to its convex surrogate M and S, respectively. These operators have the same signature as B in

Equation (5).

M (y, g(x)) := max y Y (y, y) + g(x), y g(x), y

S (y, g(x)) := max y Y (y, y) 1 + g(x), y g(x), y

respectively.

In order to analyse whether M and S are extensions of , we construct a mapping to a p-dimensional vector space using following deﬁnition:

Deﬁnition 3. A convex surrogate function B (y, ) is an extension when B (y, ) = (y, ) on the vertices of the 0-1 unit cube under the mapping to Rp (cf. Fig. 1):

i ={1, . . . , p}, [u]i = 1 w, xi yi (14)

In the sequel, we will use the notation for l : P(V ) 7 R as in Equation (6). Note that at the vertices, (y, ) has the following values:

l( ) at 0p (15)

l(I) at {v|v {0, 1}, vi = 1 i I} (16)

We call (16) the value of l at the vertex I.

3.2. Slack rescaling

Proposition 1. S (y, ) is an extension of a set function (y, ) iff (y, ) is an increasing function.

Proof. First we demonstrate the necessity. Given S (y, ) an extension of (y, ), we analyse whether (y, ) is an increasing function. Speciﬁcally, there are two cases to determine the values of S (y, ):

1. if u = 0, S (y, g(x)) = l( ) according to Equation (6), where u is deﬁned as in Equation (14).

2. if u Rp \ {0}, let I = {i|ui = 0}, then according to Equation (13), S (y, g(x)) takes the value of the following equation:

max I P(V ) l(I) 1 X

i I w, xi yi (17)

As S (y, g(x)) is an extension of (y, sign(g(x))), by Deﬁnition 3, S (y, g(x)) = (y, sign(g(x))) at the vertices as in (15) and (16). Note that it is trivial when u = 0 so the ﬁrst case is always true for arbitrary (including increasing) l.

Considering the second case when S (y, g(x)) is equal to Equation (17), let I2 = arg max I P(V ) l(I) 1

Learning Submodular Losses with the Lov asz Hinge

i I w, xi yi . As S (y, g(x)) is an extension, S (y, g(x)) = (y, sign(g(x))) at the vertex I2.

I1 P(V ) \ { }, l(I2) l(I1) (1 |(V \ I2) I1|) (18)

This leads also to two cases,

1. if |(V \ I2) I1| = 0, (V \ I2) I1 = , which implies I1 I2, then from Equation (18) we get l(I2) l(I1). This implies that l and therefore are increasing;

2. if |(V \ I2) I1| 1, this means the rhs of Equation (18) is negative, then it turns out to be redundant with l(I2) 0 which is always true.

To conclude, given S (y, ) is an extension of a set function (y, ), it is always the case that is increasing.

To demonstrate the sufﬁciency, we need to show Equation (18) is always true if l is increasing. We note that (18) holds if |(V \ I2) I1| = 0, because if I1 I2 then l(I2) l(I1). Then if |(V \ I2) I1| 1, (18) always holds even for arbitrary l. We conclude that S (y, g(x)) = (y, sign(g(x))) at the vertices when Equation (16) holds. As for the case of Equation (15), it is trivial as u = 0. So S (y, ) yields a extension of (y, ) if (y, ) is increasing.

3.3. Margin rescaling

It is a necessary, but not sufﬁcient condition that (y, y) be increasing for margin rescaling to yield an extension. However, we note that for all increasing (y, y) there exists a positive scaling γ R such that margin rescaling yields an extension. This is an important result for regularized risk minimization as we may simply rescale to guarantee that margin rescaling yields an extension, and simultaneously scale the regularization parameter such that the relative contribution of the regularizer and loss is unchanged at the vertices of the unit cube.

Proposition 2. For all increasing set functions l such that y for which M (y, ) is not an extension of (y, ), we can always ﬁnd a positive scale factor γ speciﬁc to l such that margin rescaling yields an extension. We denote Mγ and γ as the rescaled functions.

Proof. Similar to Proposition 1, we analyse two cases to determine the values of Mγ (y, g(x)):

1. if u = 0, Mγ (y, g(x)) = γl( ) where u is deﬁned as in Equation (14). It is typically the case that l( ) = 0, but this is not a technical requirement.

2. if u = 0, let I = {i|ui = 0}, then Mγ (y, g(x)) takes the value of the following equation:

max I P(V ) γl(I) X

i I w, xi yi (19)

To satisfy Deﬁnition 3, we must ﬁnd a γ > 0 such that Mγ (y, g(x)) = γ (y, sign(g(x))) at the vertices. Note that it is trivial when u = 0 so the ﬁrst case is true for arbitrary γ > 0.

For the second case as in Equation (19), let I2 = arg max I P(V ) l(I) P

i I w, xi yi . We have Mγ (y, g(x)) = (y, sign(g(x))) at the vertices I2 according to the extension. The scale factor should satisfy:

I1 P(V )\{ }, γ l(I2) l(I1) |(V \I2) I1| (20)

which leads to the following cases:

1. if |(V \ I2) I1| = 0, we have (V \ I2) I1 = , which implies I1 I2. Equation (20) reduces to

γ l(I2) l(I1) 0 (21)

and l is an increasing function so l(I1) l(I2) and Equation (21) is always true as γ > 0.

2. if |(V \ I2) I1| = 0, we need to discuss the relationship between l(I1) and l(I2):

(a) if l(I2) = l(I1), then Equation (21) becomes 0 |(V \ I2) I1|, for which the rhs is negative so it is always true. (b) if l(I2) > l(I1), then

γ |(V \ I2) I1|

l(I2) l(I1) (22)

for which the rhs is negative so it is redundant with γ > 0 . (c) if l(I2) < l(I1), then

γ |(V \ I2) I1|

l(I2) l(I1) (23)

for which the rhs is strictly positive so it becomes an upper bound on γ .

In summary the scale factor γ should satisfy the following constraint for an increasing loss function l:

I1, I2 P(V ) \ { }, 0 < γ |(V \ I2) I1|

l(I2) l(I1)

Finally, we note that the rightmost ratio is always strictly positive.

Learning Submodular Losses with the Lov asz Hinge

3.4. Complexity of subgradient computation

Although we have proven that slack and margin rescaling yield extensions to the underlying discrete loss under fairly general conditions, their key shortcoming is in the complexity of the computation of subgradients for submodular losses. The subgradient computation for slack and margin rescaling requires the computation of arg max y (y, y)(1 + h(x, y) h(x, y)) and arg max y (y, y) + h(x, y), respectively. Both of these functions require the maximization of the loss, which corresponds to supermodular minimization in the case of a submodular loss. This computation is NP-hard, and such loss functions are not feasible with these existing methods in practice. Furthermore, approximate inference, e.g. based on (Nemhauser et al., 1978), leads to poor convergence when used to train a structured output SVM resulting in a high error rate (Finley & Joachims, 2008). We therefore introduce the Lov asz hinge as an alternative operator to construct feasible convex surrogates for submodular losses.

4. Lov asz Hinge

We now construct a convex surrogate for submodular losses. This surrogate is based on a fundamental result by Lov asz relating submodular sets and piecewise linear convex functions. The Lov asz extension allows the extension of a set-function deﬁned on the vertices of the hypercube {0, 1}p to the full hypercube [0, 1]p (Lov asz, 1983):

Deﬁnition 4. The Lov asz extension ˆl of a set function l, ˆl : [0, 1]p R, is deﬁned as follows: for sπ [0, 1]p with decreasing components sπ1 sπ2 sπp ordered by a permutation π = (π1, π2, , πp), ˆl(s) is deﬁned as:

j=1 sπj (l ({π1, , πj}) l ({π1, , πj 1}))

It has been proven that a set-function l is submodular iff its Lov asz extension is convex (Lov asz, 1983). Based on this deﬁnition, we propose our novel convex surrogate for submodular functions: Deﬁnition 5. The Lov asz hinge, L, is deﬁned as the unique operator such that, for l submodular

L (y, g(x)) := (25)

j=1 sπj (l ({π1, , πj}) l ({π1, , πj 1}))

where ( )+ = max( , 0), π is a permutation,

sπj = 1 gπj(x)yπj, (26)

and gπj(x) is the πjth dimension of g(x).

If l is increasing, when sπj i becomes negative we may threshold the corresponding components to zero as in standard hinge loss. By thresholding negative sπj i to zero, we have that the Lov asz hinge coincides exactly with slack rescaling in the special case of a modular loss, and coincides with an SVM in the case of Hamming loss. In the case that l is non-increasing, the thresholding strategy no longer coincides with the Lov asz extension over the unit cube, and will not yield an extension. We therefore will not apply thresholding in the non-increasing case, but we still have that L 0 and is convex.1

As the computation of a cutting plane or loss gradient is precisely the same procedure as computing a value of the Lov asz extension, we have the same computational complexity, which is O(p log p) to sort the p coefﬁcients, followed by O(p) oracle accesses to the loss function (Lov asz, 1983). This is precisely an application of the greedy algorithm to optimize a linear program over the submodular polytope (Edmonds, 1971). In our implementation, we have employed a one-slack cutting-plane optimization with ℓ2 regularization analogous to (Joachims et al., 2009). We observe empirical convergence of the primal-dual gap at a rate comparable to that of a structured output SVM (Fig. 3).

4.1. Visualization of convex surrogates

For visualization of the loss surfaces, in this section we consider a simple binary classiﬁcation problem with two elements with a non-modular loss function l:

X := Rd 2 Y := { 1, +1}2

Then with different values for l( ), l({1}), l({2}) and l({1, 2}), we can have different modularity or monotonicity properties of the function l which then deﬁnes . We illustrate the Lov asz hinge, slack and margin rescaling for the following cases: (i) submodular increasing: l( ) = 0, l({1}) = l({2}) = 1, l({1, 2}) = 1.2, (ii) submodular non-increasing: l( ) = 0, l({1}) = l({2}) = 1, l({1, 2}) = 0.4, and (iii) supermodular increasing: l( ) = 0, l({1}) = l({2}) = 1, l({1, 2}) = 2.8.

In Fig. 1, the x axis represents the value of s1 i , the y axis represents the value of s2 i in Eq (26), and the z axis is the convex loss function given by Equation (12), Equation (13), and Deﬁnition 5, respectively. We plot the values of l as solid dots at the vertices of the hypercube. We observe that all the solid dots touch the surfaces, which empirically validates that the surrogates are extensions of the discrete loss. Here we set l as symmetric functions, while the extensions can be also validated for asymmetric increasing set functions.

1Source code is available for download at https: //sites.google.com/site/jiaqianyu08/ lovaszhinge.

Learning Submodular Losses with the Lov asz Hinge

(a) Lov asz hinge with submodular increasing l

(b) Lov asz hinge with submodular increasing l

(c) Lov asz hinge with submodular non-increasing l

(d) Lov asz hinge with submodular non-increasing l

(e) Margin-rescaling with submodular increasing l

(f) Margin-rescaling with submodular increasing l

(g) Margin-rescaling with supermodular increasing l

(h) Margin-rescaling with supermodular increasing l

(i) Slack-rescaling with submodular increasing l

(j) Slack-rescaling with submodular increasing l

(k) Slack-rescaling with supermodular increasing l

(l) Slack-rescaling with supermodular increasing l

Figure 1. We introduce a novel convex surrogate for submodular losses, the Lov asz hinge. We show here the Lov asz hinge, margin and slack rescaling surfaces with different loss functions l from different views; the x and y axes represent the value of s1 i and s2 i in Eq. (26); the z axis represents the value of the convex surrogate; the solid red dots represent the values of l at the vertices of the unit hypercube. The convex surrogate strategies yield extensions of the discrete loss.

5. Experimental Results

In this section, we consider a task in which multiple labels need to be predicted simultaneously and for which a submodular loss over labels is to be minimized. If a subsequent prediction task focuses on detecting a dinner scene, the initial multi-label prediction should emphasize that all labels e.g. people, dinning tables, forks, and knifes must be correct within a given image. This contrasts with a traditional multi-label prediction task in which a loss function decomposes over the individual predictions. The misprediction of a single label, e.g. person, will preclude the chance to predict correctly the combination of all labels. This corresponds exactly to the property of diminishing returns of a submodular function. Fig. 2 shows example images from

the Microsoft COCO dataset (Lin et al., 2014).

While using classic modular losses such as 0-1 loss, the classiﬁer is trained to minimize the sum of incorrect predictions, so the complex interaction between label mispredictions is not considered. In this work, we use a new submodular loss function and apply the Lov asz hinge to enable efﬁcient convex risk minimization.

Microsoft COCO The Microsoft COCO dataset (Lin et al., 2014) is an image recognition, segmentation, and captioning dataset. It contains more than 70 categories, more than 300,000 images and around 5 captions per image. We have used frequent itemset mining (Uno et al., 2004) to determine the most common combination of cat-

Learning Submodular Losses with the Lov asz Hinge

Figure 2. Examples from the Microsoft COCO dataset. Fig. 2(a) contains all the categories of interest (cf. Section 5); Fig. 2(b) contains dining table, fork and cup; Fig. 2(c) is not a dining scene but contains people.

egories in an image. For sets of size 6, these are: person, cup, fork, knife, chair and dining table.

For the experiments, we repeatedly sample sets of images containing k (k = 0, 1, 2, , 6) categories. The training/validation set and testing set have the same distribution. For a single iteration, we sample 1050 images for the training/validation set including 150 images each for sets containing k (k = 0, 1, 2, , 6) of the target labels. More than 12,000 images from the entire dataset are sampled at least once as we repeat the experiments to compute statistical signiﬁcance.

We use Overfeat (Sermanet et al., 2014) to extract image features following the procedure of (Razavian et al., 2014). Overfeat has been trained for the image classiﬁcation task on Image Net ILSVRC 2013, and has achieved good performance on a range of image classiﬁcation problems including the Microsoft COCO dataset.

Submodular Losses In this task, we ﬁrst deﬁne a submodular loss function as follows:

(y, y) := 1 exp ( α|I|) (27)

where I is the set of mispredicted labels. 1 exp ( |I|), is a concave function depending only on the size of I, as a consequence it is a submodular function. The empirical results with this submodular loss are shown in Table 1 and Fig. 3(a).

We have carried out experiments with another submodular loss function:

(y, y) := 1 exp ( |I|) + β, (1 y y)/2 (28)

where is the Hadamard product. The ﬁnal summand is a modular function that penalizes labels proportionate to the coefﬁcient vector β > 0. In the experiments, we order the category labels as person, dining table, chair, cup, fork, and knife. We set β = [1 0.8 0.7 0.6 0.5 0.4]T according to the size of the object. The results with this submodular loss are shown in Table 2 and Fig. 3(b). We also train and test with the 0-1 loss, which is equivalent to an SVM.

We compare different losses employed during training and during testing. msub and ssub denote the use of the submodular loss with margin and slack rescaling, respectively. As this optimization is NP-hard, we have employed the simple application of the greedy approach as is common in (non-monotone) submodular maximization (e.g. (Krause & Golovin, 2014)).

Results We repeated each experiment 10 times with random sampling in order to obtain an estimate of the average performance. Table 1 and Table 2 show the cross comparison of average loss values (with standard error) using different loss functions during training and during testing for the COCO dataset. As predicted by theory, training with the same loss function as used during testing yields the best results. Slack and margin rescaling fail due to the necessity of approximate inference, which results in a poor discriminant function. By contrast, the Lov asz hinge yields the best performance when the submodular loss is used to evaluate the test predictions. We do not expect that optimizing the submodular loss should give the best performance when the 0-1 loss is used to evaluate the test predictions. Indeed in this case, the Lov asz hinge trained on 0-1 loss corresponds with the best performing system.

Fig. 3(a) and Fig. 3(b) show for the two experiments the primal-dual gap as a function of the number of cuttingplane iterations using the Lov asz hinge with submodular loss, as well as for a SVM (labeled 0-1), and margin and slack rescaling (labeled msub and ssub). This demonstrates that the empirical convergence of the Lov asz hinge is at a rate comparable to an SVM, and is feasible to optimize in practice for real-world problems.

testing loss lsub 0-1

training loss

lsub 0.7908 0.0113 1.3035 0.0182 0-1 0.7941 0.0102 1.2863 0.0175 ssub 0.8739 0.0133 1.4397 0.0219 msub 0.8722 0.0115 1.4365 0.0206

Table 1. Comparison of average loss values (with standard error) using different loss functions during training and during testing. lsub is as in Equation (27). Training with the same loss function as used during testing yields the best results. Slack and margin rescaling fail due to the necessity of approximate inference, which results in a poor discriminant function. The Lov asz hinge trained on the appropriate loss yields the best performance in both cases.

6. Discussion & Conclusions

In this work, we have introduced a novel convex surrogate loss function, the Lov asz hinge, which makes tractable for the ﬁrst time learning with submodular loss functions. In contrast to margin and slack rescaling, computation of

Learning Submodular Losses with the Lov asz Hinge

testing loss lsub 0-1

training loss

lsub 1.3767 0.0143 1.3003 0.0176 0-1 1.3813 0.0135 1.2975 0.0152 ssub 1.4711 0.0153 1.3832 0.0156 msub 1.4811 0.0117 1.4016 0.0136

Table 2. The cross comparison of average loss values (with standard error) using different loss functions during training and during testing. lsub is as in Equation (28) (cf. comments for Table. 1).

5 10 15 20 25 30 35 0

lsub 0 1 msub ssub

5 10 15 20 25 30 35 0

lsub 0 1 msub ssub

Figure 3. The primal-dual gap as a function of the number of cutting-plane iterations using the Lov asz hinge with submodular loss, a SVM (labeled 0-1), and margin and slack rescaling with greedy inference (labeled msub and ssub). Fig. 3(a) for the experiment using Equation (27) and Fig. 3(b) for Equation (28). This demonstrates that empirical convergence of the Lov asz hinge is at a rate comparable to an SVM, and is feasible to optimize in practice for real-world problems.

the gradient or cutting plane can be achieved in O(p log p) time. Margin and slack rescaling are NP-hard to optimize in this case.

We have proven necessary and sufﬁcient conditions for margin and slack rescaling to yield tight convex surrogates to a discrete loss function. These conditions are that the discrete loss be a (properly scaled) increasing function. However, it may be of interest to consider non-increasing functions in some domains. The Lov asz hinge can be applied also to non-increasing functions.

We have demonstrated the correctness and utility of the Lov asz hinge on a natural image categorization task. We have shown that training by minimization of the Lov asz hinge applied to multiple submodular loss functions results in a lower empirical test error than existing methods, as one would expect from a correctly deﬁned convex surrogate. Slack and margin rescaling both fail in practice as approximate inference does not yield a good approximation of the discriminant function. The causes of this have been studied in a different context in (Finley & Joachims, 2008), but are effectively due to (i) repeated approximate inference compounding errors, and (ii) erroneous early termination due to underestimation of the primal objective. We empir-

ically observe that the Lov asz hinge delivers much better performance by contrast, and makes no approximations in the subgradient computation. Exact inference should yield a good predictor for slack and margin rescaling, but subexponential optimization only exists if P=NP. Therefore, the Lov asz hinge is the only polynomial time option in the literature for learning with such losses.

The introduction of this novel strategy for constructing convex surrogate loss functions for submodular losses points to many interesting areas for future research. Among them are then deﬁnition and characterization of useful loss functions in speciﬁc application areas. Furthermore, theoretical convergence results for a cutting plane optimization strategy are of interest.

Acknowledgements

This work is partially funded by ERC Grant 259112, and FP7-MC-CIG 334380. The ﬁrst author is supported by a fellowship from the China Scholarship Council.

Bach, Francis R. Structured sparsity-inducing norms through submodular functions. In Advances in Neural Information Processing Systems, pp. 118 126, 2010.

Bartlett, Peter L., Jordan, Michael I., and Mc Auliffe, Jon D. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

Blaschko, Matthew B. and Lampert, Christoph H. Learning to localize objects with structured output regression. In Forsyth, David, Torr, Philip, and Zisserman, Andrew (eds.), European Conference on Computer Vision, volume 5302 of Lecture Notes in Computer Science, pp. 2 15. 2008.

Cheng, Weiwei, H ullermeier, Eyke, and Dembczynski, Krzysztof J. Bayes optimal multilabel classiﬁcation via probabilistic classiﬁer chains. In Proceedings of the International Conference on Machine Learning, pp. 279 286, 2010.

Doppa, Janardhan Rao, Yu, Jun, Ma, Chao, Fern, Alan, and Tadepalli, Prasad. HC-search for multi-label prediction: An empirical study. In Proceedings of AAAI Conference on Artiﬁcial Intelligence, 2014.

Edmonds, Jack. Matroids and the greedy algorithm. Mathematical programming, 1(1):127 136, 1971.

Finley, Thomas and Joachims, Thorsten. Training structural SVMs when exact inference is intractable. In Proceedings of the 25th International Conference on Machine Learning, pp. 304 311, 2008.

Learning Submodular Losses with the Lov asz Hinge

Fujishige, Satoru. Submodular functions and optimization. Elsevier, 2005.

Iyer, Rishabh and Bilmes, Jeff. The Lov asz-Bregman divergence and connections to rank aggregation, clustering, and web ranking: Extended version. Uncertainity in Artiﬁcial Intelligence, 2013.

Joachims, Thorsten, Finley, Thomas, and Yu, Chun-Nam. Cutting-plane training of structural SVMs. Machine Learning, 77(1):27 59, 2009.

Krause, Andreas and Golovin, Daniel. Submodular function maximization. In Bordeaux, Lucas, Hamadi, Youssef, and Kohli, Pushmeet (eds.), Tractability: Practical Approaches to Hard Problems. Cambridge University Press, 2014.

Li, Chun-Liang and Lin, Hsuan-Tien. Condensed ﬁlter tree for cost-sensitive multi-label classiﬁcation. In Proceedings of the 31st International Conference on Machine Learning, pp. 423 431, 2014.

Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll ar, Piotr, and Zitnick, C. Lawrence. Microsoft COCO: common objects in context. Co RR, abs/1405.0312, 2014.

Lov asz, L aszl o. Submodular functions and convexity. In Mathematical Programming The State of the Art, pp. 235 257. Springer, 1983.

Mc Allester, David. Generalization bounds and consistency for structured labeling. In Predicting Structured Data. MIT Press, 2007.

Nemhauser, George L., Wolsey, Laurence A., and Fisher, Marshall L. An analysis of approximations for maximizing submodular set functions I. Mathematical Programming, 14(1):265 294, 1978.

Nowozin, Sebastian. Optimal decisions from probabilistic models: The intersection-over-union case. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

Petterson, James and Caetano, Tib erio S. Submodular multi-label learning. In Advances in Neural Information Processing Systems, pp. 1512 1520, 2011.

Razavian, Ali Sharif, Azizpour, Hossein, Sullivan, Josephine, and Carlsson, Stefan. CNN features off-theshelf: An astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 512 519, 2014.

Sermanet, Pierre, Eigen, David, Zhang, Xiang, Mathieu, Michael, Fergus, Rob, and Le Cun, Yann. Overfeat:

Integrated recognition, localization and detection using convolutional networks. In International Conference on Learning Representations, 2014.

Tewari, Ambuj and Bartlett, Peter L. On the consistency of multiclass classiﬁcation methods. The Journal of Machine Learning Research, 8:1007 1025, 2007.

Tsochantaridis, Ioannis, Joachims, Thorsten, Hofmann, Thomas, and Altun, Yasemin. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6(9):1453 1484, 2005.

Uno, Takeaki, Asai, Tatsuya, Uchida, Yuzo, and Arimura, Hiroki. An efﬁcient algorithm for enumerating closed patterns in transaction databases. In Discovery Science, pp. 16 31. Springer, 2004.

Vapnik, Vladimir N. The Nature of Statistical Learning Theory. Springer, 1995.