# structured_prediction_with_projection_oracles__b3c06702.pdf Structured Prediction with Projection Oracles Mathieu Blondel NTT Communication Science Laboratories Kyoto, Japan mathieu@mblondel.org We propose in this paper a general framework for deriving loss functions for structured prediction. In our framework, the user chooses a convex set including the output space and provides an oracle for projecting onto that set. Given that oracle, our framework automatically generates a corresponding convex and smooth loss function. As we show, adding a projection as output layer provably makes the loss smaller. We identify the marginal polytope, the output space s convex hull, as the best convex set on which to project. However, because the projection onto the marginal polytope can sometimes be expensive to compute, we allow to use any convex superset instead, with potentially cheaper-to-compute projection. Since efficient projection algorithms are available for numerous convex sets, this allows us to construct loss functions for a variety of tasks. On the theoretical side, when combined with calibrated decoding, we prove that our loss functions can be used as a consistent surrogate for a (potentially non-convex) target loss function of interest. We demonstrate our losses on label ranking, ordinal regression and multilabel classification, confirming the improved accuracy enabled by projections. 1 Introduction The goal of supervised learning is to learn a mapping that links an input to an output, using examples of such pairs. This task is noticeably more difficult when the output objects have a structure, i.e., when they are not mere vectors. This is the so-called structured prediction setting [4] and has numerous applications in natural language processing, computer vision and computational biology. We focus in this paper on the surrogate loss framework, in which a convex loss is used as a proxy for a (potentially non-convex) target loss of interest. Existing convex losses for structured prediction come with different trade-offs. On one hand, the structured perceptron [16] and hinge [52] losses only require access to a maximum a-posteriori (MAP) oracle for finding the highest-scoring structure, while the conditional random field (CRF) [29] loss requires access to a marginal inference oracle, for evaluating the expectation under a Gibbs distribution. Since marginal inference is generally considered harder than MAP inference, for instance containing #P-complete counting problems, this makes the CRF loss less widely applicable. On the other hand, unlike the structured perceptron and hinge losses, the CRF loss is smooth, which is crucial for fast convergence, and comes with a probabilistic model, which is important for dealing with uncertainty. Unfortunately, when combined with MAP decoding, these losses are typically inconsistent, meaning that their optimal estimator does not converge to the target loss function s optimal estimator. Recently, several works [15, 26, 39, 31] showed good results and obtained consistency guarantees by combining a simple squared loss with calibrated decoding. Since these approaches only require a decoding oracle at test time and no oracle at train time, this questions whether structural information is even beneficial during training. In this paper, we propose loss functions for structured prediction using a different kind of oracle: projections. Kullback-Leibler projections onto various polytopes have been used to derive online algorithms [24, 56, 49, 1] but it is not obvious how to extract a loss from these works. In our 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. framework, the user chooses a convex set containing the output space and provides an oracle for projecting onto that set. Given that oracle, we automatically generate an associated loss function. As we show, incorporating a projection as output layer provably makes the loss smaller. We identify the marginal polytope, the output space s convex hull, as the best convex set on which to project. However, because the projection onto the marginal polytope can sometimes be expensive to compute, we allow to use instead any convex superset, with potentially cheaper-to-compute projection. When using the marginal polytope as the convex set, our loss comes with an implicit probabilistic model. Our contributions are summarized as follows: Based upon Fenchel-Young losses [11, 12], we introduce projection-based losses in a broad setting. We give numerous examples of useful convex polytopes and their associated projections. We study the consistency w.r.t. a target loss of interest when combined with calibrated decoding, extending a recent analysis [38] to the more general projection-based losses. We exhibit a trade-off between computational cost and statistical estimation. We demonstrate our losses on label ranking, ordinal regression and multilabel classification, confirming the improved accuracy enabled by projections. Notation. We denote the probability simplex by p := {q Rp + : q 1 = 1}, the domain of Ω: Rp R { } by dom(Ω) := {u Rp : Ω(u) < }, the Fenchel conjugate of Ωby Ω (θ) := supu dom(Ω) u, θ Ω(u). We denote [k] := {1, . . . , k}. 2 Background and related work Surrogate loss framework. The goal of structured prediction is to learn a mapping f : X Y, from an input x X to an output y Y, minimizing the expected target risk L(f) := E(X,Y ) ρ L(f(X), Y ), where ρ (X Y) is a typically unknown distribution and L: Y Y R+ is a potentially non-convex target loss. We focus in this paper on surrogate methods, which attack the problem in two main phases. During the training phase, the labels y Y are first mapped to ϕ(y) Θ using an encoding or embedding function ϕ: Y Θ. In this paper, we focus on Θ := Rp, but some works consider general Hilbert spaces [15, 26, 31]. In most cases, ϕ(y) will be a zero-one encoding of the parts of y, i.e., ϕ(y) {0, 1}p. Given a surrogate loss S : Θ Θ R+, a model g: X Θ (e.g., a neural network or a linear model) is then learned so as to minimize the surrogate risk S(g) := E(X,Y ) ρ S(g(X), ϕ(Y )). This allows to leverage the usual empirical risk minimization framework in the space Θ. During the prediction phase, given an input x X, a model prediction θ = g(x) Θ is pulled back to a valid output by Y using a decoding function d: Θ Y. This is summarized in the following diagram: x X g model θ Θ d decoding by Y. (1) Commonly used decoders include the pre-image oracle [55, 17, 25] θ 7 argminy Y S(θ, ϕ(y)) and the maximum a-posteriori inference oracle [16, 52, 29], which finds the highest-scoring structure: MAP(θ) := argmax y Y θ, ϕ(y) . (2) In the remainder of this paper, for conciseness, we will use use S(θ, y) as a shorthand for S(θ, ϕ(y)) but it is useful to bear in mind that surrogate losses are always really defined over vector spaces. Examples of surrogate losses. We now review classical examples of loss functions that fall within that framework. The structured perceptron [16] loss is defined by SSP(θ, y) := max y Y θ, ϕ(y ) θ, ϕ(y) . (3) Clearly, it requires a MAP inference oracle at training time in order to compute subgradients w.r.t. θ. The structured hinge loss used by structured support vector machines [52] is a simple variant of (3) using an additional loss term. Classically, it is assumed that this term satisfies an affine decomposition, so that we only need a MAP oracle. The conditional random fields (CRF) [29] loss, on the other M := conv(ϕ(Y)) Figure 1: Proposed framework in the Euclidean geometry. Left. Each black point represents the vector encoding ϕ(y) of one possible structure y Y. We require to choose a convex set C including the encoded output space, ϕ(Y). The best choice is M, the convex hull of ϕ(Y), but we can use any superset C of it with potentially cheaper-to-compute projection. Setting C = Rp, our loss SC(θ, y) (omitting the superscript Ψ) recovers the squared loss (i.e., no projection). Right. When θ belongs to the interior of NC(v), the normal cone of C at a vertex v, the projection PC(θ) := argminu C u θ 2 hits the vertex v and the angle formed by θ, PC(θ) and ϕ(y) is obtuse. In this case, SC(θ, y) is a strict upper-bound for ℓC(θ, y) := 1 2 ϕ(y) PC(θ) 2 2. When θ is not in the normal cone of C at any vertex, then the angle is right and the two losses coincide, SC(θ, y) = ℓC(θ, y). hand, requires a so-called marginal inference oracle [54], for evaluating the expectation under the Gibbs distribution p(y; θ) e θ,ϕ(y) . The loss and the oracle are defined by Scrf(θ, y) := log X y Y e θ,ϕ(y ) θ, ϕ(y) and marginal(θ) := EY p[ϕ(Y )] X y Y e θ,ϕ(y) ϕ(y). When ϕ(y) is a zero-one encoding of the parts of y (i.e., a bit vector), marginal(θ) can be interpreted as some marginal distribution over parts of the structures. The CRF loss is smooth and comes with a probabilistic model, but its applicability is hampered by the fact that marginal inference is generally harder than MAP inference. This is for instance the case for permutation-prediction problems, where exact marginal inference is intractable [53, 50, 44] but MAP inference can be computed exactly. Consistency. When working with surrogate losses, an important question is whether the surrogate and target risks are consistent, that is, whether an estimator g minimizing S(g) produces an estimator d g minimizing L(f). Although this question has been widely studied in the multiclass setting [58, 6, 51, 35] and in other specific settings [21, 45], it is only recently that it was studied in a fully general structured prediction setting. The structured perceptron, hinge and CRF losses are generally not consistent when using MAP as decoder d [38]. Inspired by kernel dependency estimation [55, 17, 25], several works [15, 26, 31] showed good empirical results and proved consistency by combining a squared loss Ssq(θ, y) := 1 2 ϕ(y) θ 2 2 with calibrated decoding (no oracle is needed during training). A drawback of this loss, however, is that it does not make use of the output space Y during training, ignoring precious structural information. More recently, the consistency of the CRF loss in combination with calibrated decoding was analyzed in [38]. 3 Structured prediction with projection oracles In this section, we build upon Fenchel-Young losses [11, 12] to derive a class of smooth loss functions leveraging structural information through a different kind of oracle: projections. Our losses are applicable to a large variety of tasks (including permutation problems, for which CRF losses are intractable) and have consistency guarantees when combined with calibrated decoding (cf. 5). Fenchel-Young losses. The aforementioned perceptron, hinge and CRF losses all belong to the class of Fenchel-Young losses [11, 12]. The Fenchel-Young loss generated by Ωis defined by SΩ(θ, y) := Ω (θ) + Ω(ϕ(y)) θ, ϕ(y) . (4) As shown in [11, 12], SΩ(θ, y) satisfies the following desirable properties: Non-negativity: SΩ(θ, y) 0, Zero loss: SΩ(θ, y) = 0 Ω (θ) = ϕ(y), Convexity: SΩ(θ, y) is convex in θ, Smoothness: If Ωis 1 β -strongly convex, then SΩ(θ, y) is β-smooth, Gradient as residual (generalizing the squared loss): θSΩ(θ, y) = Ω (θ) ϕ(y). In the Fenchel duality perspective, θ = g(x) belongs to the dual space dom(Ω ) = Θ = Rp and is thus unconstrained. This is convenient, as this places no restriction on the model outputs θ = g(x). On the other hand, ϕ(y) belongs to the primal space dom(Ω), which must include the encoded output space ϕ(Y), i.e., ϕ(Y) dom(Ω), and is typically constrained. The gradient Ω is a mapping from dom(Ω ) to dom(Ω) and SΩcan be seen as loss with mixed arguments, between these two spaces. The theory of Fenchel-Young loss was recently extended to infinite spaces in [34]. Projection-based losses. Let the Bregman divergence generated by Ψ be defined as DΨ(u, v) := Ψ(u) Ψ(v) Ψ(v), u v . The Bregman projection of Ψ (θ) onto a closed convex set C is P Ψ C (θ) := argmin u C DΨ(u, Ψ (θ)). (5) Intuitively, Ψ maps the unconstrained predictions θ = g(x) to dom(Ψ), ensuring that the Bregman projection is well-defined. Let us define the Kullback-Leibler divergence by KL(u, v) := P i ui log ui i vi. Two examples of generating function Ψ are Ψ(u) = 1 2 u 2 2 with dom(Ψ) = Rp and Ψ (θ) = θ, and Ψ(u) = u, log u with dom(Ψ) = Rp + and Ψ (θ) = eθ 1. This leads to the Euclidean projection argminu C u θ 2 and the KL projection argminu C KL(u, eθ 1), respectively. Our key insight is to use a projection onto a chosen convex set C as output layer. If C contains the encoded output space, i.e., ϕ(Y) C, then ϕ(y) C for any ground truth y Y. Therefore, if Ψ (θ) C, then P Ψ C (θ) is necessarily a better prediction than Ψ (θ), since it is closer to ϕ(y) in the sense of DΨ. If Ψ (θ) already belongs to C, then P Ψ C (θ) = Ψ (θ) and thus P Ψ C (θ) is as good as Ψ (θ). To summarize, we have DΨ(ϕ(y), P Ψ C (θ)) DΨ(ϕ(y), Ψ (θ)) for all θ Θ and y Y. Therefore, it is natural to choose θ so as to minimize the following compositional loss ℓΨ C (θ, y) := DΨ(ϕ(y), P Ψ C (θ)). Unfortunately, ℓΨ C is non-convex in θ in general, and θℓΨ C (θ, y) requires to compute the Jacobian of P Ψ C (θ), which could be difficult, depending on C. Other works have considered the output of an optimization program as input to a loss [48, 20, 8] but these methods are non-convex too and typically require unrolling the program s iterations. We address these issues, using Fenchel-Young losses. Convex upper-bound. We now set the generating function Ωof the Fenchel-Young loss (4) to Ω= Ψ + IC, where IC denotes the indicator function of C. We assume that Ψ is Legendre type [46, 54], meaning that it is strictly convex and Ψ explodes at the boundary of the interior of dom(Ψ). This assumption is satisfied by both Ψ(u) = 1 2 u 2 2 and Ψ(u) = u, log u . With that assumption, as shown in [11, 12], we obtain Ω (θ) = P Ψ C (θ) for all θ Θ, allowing us to use Fenchel-Young losses. For brevity, let us define the Fenchel-Young loss generated by Ω= Ψ + IC as SΨ C (θ, y) := SΨ+IC(θ, y). (6) From the properties of Fenchel-Young losses, we have SΨ C (θ, y) = 0 P Ψ C (θ) = ϕ(y) and θSΨ C (θ, y) = P Ψ C (θ, y) ϕ(y). Moreover, as shown in [11, 12], SΨ C (θ, y) upper-bounds ℓΨ C (θ, y): ℓΨ C (θ, y) SΨ C (θ, y) θ Θ, y Y. (7) Note that if C = dom(Ψ) (largest possible set), then SΨ C (θ, y) = DΨ(ϕ(y), Ψ (θ)). In particular, with Ψ = 1 2 2 2 and C = Rp, SΨ C (θ, y) recovers the squared loss Ssq(θ, y) = 1 2 ϕ(y) θ 2 2. Choosing the projection set. Recall that C should be a convex set such that ϕ(Y) C. The next new proposition, a simple consequence of (4), gives an argument in favor of using smaller sets. Proposition 1 Using smaller sets results in smaller loss Let C, C be two closed convex sets such that C C dom(Ψ). Then, SΨ C (θ, y) SΨ C (θ, y) θ Θ, y Y. As a corollary, combined with (7), we have ℓΨ C (θ, y) SΨ C (θ, y) DΨ(ϕ(y), Ψ (θ)) and in particular when Ψ(u) = 1 2 u 2 2, noticing that Ssq = SΨ Rp, we have ℓΨ C (θ, y) = 1 2 ϕ(y) P Ψ C (θ) 2 2 SΨ C (θ, y) 1 2 ϕ(y) θ 2 2 = Ssq(θ, y). Therefore, the Euclidean projection P Ψ C (θ) always achieves a smaller squared loss than θ = g(x). This is intuitive, as C is a smaller region than Rp and C is guaranteed to include the ground-truth ϕ(y). Our loss SΨ C is a convex and structurally informed middle ground between ℓΨ C and Ssq. How to choose C? The smallest convex set C such that ϕ(Y) C is the convex hull of ϕ(Y) M := conv(ϕ(Y)) := {EY q[ϕ(Y )]: q |Y|} Θ. (8) When ϕ(y) is a zero-one encoding of the parts of y, M is also known as the marginal polytope [54], since any point inside it can be interpreted as some marginal distribution over parts of the structures. The loss SΨ C with C = M and Ψ(u) = 1 2 u 2 2 is exactly the sparse MAP loss proposed in [37]. More generally, we can use any superset C of M, with potentially cheaper-to-compute projections. For instance, when ϕ(y) uses a zero-one encoding, the marginal polytope is always contained in the unit cube, i.e., M [0, 1]p, whose projection is very cheap to compute. We show in our experiments that even just using the unit cube typically improves over the squared loss. However, an advantage of using C = M is that P Ψ M(θ) produces a convex combination of structures, i.e., an expectation. Smoothness. The well-known equivalence between strong convexity of a function and the smoothness of its Fenchel conjugate implies that the following three statements are all equivalent: β -strongly convex w.r.t. a norm over C, P Ψ C is β-Lipschitz continuous w.r.t. the dual norm over Rp, SΨ C is β-smooth in its first argument w.r.t. over Rp. With the Euclidean geometry, since Ψ(u) = 1 2 u 2 2 is 1-strongly-convex over Rp w.r.t. 2, we have that SΨ C is 1-smooth w.r.t. 2 regardless of C. With the KL geometry, the situation is different. The fact that Ψ(u) = u, log u is 1-strongly convex w.r.t. 1 over C = p is well-known (this is Pinsker s inequality). The next proposition, proved in C.1, shows that this straightforwardly extends to any bounded C and that the strong convexity constant is inversely proportional to the size of C. Proposition 2 Strong convexity of Ψ(u) = u, log u over a bounded set Let C Rd + and β := supu C u 1. Then, Ψ is 1 β -strongly convex w.r.t. 1 over C. This implies that SΨ C is β-smooth w.r.t. . Since smaller β is smoother, this is another argument for preferring smaller sets C. With the best choice of C = M, we obtain β = supy Y ϕ(y) 1. Computation. Assuming C is compact (closed and bounded), the Euclidean projection can always be computed using Frank-Wolfe or active-set algorithms, provided access to a linear maximization oracle LMOC(v) := argmaxu C u, v . Note that in the case C = M, assuming that ϕ is injective, meaning that is has a left inverse, MAP inference reduces to an LMO, since MAP(θ) = ϕ 1(LMOM(θ)) (the LMO can be viewed as a linear program, whose solutions always hit a vertex ϕ(y) of M). The KL projection is more problematic but Frank-Wolfe variants have been proposed [7, 27]. In the next section, we focus on examples of sets for which an efficient dedicated projection oracle is available. 4 Examples of convex polytopes and corresponding projections Probability simplex. For multiclass classification, we set Y = [k], where k is the number of classes. With ϕ(y) = ey, the one-hot encoding of y, MAP inference (2) becomes MAP(θ) = argmaxi [k] θk. The marginal polytope defined in (8) is now M = k, the probability simplex. The Euclidean and KL projections onto C = M then correspond to the sparsemax [32] and softmax transformations. We therefore recover the sparsemax and logistic losses as natural special cases of SΨ C . Note that, although the CRF loss [29] also comprises the logistic loss as a special case, it no longer coincides with our loss in the structured case. ϕ(1) = [1, 0, 0] ϕ(2) = [0, 1, 0] ϕ(3) = [0, 0, 1] (a) Probability simplex ϕ({1, 2}) ϕ({2}) ϕ({1, 2, 3}) (b) Unit cube ϕ({1, 2}) ϕ({2}) (c) Knapsack polytope '((2, 3, 1)) '((3, 2, 1)) '((3, 1, 2)) '((2, 1, 3)) '((1, 2, 3)) '((1, 3, 2)) (d) Birkhoff polytope ϕ((3, 2, 1)) ϕ((3, 1, 2)) ϕ((2, 1, 3)) ϕ((1, 2, 3)) ϕ((1, 3, 2)) ϕ((2, 3, 1)) (e) Permutahedron (f) Order simplex Figure 2: Examples of convex polytopes. Unit cube. For multilabel classification, we choose Y = 2[k], the powerset of [k]. Let us set ϕ(y) = P|y| i=1 eyi {0, 1}k, the label indicator vector of y (i.e., ϕ(y)i = 1 if i y and 0 otherwise). MAP inference corresponds to predicting each label independently. More precisely, for each label i [k], if θi > 0 we predict i, otherwise we do not. The marginal polytope is now M = [0, 1]k, the unit cube. Each vertex is in bijection with one possible subset of [k]. The Euclidean projection of θ onto M is equal to a coordinate-wise clipping of θ, i.e., max(min(θi, 1), 0) for all i [k]. The KL projection is equal to min(1, eθi 1) for all i [k]. More generally, whenever ϕ for the task at hand uses a 0-1 encoding, we can use the unit cube as superset with computationally cheap projection. Knapsack polytope. We now set Y = {y 2[k] : l |y| u}, the subsets of [k] of bounded size. We assume 0 l u k. This is useful for multilabel classification with known lower bound l N and upper bound u N on the number of labels per sample. Setting again ϕ(y) = P|y| i=1 eyi {0, 1}k, MAP inference is equivalent to the integer linear program argmaxϕ(y) {0,1}k θ, ϕ(y) s.t. l ϕ(y), 1 u. Let π be a permutation sorting θ in descending order. An optimal solution is ( 1 if l > 0 and i {π1, . . . , πl}, 1 else if i {π1, . . . , πu} and θi > 0, 0 else. The marginal polytope is an instance of knapsack polytope [2]. It is equal to M = {µ [0, 1]k : l µ, 1 u} and is illustrated in Figure 2c with k = 3, l = 0 and u = 2 (i.e., we keep all elements of 2[3] except {1, 2, 3}). The next proposition, proved in C.2, shows how to efficiently project on M. Proposition 3 Efficient Euclidean and KL projections on M Let ν be the projection of Ψ (θ) onto the unit cube (cf. unit cube paragraph). If l ν, 1 u, then ν is optimal. Otherwise, return the projection of Ψ (θ) onto {µ [0, 1]k : µ, 1 = m}, where m = u if ν, 1 > u and m = l otherwise. The total cost is O(k) in the Euclidean case and O(k log k) in the KL case (cf. C.2 for details). Birkhoff polytope. We view ranking as a structured prediction problem and let Y be the set of permutations π of [k]. Setting ϕ(π) {0, 1}k k as the permutation matrix associated with π, MAP inference becomes the linear assignment problem MAP(θ) = argmaxπ Y Pk i=1 θi,πi and can be computed exactly using the Hungarian algorithm [28]. The marginal polytope M becomes the Birkhoff polytope [10], the set of doubly stochastic matrices M = {P Rk k : P 1 = 1, P1 = 1, 0 P 1}. Noticeably, marginal inference is known to be #P-complete [53, 50, 3.5], since it corresponds to computing a matrix permanent. In contrast, the KL projection on the Birkhoff polytope can be computed using the Sinkhorn algorithm [47, 18]. The Euclidean projection can be computed using Dykstra s algorithm [19] or dual approaches [13]. For both projections, the cost of obtaining an ϵ-precise solution is O(k2/ϵ). To obtain cheaper projections, we can also use [13, 38] the set k k of row-stochastic matrices, a strict superset of the Birkhoff polytope and strict subset of the unit cube [0, 1]k k k k := k k = {P Rk k : P 1 = 1, 0 P 1} M. Projections onto k k reduce to k row-wise projections onto k, for a worst-case total cost of O(k2 log k) in the Euclidean case and O(k2) in the KL case. Permutahedron. We again consider ranking and let Y be the set of permutations π of [k] but use a different encoding. This time, we define ϕ(π) = (wπ1, . . . , wπk) Rk, where w Rk is a prescribed vector of weights, which without loss of generality, we assume sorted in descending order. MAP inference becomes MAP(θ) = argmaxπ Y Pk i=1 θiwπi = Pk i=1 θπ 1 i wi, where π 1 denotes the inverse permutation of π. The MAP solution is thus the inverse of the permutation sorting θ in descending order, and can be computed in O(k log k) time. When w = (k, . . . , 1), which we use in our experiments, M is known as the permutahedron. For arbitrary w, we follow [30] and call M the permutahedron induced by w. Its vertices correspond to the permutations of w. Importantly, the Euclidean projection onto M reduces to sorting, which takes O(k log k), followed by isotonic regression, which takes O(k) [57, 36]. Bregman projections reduce to isotonic optimization [30]. Order simplex. We again set Y = [k] but now consider the ordinal regression setting, where there is an intrinsic order 1 k. We need to use an encoding ϕ that takes into account that order. Inspired by the all-threshold method [42, 38], we set ϕ(y) = P 1 i