# structured_prediction_with_stronger_consistency_guarantees__292698c0.pdf

Structured Prediction with Stronger Consistency Guarantees

Anqi Mao Courant Institute New York, NY 10012 aqmao@cims.nyu.edu

Mehryar Mohri Google Research & CIMS New York, NY 10011 mohri@google.com

Yutao Zhong Courant Institute New York, NY 10012 yutao@cims.nyu.edu

We present an extensive study of surrogate losses for structured prediction supported by H-consistency bounds. These are recently introduced guarantees that are more relevant to learning than Bayes-consistency, since they are not asymptotic and since they take into account the hypothesis set H used. We ﬁrst show that no nontrivial H-consistency bound can be derived for widely used surrogate structured prediction losses. We then deﬁne several new families of surrogate losses, including structured comp-sum losses and structured constrained losses, for which we prove H-consistency bounds and thus Bayes-consistency. These loss functions readily lead to new structured prediction algorithms with stronger theoretical guarantees, based on their minimization. We describe efﬁcient algorithms for minimizing several of these surrogate losses, including a new structured logistic loss.

1 Introduction

In most applications, the output labels of learning problems have some structure that is crucial to consider. This includes natural language processing applications, where the output may be a sentence, a sequence of parts-of-speech tags, a parse tree, or a dependency graph. It also includes image annotation, image segmentation, computer vision, video annotation, object recognition, motion estimation, computational photography, bioinformatics, and many other important applications.

Several algorithms have been designed in the past for structured prediction tasks, including Conditional Random Fields (CRFs) [Lafferty et al., 2001a, Gimpel and Smith, 2010], Struct SVMs [Tsochantaridis et al., 2005a], Maximum-Margin Markov Networks (M3N) [Taskar et al., 2003a], kernel-regression-based algorithms [Cortes et al., 2005, 2007], Voted CRF and Struct Boost [Cortes et al., 2016], search-based methods [Daumé III et al., 2009, Doppa et al., 2014, Lam et al., 2015, Chang et al., 2015, Ross et al., 2011] and a variety of deep learning techniques [Jurafsky and Martin, 2009, Vinyals et al., 2015a, Nadeau and Sekine, 2007, Zhang et al., 2008, Wu et al., 2016, Lucchi et al., 2013, Vinyals et al., 2015b], see Appendix A for a more comprehensive list of references and discussion.

Structured prediction tasks inherently involve a natural loss function based on substructures, which could be the Hamming loss, the n-gram loss, the edit-distance loss, or some other sequence similaritybased loss or task-speciﬁc structured loss. Many of the algorithms previously mentioned overlook this inherent structured loss by simply minimizing the cross-entropy loss. In contrast, the surrogate loss functions minimized by algorithms such as CRFs [Lafferty et al., 2001a, Gimpel and Smith, 2010], M3N [Taskar et al., 2003a], Struct SVMs [Tsochantaridis et al., 2005a] or Voted CRF and Struct Boost [Cortes et al., 2016] do take into account the natural structured loss of the task. But are these structured prediction loss functions consistent? What guarantees can we rely on when minimizing them over a restricted hypothesis set that does not include all measurable functions? Can we derive non-asymptotic guarantees?

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

This paper deals precisely with these theoretical problems in structured prediction.

Previous work. We include a detailed discussion on consistency in structured prediction in Appendix A. Here, we brieﬂy discuss previous work by Osokin et al. [2017]. To our knowledge, this is one of the only studies proving Bayes-consistency for a family of loss functions in structured prediction (see also [Nowak et al., 2020] and other related references in Appendix A). The surrogate losses the authors proposed are the following quadratic losses (see also [Zhang, 2004]) deﬁned for any function h mapping X Y to R and any loss function ℓbetween output labels by

(x,y) X Y, Lquad(h,x,y) = y Y [ℓ(y ,y) + h(x,y )] 2. (1)

However, the authors only consider the hypothesis set of linear scoring functions. Moreover, the feature vector in their setting only depends on the input x and ignores the label y. In many applications such as natural language prediction, however, it is critical to allow for features that depend both on the input sequence and the output sequence, parse tree, or dependency graph. Finally, in this formulation, the structured prediction problem is cast as a regression problem. Thus, as shown below, the loss function derived is non-standard, even in the binary classiﬁcation case, where ℓ= ℓ0 1 is the zero-one loss and Y = {y1,y2}. In this simple case, Lquad(h,x,y1) can be expressed as

Lquad(h,x,y1) = y Y [ℓ0 1(y ,y1) + h(x,y )] 2 = h(x,y1)2 + (1 + h(x,y2))2. (2)

This is not a typical formulation since it incorporates the magnitude of individual scores. In contrast, in standard binary classiﬁcation scenario, only the difference between scores matters.

Structure of the paper. We present an extensive study of surrogate losses for structured prediction supported by H-consistency bounds. These are recently introduced guarantees that are more relevant to learning than Bayes-consistency, since they are not asymptotic and since they take into account the hypothesis set H used. We ﬁrst show that no non-trivial H-consistency bound or even Bayesconsistency can be derived for widely used surrogate structured prediction losses (Section 3). We then deﬁne several new families of surrogate losses, including structured comp-sum losses (Section 4) and structured constrained losses (Section 5), for which we prove H-consistency bounds and thus Bayes-consistency. These loss functions readily lead to new structured prediction algorithms with stronger theoretical guarantees, based on their minimization. We also describe efﬁcient gradient computation algorithms for several of these surrogate losses, including a new structured logistic loss (Section 6).

2 Preliminaries

Learning scenario. We consider the standard structured prediction scenario with the input space X and output space Y = {1,...,n}. The output space may be discrete objects with overlapping structures, such as sequences, images, graphs, parse trees, lists, or others. We assume that the output can be decomposed into l substructures. The substructures could represent words or tokens for example, or other subsequences along a sequence, resulting in the decomposition of the output space Y as Y = Y1 Yl. Here, each Yj represents the set of possible labels or classes that can be assigned to the j-th substructure.

Scoring function. Structured prediction is typically formulated via scoring functions that map X Y to R, which assign a score to each possible class y Y. Let H be a family of such scoring functions. For any h H, we denote by h(x) its prediction for the input x X, which is the output y Y that maximizes the score h(x,y), that is, h(x) = argmaxy Y h(x,y), with a ﬁxed deterministic strategy to break ties in selecting the label with the highest score. For simplicity, we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. We denote by Hall the family of all measurable scoring functions.

Generalization error and target loss. Given a distribution D over X Y and a loss function L H X Y R, the generalization error of a hypothesis h H and the best-in-class generalization error are deﬁned as follows:

RL(h) = E (x,y) D[L(h,x,y)] and R L,H = inf h H RL(h).

In structured prediction, the goal is to select a hypothesis h H with small generalization error with respect to a target loss function L(h,x,y), which can be written as L(h,x,y) = ℓ(h(x),y) for some non-negative auxiliary function ℓ(y ,y) with any y ,y Y. ℓis assumed to be symmetric, that is, ℓ(y ,y) = ℓ(y,y ). This is a natural assumption since all instances of ℓthat we are familiar with in structured prediction admit this property. A signiﬁcant characteristic of structured prediction is that the target loss function can be decomposed along the substructures Yk. As an example, we may use the Hamming loss as the target loss function, which is deﬁned as Lham(h,x,y) = ℓham(h(x),y), where ℓham(y ,y) = 1

l l j=1 1y j yj with any y j,yj Yj; we can also use the zero-one loss as the target loss function, which is deﬁned as L0 1(h,x,y) = ℓ0 1(h(x),y), where ℓ0 1(y ,y) = 1y y. Note that L0 1 can be viewed as a special case of Lham when l = 1. We denote by ℓmax = maxy ,y Y ℓ(y ,y) the maximal value of a target loss function. Without loss of generality, we assume that ℓmax 1, which can be achieved by normalizing the function ℓ.

Consistency guarantees and surrogate loss. Optimizing the target loss functions in structured prediction for many choices of the hypothesis sets is NP-hard because they are not convex. One common method to address this issue is to resort to surrogate loss functions. Different surrogate loss functions readily lead to different structured prediction algorithms. A natural learning guarantee for such surrogate losses is Bayes-consistency, which guarantees that minimizing the generalization error for a surrogate loss Lsur over Hall also leads to the minimization of generalization error for the target loss L. Deﬁnition 1 (Bayes-consistency). A surrogate loss Lsur is Bayes-consistent in structured prediction, if for any target loss ℓ, hypothesis hn Hall and any distribution,

(RLsur(hn) R Lsur,Hall n + ÐÐÐ 0) Ô (RL(hn) R L,Hall n + ÐÐÐ 0). (3)

Bayes-consistency is an asymptotic guarantee and does not take into account typical hypothesis sets used in structured prediction algorithms, such as linear models or neural networks. To tackle these issues, recent work by Awasthi, Mao, Mohri, and Zhong [2022a,b] propose a stronger consistency guarantee, referred to as H-consistency bounds, which are bounds relating the estimation error of the target loss to the estimation error of a surrogate loss (see also [Awasthi et al., 2021a,b, Mao et al., 2023h,e,f, Zheng et al., 2023, Mao et al., 2023b,g,d, Mohri et al., 2023, Mao et al., 2023c,a, Awasthi et al., 2023a,b]): Deﬁnition 2 (H-consistency bounds). Given a subset of the hypothesis class H Hall, a surrogate loss Lsur admits a H-consistency bound in structured prediction, if for some non-decreasing function f R+ R+, a bound of the following form holds for any target loss ℓ, hypothesis h H and any distribution:

RL(h) R L,H f(RLsur(h) R Lsur,H). (4)

As pointed out by Awasthi et al. [2022a,b], H-consistency bounds are the state-of-the-art consistency guarantees for surrogate losses. They are much stronger and more informative than Bayes-consistency, since they account for hypothesis sets H adopted and provide a quantitative, non-asymptotic relation between surrogate losses and target losses. H-consistency bounds can imply Bayes-consistency when taking H to be Hall. In the next sections, we will present an extensive study of surrogate losses for structured prediction supported by H-consistency bounds.

Conditional regret and minimizability gap. We denote by p(x) = (p(x,1),...,p(x,n)) the conditional distribution of Y given X = x. Then, the conditional error of a hypothesis h for a loss function L, denoted by CL(h,x), can be expressed as

CL(h,x) = E y x[ℓ(h(x),y)] = y Y p(x,y)ℓ(h(x),y).

We further deﬁne the best-in-class conditional error and the conditional regret as C L(H,x) = infh H CL(h,x) and CL,H(h,x) = CL(h,x) C L(H,x) respectively. The generalization error can then be rewritten as RL(h) = Ex[CL(h,x)].

A key quantity appearing in H-consistency bounds is the minimizability gap ML(H), which measures the difference between the best-in-class generalization error and the expected best-in-class conditional error for a loss function L and a hypothesis set H: ML(H) = R L(H) Ex[C L(H,x)]. This is an inherent quantity that we cannot hope to minimize or estimate. As shown by Steinwart [2007,

Theorem 3.2], the minimizability gaps vanish ML(Hall) = 0 for the family of all measurable functions. More generally, the minimizability gaps vanish when the best-in-class error coincides with the Bayes-error, that is, R ℓ(H) = R ℓ(Hall) [Awasthi et al., 2022b, Mao et al., 2023h].

The following result characterizes the best-in-class conditional error and the conditional regret for a target loss L, which will be helpful for proving H-consistency bounds in structured prediction. We denote by H(x) the set of all possible predictions on a input x generated by hypotheses in H: H(x) = {h(x) h H}. The proof is given in Appendix B.

Lemma 3. The best-in-class conditional error and the conditional regret for a target loss L in structured prediction can be expressed as follows:

C L,H(x) = min y H(x) y Y p(x,y)ℓ(y ,y)

CL,H(h,x) = y Y p(x,y)ℓ(h(x),y) min y H(x) y Y p(x,y)ℓ(y ,y).

3 Structured max losses

In this section, we examine the loss functions associated to several prominent structured prediction algorithms. We show that, while they are natural, none of them is Bayes-consistent, which implies that they cannot be supported by H-consistency bounds either. More generally, we consider the following family of surrogate loss functions proposed in [Cortes, Kuznetsov, Mohri, and Yang, 2016], which we refer to as structured max losses:

(x,y) X Y, Lmax(h,x,y) = max y y Φℓ(y ,y)(h(x,y) h(x,y )), (5)

where Φu R R+ is an upper bound on v u1v 0 for any u R+. In this formulation, different choices of Φu can lead to different structured prediction algorithms. Speciﬁcally, as shown by Cortes et al. [2016], the following choices of Φu(v) recover many well-known algorithms:

Φu(v) = max(0,u(1 v)): Struct SVM [Tsochantaridis et al., 2005b].

Φu(v) = max(0,u v): Max-Margin Markov Networks (M3N) [Taskar et al., 2003b].

Φu(v) = log(1 + eu v): Conditional Random Field (CRF) [Lafferty et al., 2001b].

Φu(v) = ue v: Struct Boost [Cortes et al., 2016].

The following gives a general negative result for Lmax that holds under broad assumptions.

Theorem 4 (Negative results of Lmax). Assume that n > 2 and that Φu(v) is convex and nonincreasing for u = 1. Then, the max structured loss Lmax is not Bayes-consistent.

The proof is included in Appendix C. It is straightforward to see that the assumption of Theorem 4 holds for all the choices of Φu listed above. Thus, the theorem rules out consistency guarantees for any of the loss functions associated to the structured prediction algorithms mentioned above: Struct SVM, M3N, CRF, Structboost. Furthermore, Theorem 4 provides negative results for a broad and generalized family of loss functions, collectively referred to as structured max loss. This extends the scope of existing research, as previous works had only addressed the inconsistency of speciﬁc instances within the structured max loss category, such as that of M3N [Osokin et al., 2017, Ciliberto et al., 2016, Nowak et al., 2020].

4 Structured comp-sum losses

In this section, we ﬁrst analyze the Voted CRF loss function, which incorporates the auxiliary loss function ℓin the CRF loss and which has been used in several previous studies. Next, we introduce a new family of loss functions for structured predictions that we prove to admit strong consistency guarantees.

4.1 Voted Conditional Random Field (VCRF)

We ﬁrst study a family of surrogate losses called Voted Conditional Random Field (VCRF), which corresponds to the structured prediction algorithm deﬁned in [Cortes et al., 2016]:

(x,y) X Y, LVCRF(h,x,y) = log[ eh(x,y)

y Y eh(x,y )+ℓ(y,y ) ] = log y Y eℓ(y,y )+h(x,y ) h(x,y) .

This loss function has also been presented as the softmax margin [Gimpel and Smith, 2010] or the reward-augmented maximum likelihood [Norouzi et al., 2016]. It can be viewed as the softmax variant of the M3N loss. Indeed, the loss function for M3N can be written as follows:

L(h,x,y) = max y max(0,ℓ(y ,y) + h(x,y ) h(x,y)). (6)

If we replace the maximum function with the softmax, we obtain

L(h,x,y) = log y Y emax(0,ℓ(y ,y)+h(x,y ) h(x,y)) = log y Y max(1,eℓ(y ,y)+h(x,y ) h(x,y)) . (7)

Next, we show that, as with the loss function for M3N, the VCRF loss function LVCRF is inconsistent. Theorem 5 (Negative result of LVCRF). The Voted Conditional Random Field LVCRF is not Bayesconsistent.

The proof is included in Appendix D. The key observation in the proof is that the conditional error of VCRF loss function can be reduced to a speciﬁc form when the target loss function L decouples, which can lead to a different Bayes classiﬁer from that of the target loss function.

To the best of our knowledge, no prior studies in the literature have explored the consistency of the VCRF loss formulation. The most closely related discussions center around a specialized instance of the multi-class logistic loss (also referred to as Conditional Random Field in that context), in which ℓ(y ,y) disappears within the framework of the Voted Conditional Random Field. The previous works by Osokin et al. [2017], Ciliberto et al. [2016], Nowak et al. [2020] point out that the multi-class logistic loss cannot be consistent in structured prediction due to the absence of the target loss function within its formulation. Instead, our result shows that, even when integrating the target loss ℓ(y ,y) within its formulation, the Voted Conditional Random Field cannot be consistent.

Along with Theorem 4, these results rule out consistency guarantees for commonly used surrogate loss functions in structured prediction.

4.2 Structured comp-sum loss functions

In this section, we deﬁne a family of new loss functions for structured prediction that are not only Bayes-consistent but also supported by H-consistency bounds. These are loss functions that can be viewed as the generalization to structured prediction of loss functions deﬁned via a composition and a sum, and that have been referred to as comp-sum losses in [Mao et al., 2023h]. Thus, we will refer to them as structured comp-sum losses. They are deﬁned as follows:

(x,y) X Y, Lcomp(h,x,y) = y Y

ℓ(y ,y)Φ1 y Y Φ2(h(x,y ) h(x,y )) , (8)

where ℓ(y ,y) = 1 ℓ(y ,y), Φ1 R+ R+ is a non-decreasing auxiliary function and Φ2 R R+ a non-decreasing auxiliary function. This formulation (8) can also be viewed as a weighted comp-sum loss, if we interpret ℓ( ,y) as a weight vector.

Speciﬁcally, we can choose Φ2(v) = ev and Φ1(v) = log(v), Φ1(v) = v 1, Φ1(v) = 1

α(1 1 vα ),α (0,1) and Φ1(v) = 1 1

v, which leads to new surrogate losses for structured prediction deﬁned in Table 1. These surrogate losses are novel strict generalization of their counterparts in the standard multi-class classiﬁcation case where ℓ= ℓ0 1. More precisely, when ℓ= ℓ0 1, Lcomp log coincides with the logistic loss [Verhulst, 1838, 1845, Berkson, 1944, 1951]; Lcomp exp coincides with the sumexponential loss [Weston and Watkins, 1998, Awasthi et al., 2022b]; Lcomp gce coincides with the

Table 1: A new family of surrogate losses for structured prediction: structured comp-sum losses. Φ1(v) Name Formulation

log(v) Structured logistic loss Lcomp log = y Y ℓ(y ,y)log[ eh(x,y )

y Y eh(x,y ) ].

v 1 Structured sum-exponential loss Lcomp exp = y Y ℓ(y ,y) y y eh(x,y ) h(x,y )

1 α[1 1 vα ] Structured generalized cross-entropy loss Lcomp gce = y Y ℓ(y ,y) 1

α[1 [ eh(x,y )

y Y eh(x,y ) ]

v Structured mean absolute error loss Lcomp mae = y Y ℓ(y ,y)[1 eh(x,y )

y Y eh(x,y ) ].

generalized cross-entropy loss [Zhang and Sabuncu, 2018]; and Lcomp mae coincides with the mean absolute error loss [Ghosh et al., 2017].

We will show that these structured comp-sum losses beneﬁt from H-consistency bounds in structured prediction, when H is a symmetric and complete hypothesis set. A hypothesis set H is symmetric if there exists a family F of real-valued functions such that {[h(x,1),...,h(x,n)] h H} = {[f1(x),...,fn(x)] f1,...,fn F} for any x X. Thus, the choice of the scoring functions does not depend on the order of the categories in Y. A hypothesis set H is complete if it can generate scores that span R, that is, {h(x,y) h H} = R for any (x,y) X Y. As shown by Awasthi et al. [2022b] and Mao et al. [2023h], these assumptions are general and hold for common hypothesis sets used in practice, such as the family of linear hypotheses and that of multi-layer feed-forward neural networks, and of course that of all measurable functions. Theorem 6 (H-consistency bound of Lcomp). Assume that H is symmetric and complete. Then, for any target loss ℓ, any hypothesis h H and any distribution, we have

RL(h) R L,H Γ(RLcomp(h) R Lcomp,H + MLcomp,H) ML,H, (9)

where Γ(t) = 2

t when Lcomp = Lcomp log or Lcomp exp ; Γ(t) = 2

nαt when Lcomp = Lcomp gce ; and Γ(t) = nt when Lcomp = Lcomp mae .

Theorem 6 represents a consolidated result for the four structured comp-sum losses, with the proofs for each being presented separately in Appendix E. The key step of the proof is to upper bound the conditional regret of the target loss (Lemma 3) by that of a surrogate loss. To achieve this, we upper bound the best-in-class conditional error by the conditional error of a carefully selected hypothesis hµ H. The resulting softmax Sµ of this hypothesis only differs from the original softmax S corresponding to h on two labels. Theorem 6 admits as special cases the H-consistency bounds of Mao et al. [2023h] given for standard multi-class classiﬁcation (ℓ= ℓ0 1) and signiﬁcantly extends them to the general structured prediction scenario.

Let us emphasize that our proof technique is novel and distinct from the approach used in [Mao et al., 2023h], which only applies to the special case where ℓis the zero-one loss and cannot be generalized to any target loss ℓ. In their proof, the authors choose hµ based on individual scores h(x,y), rather than the softmax. Consequently, when ℓ ℓ0 1, as is common in structured prediction, the resulting optimization problem of µ can be very intricate and a closed-form expression of the optimization solution cannot be derived. However, our new proof method overcomes this limitation. By viewing the softmax of hypothesis as a unit and introducing a pseudo-conditional distribution q, we are able to solve a simple constrained optimization problem on µ within structured prediction scenario.

By Steinwart [2007, Theorem 3.2], the minimizability gaps MLcomp,H and ML,H vanish for the family of all measurable functions. Therefore, when H = Hall, the H-consistency bounds provided in Theorem 6 imply the Bayes-consistency of these structured comp-sum losses. Corollary 7. The structured comp-sum loss Lcomp is Bayes-consistent for Lcomp = Lcomp log , Lcomp exp , Lcomp gce , and Lcomp mae .

In fact, Theorem 6 provides stronger quantitative bounds than Bayes-consistency when the minimizability gaps vanish, which suggests that if the estimation error of the structured comp-sum loss RLcomp(h) R Lcomp,H is reduced to ϵ, the estimation error of the target loss RL(h) R L,H is upper bounded by 2 ϵ for structured logistic loss and structured sum-exponential loss, 2

nα ϵ for structured generalized cross-entropy loss, and nϵ for structured mean absolute error loss.

Table 2: A new family of surrogate losses for structured prediction: structured constrained losses.

Φu(v) Name Formulation ( y Y h(x,y) = 0)

ue v Structured constrained exponential loss Lcstnd exp = y Y ℓ(y ,y)max{0,1 h(x,y )}2

umax{0,1 v}2 Structured constrained squared-hinge loss Lcstnd hinge = y Y ℓ(y ,y)max{0,1 h(x,y )} umax{0,1 v} Structured constrained hinge loss Lcstnd hinge = y Y ℓ(y ,y)max{0,1 h(x,y )}

umin{max{0,1 v

ρ},1} Structured constrained ρ-margin loss Lcstnd ρ = y Y ℓ(y ,y)min{max{0,1 h(x,y )

5 Structured constrained loss functions

In this section, we introduce another new family of surrogate losses for structured prediction that we prove to admit H-consistency bounds. We will present a novel generalization of the constrained losses [Lee et al., 2004, Awasthi et al., 2022b] to structured prediction. Thus, we refer to them as structured constrained losses and deﬁne them as follows:

(x,y) X Y, Lcstnd(h,x,y) = y Y Φℓ(y ,y)( h(x,y )), (10)

with the constraint that y Y h(x,y) = 0 and Φu R R+ is an upper bound on v u1v 0 for any u R+. In standard constrained loss formulation, a single-variable function Φ(v) that deﬁnes a margin-based loss is used. In (10), the single-variable function Φ(v) is generalized to being a function of two variables Φu(v), which depends on both the target loss and the scores, to accommodate the structured prediction scenario. Speciﬁcally, we can choose Φu(v) = ue v, Φu(v) = umax{0,1 v}2, Φu(v) = umax{0,1 v}, Φu(v) = umin{max{0,1 v/ρ},1}, which lead to new surrogate losses for structured prediction deﬁned in Table 2. These surrogate losses are novel generalization of their corresponding counterparts [Lee et al., 2004, Awasthi et al., 2022b] in standard multi-class classiﬁcation, where ℓ= ℓ0 1. As with structured comp-sum losses, we will show that these structured constrained losses beneﬁt from H-consistency bounds in structured prediction as well, for any symmetric and complete hypothesis set H. Theorem 8 (H-consistency bound of Lcstnd). Assume that H is symmetric and complete. Then, for any target loss ℓ, hypothesis h H and any distribution, we have

RL(h) R L,H Γ(RLcstnd(h) R Lcstnd,H + MLcstnd,H) ML,H. (11)

where Γ(t) = 2 ℓmaxt when Lcstnd = Lcstnd exp ; Γ(t) = 2

t when Lcstnd = Lcstnd sq hinge; and Γ(t) = t when Lcstnd = Lcstnd hinge or Lcstnd ρ .

The proof is included in Appendix F. As for Theorem 8, the key part of the proof is to upper bound the conditional regret of the target loss (Lemma 3) by that of a surrogate loss. Here too, we introduce a pseudo-conditional distribution q, which can be viewed as a weighted distribution of the original one, p(x), with weights given by the target loss function. Then, we upper bound the best-in-class conditional error by the conditional error of a carefully selected hypothesis hµ H.

As shown by Steinwart [2007, Theorem 3.2], for the family of all measurable functions, the minimizability gaps vanish: MLcstnd,H = 0 and ML,H = 0. Therefore, when H = Hall, the H-consistency bounds provided in Theorem 6 imply the Bayes-consistency of these structured constrained losses. Corollary 9. The structured constrained loss Lcstnd is Bayes-consistent for Lcstnd = Lcstnd exp , Lcstnd sq hinge, Lcstnd hinge, and Lcstnd ρ .

As with the cases of structured comp-sum losses, Theorem 8 provides in fact stronger quantitative bounds than Bayes-consistency. They show that that if the estimation error of the structured constrained loss RLcomp(h) R Lcomp,H is reduced to ϵ, the estimation error of the target loss RL(h) R L,H is upper bounded by 2 ℓmaxϵ for Lcstnd exp , 2 ϵ for Lcstnd sq hinge and ϵ for Lcstnd hinge and Lcstnd ρ .

It is important to note that we can upper bound the minimizability gap by the approximation error, or ﬁner terms depending on the magnitude of the parameter space as in [Mao et al., 2023h]. Furthermore, our H-consistency bounds (Theorems 6 and 8) can be used to derive ﬁnite sample learning bounds for a hypothesis set H. These bounds depend on the Rademacher complexity of the hypothesis set and the loss function, as well as an upper bound on the minimizability gap for the surrogate loss.

6 Optimization of Lcomp log and Lcomp exp

In this section, we show that the gradient of the structured logistic loss Lcomp log can be computed efﬁciently at any point (xi,yi) and therefore that this loss function is both supported by H-consistency bounds and is of practical use. We similarly show that for Lcomp exp in Appendix G.2.

Fix the labeled pair (xi,yi) and h H. Observe that Lcomp log (h,xi,yi) can be equivalently rewritten as follows:

Lcomp log (h,xi,yi) = y Y

ℓ(y ,yi)log y Y eh(xi,y ) h(xi,y )

ℓ(y ,yi)h(xi,y ) + y Y

ℓ(y ,yi) log y Y eh(xi,y ) = y Y

ℓ(y ,yi)h(xi,y ) + ℓi log Zh,i,

where ℓi = y Y ℓ(y ,yi), and Zh,i = y Y eh(xi,y). Note that ℓi does not depend on h and can be pre-computed. Modulo normalization, this quantity is the average similarity of yi to Y, if we interpret ℓ= 1 ℓas a similarity. While Y may be very large, this can be often computed straightforwardly for most loss functions ℓ. For example, for the Hamming loss, for sequences of length l, we have

l E[ l k=1 (1 1y k yi,k)] = 1

l k=1 E[1y k=yi,k] = 1

Thus, in this case, ℓi does not depend on i and is a universal constant. Similarly, ℓi can be shown to be a constant for many other losses.

Hypothesis set. For the remaining of this section, to simplify the presentation, we will consider the hypothesis set of linear functions H = {x w Ψ(x,y) w Rd}, where Ψ is a feature mapping from X Y to Rd. Note that a number of classical structured prediction algorithms adopt the same linear hypothesis set: Struct SVM [Tsochantaridis et al., 2005b], Max-Margin Markov Networks (M3N) [Taskar et al., 2003b], Conditional Random Field (CRF) [Lafferty et al., 2001b], Voted Conditional Random Field (VCRF) [Cortes et al., 2016]. Our algorithms can also be incorporated into standard procedures for training neural network architectures (see [Cortes et al., 2018], Appendix B).

Markovian features. We will also assume Markovian features, as is common in structured prediction. Features used in practice often admit this property. Furthermore, in the absence of any such assumption, it is known that learning and inference in general are intractable. We will largely adopt here the deﬁnitions and notation from [Cortes et al., 2016] and will consider the common case where Y is a set of sequences of length l over a ﬁnite alphabet of size r. Other structured problems can be treated in similar ways. We will denote by ε the empty string and for any sequence y = (y1,...,yl) Y, we will denote by ys s = (ys,...,ys ) the substring of y starting at index s and ending at s . For convenience, for s 0, we deﬁne ys by ys = ε.

We will assume that the feature vector Ψ admits a Markovian property of order q, that is it can be decomposed as follows for any (x,y) X Y:

Ψ(x,y) = l s=1 ψ(x,ys s q+1,s). (12)

for some position-dependent feature vector function ψ deﬁned over X q [l]. We note that we can write Ψ = p k=1 Ψk with Ψk = (0,...,Ψk,...,0). In the following, abusing the notation, we will simply write Ψk instead of Ψk. Each Ψk corresponds to a Markovian feature vector based only on k-grams, p is the largest k. Thus, for any x X and y Y, we have

p k=1 Ψk(x,y). (13)

For any k [1,p], let ψk denote the position-dependent feature vector function corresponding to Ψk. Also, for any x X and y l, deﬁne ψ by ψ(x,ys s p+1,s) = p k=1 ψk(x,ys s k+1,s). Observe then that we can write

p k=1 Ψk(x,y) =

l s=1 ψk(x,ys s k+1,s) = l s=1

p k=1 ψk(x,ys s k+1,s) = l s=1 ψ(x,ys s p+1,s).

Gradient computation. Adopting the shorthand w for h, we can rewrite the loss at (xi,yi) as:

Lcomp log (w,xi,yi) = w y Y

ℓ(y ,yi)Ψ(xi,y ) + ℓi log Zw,i.

Thus, the gradient of Lcomp log at any w Rd is given by

Lcomp log (w) = y Y

ℓ(y ,yi)Ψ(xi,y ) + ℓi y Y

y Y ew Ψ(xi,y ) Ψ(xi,y)

ℓ(y ,yi)Ψ(xi,y ) + ℓi E y qw[Ψ(xi,y)],

where qw is deﬁned for all y Y by qw(y) = ew Ψ(xi,y)

Zw with Zw = y Y ew Ψ(xi,y). Note that the sum deﬁning these terms is over a number of sequences y that is exponential in r and that the computation appears to be therefore challenging. The following lemma gives the expression of the gradient of Lcomp log and helps identify the most computationally challenging terms.

Lemma 10. For any w Rd, the gradient of Lcomp log can be expressed as follows:

Lcomp log (w) = l s=1 z p[ℓi Qw(z,s) L(z,s)] ψ(xi,z,s),

where Qw(z,s) = y ys s p+1=z qw(y) and L(z,s) = y ys s p+1=z ℓ(y,yi).

Proof. Using the decomposition of the feature vector, we can write:

ℓ(y,yi)Ψ(xi,y) = y l

ℓ(y,yi) l s=1 ψ(xi,ys s p+1,s) = l s=1 z p

y ys s p+1=z

E y qw[Ψ(xi,y)] = y l qw(y) l s=1 ψ(xi,ys s p+1,s) = l s=1 z p

y ys s p+1=z qw(y)

This completes the proof.

In light of this result, the bottleneck in the gradient computation is the evaluation of Qw(z,s) and L(z,s) for all s [l] and z p. In previous work [Cortes, Kuznetsov, Mohri, and Yang, 2016, Cortes, Kuznetsov, Mohri, Storcheus, and Yang, 2018], it was shown that the quantities Qw(z,s) can be determined efﬁciently, all together, by running two single-source shortest-distance algorithms over the (+, ) semiring on an appropriate weighted ﬁnite automaton (WFA). The overall time complexity of the computation of all quantities Qw(z,s), z p and s [l], is then in O(lrp), where r = .

We now analyze the computation of L(z,s) for a ﬁxed z p and s [l]. Note that, unlike Qw(z,s), this term does not depend on w and can therefore be computed once and for all, before any gradient computation. The sum deﬁning L(z,s) is over all sequences y that admit the substring z at position s.

Rational losses. In Appendix G.1, we also give an efﬁcient algorithm for the computation of the quantities L(z,s) in the case of Markovian losses. Here, we present an efﬁcient algorithm for their computation in the important case of rational losses. This is a general family of loss functions based on rational kernels [Cortes, Haffner, and Mohri, 2004] that includes, in particular, n-gram losses, which can be deﬁned for a pair of sequences (y,y ) as the negative inner product of the vectors of n-gram counts of y and y .

Our algorithm bears some similarity to that of Cortes et al. [2018] for the computation of the gradient of the VCRF loss function. It is however distinct because the structured prediction loss function we are considering and our deﬁnition of rational loss are both different. We will adopt a similar notation and terminology. Recall that for any sequence y, we denote by yi the symbol in its ith position and by yj i = yiyi+1 yj the substring of y starting at position i and ending at j. We denote by EA the set of transitions of a WFA A. Let U be a weighted ﬁnite-state transducer (WFST) over the (+, ) semiring over the reals, with as both the input and output alphabet. Then, we deﬁne the rational loss associated to U for all y,y by ℓ(y,y ) = U(y,y ).

0 1 a:a/1 2 b:b/1 3/1 a:a/1

Figure 1: Illustration of the WFA Y for = {a,b} and l = 3, and the WFA Yi, where yi = aba.

Let Y denote a WFA over the (+, ) semiring accepting the set of all sequences of length l with weight one and let Yi denote the WFA accepting only yi with weight one. Then, by deﬁnition, the weighted transducer Y U Yi obtained by composition maps each sequence y in l to yi with weight U(y,yi). The WFA Π1(Y U Yi) derived from that transducer by projection on the input (that is by removing output labels) is associating to each sequence y weight U(y,yi). We use weighted determinization [Mohri, 1997] to compute an equivalent deterministic WFA denote M. As shown by Cortes et al. [2015][Theorem 3], M can be computed in polynomial time. M admits a unique path labeled with any sequence y l and the weight of that path is U(y,yi). The weight of that accepting path is obtained by multiplying the weights of its transitions and that of the ﬁnal state.

a:a/1 (b, 2)

Figure 2: Illustration of the WFA N for = {a,b}, p = 2 and l = 2.

We now deﬁne a deterministic p-gram WFA N that accepts all sequences y l with each of its states (z ,s) encoding a (p 1)- gram z read to reach it and the position s in the sequence y at which it is reached. The transitions of N are therefore deﬁned as follows with weight one:

EN = {((ys 1 s p+1,s 1),a,1,(ys 1 s p+2a,s)) y l,a ,s [l]}.

The initial state is (ϵ,0) and the ﬁnal states are those with the second element of the pair (the position) being l. Note that, by construction, N is deterministic. Then, the composition (or intersection) WFA N M still associates the same weight as M to each input string y l. However, the states in that composition help us compute L(z,s). In particular, for any z p and s [l], let E(z,s) be the set of transitions of N M constructed by pairing the transition ((zp 1 1 ,s 1),zp,ω(z,s),(zp 2,s)) in N with a transition (q M,zp,ω,q M) in M. They admit the following form:

E(z,s) ={((q N,q M),zp,ω,(q N,q M)) EN M q N = (zp 1 1 ,s 1)}. (14)

The WFA N M is deterministic as a composition of two deterministic WFAs. Thus, there is a unique path labeled with a sequence y l in N M and y admits the substring z ending at position s iff that path goes through a transition in E(z,s) when reaching position s. Therefore, to compute L(z,s), it sufﬁces for us to compute the sum of the weights of all paths in N M going through a transition in E(z,s). This can be done straightforwardly using the forward-backward algorithm or two single-source shortest-distance algorithm over the (+, ) semiring [Mohri, 2002a], one from the initial state, the other one from the ﬁnal states. Since N M is acyclic and admits O(l p) transitions, we can compute all the quantities L(z,s), s [l] and z p, in time O(l p).

7 Conclusion

Our detailed study revealed shortcomings in commonly used surrogate loss functions and algorithms for structured prediction, prompting the introduction of new, strongly consistent alternatives. These ﬁndings not only enhance the theoretical and algorithmic foundations of structured prediction but also pave the way for the development of practical and effective solutions. In upcoming work, we will report an extensive empirical analysis of our algorithms. Our work provides tools and insights for future algorithm design in this domain, promising advancements in both theory and application.

T. E. Ahmad, L. Brogat-Motte, P. Laforgue, and F. d Alché Buc. Sketch in, sketch out: Accelerating both learning and inference for structured prediction with kernels. ar Xiv preprint ar Xiv:2302.10128, 2023.

C. Allauzen and M. Mohri. An efﬁcient pre-determinization algorithm. In O. H. Ibarra and Z. Dang, editors, Implementation and Application of Automata, 8th International Conference, CIAA 2003, Santa Barbara, California, USA, July 16-18, 2003, Proceedings, volume 2759 of Lecture Notes in Computer Science, pages 83 95. Springer, 2003.

C. Allauzen and M. Mohri. An optimal pre-determinization algorithm for weighted transducers. Theor. Comput. Sci., 328(1-2):3 18, 2004.

P. Awasthi, N. Frank, A. Mao, M. Mohri, and Y. Zhong. Calibration and consistency of adversarial surrogate losses. In Advances in Neural Information Processing Systems, 2021a.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. A ﬁner calibration analysis for adversarial robustness. ar Xiv preprint ar Xiv:2105.01550, 2021b.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, 2022a.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. Multi-class H-consistency bounds. In Advances in neural information processing systems, 2022b.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. Theoretically grounded loss functions and algorithms for adversarial robustness. In International Conference on Artiﬁcial Intelligence and Statistics, pages 10077 10094, 2023a.

P. Awasthi, A. Mao, M. Mohri, and Y. Zhong. DC-programming for neural network optimizations. Journal of Global Optimization, 2023b.

D. Belanger and A. Mc Callum. Structured prediction energy networks. In International Conference on Machine Learning, pages 983 992, 2016.

D. Belanger, B. Yang, and A. Mc Callum. End-to-end learning for structured prediction energy networks. In International Conference on Machine Learning, pages 429 439, 2017.

S. Belharbi, R. Hérault, C. Chatelain, and S. Adam. Deep neural networks regularization for structured output prediction. Neurocomputing, 281:169 177, 2018.

J. Berkson. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39:357 -365, 1944.

J. Berkson. Why I prefer logits to probits. Biometrics, 7(4):327 -339, 1951.

M. Blondel. Structured prediction with projection oracles. In Advances in neural information processing systems, 2019.

L. Brogat-Motte, A. Rudi, C. Brouard, J. Rousu, and F. d Alché Buc. Learning output embeddings in structured prediction. ar Xiv preprint ar Xiv:2007.14703, 2020.

V. A. Cabannes, F. Bach, and A. Rudi. Fast rates for structured prediction. In Conference on Learning Theory, pages 823 865, 2021.

V. Cabannnes, A. Rudi, and F. Bach. Structured prediction with partial labelling through the inﬁmum loss. In International Conference on Machine Learning, pages 1230 1239, 2020.

K. Chang, A. Krishnamurthy, A. Agarwal, H. Daumé III, and J. Langford. Learning to search better than your teacher. In Proceedings of ICML, 2015.

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. ar Xiv preprint ar Xiv:1412.7062, 2014.

L.-C. Chen, A. Schwing, A. Yuille, and R. Urtasun. Learning deep structured models. In International Conference on Machine Learning, pages 1785 1794, 2015.

H. Choi, O. Meshi, and N. Srebro. Fast and scalable structural svm with slack rescaling. In Artiﬁcial Intelligence and Statistics, pages 667 675, 2016.

C. Ciliberto, L. Rosasco, and A. Rudi. A consistent regularization approach for structured prediction. In Advances in neural information processing systems, 2016.

C. Ciliberto, F. Bach, and A. Rudi. Localized structured prediction. In Advances in Neural Information Processing Systems, 2019.

C. Ciliberto, L. Rosasco, and A. Rudi. A general framework for consistent structured prediction with implicit loss embeddings. The Journal of Machine Learning Research, 21(1):3852 3918, 2020.

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE):2493 2537, 2011.

C. Corro. On the inconsistency of separable losses for structured prediction. ar Xiv preprint ar Xiv:2301.10810, 2023.

C. Cortes, P. Haffner, and M. Mohri. Rational kernels: Theory and algorithms. J. Mach. Learn. Res., 5:1035 1062, 2004.

C. Cortes, M. Mohri, and J. Weston. A general regression technique for learning transductions. In L. D. Raedt and S. Wrobel, editors, Machine Learning, Proceedings of the Twenty-Second International Conference (ICML 2005), Bonn, Germany, August 7-11, 2005, volume 119 of ACM International Conference Proceeding Series, pages 153 160. ACM, 2005.

C. Cortes, M. Mohri, and J. Weston. A General Regression Framework for Learning String-to-String Mappings. In Predicting Structured Data. MIT Press, 2007.

C. Cortes, V. Kuznetsov, and M. Mohri. Ensemble methods for structured prediction. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pages 1134 1142. JMLR.org, 2014a.

C. Cortes, V. Kuznetsov, and M. Mohri. Learning ensembles of structured prediction rules. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, pages 1 12. The Association for Computer Linguistics, 2014b.

C. Cortes, V. Kuznetsov, M. Mohri, and M. K. Warmuth. On-line learning algorithms for path experts with non-additive losses. In Proceedings of COLT, 2015.

C. Cortes, V. Kuznetsov, M. Mohri, and S. Yang. Structured prediction theory based on factor graph complexity. In Advances in Neural Information Processing Systems, 2016.

C. Cortes, V. Kuznetsov, M. Mohri, D. Storcheus, and S. Yang. Efﬁcient gradient computation for structured output learning with rational and tropical losses. In Advances in Neural Information Processing Systems, 2018.

H. Daumé III, J. Langford, and D. Marcu. Search-based structured prediction. Machine Learning, 75 (3):297 325, 2009.

J. Domke. Learning graphical model parameters with approximate marginal inference. IEEE transactions on pattern analysis and machine intelligence, 35(10):2454 2467, 2013.

J. R. Doppa, A. Fern, and P. Tadepalli. Structured prediction via output space search. JMLR, 15(1): 1317 1350, 2014.

P. Dragone, S. Teso, and A. Passerini. Neuro-symbolic constraint programming for structured prediction. ar Xiv preprint ar Xiv:2103.17232, 2021.

S. Edunov, M. Ott, M. Auli, D. Grangier, and M. Ranzato. Classical structured prediction losses for sequence to sequence learning. ar Xiv preprint ar Xiv:1711.04956, 2017.

J. Finocchiaro, R. Frongillo, and B. Waggoner. An embedding framework for consistent polyhedral surrogates. In Advances in neural information processing systems, 2019.

A. Ghosh, H. Kumar, and P. S. Sastry. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artiﬁcial intelligence, 2017.

K. Gimpel and N. A. Smith. Softmax-margin crfs: Training log-linear models with cost functions. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 733 736, 2010.

C. Graber and A. Schwing. Graph structured prediction energy networks. In Advances in Neural Information Processing Systems, 2019.

C. Graber, O. Meshi, and A. Schwing. Deep structured prediction with nonlinear output transformations. In Advances in Neural Information Processing Systems, 2018.

M. Gygli, M. Norouzi, and A. Angelova. Deep value networks learn to evaluate and iteratively reﬁne structured outputs. In International Conference on Machine Learning, pages 1341 1351, 2017.

J. R. Hershey, J. L. Roux, and F. Weninger. Deep unfolding: Model-based inspiration of novel deep architectures. ar Xiv preprint ar Xiv:1409.2574, 2014.

Z. Huang, W. Xu, and K. Yu. Bidirectional lstm-crf models for sequence tagging. ar Xiv preprint ar Xiv:1508.01991, 2015.

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recognition. ar Xiv preprint ar Xiv:1412.5903, 2014.

H. Jang, S. Mo, and S. Ahn. Diffusion probabilistic models for graph-structured prediction. ar Xiv preprint ar Xiv:2302.10506, 2023.

N. Jiang, M. Zhang, W.-J. van Hoeve, and Y. Xue. Constraint reasoning embedded structured prediction. Journal of Machine Learning Research, 23(345):1 40, 2022.

D. Jurafsky and J. H. Martin. Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., 2009.

Y. Kim, C. Denton, L. Hoang, and A. M. Rush. Structured attention networks. In International Conference on Learning Representations, 2017.

K. Kunisch and T. Pock. A bilevel optimization approach for parameter learning in variational models. SIAM Journal on Imaging Sciences, 6(2):938 983, 2013.

J. Lafferty, A. Mc Callum, and F. Pereira. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, 2001a.

J. D. Lafferty, A. Mc Callum, and F. C. Pereira. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning, pages 282 289, 2001b.

M. Lam, J. R. Doppa, S. Todorovic, and T. G. Dietterich. Hc-search for structured prediction in computer vision. In CVPR, 2015.

M. Larsson, A. Arnab, S. Zheng, P. Torr, and F. Kahl. Revisiting deep structured models for pixel-level labeling with gradient-based inference. SIAM Journal on Imaging Sciences, 11(4):2610 2628, 2018.

Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines: Theory and application to the classiﬁcation of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67 81, 2004.

Y. Li and R. Zemel. Mean-ﬁeld networks. ar Xiv preprint ar Xiv:1410.5884, 2014.

T. Liu, Y. Jiang, N. Monath, R. Cotterell, and M. Sachan. Autoregressive structured prediction with language models. ar Xiv preprint ar Xiv:2210.14698, 2022.

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431 3440, 2015.

Y. Lu and B. Huang. Structured output learning with conditional generative ﬂows. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, pages 5005 5012, 2020.

A. Lucchi, L. Yunpeng, and P. Fua. Learning for structured prediction using approximate subgradient descent with working sets. In Proceedings of CVPR, 2013.

C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts, 1999.

A. Mao, C. Mohri, M. Mohri, and Y. Zhong. Two-stage learning to defer with multiple experts. In Advances in neural information processing systems, 2023a.

A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds: Characterization and extensions. In Advances in Neural Information Processing Systems, 2023b.

A. Mao, M. Mohri, and Y. Zhong. Principled approaches for learning to defer with multiple experts. ar Xiv preprint ar Xiv:2310.14774, 2023c.

A. Mao, M. Mohri, and Y. Zhong. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms. ar Xiv preprint ar Xiv:2310.14772, 2023d.

A. Mao, M. Mohri, and Y. Zhong. H-consistency bounds for pairwise misranking loss surrogates. In International conference on Machine learning, 2023e.

A. Mao, M. Mohri, and Y. Zhong. Ranking with abstention. In ICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023f.

A. Mao, M. Mohri, and Y. Zhong. Theoretically grounded loss functions and algorithms for scorebased multi-class abstention. ar Xiv preprint ar Xiv:2310.14770, 2023g.

A. Mao, M. Mohri, and Y. Zhong. Cross-entropy loss functions: Theoretical analysis and applications. In International Conference on Machine Learning, 2023h.

C. Meister, T. Vieira, and R. Cotterell. Best-ﬁrst beam search. Transactions of the Association for Computational Linguistics, 8:795 809, 2020.

C. Mohri, D. Andor, E. Choi, M. Collins, A. Mao, and Y. Zhong. Learning to reject with a ﬁxed predictor: Application to decontextualization. ar Xiv preprint ar Xiv:2301.09044, 2023.

M. Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23 (2):269 311, 1997.

M. Mohri. Generic ϵ-removal algorithm for weighted automata. In S. Yu and A. Paun, editors, Implementation and Application of Automata, 5th International Conference, CIAA 2000, London, Ontario, Canada, July 24-25, 2000, Revised Papers, volume 2088 of Lecture Notes in Computer Science, pages 230 242. Springer, 2000.

M. Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of Automata, Languages and Combinatorics, 7(3):321 350, 2002a.

M. Mohri. Generic ϵ-removal and input ϵ-normalization algorithms for weighted transducers. Int. J. Found. Comput. Sci., 13(1):129 143, 2002b.

M. Mohri. Weighted automata algorithms. In Handbook of Weighted Automata, pages 213 254. Springer, 2009.

M. Mohri and M. Riley. Weighted determinization and minimization for large vocabulary speech recognition. In G. Kokkinakis, N. Fakotakis, and E. Dermatas, editors, Fifth European Conference on Speech Communication and Technology, EUROSPEECH 1997, Rhodes, Greece, September 22-25, 1997, pages 131 134. ISCA, 1997.

M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, second edition, 2018.

D. Nadeau and S. Sekine. A survey of named entity recognition and classiﬁcation. Linguisticae Investigationes, 30(1):3 26, January 2007.

M. Norouzi, S. Bengio, N. Jaitly, M. Schuster, Y. Wu, D. Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, 2016.

A. Nowak, F. Bach, and A. Rudi. Sharp analysis of learning with discrete losses. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 1920 1929, 2019.

A. Nowak, F. Bach, and A. Rudi. Consistent structured prediction with max-min margin markov networks. In International Conference on Machine Learning, pages 7381 7391, 2020.

A. Nowak, A. Rudi, and F. Bach. On the consistency of max-margin losses. In International Conference on Artiﬁcial Intelligence and Statistics, pages 4612 4633, 2022.

A. Nowak-Vila, F. Bach, and A. Rudi. A general theory for structured prediction with smooth convex surrogates. ar Xiv preprint ar Xiv:1902.01958, 2019.

A. Osokin, F. Bach, and S. Lacoste-Julien. On structured prediction theory with calibrated convex surrogate losses. In Advances in Neural Information Processing Systems, 2017.

P. Pan, P. Liu, Y. Yan, T. Yang, and Y. Yang. Adversarial localized energy network for structured prediction. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, pages 5347 5354, 2020.

D. Patel, P. Dangati, J.-Y. Lee, M. Boratko, and A. Mc Callum. Modeling label space interactions in multi-label classiﬁcation using box embeddings. In International Conference on Learning Representations, 2022.

V. K. Pillutla, V. Roulet, S. M. Kakade, and Z. Harchaoui. A smoother way to train structured prediction models. In Advances in Neural Information Processing Systems, 2018.

S. Ross, G. J. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of AISTATS, 2011.

A. G. Schwing and R. Urtasun. Fully connected deep structured networks. ar Xiv preprint ar Xiv:1503.02351, 2015.

I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007.

V. Stoyanov, A. Ropson, and J. Eisner. Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 725 733, 2011.

I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 2014.

K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured long short-term memory networks. ar Xiv preprint ar Xiv:1503.00075, 2015.

B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS, 2003a.

B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in neural information processing systems, 2003b.

I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. JMLR, 6:1453 1484, Dec. 2005a.

I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun, and Y. Singer. Large margin methods for structured and interdependent output variables. Journal of machine learning research, 6(9), 2005b.

L. Tu and K. Gimpel. Learning approximate inference networks for structured prediction. ar Xiv preprint ar Xiv:1803.03376, 2018.

L. Tu and K. Gimpel. Benchmarking approximate inference methods for neural structured prediction. ar Xiv preprint ar Xiv:1904.01138, 2019.

L. Tu, R. Y. Pang, and K. Gimpel. Improving joint training of inference networks and structured prediction energy networks. ar Xiv preprint ar Xiv:1911.02891, 2019.

P. F. Verhulst. Notice sur la loi que la population suit dans son accroissement. Correspondance mathématique et physique, 10:113 -121, 1838.

P. F. Verhulst. Recherches mathématiques sur la loi d accroissement de la population. Nouveaux Mémoires de l Académie Royale des Sciences et Belles-Lettres de Bruxelles, 18:1 -42, 1845.

O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. Grammar as a foreign language. In Proceedings of NIPS, 2015a.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of CVPR, 2015b.

S. Wang, S. Fidler, and R. Urtasun. Proximal deep structured models. Advances in Neural Information Processing Systems, 2016.

J. Weston and C. Watkins. Multi-class support vector machines. Technical report, Citeseer, 1998.

Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Łukasz Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean. Google s neural machine translation system: Bridging the gap between human and machine translation. Co RR, abs/1609.08144, 2016. URL http://arxiv.org/abs/1609.08144.

D. Zhang, L. Sun, and W. Li. A structured prediction approach for statistical machine translation. In Proceedings of IJCNLP, 2008.

T. Zhang. Statistical analysis of some multi-category large margin classiﬁcation methods. Journal of Machine Learning Research, 5(Oct):1225 1251, 2004.

Z. Zhang and M. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, 2018.

C. Zheng, G. Wu, F. Bao, Y. Cao, C. Li, and J. Zhu. Revisiting discriminative vs. generative classiﬁers: Theory and implications. ar Xiv preprint ar Xiv:2302.02334, 2023.

K. Zheng and A. Pronobis. From pixels to buildings: End-to-end probabilistic deep networks for large-scale semantic mapping. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3511 3518, 2019.

S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random ﬁelds as recurrent neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1529 1537, 2015.

Contents of Appendix

A Related work 18

B Proof of Lemma 3 19

C Proofs for structured max losses 19

D Proofs for Voted Conditional Random Field 20

E Proofs for structured comp-sum losses 21

E.1 Structured logistic loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

E.2 Structured sum-exponential loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

E.3 Structured generalized cross-entropy loss . . . . . . . . . . . . . . . . . . . . . . . . . 23

E.4 Structured mean absolute error loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

F Proofs for structured constrained losses 26

F.1 Structured constrained exponential loss . . . . . . . . . . . . . . . . . . . . . . . . . . 26

F.2 Structured constrained squared-hinge loss . . . . . . . . . . . . . . . . . . . . . . . . . 27

F.3 Structured constrained hinge loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

F.4 Structured constrained ρ-margin loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

G Efﬁcient gradient computation and inference 31

G.1 Efﬁcient gradient computation for Lcomp log . . . . . . . . . . . . . . . . . . . . . . . . . . 31

G.2 Efﬁcient gradient computation for Lcomp exp . . . . . . . . . . . . . . . . . . . . . . . . . . 32

G.3 Efﬁcient Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A Related work

Structured prediction and neural networks. A variety of deep learning techniques have been used for structured prediction tasks, including a uniﬁed neural network architecture for natural language processing [Collobert et al., 2011], energy model-based structured prediction including structured prediction energy networks (SPENs) and various inference methods [Belanger and Mc Callum, 2016, Tu and Gimpel, 2018, Larsson et al., 2018, Tu and Gimpel, 2019, Tu et al., 2019, Graber and Schwing, 2019, Pan et al., 2020], attention networks incorporating richer structural distributions [Kim et al., 2017], tree-structured long short-term memory (LSTM) networks [Tai et al., 2015], sequence to sequence learning [Sutskever et al., 2014, Edunov et al., 2017], memory-reduced variant of best-ﬁrst beam search [Meister et al., 2020], end-to-end learning for SPENs [Belanger et al., 2017], conditional generative ﬂow [Lu and Huang, 2020], proximal methods [Wang et al., 2016], CRF using deep features [Jaderberg et al., 2014, Huang et al., 2015, Chen et al., 2014, Schwing and Urtasun, 2015, Chen et al., 2015], non-iterative feed-forward predictors [Stoyanov et al., 2011, Domke, 2013, Kunisch and Pock, 2013, Hershey et al., 2014, Li and Zemel, 2014, Belharbi et al., 2018, Zheng et al., 2015], deep value network [Gygli et al., 2017], fully convolutional networks [Long et al., 2015], constraint reasoning tool [Dragone et al., 2021, Jiang et al., 2022], multi-label box model[Patel et al., 2022], probabilistic deep networks[Zheng and Pronobis, 2019, Jang et al., 2023], autoregressive methods [Liu et al., 2022], nonlinear output transformations and embedding [Graber et al., 2018, Brogat-Motte et al., 2020], smoothing methods [Pillutla et al., 2018] and structural training [Choi et al., 2016, Ahmad et al., 2023]. Let us also mention ensemble algorithms for structured prediction algorithms [Cortes, Kuznetsov, and Mohri, 2014a,b], which can be used to combine several algorithms for this problem.

Consistency in structured prediction. Here, we discuss in detail previous work on consistency in structured prediction [Osokin et al., 2017, Ciliberto et al., 2016, Blondel, 2019, Nowak et al., 2020, 2022, Ciliberto et al., 2019, 2020, Nowak-Vila et al., 2019, Nowak et al., 2019, Cabannes et al., 2021, Cabannnes et al., 2020, Corro, 2023].

Osokin et al. [2017], Nowak et al. [2020] and Nowak et al. [2022] pointed out that the max-margin Markov networks (M3N), or more generally structural SVMs may not be Bayes-consistent. Instead, Osokin et al. [2017] proposed the ﬁrst Bayes-consistent surrogate loss in the structured prediction setting, the quadratic surrogate (QS) loss. A general theory of QS was further developed in [Nowak Vila et al., 2019, Nowak et al., 2019]. However, as pointed out in Section 1, the quadratic surrogate loss formulation casts the structured prediction problem as a regression problem and is not a typical formulation even in the binary classiﬁcation case.

Nowak et al. [2020] proposed a consistent method called max-min margin Markov networks (M4N) derived from ﬁrst principles for binary SVM. However, this method is restricted to SVM-type loss functions. Instead, we propose broad families of surrogate losses, which can be naturally derived from common multi-class losses including the logistic loss.

Nowak et al. [2022] addressed the inconsistency of Max-Margin loss in structured prediction by introducing the notion of Restricted-Max-Margin, where the maximization is performed over a subset of the original domain. Their method is based on an implicit embedding [Finocchiaro et al., 2019]; a general framework for structured prediction has been further developed by Ciliberto et al. [2020]. However, these methods are only applied to polyhedral-type surrogates which are not as smooth as the logistic loss. Thus, the resulting surrogate losses may not be favorable from the optimization point of view. Instead, our novel families of surrogate losses are very general and can be smooth, including a new structured logistic loss, for which we describe efﬁcient gradient computation algorithms.

Ciliberto et al. [2016] focused on a least squares surrogate loss function and corresponding framework. In this framework, the structured prediction problem is cast as a regression problem. They derived a regularization approach to structured prediction from the least squares surrogate loss and proved the Bayes-consistency of that approach. Ciliberto et al. [2019] focused on a local structure-adapted framework for structured prediction. They proposed a novel structured prediction algorithm that adaptively leverages locality in the learning problem. Ciliberto et al. [2020] developed a general framework for structured prediction based on implicit embedding. Their methods lead to polyhedraltype surrogates losses that beneﬁt from Bayes-consistency.

On the other hand, our work presents an extensive study of surrogate losses for structured prediction supported by H-consistency bounds. Different from the surrogate loss studied in the previous work,

the formulations of our proposed surrogate losses including structured comp-sum losses and structured constrained losses are completely novel and do not cast structured prediction problems as a regression problem. Furthermore, we prove stronger consistency guarantees that imply Bayes-consistency for these new proposed families of surrogate loss.

Other related work on structured prediction includes: projection-based losses for structured prediction [Blondel, 2019]; fast convergence rates for general structured prediction problems [Cabannes et al., 2021]; a uniﬁed framework for dealing with partial labelling [Cabannnes et al., 2020]; and an analysis of the inconsistency of separable negative log-likelihood losses for structured prediction [Corro, 2023].

B Proof of Lemma 3

Lemma 3. The best-in-class conditional error and the conditional regret for a target loss L in structured prediction can be expressed as follows: C L,H(x) = min y H(x) y Y p(x,y)ℓ(y ,y)

CL,H(h,x) = y Y p(x,y)ℓ(h(x),y) min y H(x) y Y p(x,y)ℓ(y ,y).

Proof. By the deﬁnition, the conditional L-risk can be expressed as follows: CL(h,x) = y Y p(x,y)ℓ(h(x),y). (15)

Since {h(x) h H} = H(x), the best-in-class conditional error can be expressed as follows: C L,H(x) = min y H(x) y Y p(x,y)ℓ(y ,y),

which proves the ﬁrst part of the lemma. By the deﬁnition, CL,H(h,x) = CL(h,x) C L,H(x) = y Y p(x,y)ℓ(h(x),y) min y H(x) y Y p(x,y)ℓ(y ,y).

C Proofs for structured max losses

Theorem 4 (Negative results of Lmax). Assume that n > 2 and that Φu(v) is convex and nonincreasing for u = 1. Then, the max structured loss Lmax is not Bayes-consistent.

Proof. For the structured max loss Lmax, the conditional Lmax-risk can be expressed as follows: CLmax(h,x) = y Y p(x,y)max y y Φℓ(y ,y)(h(x,y) h(x,y )).

Take ℓ(y ,y) = 1y y to be the zero-one loss. Since ℓ(y ,y) = 1 for any y y , the conditional Lmax-risk can be reformulated as follows: CLmax(h,x) = y Y p(x,y)max y y Φ1(h(x,y) h(x,y )).

Consider the distribution that supports on a singleton domain {x}. Take y1 y2 Y such that y1 n and y2 n. We deﬁne the conditional distribution as p(x,y1) = p(x,y2) = 1

2 and p(x,y) = 0 for other y Y. Then, by using the fact that Φ1(v) is convex and non-increasing, we obtain

RLmax(h) = CLmax(h,x) = 1

2 max y y1 Φ1(h(x,y1) h(x,y )) + 1

2 max y y2 Φ1(h(x,y2) h(x,y ))

2Φ1(h(x,y1) max y y1 h(x,y )) + 1

2Φ1(h(x,y2) max y y2 h(x,y ))

(Φ1(v) is non-increasing)

2 max y y2 h(x,y ) + 1

2 max y y1 h(x,y ))

(Φ1(v) is convex) Φ1(0) (Φ1(v) is non-increasing)

where the equality can be achieved by h H, deﬁned as h (x,1) = h (x,2) = ... = h (x,n). Therefore, h is a Bayes classiﬁer of the structured max loss. Note that h (x) = n. However, by Lemma 3, in such a case, the Bayes classiﬁer h ℓof the target loss satisﬁes that

h ℓ(x) = argmin y Y y Y p(x,y)ℓ(y ,y) = argmin y Y (p(x,y1)1y y1 + p(x,y2)1y y2) = y1 or y2.

Thus, we obtain h h ℓ. Therefore, Lmax is not Bayes-consistent.

D Proofs for Voted Conditional Random Field

Theorem 5 (Negative result of LVCRF). The Voted Conditional Random Field LVCRF is not Bayesconsistent.

Proof. For the Voted Conditional Random Field LVCRF, the conditional LVCRF-risk can be expressed as follows:

CLVCRF(h,x) = y Y p(x,y)log y Y eℓ(y,y )+h(x,y ) h(x,y) .

Consider the distribution that supports on a singleton domain {x}. Note that RLVCRF = CLVCRF is convex with respect to h(x,y),y = 1,...,n. To ﬁnd the global minimum, we will differentiate CLVCRF with respect to h(x,y) for any y Y and setting the derivatives to zero. Thus, we obtain for any y Y,

p(x,y) y y eℓ(y,y )+h(x,y ) h(x,y)

y Y eℓ(y,y )+h(x,y ) h(x,y) + y y p(x,y ) eℓ(y ,y)+h(x,y) h(x,y )

y Y eℓ(y ,y )+h(x,y ) h(x,y ) = 0. (16)

Using the fact that y y eℓ(y,y )+h(x,y ) h(x,y) = y Y eℓ(y,y )+h(x,y ) h(x,y) eℓ(y,y)+h(x,y) h(x,y) to further simplify the LHS of (16), we obtain for any y Y,

p(x,y) = y Y p(x,y ) eℓ(y ,y)+h(x,y) h(x,y )

y Y eℓ(y ,y )+h(x,y ) h(x,y ) = y Y p(x,y ) eℓ(y ,y)+h(x,y)

y Y eℓ(y ,y )+h(x,y ) . (17)

Consider a target loss function L such that eℓ(y,y ) = ΦyΦy , that is ℓ(y,y ) = log(Φy) + log(Φy ), where Φy is a function mapping from Y to R+. For this special choice of the target loss function, the expression of ℓ(y,y ) decouple and (17) can be simpliﬁed to

p(x,y) = y Y p(x,y ) ΦyΦy eh(x,y)

y Y Φy Φy eh(x,y ) = Φyeh(x,y)

y Y Φy eh(x,y ) . (18)

Therefore, for the Bayes classiﬁer h of Voted Conditional Random Field, by (18), we have

p(x,1) = ... = Φyeh (x,n)

which implies that

h (x) = argmax y Y h (x,y ) = argmax y Y eh (x,y ) = argmax y Y

However, by Lemma 3, in such a case, the Bayes classiﬁer h ℓof the target loss satisﬁes that

h ℓ(x) = argmin y Y y Y p(x,y)ℓ(y ,y) = argmin y Y y Y p(x,y)(log(Φy) + log(Φy )) = argmin y Y Φy .

Thus, we obtain h h ℓin general. Indeed, consider the case where p(x,y) = Φ2 y n k=1 Φ2 k ,y Y. Then,

h (x) = argmaxy Y Φy n k=1 Φ2 k = argmaxy Y Φy argminy Y Φy = h ℓ(x) when {Φy y Y} are not equal. Therefore, LVCRF is not Bayes-consistent.

E Proofs for structured comp-sum losses

E.1 Structured logistic loss

Theorem 11 (H-consistency bound of Lcomp log ). Assume that H is symmetric and complete. Then, for any target loss ℓ, hypothesis h H and any distribution,

RL(h) R L,H 2(RLcomp log (h) R Lcomp log ,H + MLcomp log ,H)

1 2 ML,H. (19)

Proof. For the comp-sum structured loss Lcomp log , the conditional Lcomp log -risk can be expressed as follows:

CLcomp log (h,x) = y Y p(x,y) y Y

ℓ(y ,y)log( eh(x,y )

y Y eh(x,y ) )

= y Y log( eh(x,y )

y Y eh(x,y ) ) y Y p(x,y)ℓ(y ,y)

= y Y log(S(x,y ))q(x,y ),

where we denote by q(x,y ) = y Y p(x,y)ℓ(y ,y) [0,1] and S(x,y) = eh(x,y)

y Y eh(x,y ) [0,1]

with the constraint that y Y S(x,y) = 1. Let ymin = argminy Y y Y p(x,y)ℓ(y ,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. For any h H such that h(x) ymin and x X, by the symmetry and completeness of H, we can always ﬁnd a family of hypotheses {hµ µ [ S(x,ymin),S(x,h(x))]} H such that

Sµ(x, ) = ehµ(x, )

y Y ehµ(x,y ) take the following values:

S(x,y) if y / {ymin,h(x)} S(x,ymin) + µ if y = h(x) S(x,h(x)) µ if y = ymin. (20)

Note that Sµ satisﬁes the constraint:

Sµ(x,y) = y Y S(x,y) = 1, µ [ S(x,ymin),S(x,h(x))].

By (20) and using the fact that H(x) = Y when H is symmetric, we obtain

CLcomp,H(h,x) = CLcomp(h,x) C Lcomp(H,x)

CLcomp(h,x) inf µ RCLcomp(hµ,x)

= sup µ [ S(x,ymin),S(x,h(x))] {q(x,ymin)[ log(S(x,ymin)) + log(S(x,h(x)) µ)]

+ q(x,h(x))[ log(S(x,h(x))) + log(S(x,ymin) + µ)]}.

Differentiating with respect to µ yields the optimal value µ = q(x,h(x))S(x,h(x)) q(x,ymin)S(x,ymin)

q(x,ymin)+q(x,h(x)) . Plugging in that value gives:

CLcomp,H(h,x) q(x,ymin)log (S(x,h(x)) + S(x,ymin))q(x,ymin)

S(x,ymin)(q(x,ymin) + q(x,h(x)))

+ q(x,h(x))log (S(x,h(x)) + S(x,ymin))q(x,h(x))

S(x,h(x))(q(x,ymin) + q(x,h(x))) .

Differentiating with respect to S shows that the minimum is attained for S(x,h(x)) = S(x,ymin), which gives:

CLcomp,H(h,x) q(x,ymin)log 2q(x,ymin) q(x,ymin) + q(x,h(x)) + q(x,h(x))log 2q(x,h(x)) q(x,ymin) + q(x,h(x))

(q(x,h(x)) q(x,ymin))2

2(q(x,h(x)) + q(x,ymin)) (alog 2a

a+b + blog 2b

2(a+b), a,b [0,1] [Mohri et al., 2018, Proposition E.7])

(q(x,h(x)) q(x,ymin))2

4 (0 q(x,h(x)) + q(x,ymin) 2)

= ( y Y p(x,y)ℓ(h(x),y) y Y p(x,y)ℓ(ymin,y)) 2

4 CL,H(h,x)2. (by Lemma 3 and H(x) = Y)

Since the function t t2

4 is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution,

(RL(h) R L,H + ML,H) 2

4 = (EX[ CL,H(h,x)])2

E X[ CL,H(h,x)2

E X[ CLcomp,H(h,x)]

= RLcomp(h) R Lcomp,H + MLcomp,H,

which leads to

RL(h) R L,H 2(RLcomp log (h) R Lcomp log ,H + MLcomp log ,H)

E.2 Structured sum-exponential loss

Theorem 12 (H-consistency bound of Lcomp exp ). Assume that H is symmetric and complete. Then, for any target loss ℓ, hypothesis h H and any distribution,

RL(h) R L,H 2(RLcomp exp (h) R Lcomp exp ,H + MLcomp exp ,H)

1 2 ML,H. (21)

Proof. For the comp-sum structured loss Lcomp exp , the conditional Lcomp exp -risk can be expressed as follows:

CLcomp exp (h,x) = y Y p(x,y) y Y

ℓ(y ,y) y y eh(x,y ) h(x,y ) = y Y ( 1 S(x,y ) 1)q(x,y ),

where we denote by q(x,y ) = y Y p(x,y)ℓ(y ,y) [0,1] and S(x,y) = eh(x,y)

y Y eh(x,y ) [0,1]

with the constraint that y Y S(x,y) = 1. Let ymin = argminy Y y Y p(x,y)ℓ(y ,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. For any h H such that h(x) ymin and x X, by the symmetry and completeness of H, we can always ﬁnd a family of hypotheses {hµ µ [ S(x,ymin),S(x,h(x))]} H such that

Sµ(x, ) = ehµ(x, )

y Y ehµ(x,y ) take the following values:

S(x,y) if y / {ymin,h(x)} S(x,ymin) + µ if y = h(x) S(x,h(x)) µ if y = ymin. (22)

Note that Sµ satisﬁes the constraint:

Sµ(x,y) = y Y S(x,y) = 1, µ [ S(x,ymin),S(x,h(x))].

By (22) and using the fact that H(x) = Y when H is symmetric, we obtain

CLcomp,H(h,x)

= CLcomp(h,x) C Lcomp(H,x)

CLcomp(h,x) inf µ RCLcomp(hµ,x)

= sup µ [ S(x,ymin),S(x,h(x))] {q(x,ymin)[ 1 S(x,ymin) 1 S(x,h(x)) µ]

+ q(x,h(x))[ 1 S(x,h(x)) 1 S(x,ymin) + µ]}

= q(x,ymin)

S(x,ymin) + q(x,h(x))

S(x,h(x)) (

q(x,ymin) +

q(x,h(x))) 2

S(x,ymin) + S(x,h(x))

(differentiating with respect to µ to optimize, optimal µ =

q(x,h(x))S(x,h(x))

q(x,ymin)S(x,ymin)

q(x,h(x)) )

q(x,h(x))) 2

(differentiating with respect to S to minimize, minimum is attained when S(x,h(x)) = S(x,ymin) = 1

(q(x,h(x)) q(x,ymin))2

q(x,h(x)) +

q(x,ymin)) 2

(q(x,h(x)) q(x,ymin))2

b 2, a,b [0,1],a + b 2)

= ( y Y p(x,y)ℓ(h(x),y) y Y p(x,y)ℓ(ymin,y)) 2

4 CL,H(h,x)2. (by Lemma 3 and H(x) = Y)

Since the function t t2

4 is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution,

(RL(h) R L,H + ML,H) 2

4 = (EX[ CL,H(h,x)])2

E X[ CL,H(h,x)2

E X[ CLcomp,H(h,x)]

= RLcomp(h) R Lcomp,H + MLcomp,H,

which leads to

RL(h) R L,H 2(RLcomp exp (h) R Lcomp exp ,H + MLcomp exp ,H)

E.3 Structured generalized cross-entropy loss

Theorem 13 (H-consistency bound of Lcomp gce ). Assume that H is symmetric and complete. Then, for any target loss ℓ, hypothesis h H and any distribution,

RL(h) R L,H 2n α

2 (RLcomp gce (h) R Lcomp gce ,H + MLcomp gce ,H)

1 2 ML,H. (23)

Proof. For the comp-sum structured loss Lcomp gce , the conditional Lcomp gce -risk can be expressed as follows:

CLcomp gce (h,x) = y Y p(x,y) y Y

α 1 ( eh(x,y )

y Y eh(x,y ) )

1 ( eh(x,y )

y Y eh(x,y ) )

α y Y p(x,y)ℓ(y ,y)

α y Y (1 S(x,y )α)q(x,y ),

where we denote by q(x,y ) = y Y p(x,y)ℓ(y ,y) [0,1] and S(x,y) = eh(x,y)

y Y eh(x,y ) [0,1]

with the constraint that y Y S(x,y) = 1. Let ymin = argminy Y y Y p(x,y)ℓ(y ,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. For any h H such that h(x) ymin and x X, by the symmetry and completeness of H, we can always ﬁnd a family of hypotheses {hµ µ [ S(x,ymin),S(x,h(x))]} H such that

Sµ(x, ) = ehµ(x, )

y Y ehµ(x,y ) take the following values:

S(x,y) if y / {ymin,h(x)} S(x,ymin) + µ if y = h(x) S(x,h(x)) µ if y = ymin. (24)

Note that Sµ satisﬁes the constraint:

Sµ(x,y) = y Y S(x,y) = 1, µ [ S(x,ymin),S(x,h(x))].

By (24) and using the fact that H(x) = Y when H is symmetric, we obtain

CLcomp,H(h,x)

= CLcomp(h,x) C Lcomp(H,x)

CLcomp(h,x) inf µ RCLcomp(hµ,x)

α sup µ [ S(x,ymin),S(x,h(x))] {q(x,ymin)[ S(x,ymin)α + (S(x,h(x)) µ)α]

+ q(x,h(x))[ S(x,h(x))α + (S(x,ymin) + µ)α]}

α(S(x,h(x)) + S(x,ymin))α(q(x,ymin) 1 1 α + q(x,h(x)) 1 1 α ) 1 α

αq(x,ymin)S(x,ymin)α 1

αq(x,h(x))S(x,h(x))α

(differentiating with respect to µ to optimize, optimum µ = q(x,h(x)) 1 1 α S(x,h(x)) q(x,ymin) 1 1 α S(x,ymin)

q(x,ymin) 1 1 α +q(x,h(x)) 1 1 α )

1 αnα [2α(q(x,ymin) 1 1 α + q(x,h(x)) 1 1 α ) 1 α q(x,ymin) q(x,h(x))]

(differentiating with respect to S to minimize, minimum is attained when S(x,h(x)) = S(x,ymin) = 1

(q(x,h(x)) q(x,ymin))2

4nα (( a 1 1 α +b 1 1 α 2 )

4 (a b)2, a,b [0,1], 0 a + b 1)

= ( y Y p(x,y)ℓ(h(x),y) y Y p(x,y)ℓ(ymin,y)) 2

= CL,H(h,x)2

4nα . (by Lemma 3 and H(x) = Y)

Since the function t t2

4nα is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution,

(RL(h) R L,H + ML,H) 2

4nα = (EX[ CL,H(h,x)])2

E X[ CL,H(h,x)2

E X[ CLcomp,H(h,x)]

= RLcomp(h) R Lcomp,H + MLcomp,H,

which leads to

RL(h) R L,H 2n α

2 (RLcomp gce (h) R Lcomp gce ,H + MLcomp gce ,H)

E.4 Structured mean absolute error loss

Theorem 14 (H-consistency bound of Lcomp mae ). Assume that H is symmetric and complete. Then, for any target loss ℓ, hypothesis h H and any distribution,

RL(h) R L,H n(RLcomp mae (h) R Lcomp mae ,H + MLcomp mae ,H) ML,H. (25)

Proof. For the comp-sum structured loss Lcomp mae , the conditional Lcomp mae -risk can be expressed as follows:

CLcomp mae (h,x) = y Y p(x,y) y Y

ℓ(y ,y)(1 eh(x,y )

y Y eh(x,y ) )

= y Y (1 eh(x,y )

y Y eh(x,y ) ) y Y p(x,y)ℓ(y ,y)

= y Y (1 S(x,y ))q(x,y ),

where we denote by q(x,y ) = y Y p(x,y)ℓ(y ,y) [0,1] and S(x,y) = eh(x,y)

y Y eh(x,y ) [0,1]

with the constraint that y Y S(x,y) = 1. Let ymin = argminy Y y Y p(x,y)ℓ(y ,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. For any h H such that h(x) ymin and x X, by the symmetry and completeness of H, we can always ﬁnd a family of hypotheses {hµ µ [ S(x,ymin),S(x,h(x))]} H such that

Sµ(x, ) = ehµ(x, )

y Y ehµ(x,y ) take the following values:

S(x,y) if y / {ymin,h(x)} S(x,ymin) + µ if y = h(x) S(x,h(x)) µ if y = ymin. (26)

Note that Sµ satisﬁes the constraint:

Sµ(x,y) = y Y S(x,y) = 1, µ [ S(x,ymin),S(x,h(x))].

By (26) and using the fact that H(x) = Y when H is symmetric, we obtain

CLcomp,H(h,x)

= CLcomp(h,x) C Lcomp(H,x)

CLcomp(h,x) inf µ RCLcomp(hµ,x)

= sup µ [ S(x,ymin),S(x,h(x))] {q(x,ymin)[ S(x,ymin) + S(x,h(x)) µ]

+ q(x,h(x))[ S(x,h(x)) + S(x,ymin) + µ]}

= q(x,ymin)S(x,h(x)) q(x,h(x))S(x,h(x)) (differentiating with respect to µ to optimize, optimum µ = S(x,ymin))

n(q(x,ymin) q(x,h(x)))

(differentiating with respect to S to minimize, minimum is attained when S(x,h(x)) = 1

= y Y p(x,y)ℓ(h(x),y) y Y p(x,y)ℓ(ymin,y)

= CL,H(h,x)

n (by Lemma 3 and H(x) = Y)

Therefore, we obtain for any hypothesis h H and any distribution,

RL(h) R L,H + ML,H n = EX[ CL,H(h,x)]

n = E X[ CLcomp,H(h,x)]

= RLcomp(h) R Lcomp,H + MLcomp,H,

which leads to

RL(h) R L,H n(RLcomp mae (h) R Lcomp mae ,H + MLcomp mae ,H) ML,H.

F Proofs for structured constrained losses

F.1 Structured constrained exponential loss

Theorem 15 (H-consistency bound of Lcstnd exp ). Assume that H is symmetric and complete. Then, for any target loss ℓ, hypothesis h H and any distribution,

RL(h) R L,H 2

ℓmax(RLcstnd exp (h) R Lcstnd exp ,H + MLcstnd exp ,H)

1 2 ML,H. (27)

Proof. Denote by q(x,y ) = y Y p(x,y)ℓ(y ,y) [0,ℓmax]. For the constrained structured loss Lcstnd exp , the conditional Lcstnd exp -risk can be expressed as follows:

CLcstnd exp (h,x) = y Y p(x,y) y Y ℓ(y,y )eh(x,y ) = y Y eh(x,y )q(x,y ).

Let ymin = argminy Y y Y p(x,y)ℓ(y ,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. For any h H such that h(x) ymin and x X, by the symmetry and completeness of H, we can always ﬁnd a family of hypotheses {hµ µ R} H such that hµ(x, ) take the following values:

h(x,y) if y / {ymin,h(x)} h(x,ymin) + µ if y = h(x) h(x,h(x)) µ if y = ymin. (28)

Note that the hypotheses hµ satisﬁes the constraint:

hµ(x,y) = y Y h(x,y) = 0, µ R.

Since y Y h(x,y) = 0, there must be non-negative scores. By deﬁnition of h(x) as a maximizer, we must thus have h(x,h(x)) 0. By (28) and using the fact that H(x) = Y when H is symmetric, we obtain

CLcstnd,H(h,x)

= CLcstnd(h,x) C Lcstnd(H,x)

CLcstnd(h,x) inf µ RCLcstnd(hµ,x)

= sup µ R {q(x,ymin)(eh(x,ymin) eh(x,h(x)) µ) + q(x,h(x))(eh(x,h(x)) eh(x,ymin)+µ)}

q(x,h(x))eh(x,h(x))

q(x,ymin)eh(x,ymin))

(differentiating with respect to µ to optimize, optimum µ = 1

2 log q(x,ymin)eh(x,h(x))

q(x,h(x))eh(x,ymin) )

eh(x,h(x))(

q(x,h(x))) 2

(eh(x,h(x)) eh(x,ymin) and q(x,h(x)) q(x,ymin))

q(x,h(x))) 2 (h(x,h(x)) 0)

= q(x,h(x)) q(x,ymin)

q(x,ymin) +

1 4ℓmax (q(x,h(x)) q(x,ymin))2 (0 q(x,y) ℓmax)

= 1 4ℓmax CL,H(h,x)2. (by Lemma 3 and H(x) = Y)

Since the function t t2

4ℓmax is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution,

(RL(h) R L,H + ML,H) 2

4ℓmax = (EX[ CL,H(h,x)])2

E X[ CL,H(h,x)2

E X[ CLcstnd,H(h,x)]

= RLcstnd(h) R Lcstnd,H + MLcstnd,H,

which leads to

RL(h) R L,H 2

ℓmax(RLcstnd exp (h) R Lcstnd exp ,H + MLcstnd exp ,H)

F.2 Structured constrained squared-hinge loss

Theorem 16 (H-consistency bound of Lcstnd sq hinge). Assume that H is symmetric and complete. Then, for any target loss ℓ, hypothesis h H and any distribution,

RL(h) R L,H (RLcstnd sq hinge(h) R Lcstnd sq hinge,H + MLcstnd sq hinge,H)

1 2 ML,H. (29)

Proof. Denote by q(x,y ) = y Y p(x,y)ℓ(y ,y) [0,ℓmax]. For the constrained structured loss Lcstnd sq hinge, the conditional Lcstnd sq hinge-risk can be expressed as follows:

CLcstnd sq hinge(h,x) = y Y p(x,y) y Y ℓ(y ,y)max{0,1 + h(x,y )} 2

= y Y max{0,1 + h(x,y )} 2q(x,y ).

Let ymin = argminy Y y Y p(x,y)ℓ(y ,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. For any h H such that h(x) ymin and x X, by the symmetry and completeness of H, we can always ﬁnd a family of hypotheses {hµ µ R} H such that hµ(x, ) take the following values:

h(x,y) if y / {ymin,h(x)} h(x,ymin) + µ if y = h(x) h(x,h(x)) µ if y = ymin. (30)

Note that the hypotheses hµ satisﬁes the constraint:

hµ(x,y) = y Y h(x,y) = 0, µ R.

Since y Y h(x,y) = 0, there must be non-negative scores. By deﬁnition of h(x) as a maximizer, we must thus have h(x,h(x)) 0. By (30) and using the fact that H(x) = Y when H is symmetric, we obtain

CLcstnd,H(h,x)

= CLcstnd(h,x) C Lcstnd(H,x)

CLcstnd(h,x) inf µ RCLcstnd(hµ,x)

= sup µ R {q(x,ymin)(max{0,1 + h(x,ymin)}2 max{0,1 + h(x,h(x)) µ}2)

+ q(x,h(x))(max{0,1 + h(x,h(x))}2 max{0,1 + h(x,ymin) + µ}2)}

(1 + h(x,h(x)))2(q(x,ymin) q(x,h(x)))2 (differentiating with respect to µ to optimize)

(q(x,h(x)) q(x,ymin))2 (h(x,h(x)) 0)

= y Y p(x,y)ℓ(h(x),y) y Y p(x,y)ℓ(ymin,y)

= CL,H(h,x)2. (by Lemma 3 and H(x) = Y)

Since the function t t2 is convex, by Jensen s inequality, we obtain for any hypothesis h H and any distribution,

(RL(h) R L,H + ML,H) 2 = (E X[ CL,H(h,x)])

E X[ CL,H(h,x)2]

E X[ CLcstnd,H(h,x)]

= RLcstnd(h) R Lcstnd,H + MLcstnd,H,

which leads to

RL(h) R L,H (RLcstnd sq hinge(h) R Lcstnd sq hinge,H + MLcstnd sq hinge,H)

F.3 Structured constrained hinge loss

Theorem 17 (H-consistency bound of Lcstnd hinge). Assume that H is symmetric and complete. Then, for any target loss ℓ, hypothesis h H and any distribution,

RL(h) R L,H RLcstnd hinge(h) R Lcstnd hinge,H + MLcstnd hinge,H ML,H. (31)

Proof. Denote by q(x,y ) = y Y p(x,y)ℓ(y ,y) [0,ℓmax]. For the constrained structured loss Lcstnd hinge, the conditional Lcstnd hinge-risk can be expressed as follows:

CLcstnd hinge(h,x) = y Y p(x,y) y Y ℓ(y ,y)max{0,1 + h(x,y )}

= y Y max{0,1 + h(x,y )}q(x,y ).

Let ymin = argminy Y y Y p(x,y)ℓ(y ,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. For any h H such that h(x) ymin and x X, by the symmetry and completeness of H, we can always ﬁnd a family of hypotheses {hµ µ R} H such that hµ(x, ) take the following values:

h(x,y) if y / {ymin,h(x)} h(x,ymin) + µ if y = h(x) h(x,h(x)) µ if y = ymin. (32)

Note that the hypotheses hµ satisﬁes the constraint:

hµ(x,y) = y Y h(x,y) = 0, µ R.

Since y Y h(x,y) = 0, there must be non-negative scores. By deﬁnition of h(x) as a maximizer, we must thus have h(x,h(x)) 0. By (32) and using the fact that H(x) = Y when H is symmetric, we obtain

CLcstnd,H(h,x)

= CLcstnd(h,x) C Lcstnd(H,x)

CLcstnd(h,x) inf µ RCLcstnd(hµ,x)

= sup µ R {q(x,ymin)(max{0,1 + h(x,ymin)} max{0,1 + h(x,h(x)) µ})

+ q(x,h(x))(max{0,1 + h(x,h(x))} max{0,1 + h(x,ymin) + µ})}

(1 + h(x,h(x)))(q(x,h(x)) q(x,ymin)) (differentiating with respect to µ to optimize) q(x,h(x)) q(x,ymin) (h(x,h(x)) 0)

= y Y p(x,y)ℓ(h(x),y) y Y p(x,y)ℓ(ymin,y)

= CL,H(h,x). (by Lemma 3 and H(x) = Y)

Therefore, we obtain for any hypothesis h H and any distribution,

RL(h) R L,H + ML,H = E X[ CL,H(h,x)]

E X[ CLcstnd,H(h,x)]

= RLcstnd(h) R Lcstnd,H + MLcstnd,H,

which leads to

RL(h) R L,H RLcstnd hinge(h) R Lcstnd hinge,H + MLcstnd hinge,H ML,H.

F.4 Structured constrained ρ-margin loss

Theorem 18 (H-consistency bound of Lcstnd ρ ). Assume that H is symmetric and complete. Then, for any target loss ℓ, hypothesis h H and any distribution,

RL(h) R L,H RLcstnd ρ (h) R Lcstnd ρ ,H + MLcstnd ρ ,H ML,H. (33)

Proof. Denote by q(x,y ) = y Y p(x,y)ℓ(y ,y) [0,ℓmax]. For the constrained structured loss Lcstnd ρ , the conditional Lcstnd ρ -risk can be expressed as follows:

CLcstnd ρ (h,x) = y Y p(x,y) y Y ℓ(y ,y)max{0,1 + h(x,y )}

= y Y max{0,1 + h(x,y )}q(x,y ).

Let ymin = argminy Y y Y p(x,y)ℓ(y ,y), where we choose the label with the highest index under the natural ordering of labels as the tie-breaking strategy. For any h H such that h(x) ymin and x X, by the symmetry and completeness of H, we can always ﬁnd a family of hypotheses {hµ µ R} H such that hµ(x, ) take the following values:

h(x,y) if y / {ymin,h(x)} h(x,ymin) + µ if y = h(x) h(x,h(x)) µ if y = ymin. (34)

Note that the hypotheses hµ satisﬁes the constraint:

hµ(x,y) = y Y h(x,y) = 0, µ R.

Since y Y h(x,y) = 0, there must be non-negative scores. By deﬁnition of h(x) as a maximizer, we must thus have h(x,h(x)) 0. By (34) and using the fact that H(x) = Y when H is symmetric, we obtain

CLcstnd,H(h,x)

= CLcstnd(h,x) C Lcstnd(H,x)

CLcstnd(h,x) inf µ RCLcstnd(hµ,x)

= sup µ R {q(x,ymin)(min{max{0,1 + h(x,ymin)

ρ },1} min{max{0,1 + h(x,h(x)) µ

+ q(x,h(x))(min{max{0,1 + h(x,h(x))

ρ },1} min{max{0,1 + h(x,ymin) + µ

q(x,h(x)) q(x,ymin) (differentiating with respect to µ to optimize)

= y Y p(x,y)ℓ(h(x),y) y Y p(x,y)ℓ(ymin,y)

= CL,H(h,x). (by Lemma 3 and H(x) = Y)

Therefore, we obtain for any hypothesis h H and any distribution,

RL(h) R L,H + ML,H = E X[ CL,H(h,x)]

E X[ CLcstnd,H(h,x)]

= RLcstnd(h) R Lcstnd,H + MLcstnd,H,

which leads to

RL(h) R L,H RLcstnd ρ (h) R Lcstnd ρ ,H + MLcstnd ρ ,H ML,H.

G Efﬁcient gradient computation and inference

Here, we describe efﬁcient algorithms for the computation of the gradients for the loss functions Lcomp log and Lcomp exp . We also brieﬂy discuss an efﬁcient algorithm for inference.

G.1 Efﬁcient gradient computation for Lcomp log

We ﬁrst present an efﬁcient algorithm for the computation of the quantities L(z,s) in the important case of rational losses, next in the case of Markovian losses.

Rational losses. Rational losses form a general family of loss functions based on rational kernels [Cortes et al., 2004] that includes, in particular, n-gram losses, which can be deﬁned for a pair of sequences (y,y ) as the negative inner product of the vectors of n-gram counts of y and y .

Our algorithm bears some similarity to that of Cortes et al. [2018] for the computation of the gradient of the VCRF loss function. It is however distinct because the structured prediction loss function we are considering and our deﬁnition of rational loss are both different. We will adopt a similar notation and terminology. Recall that for any sequence y, we denote by yi the symbol in its ith position and by yj i = yiyi+1 yj the substring of y starting at position i and ending at j. We denote by EA the set of transitions of a WFA A.

Let U be a weighted ﬁnite-state transducer (WFST) over the (+, ) semiring over the reals, with as both the input and output alphabet. Then, we deﬁne the rational loss associated to U for all y,y

by ℓ(y,y ) = U(y,y ).

0 1 a:a/1 2 b:b/1 3/1 a:a/1

Figure 3: Illustration of the WFA Y for = {a,b} and l = 3, and the WFA Yi, where yi = aba.

Let Y denote a WFA over the (+, ) semiring accepting the set of all sequences of length l with weight one and let Yi denote the WFA accepting only yi with weight one. Then, by deﬁnition, the weighted transducer Y U Yi obtained by composition maps each sequence y in l to yi with weight U(y,yi). The WFA Π1(Y U Yi) derived from that transducer by projection on the input (that is by removing output labels) is associating to each sequence y weight U(y,yi). We use weighted determinization [Mohri, 1997] to compute an equivalent deterministic WFA denote M. As shown by Cortes et al. [2015][Theorem 3], M can be computed in polynomial time. M admits a unique path labeled with any sequence y l and the weight of that path is U(y,yi). The weight of that accepting path is obtained by multiplying the weights of its transitions and that of the ﬁnal state.

a:a/1 (b, 2)

Figure 4: Illustration of the WFA N for = {a,b}, p = 2 and l = 2.

We now deﬁne a deterministic p-gram WFA N that accepts all sequences y l with each of its states (z ,s) encoding a (p 1)- gram z read to reach it and the position s in the sequence y at which it is reached. The transitions of N are therefore deﬁned as follows with weight one:

EN = {((ys 1 s p+1,s 1),a,1,(ys 1 s p+2a,s)) y l,a ,s [l]}.

The initial state is (ϵ,0) and the ﬁnal states are those with the second element of the pair (the position) being l. Note that, by construction, N is deterministic. Then, the composition (or intersection) WFA N M still associates the same weight as M to each input string y l. However, the states in that composition help us compute L(z,s). In particular, for any z p and s [l], let E(z,s) be the set of transitions of N M constructed by pairing the transition ((zp 1 1 ,s 1),zp,ω(z,s),(zp 2,s)) in N with a transition (q M,zp,ω,q M) in M. They admit the following form:

E(z,s) ={((q N,q M),zp,ω,(q N,q M)) EN M q N = (zp 1 1 ,s 1)}. (35)

The WFA N M is deterministic as a composition of two deterministic WFAs. Thus, there is a unique path labeled with a sequence y l in N M and y admits the substring z ending at position

s iff that path goes through a transition in E(z,s) when reaching position s. Therefore, to compute L(z,s), it sufﬁces for us to compute the sum of the weights of all paths in N M going through a transition in E(z,s). This can be done straightforwardly using the forward-backward algorithm or two single-source shortest-distance algorithm over the (+, ) semiring [Mohri, 2002a], one from the initial state, the other one from the ﬁnal states. Since N M is acyclic and admits O(l p) transitions, we can compute all the quantities L(z,s), s [l] and z p, in time O(l p).

Markovian loss. We consider adopting a Markovian assumption, which is commonly adopted in natural language processing [Manning and Schütze, 1999]. We will assume that ℓcan be decomposed as follows for all y,y l: ℓ(y,y ) = l t=1 ℓt(yt t p+1,y ). Thus, we can write:

L(z,s) = y ys s p+1=z

ℓt(yt t p+1,yi).

To efﬁciently compute L(z,s), we will use a WFA representation similar to the one used by Cortes et al. [2016, 2018] and, for convenience, will adopt a similar notation. L(z,s) coincides with a ﬂow computation in a WFA A that we now deﬁne. A has the following set of states:

QA = {(yt t p+1,t) y l,t = 0,...,l},

with IA = (ε,0) its single initial state, FA = {(yl l p+1,l) y l} its set of ﬁnal states, and a transition from state (yt 1 t p+1,t 1) to state (yt 1 t p+2 b,t) with label b and weight ω(yt 1 t p+1 b,t) = ℓt(yt 1 t p+1b,yi), that is the following set of transitions:

EA = {((yt 1 t p+1,t 1),b,ω(yt 1 t p+1 b,t),(yt 1 t p+2 b,t)) y l,b ,t [l]}.

By construction, A is deterministic. The weight of a path in A is obtained by multiplying the weights of its constituent transitions. In view of that, L(z,s) can be seen as the sum of the weights of all paths in A going through the transition from state (zp 1 1 ,s 1) to (zp 2,s) with label zp.

For any state (yt t p+1,t) QA, we denote by α((yt t p+1,t)) the sum of the weights of all paths in A from the initial state IA to (yt t p+1,t) and by β((yt t p+1,t)) the sum of the weights of all paths from (yt t p+1,t) to a ﬁnal state. Then, L(z,s) is given by

L(z,s) = α((zp 1 1 ,s 1)) ω(z,s) β((zp 2,s)).

Since A is acyclic, α and β can be computed for all states in linear time in the size of A using a single-source shortest-distance algorithm over the (+, ) semiring or the so-called forward-backward algorithm. Thus, since A admits O(l p) transitions, we can also compute all quantities L(z,s), s [l] and z p, in time O(lrp).

G.2 Efﬁcient gradient computation for Lcomp exp

In this section, we provide a brief overview of the gradient computation for Lcomp exp , which is similar to the approach used for Lcomp log .

Note that the loss Lcomp exp on a given point (xi,yi) can be expressed as follows:

Lcomp exp = y l

ℓ(y ,yi) y y eh(xi,y ) h(xi,y )

ℓ(y ,yi) y l eh(xi,y ) h(xi,y ) y l

= y l eh(xi,y ) y l

ℓ(y ,yi)e h(xi,y ) y l

= y l ew Ψ(xi,y ) y l

ℓ(y ,yi)e w Ψ(xi,y ) y l

The gradient of Lcomp exp is therefore given by

Lcomp exp (w) = y l ew Ψ(xi,y )Ψ(xi,y ) y l

ℓ(y ,yi)e w Ψ(xi,y )

y l ew Ψ(xi,y ) y l

ℓ(y ,yi)e w Ψ(xi,y )Ψ(xi,y ).

An efﬁcient computation of these terms is not straightforward since the summations run over an exponential number of sequences for y. However, we will leverage the Markovian property of the features to design an efﬁcient computation. This approach is similar to what we demonstrated earlier for Lcomp log . We start with identifying the most computationally challenging terms by rewriting the expression of the gradient of Lcomp exp in the following lemma.

Lemma 19. For any w Rd, the gradient of Lcomp exp can be expressed as follows:

Lcomp exp (w) = l s=1 z p[Nw Q w(z,s) Zw Cw(z,s)] ψ(xi,z,s),

where Q w(z,s) = y ys s p+1=z ew Ψ(xi,y), Cw(z,s) = y ys s p+1=z ℓ(y,yi)e w Ψ(xi,y) and Nw =

y l ℓ(y,yi)e w Ψ(xi,y).

Proof. Using the decomposition of the feature vector, we can write:

y l ew Ψ(xi,y)Ψ(xi,y) = y l ew Ψ(xi,y) l s=1 ψ(xi,ys s p+1,s)

= l s=1 z p

y ys s p+1=z ew Ψ(xi,y)

ℓ(y,yi)e w Ψ(xi,y)Ψ(xi,y) = y l

ℓ(y,yi)e w Ψ(xi,y) l s=1 ψ(xi,ys s p+1,s)

= l s=1 z p

y ys s p+1=z

ℓ(y,yi)e w Ψ(xi,y)

This completes the proof.

It was shown by Cortes et al. [2016, 2018] that all of the quantities Q w(z,s) for z p and s [l] and Zw can be computed efﬁciently in time O(lrp), where r = . Thus, the remaining bottleneck in the gradient computation suggested by Lemma 19 is the evaluation of the quantities Cw(z,s) for z p and s [l] and Nw. As with the loss function Lcomp log discussed in the previous section, we will analyze the computation of these quantities ﬁrst in the case of rational losses, next in that of Markovian loss.

Rational losses. We will adopt the same notation as in the case of the Lcomp log loss with the same deﬁnition of a rational loss: ℓis a rational loss if there exists a WFST over the (+, ) semiring over the reals with as both the input and output alphabet such that for all y,y , we have ℓ(y,y ) = U(y,y ).

Our algorithm is also somewhat similar to the one described for the Lcomp log loss or that of Cortes et al. [2018] for the computation of the gradient of the VCRF loss function. There are, however, several differences here too because the quantities computed and thus the automata operations required are distinct.

Exactly as in the case of Lcomp log loss, we ﬁrst deﬁne a deterministic WFA M over the (+, ) semiring that can be computed in polynomial time and that admits a unique path labeled with any sequence

y l, whose weight is U(y,yi). Next, as in [Cortes et al., 2018], we deﬁne a deterministic WFA A such that

A(y) = e w Ψ(xi,y) = l s=1 e w ψ(xi,ys s p+1,s).

The set of states QA of A are deﬁned as QA = {(ys s p+1,s) y l,s = 0,...,l}, with IA = (ε,0) its

single initial state, FA = {(yl l p+1,l) y l} its set of ﬁnal states, and with a transition from state (ys 1 s p+1,s 1) to state (ys 1 s p+2 a,s) with label a and weight, that is, the following set of transitions:

EA = {((ys 1 s p+1,s 1),a,e w ψ(xi,ys 1 s p+1a,s),(ys 1 s p+2 a,s)) y l,a ,s [l]}.

Then, by deﬁnition of composition or intersection [Mohri, 2009], the WFA (M A) is deterministic and admits a unique path labeled with any given y l whose weight is (M A)(y) = M(y) A(y) = ℓ(y,yi)e w Ψ(xi,y).

(a, 1) a:a/exp(-w ψ(x,εa,1))

b:b/exp(-w ψ(x,εb,1))

(a, 2) a:a/exp(-w ψ(x,aa,2))

b:b/exp(-w ψ(x,ab,2))

a:a/exp(-w ψ(x,ba,2))

b:b/exp(-w ψ(x,bb,2)) Figure 5: Illustration of the WFA A for = {a,b}, p = 2 and l = 2.

Now, Nw coincides with the sum of the weights of all accepted paths in this WFA. Thus, since (M A) is acyclic, it can be computed in time linear in the size of (M A), that is its number of transitions. For any s [l] and z p, Cw(z,s) is the sum of the weights of all paths in (M A) labeled with a sequence y admitting z as a substring ending at position s. The states in the composition (M A) help us compute Cw(z,s). As in the case of the Lcomp log loss, for any z p and s [l], we deﬁne E(z,s) as the set of transitions of (M A) constructed by pairing a transition (q M,zp,ωM,q M) in M with the transition ((zp 1 1 ,s 1),zp,ω(z,s),(zp 2,s)) in A. They admit the following form:

E(z,s) ={((q M,q A),zp,ωM ω(z,s),(q M,q A)) EM A q A = (zp 1 1 ,s 1)}. (37)

The WFA (M A) is deterministic as a composition of two deterministic WFAs. Thus, there is a unique path labeled with a sequence y l in (M A) and y admits the substring z ending at position s iff that path goes through a transition in E(z,s) when reaching position s. Therefore, to compute Cw(z,s), it sufﬁces for us to compute the sum of the weights of all paths in Cw(z,s) going through a transition in E(z,s). This can be done straightforwardly using the forward-backward algorithm or two single-source shortest-distance algorithm over the (+, ) semiring [Mohri, 2002a], one from the initial state, the other one from the ﬁnal states. Since (M A) is acyclic and admits O(l p) transitions, we can compute all the quantities Cw(z,s), s [l] and z p, in time O(l p).

Markovian loss. Here, we adopt the Markovian assumption and assume that ℓcan be decomposed as follows for all y,y l: ℓ(y,y ) = l t=1 ℓt(yt t p+1,y ). Thus, the quantity Cw(z,s) can be written as:

Cw(z,s) = y ys s p+1=z

ℓt(yt t p+1,yi)e w l k=1 ψ(xi,yk k p+1,k)

= y ys s p+1=z

ℓt(yt t p+1,yi) l k=1 e w ψ(xi,yk k p+1,k)

= y ys s p+1=z

ℓt(yt t p+1,yi)e w ψ(xi,yt t p+1,t).

Then, we can proceed as in the Markovian loss case for the loss function Lcomp log except that instead of the WFA A used there, we deﬁne here a similar WFA A . The only difference is that the weight ω(yt 1 t p+1 b,t) = ℓt(yt 1 t p+1b,yi) for the WFA A is replaced with ω (yt 1 t p+1 b,t) =

ℓt(yt 1 t p+1b,yi)e w ψ(xi,yt 1 t p+1b,t). With the same argument, we can compute all quantities Cw(z,s), s [l] and z p, in time O(lrp). The quantity Nw can also be efﬁciently computed in time O(lrp) since it is the sum of the weights of all paths in A .

G.3 Efﬁcient Inference

We focused on the problem of efﬁcient computation of the gradient. Inference is also a key problem in structured prediction since the label with a highest score must be determined out of an exponentially large set of possible ones. However, for the linear hypotheses considered in the previous sections, this problem can be efﬁciently tackled since it can be cast as a shortest-distance problem in a directed acyclic graph, as in [Cortes, Kuznetsov, Mohri, and Yang, 2016].

More generally, an efﬁcient gradient computation, efﬁcient inference and other related algorithms can beneﬁt from standard weighted automata and transducer optimization algorithms such as ϵ-removal [Mohri, 2000, 2002b] and determinization [Mohri, 1997, Mohri and Riley, 1997, Allauzen and Mohri, 2003, 2004] (see also the survey chapter [Mohri, 2009]).