# hconsistency_guarantees_for_regression__6a74ed83.pdf

H-Consistency Guarantees for Regression

Anqi Mao 1 Mehryar Mohri 2 1 Yutao Zhong 1

We present a detailed study of H-consistency bounds for regression. We ﬁrst present new theorems that generalize the tools previously given to establish H-consistency bounds. This generalization proves essential for analyzing H-consistency bounds speciﬁc to regression. Next, we prove a series of novel H-consistency bounds for surrogate loss functions of the squared loss, under the assumption of a symmetric distribution and a bounded hypothesis set. This includes positive results for the Huber loss, all ℓp losses, p 1, the squared ϵ-insensitive loss, as well as a negative result for the ϵ-insensitive loss used in Support Vector Regression (SVR). We further leverage our analysis of H-consistency for regression and derive principled surrogate losses for adversarial regression (Section 5). This readily establishes novel algorithms for adversarial regression, for which we report favorable experimental results in Section 6.

1. Introduction

Learning algorithms often optimize loss functions that differ from the originally speciﬁed task. In classiﬁcation, this divergence typically arises due to the computational intractability of optimizing the original loss or because it lacks certain desirable properties like differentiability or smoothness. In regression, the shift may occur because the surrogate loss used exhibits more favorable characteristics, such as handling outliers or ensuring sparser solutions. For instance, the Huber loss and ℓ1 loss are used to mitigate the impact of outliers since the squared loss is known to be sensitive to the presence of outliers, while ϵ-insensitive losses promote sparsity. But, what guarantees do we have

1Courant Institute of Mathematical Sciences, New York, NY; 2Google Research, New York, NY. Correspondence to: Anqi Mao <aqmao@cims.nyu.edu>, Mehryar Mohri <mohri@google.com>, Yutao Zhong <yutao@cims.nyu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

when training with a loss function distinct from the target squared loss?

Addressing this question can have signiﬁcant implications in the design of regression algorithms. It can also strongly beneﬁt the design of useful surrogate losses for other related problems, such as adversarial regression, as we shall see.

The statistical properties of surrogate losses have been extensively studied in the past. In particular, the Bayesconsistency of various convex loss functions, including margin-based surrogate losses in binary classiﬁcation (Zhang, 2004a; Bartlett, Jordan, and Mc Auliffe, 2006), and other loss function families for multi-classiﬁcation (Zhang, 2004b; Tewari & Bartlett, 2007; Steinwart, 2007), has been examined in detail.

However, prior work by Long & Servedio (2013) has highlighted the limitations of Bayes-consistency, since it does not account for the hypothesis set adopted. They established that for some hypothesis sets and distributions, algorithms minimizing Bayes-consistent losses may retain a constant expected error, while others minimizing inconsistent losses tend to have an expected error approaching zero. This indicates the signiﬁcant role of the chosen hypothesis set in consistency.

Recent seminal work by Awasthi, Mao, Mohri, and Zhong (2022a;b) and Mao, Mohri, and Zhong (2023f;c;e;b) has analyzed H-consistency bounds for broad families of surrogate losses in binary classication, multi-class classiﬁcation, structured prediction, and abstention (Mao et al., 2023a). These bounds are more informative than Bayes-consistency since they are hypothesis set-speciﬁc and do not require the entire family of measurable functions. Moreover, they offer ﬁnite sample, non-asymptotic guarantees. In light of these recent guarantees, the following questions naturally arise: Can we derive a non-asymptotic analysis of regression taking into account the hypothesis set? How can we beneﬁt from that analysis?

While there is some previous work exploring Bayesconsistency in regression (Caponnetto, 2005; Christmann & Steinwart, 2007; Steinwart, 2007), we are not aware of any prior H-consistency bounds or similar ﬁnite sample guarantees for surrogate losses in regression, such as, for example, the Huber loss or the squared ϵ-insensitive loss.

H-Consistency Guarantees for Regression

This paper presents the ﬁrst in-depth study of H-consistency bounds in the context of regression. We ﬁrst present new theorems that generalize the tools previously given by Awasthi et al. (2022a;b) and Mao et al. (2023f;c;e;b) to establish Hconsistency bounds (Section 3). This generalization proves essential in regression for analyzing H-consistency bounds for surrogate losses such as Huber loss and the squared ϵ-insensitive loss. It also provides ﬁner bounds for the ℓ1 loss.

Next, we prove a series of H-consistency bounds for surrogate loss functions of the squared loss, under the assumption of a symmetric distribution and a bounded hypothesis set (Section 4). We prove the ﬁrst H-consistency bound for the Huber loss, which is a commonly used surrogate loss used to handle outliers, contingent upon a speciﬁc condition concerning the Huber loss parameter δ and the distribution mass around the mean. We further prove that this condition is necessary when H is realizable.

We then extend our analysis to cover H-consistency bounds for ℓp losses, for all values of p 1. In particular, remarkably, we give guarantees for the ℓ1 loss and ℓp losses with p (1,2). We further analyze the ϵ-insensitive and the squared ϵ-insensitive losses integral to the deﬁnition of the SVR (Support Vector Regression) and quadratic SVR algorithms (Vapnik, 2000). These loss functions and SVR algorithms admit the beneﬁt of yielding sparser solutions. We give the ﬁrst H-consistency bound for the quadratic ϵ-insensitive loss. We also prove a negative result for the ϵ-insensitive loss: this loss function used in the deﬁnition of SVR does not admit H-consistency bounds with respect to the squared loss, even under some additional assumptions on the parameter ϵ and the distribution.

Subsequently, leveraging our analysis of H-consistency for regression, we derive principled surrogate losses for adversarial regression (Section 5). This readily establishes a novel algorithm for adversarial regression, for which we report favorable experimental results in Section 6.

Previous work. Bayes-consistency has been extensively studied in various learning problems. These include binary classiﬁcation (Zhang, 2004a; Bartlett et al., 2006), multiclass classiﬁcation (Zhang, 2004b; Tewari & Bartlett, 2007; Narasimhan et al., 2015; Finocchiaro et al., 2019; Wang & Scott, 2020; Frongillo & Waggoner, 2021; Wang & Scott, 2023), ranking (Menon & Williamson, 2014; Gao & Zhou, 2015; Uematsu & Lee, 2017), multi-label classiﬁcation (Gao & Zhou, 2011; Koyejo et al., 2015; Zhang et al., 2020), structured prediction (Ciliberto et al., 2016; Osokin et al., 2017; Blondel, 2019), and ordinal regression (Pedregosa et al., 2017). The concept of H-consistency has been studied under the realizable assumption in (Long & Servedio, 2013; Zhang & Agarwal, 2020). The notion of H-consistency bounds in classiﬁcation is due to Awasthi et al. (2022a;b). H-

consistency bounds have been further analyzed in scenarios such as multi-class classiﬁcation (Mao et al., 2023f; Zheng et al., 2023; Mao et al., 2023b; 2024f), ranking (Mao et al., 2023c;d), structured prediction (Mao et al., 2023e), top-k classiﬁcation (Mao et al., 2024e), abstention and deferral (Mao et al., 2023a; 2024b;c;a;d; Mohri et al., 2024).

However, in the context of regression, there is limited work on the consistency properties of surrogate losses. The main related work we are aware are (Caponnetto, 2005; Christmann & Steinwart, 2007; Steinwart, 2007). In particular, Steinwart (2007) studied Bayes-consistency for a family of regression surrogate losses including ℓp, but without presenting any non-asymptotic bound. Nevertheless, we partly beneﬁt from this previous work. In particular, we adopt the same symmetric and bounded distribution assumption.

2. Preliminaries

Bounded regression. We ﬁrst introduce the learning scenario of bounded regression. We denote by X the input space, Y a measurable subset of R, and D a distribution over X Y. As for other supervised learning problems, the learner receives a labeled sample S = ((x1,y1),...,(xm,ym)) drawn i.i.d. according to D.

The measure of error is based on the magnitude of the difference between the predicted real-valued label and the true label. The function used to measure the error is denoted as L Y Y R+. Let L (h,x,y) L(h(x),y) be the associated loss function. Some common examples of loss functions used in regression are the squared loss ℓ2, deﬁned by L(y ,y) = y y 2 for all y,y Y; or more generally the ℓp loss deﬁned by L(y ,y) = y y p, for p 1. The squared loss is known to be quite sensitive to outliers. An alternative more robust surrogate loss is the Huber loss ℓδ (Huber, 1964), which is deﬁned for a parameter δ > 0 as the following combination of the ℓ2 and ℓ1 loss functions: L(y ,y) = 1 2(y y)2 if y y δ, (δ y y 1

2δ2) otherwise. The ϵ-insensitive loss ℓϵ and the squared ϵ-insensitive loss ℓsq ϵ (Vapnik, 2000) are deﬁned by L(y ,y) = max{ y y ϵ,0} and L(y ,y) = max{ y y 2 ϵ2,0}, for some ϵ > 0.

Bayes-Consistency. Given a loss function L, we denote by EL(h) the generalization error of a hypothesis h H, and by E L(H) the best-in-class error for a hypothesis set H:

EL(h) = E (x,y) D[L(h,x,y)] E L(H) = inf h H EL(h).

A desirable property of surrogate losses in regression is Bayes-consistency (Zhang, 2004a; Bartlett et al., 2006; Steinwart, 2007), that is, minimizing the surrogate losses L over the family of all measurable functions Hall leads to the minimization of the squared loss ℓ2 over Hall. We say that L is Bayes-consistent with respect to ℓ2, if, for all distribu-

H-Consistency Guarantees for Regression

tions and sequences of {hn}n N Hall, limn + EL(hn) E L(Hall) = 0 implies limn + Eℓ2(hn) E ℓ2(Hall) = 0. Bayes-consistency stands as an essential prerequisite for a surrogate loss. Nonetheless, it has some shortcomings: it is only an asymptotic property and it fails to account for the hypothesis set H (Awasthi et al., 2022a;b).

H-Consistency bounds. In contrast with Bayesconsistency, H-Consistency bounds take into account the speciﬁc hypothesis set H and are non-asymptotic. Given a hypothesis set H, we say that a regression loss function L admits an H-consistency bound with respect to ℓ2 (Awasthi et al., 2022a;b), if for some non-decreasing function f R+ R+, for all distributions and all h H, the following inequality holds:

Eℓ2(h) E ℓ2(H) f(EL(h) E L(H)).

Thus, when the L-estimation error can be reduced to some η > 0, the squared loss estimation error is upper bounded by f(η). An H-Consistency bound is a stronger and more informative property than Bayes-consistency, which is implied by taking the limit.

In the next section, we will prove H-consistency bounds for several common surrogate regression losses with respect to the squared loss ℓ2. A by-product of these guarantees is the Bayes-consistency of these losses.

For a regression loss function L and a hypothesis h, the generalization error can be expressed as follows:

EL(h) = E x[E y[L(h(x),y) x]] = E x[CL(h,x)],

where CL(h,x) is the conditional error Ey[L(h(x),y) x]. We also write C L(H,x) to denote the best-in-class conditional error deﬁned by C L(H,x) = infh H CL(h,x). The conditional regret or calibration gap, CL,H(h,x), measures the difference between the conditional error of h and the best-in-class conditional error: CL,H(h,x) = CL(h,x) C L(H,x). A generalization of conditional regret is the conditional ϵ-regret, deﬁned as: [ CL,H(h,x)]ϵ = CL,H(h,x)1 CL,H(h,x)>ϵ.

A key term appearing in our bounds is the minimizability gap, deﬁned for a loss function L and a hypothesis set H as ML(H) = E L(H) Ex[C L(H,x)]. It quantiﬁes the discrepancy between the best-in-class generalization error and the expected best-in-class conditional error. An alternative expression for the minimizability gap is: ML(H) = infh H Ex[CL(H,x)] Ex[infh H CL(H,x)]. Due to the super-additivity of the inﬁmum, the minimizability gap is always non-negative. As shown by Steinwart (2007, Lemma 2.5, Theorem 3.2), for the family of all measurable functions, the equality E L(Hall) = Ex[C L(Hall,x)] holds. Thus, the minimizability gap can be bounded above by the approximation error E L(H) E L(Hall). The minimizability gap

becomes zero when when H = Hall or, more broadly, when E L(H) = E L(Hall).

3. General H-Consistency Theorems

To derive H-consistency bounds for regression, we ﬁrst give two key theorems establishing that if a convex or concave function provides an inequality between the conditional regrets of regression loss functions L1 and L2, then this inequality translates into an H-consistency bound involving the minimizability gaps of L1 and L2.

Theorem 3.1 (General H-consistency bound convex function). Let D denote a distribution over X Y. Assume that there exists a convex function Ψ R+ R with Ψ(0) 0, a positive function α H X R + with supx X α(h,x) < + for all h H, and ϵ 0 such that the following holds for all h H, x X: Ψ([ CL2,H(h,x)]ϵ) α(h,x) CL1,H(h,x). Then, for any hypothesis h H, the following inequality holds:

Ψ(EL2(h) E L2(H) + ML2(H))

[sup x X α(h,x)](EL1(h) E L1(H) + ML1(H))

+ max{Ψ(0),Ψ(ϵ)}.

Theorem 3.2 (General H-consistency bound concave function). Let D denote a distribution over X Y. Assume that there exists a concave function Γ R+ R, a positive function α H X R + with supx X α(h,x) < + for all h H, and ϵ 0 such that the following holds for all h H, x X: [ CL2,H(h,x)]ϵ Γ(α(h,x) CL1,H(h,x)). Then, for any hypothesis h H, the following inequality holds

EL2(h) E L2(H) + ML2(H)

Γ([sup x X α(h,x)](EL1(h) E L1(H) + ML1(H))) + ϵ.

In the special case where Γ(x) = x 1 q for some q 1 with conjugate p 1, that is 1

q = 1, for any h H, the

following inequality holds, assuming EX[α

1 p < + for all h H:

EL2(h) E L2(H) + ML2(H)

1 p E X[EL1(h) E L1(H) + ML1(H)]

Theorems 3.1 and 3.2 provide signiﬁcantly more general tools for establishing H-consistency bounds than previous results from (Awasthi et al., 2022a, Theorems 1 and 2) and (Awasthi et al., 2022b, Theorems 1 and 2) for binary and multi-class classiﬁcation. They offer a more general framework for establishing consistency bounds by allowing

H-Consistency Guarantees for Regression

for non-constant functions α. This generalization is crucial for analyzing consistency bounds in regression, where α may not be constant for certain surrogate losses (e.g., Huber loss, squared ϵ-insensitive loss). Our generalized theorems also enable ﬁner consistency bounds, as demonstrated later in the case of the ℓ1 loss. The proofs of Theorems 3.1 and 3.2 are included in Appendix A.

To leverage these general theorems, we will characterize the best-in-class conditional error and the conditional regret of the squared loss. We ﬁrst introduce some deﬁnitions we will need. We say that the conditional distribution is bounded by B > 0 if, for all x X,P( Y B X = x) = 1. We say that a hypothesis set H is bounded by B > 0 if, h(x) B for all h H and x X, and all values in [ B,+B] are attainable by h(x), h H. The conditional mean of the distribution at x is denoted as: µ(x) = E[y x]. Theorem 3.3. Assume that the conditional distribution and the hypothesis set H are bounded by B > 0. Then, the bestin-class conditional error and the conditional regret of the squared loss can be characterized as: for all h H,x X,

C ℓ2(H,x) = Cℓ2(µ(x),x) = E[y2 x] (µ(x))2

Cℓ2,H(h,x) = (h(x) µ(x))2.

Refer to Appendix A for the proof. As in (Steinwart, 2007), for our analysis, we will focus speciﬁcally on symmetric distributions, where the conditional mean and the conditional median coincide. This is because, otherwise, as shown by Steinwart (2007, Proposition 4.14) the squared loss is essentially the only distance-based and locally Lipschitz continuous loss function that is Bayes-consistent with respect to itself for all bounded conditional distributions.

A distribution D over X Y is said to be symmetric if and only if for all x X, there exits y0 R such that Dy x(y0 A) = Dy x(y0 + A) for all measurable A [0,+ ). The next result characterizes the best-in-class predictor for any symmetric regression loss functions for such distributions. Theorem 3.4. Let ψ R R be a symmetric function such that ψ(x) = ψ( x) for all x R. Furthermore, ψ(x) 0 for all x in its domain and it holds that ψ(0) = 0. Assume that the conditional distribution and the hypothesis set H is bounded by B > 0. Assume that the distribution is symmetric and the regression loss function is given by L(y ,y) = ψ(y y). Then, we have C L(H,x) = CL(µ(x),x).

The proof is included in Appendix A. It is straightforward to see that all the previously mentioned regression loss functions satisfy the assumptions in Theorem 3.4. Therefore, for these loss functions, the best-in-class conditional error is directly characterized by Theorem 3.4. Furthermore, if we have x µ(x) H, then under the same assumption, we have E L(H) = Ex[C L(H,x)] = Ex[CL(µ(x),x)] and thus the minimizability gap vanishes: ML(H) = 0.

Deﬁnition 3.5. A hypothesis set H is said to be realizable if the function that maps x to the conditional mean µ(x) is included in H: x µ(x) H.

Corollary 3.6. Under the same assumption as in Theorem 3.4, for realizable hypothesis sets, we have ML(H) = 0.

4. H-Consistency Bounds for Regression

In this section, we will analyze the H-consistency of several regression loss functions with respect to the squared loss.

4.1. Huber Loss

The Huber loss ℓδ (h,x,y) 1

2(h(x) y)21 h(x) y δ + (δ h(x) y 1

2δ2)1 h(x) y >δ is a frequently used loss function in regression for dealing with outliers. It imposes quadratic penalties on small errors and linear penalties on larger ones. The next result provides H-consistency bounds for the Huber loss with respect to the squared loss.

Theorem 4.1. Assume that the distribution is symmetric, the conditional distribution and the hypothesis set H are bounded by B > 0. Assume that pmin(δ) = infx X P(0 µ(x) y δ x) is positive. Then, for all h H, the following H-consistency bound holds:

Eℓ2(h) E ℓ2(H) + Mℓ2(H)

pmin(δ) (Eℓδ(h) E ℓδ(H) + Mℓδ(H)).

The proof is presented in Appendix B.1. It leverages the general Theorem 3.1 with α(h,x) = P(0 µ(x) y δ x). Note that the previous established general tools for Hconsistency bounds (Awasthi et al., 2022a;b) require α to be constant, which is not applicable in this context. This underscores the necessity of generalizing previous tools to accommodate any positive function α.

As shown by Corollary 3.6, when H is realizable, the minimizability gap vanishes. Thus, by Theorem 4.1, we obtain the following corollary.

Corollary 4.2. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Assume that pmin(δ) = infx X P(0 µ(x) y δ x) is positive. Then, for all h H, the following H-consistency bound holds:

Eℓ2(h) E ℓ2(H) max{ 2B

pmin(δ) (Eℓδ(h) E ℓδ(H)).

Corollary 4.2 implies the Bayes-consistency of the Huber loss when pmin(δ) > 0, by taking the limit on both sides of the bound. Note that, as the value of δ increases, 2B

H-Consistency Guarantees for Regression

decreases and pmin(δ) increases, which improves the linear dependency on the Huber loss estimation error in this bound. However, this comes at the price of an Huber loss more similar to the squared loss and thus a higher sensitivity to outliers. Thus, selecting an appropriate value for δ involves considering these trade-offs.

The bound is uninformative when the probability mass pmin(δ) is zero. However, the following theorem shows that the condition pmin(δ) > 0 is necessary and that otherwise, in general, the Huber loss is not H-consistent with respect to the squared loss.

Theorem 4.3. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Then, the Huber loss ℓδ is not H-consistent with respect to the squared loss.

Refer to Appendix B.1 for the proof, which consists of considering a distribution that concentrates on an input x with P(Y = y x) = 1

2 = P(Y = 2µ(x) y x), where B y < µ(x) B and µ(x) y > δ. Then, we show that both h x y + δ and h x µ(x) are best-in-class predictors of the Huber loss, while the best-in-class-predictor of the squared loss is uniquely h x µ(x).

4.2. ℓp Loss

Here, we analyze ℓp loss functions for any p 1: ℓp (h,x,y) h(x) y p. We show that this family of loss functions beneﬁts from H-consistency bounds with respect to the squared loss assuming, when adopting the same symmetry and boundedness assumptions as in the previous section.

Theorem 4.4. Assume that the distribution is symmetric, and that the conditional distribution and the hypothesis set H are bounded by B > 0. Then, for all h H and p 1, the following H-consistency bound holds:

Eℓ2(h) E ℓ2(H) + Mℓ2(H)

Γ(Eℓp(h) E ℓp(H) + Mℓp(H)),

where Γ(t) = supx X,y Y{ h(x) y + µ(x) y }t for p =

1, Γ(t) = 2 (8B)p 2p(p 1) t for p (1,2], and Γ(t) = t 2 p for p 2.

The proof is included in Appendix B.2. Note that for p = 1, Γ can be further upper bounded as follows: Γ(t) = supx X supy Y{ h(x) y + µ(x) y }t 4B t since the conditional distribution and the hypothesis set H are bounded by B > 0. This upper bound can also be obtained by using general theorems in Section 3 with α 1. However, our generalized theorems, which apply to any positive function α, yield a ﬁner bound for the ℓ1 loss. This

further shows that our generalized theorems are not only useful but can also yield ﬁner bounds.

The key term appearing in the bounds is the minimizability gap Mℓp(H) = E ℓp(H) Ex[C ℓp(H,x)], which is helpful for comparing the bounds between ℓp losses for different p 1. For example, for the ℓ1 and ℓ2 loss, by Theorem 3.4, we have Mℓ1(H) = E ℓ1(H) Ex[Ey[ µ(x) y ]] and

Mℓ2(H) = E ℓ2(H) Ex[Ey[ µ(x) y 2]]. Thus, in the de-

terministic case, both Ey[ µ(x) y ] and Ey[ µ(x) y 2]

vanish, and Mℓ2 = E ℓ2(H) (E ℓ1(H)) 2 = (Mℓ1)2.

In particular, when H is realizable, we have Mℓp(H) = Mℓ2(H) = 0. This yields the following result.

Corollary 4.5. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Then, for all h H and p 1, the following H-consistency bound holds:

Eℓ2(h) E ℓ2(H) Γ(Eℓp(h) E ℓp(H)),

where Γ(t) = supx X,y Y{ h(x) y + µ(x) y }t for p =

1, Γ(t) = 2 (8B)p 2p(p 1) t for p (1,2], and Γ(t) = t 2 p for p 2.

Corollary 4.5 shows that when the estimation error of ℓp is reduced to ϵ, the estimation error of the squared loss (Eℓ2(h) E ℓ2(H)) is upper bounded by ϵ 2 p for p > 2, and by ϵ for 1 p 2, which is more favorable, modulo a multiplicative constant.

4.3. Squared ϵ-Insensitive Loss

The ϵ-insensitive loss and the squared ϵ-insensitive loss functions are used in the support vector regression (SVR) algorithms (Vapnik, 2000). The use of these loss functions results in sparser solutions, characterized by fewer support vectors for the SVR algorithms. Moreover, the selection of the parameter ϵ determines a trade-off between accuracy and sparsity: larger ϵ values yield increasingly sparser solutions. We ﬁrst provide a positive result for the squared ϵ-insensitive loss ℓsq ϵ (h,x,y) max{ h(x) y 2 ϵ2,0}, by showing that it admits an H-consistency bound with respect to ℓ2.

Theorem 4.6. Assume that the distribution is symmetric, and that the conditional distribution and the hypothesis set H are bounded by B > 0. Assume that pmin(ϵ) = infx X P(µ(x) y ϵ x) is positive. Then, for all h H, the following H-consistency bound holds:

Eℓ2(h) E ℓ2(H) + Mℓ2(H)

Eℓsq ϵ(h) E ℓsq ϵ(H) + Mℓsq ϵ(H)

H-Consistency Guarantees for Regression

The proof is presented in Appendix B.3. It requires the use of Theorem 3.1 with α(h,x) = P(µ(x) y ϵ x). As in the case of the Huber loss, the previous established general tools for H-consistency bounds (Awasthi et al., 2022a;b) do not apply here. Our generalization of previous tools proves essential for analyzing H-consistency bounds in regression. By Corollary 3.6, for realizable hypothesis sets, the minimizability gap vanishes. Thus, by Theorem 4.6, we obtain the following corollary.

Corollary 4.7. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Assume that pmin(ϵ) = infx X P(µ(x) y ϵ x) is positive. Then, for all h H, the following H-consistency bound holds:

Eℓ2(h) E ℓ2(H) Eℓsq ϵ(h) E ℓsq ϵ(H)

By taking the limit on both sides of the bound of Corollary 4.7, we can infer the H-consistency of the squared ϵ-insensitive loss under the assumption pmin(ϵ) > 0. Note that increasing ϵ diminishes pmin(ϵ), making the bound less favorable. Conversely, smaller ϵ values enhance the linear dependency bound but may hinder solution sparsity. Therefore, selecting the optimal ϵ involves balancing the trade-off between linear dependency and sparsity. When pmin(ϵ) approaches zero, the bound derived from Corollary 4.7 becomes less informative. However, as demonstrated in the subsequent theorem, the squared ϵ-insensitive loss fails to exhibit H-consistency with the squared loss if the condition pmin(ϵ) > 0 is not satisﬁed.

Theorem 4.8. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Then, the squared ϵ-insensitive loss ℓsq ϵ is not H-consistent.

The proof is given in Appendix B.3. It consists of considering a distribution that concentrates on an input x with P(Y = y x) = 1 2 = P(Y = 2µ(x) y x), where B y < µ(x) B and µ(x) y < ϵ. Then, we show that both h x y + ϵ and h x µ(x) are best-in-class predictors of the squared ϵ-insensitive loss, while the best-inclass-predictor of the squared loss is uniquely h x µ(x).

4.4. ϵ-Insensitive Loss

Here, we present negative results for the ϵ-insensitive loss ℓϵ (h,x,y) max{ h(x) y ϵ,0} used in the SVR algorithm, by showing that even under the assumption infx X P(µ(x) y ϵ) > 0 or infx X P(0 µ(x) y ϵ) > 0, it is not H-consistent with respect to the squared loss. The proofs of Theorems 4.9 and 4.10 are included in Appendix B.4. In the proof, we consider distributions that concentrate on an input x, leading to both h x y + ϵ

and h x µ(x) being the best-in-class predictors for the ϵ-insensitive loss.

Theorem 4.9. Assume that the distribution is symmetric and satisﬁes infx X P(µ(x) y ϵ x) > 0. Assume that the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Then, the ϵ-insensitive loss ℓϵ is not H-consistent with respect to the squared loss.

Theorem 4.10. Assume that the distribution is symmetric and satisﬁes pmin(ϵ) = infx X P(0 µ(x) y ϵ x) > 0. Assume further that the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Then, the ϵ-insensitive loss ℓϵ is not H-consistent with respect to the squared loss.

Tightness. Our H-consistency bounds are uniform bounds that hold for all distributions meeting the conditions of the theorems and for every hypothesis within the hypothesis set. The boundedness of the hypothesis set ensures that all values in [ B,+B] are attainable by h(x), h H. This means our bounds can be achieved with a speciﬁc hypothesis and a distribution that concentrates on a single point that fulﬁlls the conditions. Therefore, our bounds are, in fact, tight. In Lemma B.4 of Appendix B.5, we show an example illustrating the tightness of our bounds for the Huber loss and ℓ1 loss.

In Table 1, we provide a summary of the surrogate losses discussed above.

4.5. Generalization Bounds

We can use our H-consistency bounds to derive bounds on the squared loss estimation error of a surrogate loss minimizer. For a labeled sample S = ((x1,y1),...,(xm,ym)) drawn i.i.d. according to D, let h S H be the empirical minimizer of a regression loss function L over S and RL m(H) the Rademacher complexity of the hypothesis set {(x,y) L(h(x),y) h H}. We denote by BL an upper bound of the regression loss function L. Then, the following generalization bound holds.

Theorem 4.11. Assume that the distribution is symmetric, the conditional distribution and the hypothesis set H are bounded by B > 0. Then, for any δ > 0, with probability at least 1 δ over the draw of an i.i.d. sample S of size m, the following squared loss estimation bound holds for h S:

Eℓ2( h S) E ℓ2(H)

Γ(ML(H) + 4RL m(H) + 2BL

δ 2m ) Mℓ2(H).

where Γ(t) = supx X supy{ h S(x) y + µ(x) y }t for L = ℓ1, Γ(t) = 2 (8B)p 2p(p 1) t for L = ℓp, p (1,2], Γ(t) =

H-Consistency Guarantees for Regression

Table 1. Summary of surrogate losses in standard regression.

Surrogates Γ(t) α(h,x) supx X α(h,x) Main results

Huber loss max{ 2B

δ ,2}t 1 P(0 µ(x) y δ x) 1 pmin(δ) Theorem 4.1 and Corollary 4.2

ℓp, p 2 t p 2 1 1 Theorem 4.4 and Corollary 4.5

ℓp, 1 < p 2 2 (8B)p 2p(p 1) t 1 1 Theorem 4.4 and Corollary 4.5

ℓp, p = 1 t supy Y{ h(x) y + µ(x) y } supx X,y Y{ h(x) y + µ(x) y } Theorem 4.4 and Corollary 4.5

Squared ϵ-insensitive t 1 2P(µ(x) y ϵ x) 1 2pmin(ϵ) Theorem 4.6 and Corollary 4.7

t 2 p for L = ℓp, p 2, Γ(t) = max{ 2B

δ ,2} pmin(δ) t for L = ℓδ, and Γ(t) = 1 2pmin(ϵ) t for L = ℓsq ϵ.

The proof is included in Appendix C. Theorem 4.11 provides the ﬁrst ﬁnite-sample guarantees for the squared loss estimation error of the empirical minimizer of the Huber loss, squared ϵ-insensitive loss, ℓ1 loss, and more generally ℓp loss. The proof leverages the H-consistency bounds of Theorems 4.1, 4.4, 4.6, along with standard Rademacher complexity bounds (Mohri et al., 2018). Under the boundedness assumption, we have h(x) y h(x) + y 2B. Thus, an upper bound BL for the regression loss function can be derived. For example, for the ℓp loss, we have h(x) y p (2B)p and thus BL = (2B)p.

5. Application to Adversarial Regression

In this section, we show how the H-consistency guarantees we presented in the previous section can be applied to the design of new algorithms for adversarial regression. Deep neural networks are known to be vulnerable to small adversarial perturbations around input data (Krizhevsky et al., 2012; Szegedy et al., 2013; Sutskever et al., 2014).

Despite extensive previous work aimed at improving the robustness of neural networks, this often comes with a reduction in standard (non-adversarial) accuracy, leading to a trade-off between adversarial and standard generalization errors in both the classiﬁcation (Madry et al., 2017; Tsipras et al., 2018; Zhang et al., 2019; Raghunathan et al., 2019; Awasthi et al., 2021a;b; Min et al., 2021; Javanmard & Soltanolkotabi, 2022; Ma et al., 2022; Taheri et al., 2022; Dobriban et al., 2023; Awasthi et al., 2023b;a) and regression scenarios (Javanmard et al., 2020; Dan et al., 2020; Xing et al., 2021; Hassani & Javanmard, 2022; Liu et al., 2023; Ribeiro & Sch on, 2023; Ribeiro et al., 2023).

In the context of adversarial classiﬁcation, Zhang et al. (2019) proposed algorithms seeking a trade-off between these two types of errors, by using the theory of Bayesconsistent binary classiﬁcation surrogate losses functions. More recently, Mao et al. (2023f) introduced enhanced algorithms by minimizing smooth adversarial comp-sum losses, leveraging the H-consistency guarantee of comp-sum losses

in multi-class classiﬁcation.

Building on the insights from these previous studies, we aim to leverage our novel H-consistency theory tailored for standard regression to introduce a family of new loss functions for adversarial regression, termed as smooth adversarial regression losses. Minimizing these loss functions readily leads to new algorithms for adversarial regression.

5.1. Adversarial Squared Loss

In adversarial regression, the target adversarial generalization error is measured by the worst squared loss under the bounded γ perturbation of x. This is deﬁned for any (h,x,y) H X Y as follows:

ℓ2(h,x,y) = sup x x x γ (h(x ) y)2

where denotes a norm on X, typically an ℓp norm for p 1. We refer to ℓ2 as the adversarial squared loss. By adding and subtracting the standard squared loss ℓ2, for any (h,x,y) H X Y, we can write ℓ2 as follows: ℓ2(h,x,y) = (h(x) y)2 + supx x x γ(h(x ) y)2 (h(x) y)2. Then, assuming that the conditional distribution and the hypothesis set H are bounded by B > 0, we can write

sup x x x γ (h(x ) y)2 (h(x) y)2

= sup x x x γ (h(x ) h(x))(h(x ) + h(x) + y)

sup x x x γ 3B h(x ) h(x) . ( h(x) B, y B)

Thus, the ℓ2 loss can be upper bounded as follows for all (x,y):

ℓ2(h,x,y) (h(x) y)2+ν sup x x x γ h(x ) h(x) , (1)

where ν 3B is a positive constant.

5.2. Smooth Adversarial Regression Losses

Let L be a standard regression loss function that admits an H-consistency bound with respect to the squared loss:

Eℓ2(h) E ℓ2(H) Γ(EL(h) E L(H)).

H-Consistency Guarantees for Regression

To trade-off the adversarial and standard generalization errors in regression, by using (1), we can upper bound the difference between the adversarial generalization and the best-in-class standard generalization error as follows:

E ℓ2(h) E ℓ2(H)

Eℓ2(h) E ℓ2(H) + ν sup x x x γ h(x ) h(x)

Γ(EL(h) E L(H)) + ν sup x x x γ h(x ) h(x) .

Thus, by Corollaries 4.2, 4.5 and 4.7, we obtain the following guarantees with respect to the adversarial squared loss. The proofs are presented in Appendix D. Theorem 5.1. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Assume that pmin(δ) = infx X P(0 µ(x) y δ x) is positive. Then, for any ν 3B and all h H, the following bound holds:

E ℓ2(h) E ℓ2(H) max{ 2B

pmin(δ) (Eℓδ(h) E ℓδ(H))

+ ν sup x x x γ h(x ) h(x) .

Theorem 5.2. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Then, for any ν 3B and all h H, the following bound holds:

E ℓ2(h) E ℓ2(H)

Γ(Eℓp(h) E ℓp(H)) + ν sup x x x γ h(x ) h(x) ,

where Γ(t) = t 2 p if p 2, 2 (8B)p 2p(p 1) t for p (1,2) and 4B t, if p = 1.

Theorem 5.3. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Assume that pmin(ϵ) = infx X P(µ(x) y ϵ x) is positive. Then, for any ν 3B and all h H, the following bound holds:

E ℓ2(h) E ℓ2(H)

Eℓsq ϵ(h) E ℓsq ϵ(H)

2pmin(ϵ) + ν sup x x x γ h(x ) h(x) .

Theorems 5.1, 5.2 and 5.3 suggest minimizing

L(h,x,y) + τ sup x x x γ h(x ) h(x) (2)

where L can be chosen as ℓδ, ℓp and ℓsq ϵ and τ > 0 is a parameter. For simplicity, we use the parameter τ to approximate the effect of the functional form Γ in these bounds,

as with the approach adopted in (Zhang et al., 2019). This approach could help trade-off the standard part and the adversarial (regularization) part. Given that L is a convex function of h, the minimization of (2) can be achieved equivalently and more efﬁciently by applying the standard Lagrange method. This allows for the replacement of the ℓ1 norm with its square, since the regularization term can be moved to a constraint, where it can then be squared.

We refer to (2) as smooth adversarial regression loss functions. They can be obtained by augmenting the standard regression loss function such as the Huber loss, the ℓp loss and the ϵ-insensitive loss with a natural smoothness term. Minimizing the regularized empirical smooth adversarial regression loss functions leads to a new family of algorithms for adversarial regression, smooth adversarial regression algorithms. In the next section, we report experimental results illustrating the effectiveness of these new algorithms, in particular in terms of the trade-off between the adversarial accuracy and standard accuracy, as guaranteed by Theorems 5.1, 5.2 and 5.3. We will show that these algorithms outperform the direct minimization of the adversarial squared loss.

6. Experiments

In this section, we demonstrate empirically the effectiveness of the smooth adversarial regression algorithms introduced in the previous section.

Experimental settings. We studied two real-world datasets: the Diabetes dataset (Efron et al., 2004) and the Diverse MAGIC wheat dataset (Scott et al., 2021), and adopted the same exact settings for feature engineering as (Ribeiro et al., 2023, Example 3 and Example 5 in Appendix D). For the sake of a fair comparison, we used a linear hypothesis set. We considered an ℓ perturbation with perturbation size γ {0.001,0.005,0.01} for adversarial training. For our smooth adversarial regression losses (2), we chose L = ℓ2, the squared loss, and L = ℓδ with δ = 0.2, the Huber loss, setting τ = 1 as the default. Other choices for the regression loss functions and the value of τ may yield better performance, which can typically be selected by cross-validation in practice. Both our smooth adversarial regression losses and the adversarial squared loss were optimized using the CVXPY library (Diamond & Boyd, 2016).

Evaluation. We report the standard error, measured by the squared loss (or MSE) on the test data, and the robust error, measured by the adversarial squared loss with ℓ perturbation and the corresponding perturbation size used for training. We averaged both errors over ﬁve runs and report the standard deviation for both our smooth adversarial regression losses and the adversarial squared loss.

Results. Tables 2 and 3 present the experimental results of

H-Consistency Guarantees for Regression

Table 2. Comparison of the performance of the ADV-SQ algorithm and our smooth adversarial regression algorithms for L = ℓ2 and L = ℓδ for ℓ adversarial training with perturbation size γ {0.001, 0.005, 0.01} on the Diverse MAGIC wheat dataset.

METHOD SIZE CLEAN ROBUST

ADV-SQ 0.001 1.28 0.10 1.32 0.11 OURS (L = ℓ2) 1.28 0.09 1.32 0.09 OURS (L = ℓδ) 1.30 0.08 1.34 0.09

ADV-SQ 0.005 1.30 0.09 1.53 0.10 OURS (L = ℓ2) 1.26 0.09 1.46 0.10 OURS (L = ℓδ) 1.03 0.09 1.12 0.10

ADV-SQ 0.01 1.30 0.08 1.78 0.11 OURS (L = ℓ2) 1.22 0.11 1.62 0.14 OURS (L = ℓδ) 0.97 0.02 1.01 0.02

Table 3. Comparison of the performance of the ADV-SQ algorithm and our smooth adversarial regression algorithms for L = ℓ2 and L = ℓδ for ℓ adversarial training with perturbation size γ {0.001, 0.005, 0.01} on the Diabetes dataset.

METHOD SIZE CLEAN ROBUST

ADV-SQ 0.001 2.53 0.48 2.57 0.49 OURS (L = ℓ2) 1.24 0.21 1.26 0.21 OURS (L = ℓδ) 1.31 0.15 1.32 0.15

ADV-SQ 0.005 1.12 0.12 1.18 0.13 OURS (L = ℓ2) 0.80 0.04 0.82 0.04 OURS (L = ℓδ) 0.78 0.06 0.79 0.06

ADV-SQ 0.01 0.83 0.05 0.87 0.05 OURS (L = ℓ2) 0.74 0.05 0.76 0.05 OURS (L = ℓδ) 0.81 0.05 0.82 0.05

our adversarial regression algorithms with both the squared (L = ℓ2) and Huber (L = ℓδ) losses. The results suggest that these algorithms consistently surpass the adversarial squared loss in clean and robust error metrics across various settings. In particular, on the Diabetes dataset with a perturbation size of γ = 0.01, our method (L = ℓ2) outperforms the adversarial squared loss by more than 0.5% in both robust error and clean error. Similarly, on the Diverse MAGIC wheat dataset with a perturbation size of γ = 0.01, our method (L = ℓδ) surpasses the adversarial squared loss by more than 0.3% in terms of robust and clean errors.

Remarkably, the surrogate loss using the Huber loss occasionally outperforms the squared loss variant. This highlights the importance of using surrogate losses, even within the adversarial training framework, to enhance performance.

7. Discussion

Beneﬁts of surrogate losses in regression. As already underscored, the main beneﬁt of surrogate losses in regression, such as the Huber loss and the ℓ1 loss, is to reduce the inﬂuence of outliers. Our experimental results further demonstrate their advantage in adversarial regression settings, where we see the Huber loss-based surrogate occasionally outperforming the squared loss-based one.

Smooth adversarial regression losses. The analysis of Section 5.1 can be used to derive surrogate losses for adversarial regression scenarios beyond the adversarial squared loss, such as the adversarial ℓ1 loss. In this context, the formulation of the smooth adversarial regression loss replaces the absolute value with the local Lipschitz constant of the target loss. To establish guarantees for these new surrogate losses, the H-consistency bounds shown in Section 4 can be extended to other target losses in regression, such as the ℓ1 loss, an analysis we have already undertaken. For instance, we can establish H-consistency bounds for ℓ2 losses with respect to ℓ1. Additionally, it would also be interesting to explore the connection between the surrogate losses and Moreau envelope theory (see, for example, (Zhou et al., 2022)). We note that the regularization term supx x x γ h(x ) h(x) in the smooth adversarial regression loss closely relates to the local Lipschitz constant and the gradient norm, which are established methods in adversarially robust training (Hein & Andriushchenko, 2017; Finlay & Oberman, 2019; Yang et al., 2020; Gouk et al., 2021).

Technical signiﬁcance and novelty. In Section 3 we presented a more general framework crucial for establishing Hconsistency bounds by allowing for non-constant functions α. Our proof techniques for establishing H-consistency bounds are entirely novel and differ signiﬁcantly from those of the previous work on H-consistency bounds in the classiﬁcation setting, due to the continuity of the target loss function and the inﬁnite nature of the label space. This requires, in particular, proving a series of inequalities (e.g., Lemma B.1, Lemma B.2, Lemma B.3) adapted to each surrogate loss case to lower bound the conditional regret of the surrogate by the conditional regret of the squared loss.

8. Conclusion

We presented the ﬁrst study of H-consistency bounds for regression, by generalizing existing tools to prove such guarantees. Leveraging our generalized tools, we proved a series of novel H-consistency bounds for surrogate losses of the squared loss. These theoretical insights have immediate practical implications, illustrated by the formulation of principled surrogate losses for adversarial regression and the subsequent design of effective new algorithms. Our H-consistency guarantees can guide the design of new algorithms for adversarial regression and inform future analyses of surrogate losses for other target losses in regression.

Impact Statement

This work aims to advance the ﬁeld of Machine Learning. There are many potential societal consequences of our work, none which we feel must be speciﬁcally highlighted here.

H-Consistency Guarantees for Regression

Awasthi, P., Frank, N., Mao, A., Mohri, M., and Zhong, Y. Calibration and consistency of adversarial surrogate losses. In Advances in Neural Information Processing Systems, pp. 9804 9815, 2021a.

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. A ﬁner calibration analysis for adversarial robustness. ar Xiv preprint ar Xiv:2105.01550, 2021b.

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. Hconsistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, pp. 1117 1174, 2022a.

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. Multi-class H-consistency bounds. In Advances in neural information processing systems, pp. 782 795, 2022b.

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. DCprogramming for neural network optimizations. Journal of Global Optimization, 2023a.

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. Theoretically grounded loss functions and algorithms for adversarial robustness. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 10077 10094, 2023b.

Bartlett, P. L., Jordan, M. I., and Mc Auliffe, J. D. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

Blondel, M. Structured prediction with projection oracles. In Advances in neural information processing systems, 2019.

Caponnetto, A. A note on the role of squared loss in regression. Technical report, Massachusetts Institute of Technology, 2005.

Christmann, A. and Steinwart, I. Consistency and robustness of kernel-based regression in convex risk minimization. Bernoulli, 13(3), 2007.

Ciliberto, C., Rosasco, L., and Rudi, A. A consistent regularization approach for structured prediction. In Advances in neural information processing systems, 2016.

Clarkson, J. A. Uniformly convex spaces. Transactions of the American Mathematical Society, 40(3):396 414, 1936.

Dan, C., Wei, Y., and Ravikumar, P. Sharp statistical guaratees for adversarially robust gaussian classiﬁcation. In International Conference on Machine Learning, pp. 2345 2355, 2020.

Diamond, S. and Boyd, S. Cvxpy: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909 2913, 2016.

Dobriban, E., Hassani, H., Hong, D., and Robey, A. Provable tradeoffs in adversarially robust classiﬁcation. IEEE Transactions on Information Theory, 2023.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least angle regression. The Annals of Statistics, 32(2):407 451, 2004.

Finlay, C. and Oberman, A. M. Scaleable input gradient regularization for adversarial robustness. ar Xiv preprint ar Xiv:1905.11468, 2019.

Finocchiaro, J., Frongillo, R., and Waggoner, B. An embedding framework for consistent polyhedral surrogates. In Advances in neural information processing systems, 2019.

Frongillo, R. and Waggoner, B. Surrogate regret bounds for polyhedral losses. In Advances in Neural Information Processing Systems, pp. 21569 21580, 2021.

Gao, W. and Zhou, Z.-H. On the consistency of multi-label learning. In Conference on learning theory, pp. 341 358, 2011.

Gao, W. and Zhou, Z.-H. On the consistency of AUC pairwise optimization. In International Joint Conference on Artiﬁcial Intelligence, 2015.

Gouk, H., Frank, E., Pfahringer, B., and Cree, M. J. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110:393 416, 2021.

Hassani, H. and Javanmard, A. The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression. ar Xiv preprint ar Xiv:2201.05149, 2022.

Hein, M. and Andriushchenko, M. Formal guarantees on the robustness of a classiﬁer against adversarial manipulation. In Advances in neural information processing systems, 2017.

Huber, P. J. Robust estimation of a location parameter. Ann. Math. Statist, 35:73 -101, 1964.

Javanmard, A. and Soltanolkotabi, M. Precise statistical analysis of classiﬁcation accuracies for adversarial training. The Annals of Statistics, 50(4):2127 2156, 2022.

Javanmard, A., Soltanolkotabi, M., and Hassani, H. Precise tradeoffs in adversarial training for linear regression. In Conference on Learning Theory, pp. 2034 2078, 2020.

H-Consistency Guarantees for Regression

Koyejo, O. O., Natarajan, N., Ravikumar, P. K., and Dhillon, I. S. Consistent multilabel classiﬁcation. In Advances in Neural Information Processing Systems, 2015.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097 1105, 2012.

Liu, C., Jiao, Y., Wang, J., and Huang, J. Non-asymptotic bounds for adversarial excess risk under misspeciﬁed models. ar Xiv preprint ar Xiv:2309.00771, 2023.

Long, P. and Servedio, R. Consistency versus realizable Hconsistency for multiclass classiﬁcation. In International Conference on Machine Learning, pp. 801 809, 2013.

Ma, X., Wang, Z., and Liu, W. On the tradeoff between robustness and fairness. In Advances in Neural Information Processing Systems, pp. 26230 26241, 2022.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Mao, A., Mohri, C., Mohri, M., and Zhong, Y. Two-stage learning to defer with multiple experts. In Advances in neural information processing systems, 2023a.

Mao, A., Mohri, M., and Zhong, Y. H-consistency bounds: Characterization and extensions. In Advances in Neural Information Processing Systems, 2023b.

Mao, A., Mohri, M., and Zhong, Y. H-consistency bounds for pairwise misranking loss surrogates. In International conference on Machine learning, 2023c.

Mao, A., Mohri, M., and Zhong, Y. Ranking with abstention. In ICML 2023 Workshop The Many Facets of Preference Based Learning, 2023d.

Mao, A., Mohri, M., and Zhong, Y. Structured prediction with stronger consistency guarantees. In Advances in Neural Information Processing Systems, 2023e.

Mao, A., Mohri, M., and Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In International Conference on Machine Learning, 2023f.

Mao, A., Mohri, M., and Zhong, Y. Principled approaches for learning to defer with multiple experts. In International Symposium on Artiﬁcial Intelligence and Mathematics, 2024a.

Mao, A., Mohri, M., and Zhong, Y. Predictor-rejector multiclass abstention: Theoretical analysis and algorithms. In International Conference on Algorithmic Learning Theory, pp. 822 867, 2024b.

Mao, A., Mohri, M., and Zhong, Y. Theoretically grounded loss functions and algorithms for score-based multi-class abstention. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 4753 4761, 2024c.

Mao, A., Mohri, M., and Zhong, Y. Regression with multiexpert deferral. ar Xiv preprint ar Xiv:2403.19494, 2024d.

Mao, A., Mohri, M., and Zhong, Y. Top-k classiﬁcation and cardinality-aware prediction. ar Xiv preprint ar Xiv:2403.19625, 2024e.

Mao, A., Mohri, M., and Zhong, Y. A universal growth rate for learning with smooth surrogate losses. ar Xiv preprint ar Xiv:2405.05968, 2024f.

Menon, A. K. and Williamson, R. C. Bayes-optimal scorers for bipartite ranking. In Conference on Learning Theory, pp. 68 106, 2014.

Min, Y., Chen, L., and Karbasi, A. The curious case of adversarially robust models: More data can help, double descend, or hurt generalization. In Uncertainty in Artiﬁcial Intelligence, pp. 129 139, 2021.

Mohri, C., Andor, D., Choi, E., Collins, M., Mao, A., and Zhong, Y. Learning to reject with a ﬁxed predictor: Application to decontextualization. In International Conference on Learning Representations, 2024.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018.

Narasimhan, H., Ramaswamy, H., Saha, A., and Agarwal, S. Consistent multiclass algorithms for complex performance measures. In International Conference on Machine Learning, pp. 2398 2407, 2015.

Osokin, A., Bach, F., and Lacoste-Julien, S. On structured prediction theory with calibrated convex surrogate losses. In Advances in Neural Information Processing Systems, 2017.

Pedregosa, F., Bach, F., and Gramfort, A. On the consistency of ordinal regression methods. Journal of Machine Learning Research, 18:1 35, 2017.

Raghunathan, A., Xie, S. M., Yang, F., Duchi, J. C., and Liang, P. Adversarial training can hurt generalization. ar Xiv preprint ar Xiv:1906.06032, 2019.

Ribeiro, A. H. and Sch on, T. B. Overparameterized linear regression under adversarial attacks. IEEE Transactions on Signal Processing, 71:601 614, 2023.

Ribeiro, A. H., Zachariah, D., Bach, F., and Sch on, T. B. Regularization properties of adversarially-trained linear regression. In Advances in Neural Information Processing Systems, 2023.

H-Consistency Guarantees for Regression

Scott, M. F., Fradgley, N., Bentley, A. R., Brabbs, T., Corke, F., Gardner, K. A., Horsnell, R., Howell, P., Ladejobi, O., Mackay, I. J., et al. Limited haplotype diversity underlies polygenic trait architecture across 70 years of wheat breeding. Genome Biology, 22(1):1 30, 2021.

Steinwart, I. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007.

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104 3112, 2014.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Taheri, H., Pedarsani, R., and Thrampoulidis, C. Asymptotic behavior of adversarial training in binary linear classiﬁcation. In IEEE International Symposium on Information Theory (ISIT), pp. 127 132, 2022.

Tewari, A. and Bartlett, P. L. On the consistency of multiclass classiﬁcation methods. Journal of Machine Learning Research, 8(36):1007 1025, 2007.

Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A. Robustness may be at odds with accuracy. ar Xiv preprint ar Xiv:1805.12152, 2018.

Uematsu, K. and Lee, Y. On theoretically optimal ranking functions in bipartite ranking. Journal of the American Statistical Association, 112(519):1311 1322, 2017.

Vapnik, V. N. The Nature of Statistical Learning Theory. Springer-Verlag, 2000.

Wang, Y. and Scott, C. Weston-Watkins hinge loss and ordered partitions. In Advances in neural information processing systems, pp. 19873 19883, 2020.

Wang, Y. and Scott, C. D. On classiﬁcation-calibration of gamma-phi losses. ar Xiv preprint ar Xiv:2302.07321, 2023.

Xing, Y., Zhang, R., and Cheng, G. Adversarially robust estimate and risk analysis in linear regression. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 514 522, 2021.

Yang, Y.-Y., Rashtchian, C., Zhang, H., Salakhutdinov, R. R., and Chaudhuri, K. A closer look at accuracy vs. robustness. In Advances in neural information processing systems, pp. 8588 8601, 2020.

Zhang, H., Yu, Y., Jiao, J., Xing, E. P., Ghaoui, L. E., and Jordan, M. I. Theoretically principled trade-off between robustness and accuracy. ar Xiv preprint ar Xiv:1901.08573, 2019.

Zhang, M. and Agarwal, S. Bayes consistency vs. Hconsistency: The interplay between surrogate loss functions and the scoring function class. In Advances in Neural Information Processing Systems, pp. 16927 16936, 2020.

Zhang, M., Ramaswamy, H. G., and Agarwal, S. Convex calibrated surrogates for the multi-label f-measure. In International Conference on Machine Learning, pp. 11246 11255, 2020.

Zhang, T. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, 2004a.

Zhang, T. Statistical analysis of some multi-category large margin classiﬁcation methods. Journal of Machine Learning Research, 5(Oct):1225 1251, 2004b.

Zheng, C., Wu, G., Bao, F., Cao, Y., Li, C., and Zhu, J. Revisiting discriminative vs. generative classiﬁers: Theory and implications. ar Xiv preprint ar Xiv:2302.02334, 2023.

Zhou, L., Koehler, F., Sur, P., Sutherland, D. J., and Srebro, N. A non-asymptotic moreau envelope theory for high-dimensional generalized linear models. In Advances in Neural Information Processing Systems, pp. 21286 21299, 2022.

H-Consistency Guarantees for Regression

Contents of Appendix

A Proofs of general H-consistency theorems 14

A.1 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

A.2 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

A.3 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.4 Proof of Theorem 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

B Proofs of H-consistency bounds for common surrogate losses 16

B.1 H-consistency of ℓδ with respect to ℓ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

B.2 H-consistency of ℓp with respect to ℓ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B.3 H-consistency of ℓsq ϵ with respect to ℓ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

B.4 H-consistency of ℓϵ with respect to ℓ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

B.5 Tightness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

C Proofs of generalization bound 24

D Proofs of adversarial regression 25

D.1 Proof of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

D.2 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

D.3 Proof of Theorem 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

H-Consistency Guarantees for Regression

A. Proofs of general H-consistency theorems

A.1. Proof of Theorem 3.1

Theorem 3.1 (General H-consistency bound convex function). Let D denote a distribution over X Y. Assume that there exists a convex function Ψ R+ R with Ψ(0) 0, a positive function α H X R + with supx X α(h,x) < + for all h H, and ϵ 0 such that the following holds for all h H, x X: Ψ([ CL2,H(h,x)]ϵ) α(h,x) CL1,H(h,x). Then, for any hypothesis h H, the following inequality holds:

Ψ(EL2(h) E L2(H) + ML2(H)) [sup x X α(h,x)](EL1(h) E L1(H) + ML1(H)) + max{Ψ(0),Ψ(ϵ)}.

Proof. For any h H, we can write

Ψ(EL2(h) E L2,H + ML2,H)

= Ψ(E X[ CL2,H(h,x)])

E X[Ψ( CL2,H(h,x))] (Jensen s ineq.)

= E X[Ψ( CL2,H(h,x)1 CL2,H(h,x)>ϵ + CL2,H(h,x)1 CL2,H(h,x) ϵ)]

E X[Ψ( CL2,H(h,x)1 CL2,H(h,x)>ϵ) + Ψ( CL2,H(h,x)1 CL2,H(h,x) ϵ)] (Ψ(0) 0)

E X[α(h,x) CL1,H(h,x)] + sup t [0,ϵ] Ψ(t) (assumption)

[sup x X α(h,x)]E x[ CL1,H(h,x)] + sup t [0,ϵ] Ψ(t) (H older s ineq.)

= [sup x X α(h,x)](EL1(h) E L1,H + ML1,H) + max{Ψ(0),Ψ(ϵ)}, (convexity of Ψ)

which completes the proof.

A.2. Proof of Theorem 3.2

Theorem 3.2 (General H-consistency bound concave function). Let D denote a distribution over X Y. Assume that there exists a concave function Γ R+ R, a positive function α H X R + with supx X α(h,x) < + for all h H, and ϵ 0 such that the following holds for all h H, x X: [ CL2,H(h,x)]ϵ Γ(α(h,x) CL1,H(h,x)). Then, for any hypothesis h H, the following inequality holds

EL2(h) E L2(H) + ML2(H) Γ([sup x X α(h,x)](EL1(h) E L1(H) + ML1(H))) + ϵ.

In the special case where Γ(x) = x 1 q for some q 1 with conjugate p 1, that is 1

q = 1, for any h H, the following

inequality holds, assuming EX[α

1 p < + for all h H:

EL2(h) E L2(H) + ML2(H) E X[α

1 p E X[EL1(h) E L1(H) + ML1(H)]

H-Consistency Guarantees for Regression

Proof. For any h H, we can write

EL2(h) E L2,H + ML2,H

= E X[CL2(h,x) C L2,H(x)]

= E X[ CL2,H(h,x)]

= E X[ CL2,H(h,x)1 CL2,H(h,x)>ϵ + CL2,H(h,x)1 CL2,H(h,x) ϵ]

E X[Γ(α(h,x) CL1,H(h,x))] + ϵ (assumption)

Γ(E X[α(h,x) CL1,H(h,x)]) + ϵ (Jensen s ineq.)

Γ([sup x X α(h,x)] E X[ CL1,H(h,x)]) + ϵ (H older s ineq.)

= Γ([sup x X α(h,x)](EL1(h) E L1,H + ML1,H)) + ϵ.

When Γ(x) = x 1 q for some q 1 with conjugate number p, starting from the fourth inequality above, we can write

EL2(h) E L2,H + ML2,H E X[α 1 q (h,x) C

1 q L2,H(h,x)] + ϵ

1 p E X[ CL1,H(h,x)]

1 q + ϵ (H older s ineq)

1 p E X[EL1(h) E L1,H + ML1,H]

This completes the proof.

A.3. Proof of Theorem 3.3

Theorem 3.3. Assume that the conditional distribution and the hypothesis set H are bounded by B > 0. Then, the best-in-class conditional error and the conditional regret of the squared loss can be characterized as: for all h H,x X,

C ℓ2(H,x) = Cℓ2(µ(x),x) = E[y2 x] (µ(x))2

Cℓ2,H(h,x) = (h(x) µ(x))2.

Proof. By deﬁnition,

C ℓ2(H,x) = inf h H E[(h(x) y)2 x]

= inf h H[(h(x) E[y x])2 + E[y2 x] (E[y x])2]

= E[y2 x] (E[y x])2

Cℓ2,H(h,x) = E[(h(x) y)2 x] inf h H E[(h(x) y)2 x]

= (h(x) E[y x])2 + E[y2 x] (E[y x])2 (E[y2 x] (E[y x])2)

= (h(x) E[y x])2.

This completes the proof.

A.4. Proof of Theorem 3.4

Theorem 3.4. Let ψ R R be a symmetric function such that ψ(x) = ψ( x) for all x R. Furthermore, ψ(x) 0 for all x in its domain and it holds that ψ(0) = 0. Assume that the conditional distribution and the hypothesis set H is bounded by B > 0. Assume that the distribution is symmetric and the regression loss function is given by L(y ,y) = ψ(y y). Then, we have C L(H,x) = CL(µ(x),x).

H-Consistency Guarantees for Regression

Proof. By the symmetry of the distribution, we can write

E y[ψ(h(x) y) x] = Ey[ψ(h(x) y) x] + Ey[ψ(h(x) 2µ(x) + y) x]

= Ey[ψ(h(x) y) x] + Ey[ψ( h(x) + 2µ(x) y) x]

2 (ψ is symmetric)

= Ey[ψ(h(x) y) + ψ( h(x) + 2µ(x) y) x]

2 E y[ψ(µ(x) y)] (Jensen s inequality)

where the equality is achieved when h(x) = µ(x) H. This completes the proof.

B. Proofs of H-consistency bounds for common surrogate losses

B.1. H-consistency of ℓδ with respect to ℓ2

Deﬁne the function g as g t 1

2t21 t δ + (δ t 1

2δ2)1 t >δ. Consider the function F deﬁned over [ B,B]2 by F(x,y) = g(x+y)+g(x y)

2 g(y). We prove a useful lemma as follows.

Lemma B.1. For any x,y [ B,B] and y δ, the following inequality holds:

F(x,y) min{ δ

Proof. Given the deﬁnition of g and the symmetry of F with respect to y = 0, we can assume, without loss of generality, that y 0. Next, we will analyze case by case.

Case I: x + y δ, x y δ, 0 y δ. In this case, we have

1 2(x + y)2 + 1

Case II: x + y δ, x y > δ, 0 y δ. In this case, we must have y δ x < y δ and δ y max{ x δ,x + δ} x + δ. Thus,

1 2(x + y)2 + δ x y 1

1 2(x + y)2 + δ(y x) 1

2y2 (x y < 0)

2y2 + (x + δ)y + 1

2δ2 + (x + δ)δ + 1

2 (the minimum of the quadratic function is attained when y = δ)

Case III: x + y > δ, x y δ, 0 y δ. In this case, we must have y + δ x y + δ and δ y max{ x + δ,x δ}

H-Consistency Guarantees for Regression

x + δ. Thus,

F(x,y) = δ x + y 1

= δ(x + y) 1

2y2 (x + y > 0)

2y2 + ( x + δ)y + 1

2δ2 + ( x + δ)δ + 1

2 (the minimum of the quadratic function is attained when y = δ)

Case IV: x+y > δ, x y > δ, 0 y δ. In this case, we must have x > y +δ δ and 0 y < min{x δ,δ}. Thus, we have

F(x,y) = δ x + y 1

2δ2 + δ x y 1

= δ(x + y) 1

2δ2 + δ(x y) 1

2y2 (x + y > 0 and x y > 0)

= y2 + 2δx δ2

Then, if δ < x 2δ and min{x δ,δ} = x δ,

F(x,y) = y2 + 2δx δ2

(x δ)2 + 2δx δ2

2 (the minimum of the quadratic function is attained when y = x δ)

= x2 + 4δx 2δ2

4 (δ < x 2δ)

If 2δ < x B and min{x δ,δ} = δ,

F(x,y) = y2 + 2δx δ2

δ2 + 2δx δ2

2 (the minimum of the quadratic function is attained when y = δ)

2B x2 (2δ < x B)

Case V: x + y < δ, x y > δ, 0 y δ. In this case, we must have x < y δ δ, and 0 y < min{ x δ,δ}. Thus,

H-Consistency Guarantees for Regression

F(x,y) = δ x + y 1

2δ2 + δ x y 1

= δ(x + y) 1

2δ2 δ(x y) 1

2y2 (x + y < 0 and x y < 0)

= y2 2δx δ2

Then, if 2δ x < δ and min{ x δ,δ} = x δ,

F(x,y) = y2 2δx δ2

( x δ)2 2δx δ2

2 (the minimum of the quadratic function is attained when y = x δ)

= x2 4δx 2δ2

4 ( 2δ x < δ)

If B x < 2δ and min{ x δ,δ} = δ,

F(x,y) = y2 2δx δ2

2 (the minimum of the quadratic function is attained when y = δ)

2B x2 ( B x < 2δ)

In summary, we complete the proof.

Theorem 4.1. Assume that the distribution is symmetric, the conditional distribution and the hypothesis set H are bounded by B > 0. Assume that pmin(δ) = infx X P(0 µ(x) y δ x) is positive. Then, for all h H, the following H-consistency bound holds:

Eℓ2(h) E ℓ2(H) + Mℓ2(H) max{ 2B

pmin(δ) (Eℓδ(h) E ℓδ(H) + Mℓδ(H)).

H-Consistency Guarantees for Regression

Proof. By Theorem 3.4, we can write h H,x X,

2(h(x) y)21 h(x) y δ + (δ h(x) y 1

2δ2)1 h(x) y >δ x]

2(µ(x) y)21 µ(x) y δ + (δ µ(x) y 1

2δ2)1 µ(x) y >δ x]

= E y[g(h(x) µ(x) + µ(x) y) + g(h(x) µ(x) (µ(x) y))

2 g(µ(x) y) x]

(distribution is symmetric with respect to µ(x))

= E y[F(h(x) µ(x),µ(x) y) x]

2P(0 µ(x) y δ x)E y[F(h(x) µ(x),µ(x) y) 0 µ(x) y δ]

P(0 µ(x) y δ x)min{ δ

2}(h(x) µ(x))2 ( h(x) µ(x) 2B, µ(x) y 2B)

= P(0 µ(x) y δ x)min{ δ

2} Cℓ2(h,x).

By Theorems 3.1 or 3.2 with α(h,x) = 1 P(0 µ(x) y δ x), we have

Eℓ2(r) E ℓ2(R) + Mℓ2(R) max{ 2B

pmin(δ) (Eℓδ(r) E ℓδ(R) + Mℓδ(R)).

Theorem 4.3. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Then, the Huber loss ℓδ is not H-consistent with respect to the squared loss.

Proof. Consider a distribution that concentrates on an input x. Choose y,µ(x),δ R such that B y < µ(x) B and µ(x) y > δ. Consider the conditional distribution as P(Y = y x) = 1

2 = P(Y = 2µ(x) y x). Thus, the distribution is symmetric with respect to y = µ(x). For such a distribution, the best-in-class predictor for the squared loss is h (x) = µ(x). However, for the Huber loss, we have

2(h(x) y)21 h(x) y δ + (δ h(x) y 1

2δ2)1 h(x) y >δ x]

2(h(x) y)21 h(x) y δ + (δ h(x) y 1

2δ2)1 h(x) y >δ)

2(h(x) 2µ(x) + y)21 h(x) 2µ(x)+y δ + (δ h(x) 2µ(x) + y 1

2δ2)1 h(x) 2µ(x)+y >δ).

Thus, plugging h x y + δ and h x µ(x), we obtain that

Cℓδ(h,x) = 1

2(δ 2y + δ 2µ(x) 1

2δ2) (h(x) y = δ and h(x) 2µ(x) + y < δ)

Cℓδ(h ,x) = 1

2(δ µ(x) y 1

2δ2 + δ µ(x) + y 1

2δ2) (h (x) y > δ and h (x) 2µ(x) + y < δ)

Therefore, Cℓδ(h,x) = Cℓδ(h ,x), and both h and h are the best-in-class predictors for the Huber loss. This implies that the Huber loss is not H-consistent with respect to the squared loss.

H-Consistency Guarantees for Regression

B.2. H-consistency of ℓp with respect to ℓ2

Theorem 4.4. Assume that the distribution is symmetric, and that the conditional distribution and the hypothesis set H are bounded by B > 0. Then, for all h H and p 1, the following H-consistency bound holds:

Eℓ2(h) E ℓ2(H) + Mℓ2(H) Γ(Eℓp(h) E ℓp(H) + Mℓp(H)),

where Γ(t) = supx X,y Y{ h(x) y + µ(x) y }t for p = 1, Γ(t) = 2 (8B)p 2p(p 1) t for p (1,2], and Γ(t) = t 2 p for p 2.

Proof. We will analyze case by case.

Case I: p 2. By Theorem 3.4, we can write

h H,x X, Cℓp(h,x)

= E y[ h(x) y p µ(x) y p x]

= E y[ h(x) y p + h(x) 2µ(x) + y p

2 µ(x) y p x]

(distribution is symmetric with respect to µ(x))

= E y[ h(x) µ(x) + µ(x) y p + h(x) µ(x) (µ(x) y) p

2 µ(x) y p x]

h(x) µ(x) p (by Clarkson s inequality (Clarkson, 1936))

= ((h(x) µ(x))2)

= ( Cℓ2(h,x))

By Theorem 3.1, we have

Eℓ2(r) E ℓ2(R) + Mℓ2(R) (Eℓp(r) E ℓp(R) + Mℓp(R))

Case II: 1 < p 2. In this case, the Clarkson s inequality cannot be used directly. We ﬁrst prove a useful lemma as follows.

Lemma B.2. For any x,y [ B,B] and 1 < p 2, the following inequality holds:

x + y p + x y p

2 y p (2B)p 2p(p 1)

Proof. For any y [ B,B], consider the function fy x x+y p+ x y p

2 y p (2B)p 2p(p 1)

2 x2. We compute the ﬁrst derivative and second derivative of fy as follows:

x+y + p x y p

x y 2 (2B)p 2p(p 1)x

p(p 1) x+y 2 p + p(p 1)

x y 2 p 2 (2B)p 2p(p 1).

Thus, using the fact that 1 < p 2 and x + y 2B, x y 2B, we have

x [ B,B], f y (x)

p(p 1) (2B)2 p + p(p 1)

(2B)2 p 2 (2B)p 2p(p 1) = 0.

Therefore, fy(x) is convex. Since f y(0) = 0, x = 0 achieves the minimum:

x,y [ B,B], fy(x) fy(0) = 0.

This completes the proof.

H-Consistency Guarantees for Regression

By Theorem 3.4, we can write

h H,x X, Cℓp(h,x)

= E y[ h(x) y p µ(x) y p x]

= E y[ h(x) y p + h(x) 2µ(x) + y p

2 µ(x) y p x]

(distribution is symmetric with respect to µ(x))

= E y[ h(x) µ(x) + µ(x) y p + h(x) µ(x) (µ(x) y) p

2 µ(x) y p x]

(8B)p 2p(p 1)

2 (h(x) µ(x))2 (by Lemma B.2 and h(x) µ(x) 4B, µ(x) y 4B)

= (8B)p 2p(p 1)

2 Cℓ2(h,x).

By Theorem 3.1, we have

Eℓ2(r) E ℓ2(R) + Mℓ2(R) 2 (8B)p 2p(p 1)(Eℓp(r) E ℓp(R) + Mℓp(R)).

Case III: p = 1. By Theorem 3.4, we can write

h H,x X, Cℓ2(h,x) = E y[(h(x) y)2 (µ(x) y)2 x]

= E y[( h(x) y + µ(x) y )( h(x) y µ(x) y ) x]

sup y Y { h(x) y + µ(x) y }E y[ h(x) y µ(x) y x]

= sup y Y { h(x) y + µ(x) y } Cℓ1(h,x).

By Theorems 3.1 or 3.2 with α(h,x) = supy Y{ h(x) y + µ(x) y }, we have

Eℓ2(r) E ℓ2(R) + Mℓ2(R) sup x X sup y Y { h(x) y + µ(x) y }(Eℓ1(r) E ℓ1(R) + Mℓ1(R)).

B.3. H-consistency of ℓsq ϵ with respect to ℓ2

Deﬁne the function g as g t max{t2 ϵ2,0}. Consider the function F deﬁned over R2 by F(x,y) = g(x+y)+g(x y)

2 g(y). We ﬁrst prove a useful lemma as follows.

Lemma B.3. For any x R and y ϵ, the following inequality holds:

Proof. Given the deﬁnition of g and the symmetry of F with respect to y = 0, we can assume, without loss of generality, that y 0. Next, we will analyze case by case.

Case I: x + y > ϵ, x y > ϵ, y ϵ. In this case, we have

F(x,y) = (x + y)2 ϵ2 + (x y)2 ϵ2

2 y2 + ϵ2 = x2.

Case II: x + y > ϵ, x y ϵ, y ϵ. In this case, we must have y ϵ x y + ϵ and x + ϵ y max{x ϵ,ϵ} x ϵ.

H-Consistency Guarantees for Regression

F(x,y) = (x + y)2 ϵ2 + 0

= y2 + 2xy + x2 + ϵ2

(x + ϵ)2 + 2x(x + ϵ) + x2 + ϵ2

2 (the minimum of the quadratic function is attained when y = x + ϵ)

Case III: x + y ϵ, x y > ϵ, y ϵ. In this case, we must have y ϵ x y+ϵ and x+ϵ y max{ x ϵ,ϵ} x ϵ. Thus,

F(x,y) = 0 + (x y)2 ϵ2

= y2 2xy + x2 + ϵ2

( x + ϵ)2 2x( x + ϵ) + x2 + ϵ2

2 (the minimum of the quadratic function is attained when y = x + ϵ)

Case IV: x + y ϵ, x y ϵ, y ϵ. In this case, we must have x = 0 and y = ϵ. Thus,

F(x,y) = 0 + 0

2 0 = 0 = x2.

In summary, we complete the proof.

Theorem 4.6. Assume that the distribution is symmetric, and that the conditional distribution and the hypothesis set H are bounded by B > 0. Assume that pmin(ϵ) = infx X P(µ(x) y ϵ x) is positive. Then, for all h H, the following H-consistency bound holds:

Eℓ2(h) E ℓ2(H) + Mℓ2(H) Eℓsq ϵ(h) E ℓsq ϵ(H) + Mℓsq ϵ(H)

Proof. By Theorem 3.4, we can write h H,x X,

Cℓsq ϵ(h,x)

= E y[max{(h(x) y)2,ϵ2} x] E y[max{(µ(x) y)2,ϵ2} x]

= E y[g(h(x) µ(x) + µ(x) y) + g(h(x) µ(x) (µ(x) y))

2 g(µ(x) y) x]

(distribution is symmetric with respect to µ(x))

= E y[F(h(x) µ(x),µ(x) y) x]

2P(µ(x) y ϵ x)E y[F(h(x) µ(x),µ(x) y) µ(x) y ϵ]

2P(µ(x) y ϵ x)(h(x) µ(x))2 (by Lemma B.3)

= 2P(µ(x) y ϵ x) Cℓ2(h,x).

By Theorems 3.1 or 3.2 with α(h,x) = 1 2 P(µ(x) y ϵ x), we have

Eℓ2(r) E ℓ2(R) + Mℓ2(R) Eℓsq ϵ(r) E ℓsq ϵ(R) + Mℓsq ϵ(R)

H-Consistency Guarantees for Regression

Theorem 4.8. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Then, the squared ϵ-insensitive loss ℓsq ϵ is not H-consistent.

Proof. Consider a distribution that concentrates on an input x. Choose y,µ(x),ϵ R such that B y < µ(x) B and µ(x) y < ϵ. Consider the conditional distribution as P(Y = y x) = 1

2 = P(Y = 2µ(x) y x). Thus, the distribution is symmetric with respect to y = µ(x). For such a distribution, the best-in-class predictor for the squared loss is h (x) = µ(x). However, for the ϵ-insensitive loss, we have

Cℓsq ϵ(h,x)

= E y[max{(h(x) y)2 ϵ2,0} x]

2 max{(h(x) y)2 ϵ2,0} + 1

2 max{(h(x) 2µ(x) + y)2 ϵ2,0}.

Thus, plugging h x y + ϵ and h x µ(x), we obtain that

Cℓsq ϵ(h,x) = 1

2(0) (h(x) y = ϵ and ϵ > h(x) 2µ(x) + y > ϵ)

Cℓsq ϵ(h ,x) = 1

2(0) (0 < h (x) y < ϵ and 0 > h (x) 2µ(x) + y > ϵ)

Therefore, Cℓsq ϵ(h,x) = Cℓsq ϵ(h ,x), and both h and h are the best-in-class predictors for the ϵ-insensitive loss. This implies that the ϵ-insensitive loss is not H-consistent with respect to the squared loss.

B.4. H-consistency of ℓϵ with respect to ℓ2

Here, we present negative results for the ϵ-insensitive loss ℓϵ (h,x,y) max{ h(x) y ϵ,0} used in the SVR algorithm, by showing that even under the assumption infx X P(µ(x) y ϵ) > 0 or infx X P(0 µ(x) y ϵ) > 0, it is not H-consistent with respect to the squared loss. In the proof, we consider distributions that concentrate on an input x, leading to both h x y + ϵ and h x µ(x) being the best-in-class predictors for the ϵ-insensitive loss.

Theorem 4.9. Assume that the distribution is symmetric and satisﬁes infx X P(µ(x) y ϵ x) > 0. Assume that the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Then, the ϵ-insensitive loss ℓϵ is not H-consistent with respect to the squared loss.

Proof. Consider a distribution that concentrates on an input x. Choose y,µ(x),ϵ R such that B y < µ(x) B and µ(x) y > ϵ. Consider the conditional distribution as P(Y = y x) = 1

2 = P(Y = 2µ(x) y x). Thus, the distribution is symmetric with respect to y = µ(x). For such a distribution, the best-in-class predictor for the squared loss is h (x) = µ(x). However, for the ϵ-insensitive loss, we have

Cℓsq ϵ(h,x)

= E y[max{ h(x) y ϵ,0} x]

2 max{ h(x) y ϵ,0} + 1

2 max{ h(x) 2µ(x) + y ϵ,0}.

Thus, plugging h x y + ϵ and h x µ(x), we obtain that

Cℓsq ϵ(h,x) = 1

2(2µ(x) 2y 2ϵ) (h(x) y = ϵ and h(x) 2µ(x) + y < ϵ)

= µ(x) y ϵ.

Cℓsq ϵ(h ,x) = 1

2(µ(x) y ϵ) + 1

2(µ(x) y ϵ) (h (x) y > ϵ and h (x) 2µ(x) + y < ϵ)

= µ(x) y ϵ.

H-Consistency Guarantees for Regression

Therefore, Cℓsq ϵ(h,x) = Cℓsq ϵ(h ,x), and both h and h are the best-in-class predictors for the ϵ-insensitive loss. This implies that the ϵ-insensitive loss is not H-consistent with respect to the squared loss.

Theorem 4.10. Assume that the distribution is symmetric and satisﬁes pmin(ϵ) = infx X P(0 µ(x) y ϵ x) > 0. Assume further that the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Then, the ϵ-insensitive loss ℓϵ is not H-consistent with respect to the squared loss.

Proof. Consider a distribution that concentrates on an input x. Choose y,µ(x),ϵ R such that B y < µ(x) B and µ(x) y < ϵ. Consider the conditional distribution as P(Y = y x) = 1

2 = P(Y = 2µ(x) y x). Thus, the distribution is symmetric with respect to y = µ(x). For such a distribution, the best-in-class predictor for the squared loss is h (x) = µ(x). However, for the ϵ-insensitive loss, we have

Cℓsq ϵ(h,x)

= E y[max{ h(x) y ϵ,0} x]

2 max{ h(x) y ϵ,0} + 1

2 max{ h(x) 2µ(x) + y ϵ,0}.

Thus, plugging h x y + ϵ and h x µ(x), we obtain that

Cℓsq ϵ(h,x) = 1

2(0) (h(x) y = ϵ and ϵ > h(x) 2µ(x) + y > ϵ)

Cℓsq ϵ(h ,x) = 1

2(0) (0 < h (x) y < ϵ and 0 > h (x) 2µ(x) + y > ϵ)

Therefore, Cℓsq ϵ(h,x) = Cℓsq ϵ(h ,x), and both h and h are the best-in-class predictors for the ϵ-insensitive loss. This implies that the ϵ-insensitive loss is not H-consistent with respect to the squared loss.

B.5. Tightness

Lemma B.4. Suppose that the distribution and the hypothesis set H satisfy the condition of Theorem 4.1. Then, for any t [0,δ2], there exist a distribution and a hypothesis h H such that Eℓ2(h) E ℓ2(H) + Mℓ2(H) =

δ ,2} pmin(δ) (Eℓδ(h) E ℓδ(H) + Mℓδ(H)) = t; For any t [0,4B2], there exist a distribution and a hypothesis h H

such that Eℓ2(h) E ℓ2(H) + Mℓ2(H) = supx X supy Y{ h(x) y + µ(x) y }(Eℓ1(h) E ℓ1(H) + Mℓ1(H)) = t.

Proof. Consider a distribution D that concentrates on an input x. For the Huber loss, choose the distribution D and y,µ(x),δ R such that µ(x) = y and δ = B. Let h be a hypothesis such that h(x) y 2 = t δ2. In this case, Cℓ2(h,x) = Ey[(h(x) y)2 (µ(x) y)2 x] = (h(x) y)2 and Cℓδ(h,x) =

2(h(x) y)21 h(x) y δ + (δ h(x) y 1

2δ2)1 h(x) y >δ x] = 1

2(h(x) y)2. Thus, Eℓ2(h) E ℓ2(H) + Mℓ2(H) =

Cℓ2(h,x) = t2 and max{ 2B

δ ,2} pmin(δ) (Eℓδ(h) E ℓδ(H) + Mℓδ(H)) = 2 Cℓδ(h,x) = t2.

For the ℓ1 loss, choose the distribution D and y,µ(x) R such that µ(x) = y. Let h be a hypothesis such that h(x) y 2 = t 4B2. In this case, Cℓ2(h,x) = Ey[(h(x) y)2 (µ(x) y)2 x] = (h(x) y)2 and Cℓ1(h,x) = Ey[ h(x) y µ(x) y x] = h(x) y . Thus, Eℓ2(h) E ℓ2(H) + Mℓ2(H) = Cℓ2(h,x) = t2 and supx X supy Y{ h(x) y + µ(x) y }(Eℓ1(h) E ℓ1(H) + Mℓ1(H)) = h(x) y Cℓ1(h,x) = t2.

C. Proofs of generalization bound

Theorem 4.11. Assume that the distribution is symmetric, the conditional distribution and the hypothesis set H are bounded by B > 0. Then, for any δ > 0, with probability at least 1 δ over the draw of an i.i.d. sample S of size m, the following

H-Consistency Guarantees for Regression

squared loss estimation bound holds for h S:

Eℓ2( h S) E ℓ2(H) Γ(ML(H) + 4RL m(H) + 2BL

δ 2m ) Mℓ2(H).

where Γ(t) = supx X supy{ h S(x) y + µ(x) y }t for L = ℓ1, Γ(t) = 2 (8B)p 2p(p 1) t for L = ℓp, p (1,2], Γ(t) = t 2 p

for L = ℓp, p 2, Γ(t) = max{ 2B

δ ,2} pmin(δ) t for L = ℓδ, and Γ(t) = 1 2pmin(ϵ) t for L = ℓsq ϵ.

Proof. By using the standard Rademacher complexity bounds (Mohri et al., 2018), for any δ > 0, with probability at least 1 δ, the following holds for all h H:

EL(h) EL,S(h) 2RL m(H) + BL

Fix ϵ > 0. By the deﬁnition of the inﬁmum, there exists h H such that EL(h ) E L(H)+ϵ. By deﬁnition of h S, we have

EL( h S) E L(H)

= EL( h S) EL,S( h S) + EL,S( h S) E L(H)

EL( h S) EL,S( h S) + EL,S(h ) E L(H)

EL( h S) EL,S( h S) + EL,S(h ) E L(h ) + ϵ

2[2RL m(H) + BL

Since the inequality holds for all ϵ > 0, it implies:

EL( h S) E L(H) 4RL m(H) + 2BL

Plugging in this inequality in the bound of Theorems 4.1, 4.4, 4.6 completes the proof.

D. Proofs of adversarial regression

D.1. Proof of Theorem 5.1

Theorem 5.1. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Assume that pmin(δ) = infx X P(0 µ(x) y δ x) is positive. Then, for any ν 3B and all h H, the following bound holds:

E ℓ2(h) E ℓ2(H) max{ 2B

pmin(δ) (Eℓδ(h) E ℓδ(H)) + ν sup x x x γ h(x ) h(x)

Proof. By (1), we have

E ℓ2(h) E ℓ2(H) Eℓ2(h) E ℓ2(H) + ν sup x x x γ h(x ) h(x)

pmin(δ) (Eℓδ(h) E ℓδ(H)) + ν sup x x x γ h(x ) h(x) . (Corollary 4.2)

This completes the proof.

D.2. Proof of Theorem 5.2

Theorem 5.2. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Then, for any ν 3B and all h H, the following bound holds:

E ℓ2(h) E ℓ2(H) Γ(Eℓp(h) E ℓp(H)) + ν sup x x x γ h(x ) h(x) ,

where Γ(t) = t 2 p if p 2, 2 (8B)p 2p(p 1) t for p (1,2) and 4B t, if p = 1.

H-Consistency Guarantees for Regression

Proof. By (1), we have

E ℓ2(h) E ℓ2(H) Eℓ2(h) E ℓ2(H) + ν sup x x x γ h(x ) h(x)

Γ(Eℓp(h) E ℓp(H)) + ν sup x x x γ h(x ) h(x) . (Corollary 4.5)

where Γ(t) =

t 2 p p > 2 2 (8B)p 2p(p 1) t p (1,2]

4B t p = 1.

. This completes the proof.

D.3. Proof of Theorem 5.3

Theorem 5.3. Assume that the distribution is symmetric, the conditional distribution is bounded by B > 0, and the hypothesis set H is realizable and bounded by B > 0. Assume that pmin(ϵ) = infx X P(µ(x) y ϵ x) is positive. Then, for any ν 3B and all h H, the following bound holds:

E ℓ2(h) E ℓ2(H) Eℓsq ϵ(h) E ℓsq ϵ(H)

2pmin(ϵ) + ν sup x x x γ h(x ) h(x) .

Proof. By (1), we have

E ℓ2(h) E ℓ2(H) Eℓ2(h) E ℓ2(H) + ν sup x x x γ h(x ) h(x)

Eℓsq ϵ(h) E ℓsq ϵ(H)

2pmin(ϵ) + ν sup x x x γ h(x ) h(x) . (Corollary 4.7)

This completes the proof.