# reweighting_improves_conditional_risk_bounds__f7d04ba4.pdf

Published in Transactions on Machine Learning Research (01/2025)

Reweighting Improves Conditional Risk Bounds

Yikai Zhang Yikai.Zhang@morganstanley.com Machine Learning Research, Morgan Stanley

Jiahe Lin Jiahe.Lin@morganstanley.com Machine Learning Research, Morgan Stanley

Fengpei Li Fengpei.Li@morganstanley.com Machine Learning Research, Morgan Stanley

Songzhu Zheng Songzhu.Zheng@morganstanley.com Machine Learning Research, Morgan Stanley

Anant Raj Anant.Raj@morganstanley.com Machine Learning Research, Morgan Stanley

Anderson Schneider Anderson.Schneider@morganstanley.com Machine Learning Research, Morgan Stanley

Yuriy Nevmyvaka Yuriy.Nevmyvaka@morganstanley.com Machine Learning Research, Morgan Stanley

Reviewed on Open Review: https: // openreview. net/ forum? id= Mv Yddud Hu E

In this work, we study the weighted empirical risk minimization (weighted ERM) schema, in which an additional data-dependent weight function is incorporated when the empirical risk function is being minimized. We show that under a general balanceable Bernstein condition, one can design a weighted ERM estimator to achieve superior performance in certain subregions over the one obtained from standard ERM, and the superiority manifests itself through a data-dependent constant term in the error bound. These sub-regions correspond to large-margin ones in classification settings and low-variance ones in heteroscedastic regression settings, respectively. Our findings are supported by evidence from synthetic data experiments.

1 Introduction

The empirical risk minimization (ERM) schema plays an important role in tackling modern machine learning tasks. Given a set of samples Sn {zi (xi, yi)}n i=1 that are often assumed i.i.d., ERM estimates the target hypothesis f within some hypothesis class F by minimizing the empirical risk based on Sn, that is, bf arg minf F 1 n P

zi Sn ℓ(f, zi), where ℓis some loss function. The method is widely adopted due to its ease of use and generalization power.

Let z (x, y) and f arg minf F E[ℓ(f, z)], the excess risk is defined as:

Excess Risk Eℓ( bf, z) Eℓ(f , z).

Under suitable conditions, ERM achieves low excess risk with high probability. Specifically, it is known to achieve a minimax optimal rate of O(n 1/2) for learning tasks under general settings (Vapnik & Chervonenkis,

Correspondence to: Yikai Zhang. zhangyikai91@gmail.com ; : equal contribution

Published in Transactions on Machine Learning Research (01/2025)

1974; Talagrand, 1994; Boucheron et al., 2005), and an O(n 1) fast rate in more restrictive ones (Koltchinskii & Panchenko, 2000; Mendelson, 2002a; Bartlett et al., 2005; Boucheron et al., 2005), where matching lower bounds have also been established (Massart & N ed elec, 2006; Zhivotovskiy & Hanneke, 2018).

More recently, alternative risk minimization schema that outperforms ERM on certain sub-regions has been proposed in several works. Notably, Namkoong & Duchi (2017) introduces a distributionally robust optimization schema that is provably superior to ERM in scenarios where ERM is susceptible to noise; e.g., when the loss is piecewise linear and the thus under/over-estimation of the slope has a sizable impact to the solution. In Xu & Zeevi (2024), a carefully designed moment penalization function is introduced and it achieves a superior rate than ERM within the large-variance hypothesis class. Bousquet & Zhivotovskiy (2021) introduces a rejection option to achieve fast rate of convergence in the absence of margin conditions; note that these conditions are typically required for establishing fast rate for ERM in settings absent of the rejection option. The aforementioned studies point to different directions of improvement over the standard ERM.

In this work, we leverage the problem-dependent structure to improve upon ERM, which provably outperforms in high confidence regions. Specifically, instead of minimizing directly the empirical loss, a re-weighted version based on some weight function ω( ) is considered, which leads to an estimator that minimizes the weighted empirical risk of the following form:

bfw ERM arg minf F 1 n

i=1 ω(xi)ℓ(f; zi). (1)

We show that such an estimator improves the conditional excessive risk, given by

Ez ℓ( bf; z) ℓ(f ; z)|{ω(x) > c} ,

with the conditioning region characterized by a level set associated with the data-dependent weight. Such an improvement can be of interest to practitioners in settings akin to those considered in the selective learning literature, where the focus is on model outcomes associated with selected sub-regions.

Our approach is inspired by an analysis of a Bernstein-type condition of the form Var[h(z)] BE[h(z)] for all h in some function class H (e.g., Bartlett & Mendelson, 2006), which is often required for local Rademacher complexity-based analysis (Bartlett et al., 2005). In particular, one way to derive the O(n 1) fast rate is to leverage some variance-aware inequalities such as the Bernstein inequality (Boucheron et al., 2005) and Talagrand s inequality (Talagrand, 1994; Bartlett et al., 2005), which gives rise to intermediate results of the form:

Ez[ℓ( bf, z) ℓ(f , z)] 1

zi Sn {ℓ( bf, zi) ℓ(f , zi)} + a

Var[ℓ( bf, z) ℓ(f , z)]

for some problem-dependent constant a and b. Under a Bernstein-type condition, the variance term on the right-hand side of inequality (2) is replaced by BEz[ℓ( bf, z) ℓ(f , z)]. The multiplier B is usually chosen in a conservative way; for example, it is chosen as the inverse of the minimum margin present in the Massart noise condition (Massart & N ed elec, 2006) for classification problems or the uniform upper bound of regression loss in bounded regression settings (Boucheron et al., 2005; Bartlett et al., 2005).

On the other hand, a vanilla conservative choice of B in Var[h] BE[h] is not necessarily optimal. Under the weighted ERM schema where a carefully designed weighted empirical risk in (1) is considered, the Bernstein condition could be balanced as: Var[ω(x)h(x)] E[ω(x)h(x)]. Crucially, the constant B can be eliminated, which subsequently leads to a tighter bound (up to some problem-dependent constant) in excess risk in certain sub-regions. Specifically, within the context of classification and heteroscedastic bounded regression settings, these regions can be characterized as the large-margin and the low-variance ones, respectively. More generally, the conclusion is applicable to learning tasks whose loss function satisfies a Balanceable Bernstein Condition when the weight function can be designed accordingly.

Contribution The major contribution of this paper can be summarized as follows.

Published in Transactions on Machine Learning Research (01/2025)

We consider the weighted ERM schema and investigate its theoretical properties; in particular, we show that it admits a tighter bound on excessive risk up to some problem-dependent constant, conditional on specific sub-regions. The enhanced conditional excessive risk bound also implies an improved unconditional excessive risk bound, provided that the noise satisfies certain conditions. See Table 1 for a comparison.

Empirically, we demonstrate that with a properly designed weight function, one can achieve superior performance in selected regions, respectively for heteroscedastic regression and classification tasks.

The proofs of the theorems rely on an alternative version of Theorem 3.3 from Bartlett et al. (2005), which is based on a relaxed Bernstein-type condition that allows for an ε additive error. The exact arguments and strategies adopted are technical tools that are of independent interest to the community.

As a by-product, the O(1/n) fast rate for learning the variance function σ2(x) derived in the regression example (Theorem 4.6) improves over existing results in Zhang et al. (2023) that has an O(1/ n) rate.

ERM Weighted-ERM Classification

risk function ℓ(f; z) 1{f(x) = y} ω ℓ(f; z) bω(x)1{f(x) = y}

general Setting Ez[ℓ(b f; z) ℓ(f ; z)] ε

γ Ez[ℓ(b f; z) ℓ(f ; z)] ε

γ large margin region R {ω (x) > c} Ez[ℓ(b f; z) ℓ(f ; z)|R] ε

γ 1 P(R) Ez[ℓ(b f; z) ℓ(f ; z)|R] ε

low-margin diminishing (1 P(R))c2 ε Ez[ℓ(b f; z) ℓ(f ; z)] ε

γ Ez[ℓ(b f; z) ℓ(f ; z)] ε

risk function ℓ(f; z) (f(x) y)2 ω ℓ(f; z) (f(x) y)2/ b σ2(x)

general Setting Ez[ℓ(b f; z) ℓ(f ; z)] ε/γ2 Ez[ℓ(b f; z) ℓ(f ; z)] ε/γ2 low variance region R {σ2 (x) < 1/c} Ez[ℓ(b f; z) ℓ(f ; z)|R] ε γ2 1 P(R) Ez[ℓ(b f; z) ℓ(f ; z)|R] ε γc 1 P(R)

large-variance diminishing (1 P(R))c ε Ez[ℓ(b f; z) ℓ(f ; z)] ε γ2 Ez[ℓ(b f; z) ℓ(f ; z)] ε γc

Table 1: Comparison between existing results and ours. Let z (x, y), represents results that are implied by Massart & N ed elec (2006), where the excessive risk bound is minimax optimal, represents results that are implied by Bartlett et al. (2005), represents results in this work. Throughout the table, all results are derived given sample size Θ( 1

ε). In classification task, f (x) is the Bayes optimal classifier and ω (x) is the margin function calculated as 2P[y = f (x)|x] 1. bω(x) is the approximated margin and γ represents the minimum margin. For the regression task, f (x) = E[y|x], σ2 (x) = Var(y|x) and b σ2(x) is an approximation of σ2 (x). 1/γ is taken to be the maximum possible variance. We denote P( ) as the probability mass of a set of events. For the general region, weighted ERM recovers at least the same rate as ERM. For the large-margin region or low-variance region, weighted ERM improves ERM by (c/γ) > 1 on the conditional excessive risk bound. In the case of classification we provide a lower bound construction showing the tightness of the conditional risk (Theorem 4.4). The improved conditional excessive risk of weighted-ERM also implies an improved original excessive risk under low-margin or large-variance diminishing condition. To clarify, it is not necessary for the value of c to be large. Instead, having c as a constant, such as 0.1, is sufficient, especially when γ approaches a vanishing value, e.g., when γ is of order ε.

2 Related Work

In this section, we provide a brief literature review of related work. On the theory front, we review some representative results in standard ERM, under both the slow rate and the fast rate regimes; on the methodology front, we establish connections to existing literature where Gaussian maximum likelihood estimation (MLE) is performed in heteroscedastic regression settings, and note that MLE coincides with weighted ERM when ℓ2 loss is used with the inverse variance function being the weight. Finally, given the connection between the results established in this work and those in selective learning, we also briefly review results for the case where a rejection option is allowed.

Published in Transactions on Machine Learning Research (01/2025)

Empirical risk minimization The combination of VC-tool and Rademacher complexity analysis (Vapnik & Chervonenkis, 1974; Koltchinskii & Panchenko, 2000; Mendelson, 2002b; Boucheron et al., 2005; Bartlett et al., 2005) constitutes one of the most widely-adopted frameworks in deriving risk bounds for ERM. Under the slow rate regime, its generalization error bound is at the rate of O(n 1/2), and such rate can be established using a uniform convergence argument (Vapnik & Chervonenkis, 1974; Talagrand, 1994; Boucheron et al., 2005) over an unrestricted function class. On the other hand, fast rate can be achieved if one moves away from an arbitrary hypothesis (Mendelson, 2002a; Boucheron et al., 2005; Bartlett et al., 2005). By restricting to a subset of the function class that has small variance and considering the Rademacher averages associated with this small subset (i.e., localized ) as a complexity term, sharper bounds can be obtained visa-vis the case where global averages are used. In particular, under a Bernstein-type condition (Massart & N ed elec, 2006), fast rate up to O(n 1) can be obtained in well-specified (or realizable, equivalently) settings where the optimal model lies within the hypothesis class. For binary classification, this condition can be satisfied by imposing Massart or Tsybakov noise condition (Mammen & Tsybakov, 1999; Tsybakov, 2004; Massart & N ed elec, 2006). The same fast rate can be obtained for bounded regression (Bartlett et al., 2005; Massart & N ed elec, 2006) or learning problems whose loss function satisfies strong convexity and Lipschitz condition (Klochkov & Zhivotovskiy, 2021), or when it is self-concordant (Bach, 2010) or exp-concave (Koren & Levy, 2015).

Maximum likelihood estimation and weighted ERM The weighted empirical risk minimization has been adopted to tackle several challenges in machine learning including distribution shifting (Cortes et al., 2008; 2010; Ge et al., 2024), censored observation (Ausset et al., 2022) and reinforcement learning (Jiang & Li, 2016; Xie et al., 2019; Min et al., 2021). In particular, in heteroscedastic regression settings where the variance depends on the input, a negative Gaussian likelihood-based formulation coincides with weighted ERM, when the latter uses ℓ2 loss and the inverse of the conditional variance as the weight. In particular, in the view of weighted ERM, samples with higher conditional variance (and potentially more noisy) are down-weighted and therefore contribute less to the overall empirical risk. The conditional variance can be estimated using either parametric models (Daye et al., 2012; Zhang et al., 2023) or kernel methods (Cawley et al., 2004). Despite the wide applicability of such a formulation (Kendall & Gal, 2017; Lakshminarayanan et al., 2017; Shah et al., 2022; Seitzer et al., 2022), little has been done in comparing the sample efficiency between estimators based on the weighted and those of the standard ERM. In this work, we aim to fill this gap, by explicitly analyzing the sample efficiency in the weighted ERM case and juxtaposing its performance to that in the standard ERM.

Reject option and selective risk Along a line of work that slightly deviates from the ERM is the learning with rejection option; e.g., Chow s reject option model (Chow, 1970). The model allows one to refrain from making predictions on hard instances at the inference stage with some abstention cost; better precision is obtained in exchange for lower coverage, and the selective risk is only evaluated on the covered subset. Such a learning schema has applications in domains such as finance (Pidan & El-Yaniv, 2011) and health care (Hanczar & Dougherty, 2008), and can be generalized to other tasks (Mozannar & Sontag, 2020). Extensive work has been done to understand the trade-off between coverage ratio and precision (Herbei & Wegkamp, 2006; Bartlett & Wegkamp, 2008; Yuan & Wegkamp, 2010; El-Yaniv et al., 2010; Cortes et al., 2016), along with the statistical properties (Bousquet & Zhivotovskiy, 2021; Zhang et al., 2024) associated with the learning procedure. At the conceptual level, weighted ERM can be viewed as a soft counterpart to learning with the rejection option. In particular, instead of adopting a hard in-or-out rule to improve selective risk, performing weighted ERM can lead to similar benefit in a soft manner, with the weight typically coming from plug-in estimates (Herbei & Wegkamp, 2006; Bartlett & Wegkamp, 2008; Franc et al., 2023). In practice, the weight can be the estimated margin or the inverse variance function, in classification and regression settings, respectively.

3 A Road Map

Notation We denote vectors by bold-faced letters (e.g., x) and scalar variables by lower-case regular ones (e.g., y). Let (X, P, B) be a probability space, where X denotes the sample space, P the probability measure and B the Borel σ-field. The expectation with respect to the probability distribution of a random

Published in Transactions on Machine Learning Research (01/2025)

vector x X is denoted by Ex[ ]; the subscript is omitted whenever there is no ambiguity in the respective context. For a random vector z (x, y) defined on the product sample space Z X Y (equipped with the corresponding Borel σ-field), its probability measure is denoted by P. We use z D to denote that z is sampled from data generative process (DGP) D. Throughout the remainder of this manuscript, we use xi with a subscript i to denote the ith random sample drawn from X; yi and zi are analogously defined. We also denote W : X 7 [0, c1] as a hypothesis classes with finite peudo-dimension d P < . Finally, we use Θ( ) and O( ) to denote counterparts of the standard Θ( ), O( ) notation while suppressing the poly-logarithmic factors; the notation f g means f cg for some universal constant c, and is analogously defined. We say g f if cg f cg for some universal constants c and c.

Problem setup Consider a sequence of i.i.d. samples z1:n {zi = (xi, yi)}n i=1 drawn from sample space Z, with xi s being the input and yi s the output; let f : X 7 Y, f F and ℓbe some loss function that is bounded and Lipschitz. In this study, we focus on analyzing the excess risk of the ERM and the weighted ERM estimators (see definitions below) on certain sub-regions.

Definition 1 (Empirical risk and the ERM estimator). Given z1:n and f F, for loss function ℓ: F Z [0, a], we define the empirical loss as

LERM(f; z1:n) = 1

i=1 ℓ(f; zi);

the ERM estimator is given by

bf ERM arg min f F LERM = arg min f F

i=1 ℓ(f; zi). (3)

Definition 2 (Weighted empirical risk and the weighted ERM estimator). Given z1:n, f F and some weight function ω W : X 7 [0, c1], for loss function ℓ: F Z 7 [0, a], we define the weighted empirical loss as Lw ERM(f; z1:n) = 1

i=1 ω(xi)ℓ(f; zi);

the weighted ERM estimator is correspondingly given in the form of

bfw ERM arg min f F Lw ERM = arg min f F

i=1 ω(xi)ℓ(f; zi). (4)

We provide a brief overview of the main results established in Section 4. As briefly mentioned in Section 1, under the weighted ERM schema, an improved bound can be derived by encapsulating the weight function inside the appropriate terms of the Bernstein-type condition, namely,

Var[h] BE[h] | {z } (vanilla Bernstein-type condition)

Var[ω(x)h(x)] E[ω(x)h(x)] | {z } (balanceble Bernstein condition)

Such an encapsulation step does not alter the rate of O(n 1) in the risk bound; however, the constant on which the bound depends will change, which then leads to a tighter bound for selected regions, as manifested through an improved problem-dependent constant. This selected region is determined by the ratio B/(B ω(x)); both B and B potentially depend on some γ that characterizes the uniform lower or upper bound of the weight, depending on the setting (classification or regression). Finally, note that a Bernstein-type condition can be satisfied by imposing some margin condition or boundedness, in classification setting with 0-1 loss and heteroscedastic regression settings with ℓ2 loss, respectively.

To derive the desired risk bounds, we adopt the following road map and formalize the arguments in Section 4.

We first establish the risk bounds of the weighted ERM estimator (in Theorem 4.1 for classification and in Theorem 4.5 for regression, resp.), assuming that the weight function bω(x) used in the estimation is sufficiently close to true one ω (x). Note that ω( ) coincides with the margin function in the case of classification and the inverse of the variance function in the case of regression.

Published in Transactions on Machine Learning Research (01/2025)

Subsequently in Theorem 4.3 (for classification) and Theorem 4.6 (for regression), we provide the risk bound for bω(x), which shows that the sample complexity for estimating ω(x) is comparable to that of learning the f ( ). This justifies the aforementioned assumption on bω, in that such an assumption can be operationalized without rendering the estimation harder .

Finally, for the classification case, we additionally show that the established conditional excess risk bounds for the weighted ERM estimator is tight, by providing a lower bound result that matches its conditional (i.e., in sub-regions) excess risk upper bound. To contrast, the ERM estimator is provably inferior to the weighted one in terms of the lower bound in such sub-regions, despite of its minimax optimality in the entire region.

Note that the above result points to the fact that the weight function (i.e., the margin function) we consider is optimal, as manifested by a minimax optimal conditional excessive risk bound by adopting such a weight function.

Note that our results point to the possibility of leveraging weighted ERM to achieve superior performance in certain regions, provided that the weight function is carefully designed and the region is chosen accordingly. See also results presented in Section 5 based on synthetic data experiments that attests to our theoretical claims, as well as how an estimate of the weight function can be readily obtained from data, and the weighted ERM schema be operationalized in two steps.

4 Main Results

Before stating our main results that are applicable to a more general setting, we first present results for two specific ones, namely, classification under margin condition (Section 4.1) and bounded heteroskedastic regression (Section 4.2) settings. Our result suggests that weighted ERM can improve the standard ERM error bound by a problem-dependent constant, in regions with high margin in the former and those with low variance in the latter. In Section 4.3, we present our main results for a more general setting; specifically, we show that a similar property for weighted ERM holds, as long as the loss function under consideration satisfies a balanceable Bernstein type condition.

4.1 Classification with/without margin condition

We formalize the classification problem considered in this paper, which largely follows from the general setting of Massart/benign label noise (Massart & N ed elec, 2006; Hanneke, 2009; Diakonikolas et al., 2019).

Let F {f : X 7 { 1, 1}} be a hypothesis class with finite VC-dimension d VC(F) < , G {η : X 7 [0, 1]} be a hypothesis class with finite pseudo-dimension d P (G) < , and W {ω : X 7 [γ, 1]} be some other hypothesis class with some fixed constant 0 γ 1.

The data generative process (DGP) for z (x, y) can be characterized as follows:

( 1 with probability η (x) 1 with probability 1 η (x) , where η (x) P(y = 1|x). (5)

For the family of DGP described in (5), η (x) captures the conditional probability. The label y can be equivalently viewed as satisfying y|x 2Ber(η (x)) 1, where Ber(p) denotes a Bernoulli random variable with success rate p. Let f be the target hypothesis or the Bayes optimal classifier, defined as

f (x) arg max c { 1,1} P(y = c|x),

The VC dimension of F = {f : X 7 { 1, 1}} is the largest integer d such that SF(d) = 2, where SF(k) is the value of the growth function, namely, the largest cardinality {(f(x1), f(x2), ..., f(xk)) : f F} among all x1, ..., xk X (Vapnik & Chervonenkis, 1971). The pseudo-dimension of G = {g : X 7 [l, u]} is the VC-dimension of the hypothesis class H = {h : X R 7 { 1, 1} | h(x, t) = sign(g(x) t), g G, t R, x X} (Pollard, 1990).

Published in Transactions on Machine Learning Research (01/2025)

that is, f (x) is the value of c { 1, 1} that maximizes the conditional probability, which also satisfies f (x) = 21{η (x) 1

2} 1. Let ω be the feature-dependent margin function given by ω (x) 2 P(y = f (x)|x) 1, and note that ω (x) |η (x) 1/2|. Throughout the remainder of this subsection, we consider the well-specified setting (Massart & N ed elec, 2006; Bousquet & Zhivotovskiy, 2021) by assuming f F, η G and ω W.

We choose ℓ(f, z) = 1{f(x) = y} as the loss function and propose to use the margin function ω (x) as the weight to perform a weighted ERM; this leads to a weighted risk of the form ω (x)1{f(x) = y}. In practice, when ω (x) is not available, one can use its estimate while still achieving similar results under mild assumptions.

We first give an informal overview of the results established. Typically, a standard margin condition (Massart & N ed elec, 2006; Bousquet & Zhivotovskiy, 2021) requires the minimum margin to satisfy γ infx ω (x) > 0. For standard ERM, Bernstein-type condition (Bartlett et al., 2005) is satisfied in the form of:

Varz[1{y = f(x)} 1{y = f (x)}] (1/γ) Ez[(1{y = f(x)} 1{y = f (x)})]

whereas in the case of weighted ERM, it is satisfied in the following form:

Varz[ω (x)(1{y = f(x)} 1{y = f (x)})] Ez[ω (x)(1{y = f(x)} 1{y = f (x)})]. (6)

Crucially, in the latter case, the 1/γ factor is removed, which subsequently leads to an improved bound. In particular, Equation (6) does not require the margin condition γ > 0 (Massart & N ed elec, 2006), i.e., γ can be zero; this suggests that even if the vanilla empirical risk does not satisfy the Bernstein condition, it is still possible to utilize a weight function to establish a Bernstein-type condition. Theorem 4.1 formally states this result. Theorem 4.1 (Risk Bound for the case of Classification). Suppose that we have bω( ) W s.t. Ex[(bω(x) ω (x))2] ε is satisfied. Let Sn = {(xi, yi)}n i=1 be i.i.d. samples drawn according to the DGP described in (5), and they are independent of the ones used for estimating bω. Let bfw ERM be the weighted ERM estimator as defined in (4), with the weight substituted by bω(xi) s. Then the following hold simultaneously with probability at least 1 δ for some ε > 0, δ > 0:

Ez[ω (x)(ℓ( bfw ERM; z) ℓ(f ; z))] ε, Ez[bω(x)(ℓ( bfw ERM; z) ℓ(f ; z))] ε, (7)

provided that the sample size n satisfies n d V C(F) log( 1

The next theorem shows that there exists a DGP that conforms with the description in (5), such that when the estimation is performed based on samples drawn from it, the risk of the weighted ERM estimator on some sub-region can be arbitrarily close to zero, provided that the sample size grows commensurately; on the other hand, the risk of the ERM estimator on the same region is bounded below by a constant, irrespective of the sample size. Here the sub-region is characterized by a large-margin level set, and the risk on this sub-region can be viewed as the selective risk.

Theorem 4.2. There exists a DGP that belongs to the DGP family satisfying (5), such that under the same assumption as in Theorem 4.1, when the estimation is performed based on i.i.d. samples Sn = {(xi, yi)}n i=1 drawn from it, the following hold simultaneously for the ERM estimator (as per defined in (3)) and weighted ERM estimator (as per defined in (4) with the weight function substituted by bω(xi) s), with sample size n = 64d log(d) log( 1

with probability at least 0.1, Ex[1{ bf ERM = f }|ω (x) > γ] 0.015;

with probability at least 1 δ, Ex[1{ bfw ERM = f }|ω (x) > γ] ε.

Remark 1 (On the bounds established). Some discussion of the implications of the bounds established in Theorems 4.1 and 4.2 is provided next. First note that the low/high margin region can be characterized through {x : ω(x) < c} (and {x : ω(x) > c}, resp.); the excess risk bound, given by Ez[1{ bf = y} 1{f = y}], can be further derived for weighted ERM based on (7). This enables a direct comparison between the bound of the weighted ERM and that of ERM, depending on the property of P(ω (x) c).

Published in Transactions on Machine Learning Research (01/2025)

Improved bounds in the large-margin region. For any c [0, 1), given large-margin region characterized by {x : ω (x) > c} with P(ω (x) > c) > 0 and Θ 1

ε samples, the risk bound of an ERM estimator (e.g., Massart & N ed elec, 2006, equation (7)) implies the following excess risk within the region:

Ez[1{ bf ERM(x) = y} 1{f (x) = y}|ω (x) c] ε γP(ω (x) > c);

for the weighted ERM estimator, it achieves

Ez[1{ bfw ERM(x) = y} 1{f (x) = y}|ω (x) c] ε c P(ω (x) > c), (8)

which improves the ERM bound by a factor of (γ/c). In particular, in Theorem 4.2, a lower bound construction for the conditional risk on the large-margin region is presented, to demonstrate that there exists a scenario where the ERM estimator fails with constant probability, while the weighted ERM one achieves standard PAC guarantee (Valiant, 1984). Note that under Chow s rejection rule (Chow, 1970) which optimally balances the trade-off between coverage and accuracy, the non-reject region coincides precisely with the large-margin region described above.

These analyses establish the benefit of the weighted ERM procedure where data is weighed by the margin function ω (x). In particular, its major advantage lies in the region where c is large compared to γ, and such advantage vanishes as (γ/c) 1. Later in Theorem 4.4, a pathological scenario achieving a matching lower bound is presented, which implies the minimax optimality of the bound presented in (8).

Improved bounds under the low-margin diminishing condition. The low-margin diminishing condition holds when there exists c such that P(ω (x) c)c2 ε. One can view the set {x|ω (x) ε} as the collection of x with low margin , whose corresponding y s have high label noise. The condition P(ω (x) c)c2 ε describes settings where the low margin set has diminishing mass. Under such a condition, Equation (8) implies the following improved excessive risk bound:

Ez[1{ bf = y} 1{f = y}] ε/c, (9)

which enjoys improvement of a factor γ/c when compared to the excessive risk of ERM. The condition P(ω (x) c)c2 ε could be satisfied when c ε and P(ω (x) c) 0.1. The derivation is presented in Appendix A.2.

Fast rate with/without margin condition. Under standard margin condition γ > 0, based on the Corollary 3 from Massart & N ed elec (2006), both ERM and weighted ERM achieves a Ez[ω (x)(1{ bf = y} 1{f = y})] ε excessive risk with Θ 1

ε samples, which is the informationtheoretic optimality (Massart & N ed elec, 2006; Zhivotovskiy & Hanneke, 2018). Notably, Theorem 4.1 implies such a result given the fact that ω (x) γ.

When the standard margin condition is not satisfied, e.g., for cases x X with η (x) = 0.5, the corresponding y effectively becomes a pure noise. In this case, a fast rate of O(1/n) cannot be attained due to these extremely noisy data points. However, it is still possible to attain a fast rate with a slight modification of the vanilla empirical risk. Specifically, in Theorem 2.1 of Bousquet & Zhivotovskiy (2021) establishes risk bounds that enjoy a fast rate under Chow s rejection option framework (Chow, 1970), without requiring a margin condition. This is achieved by introducing an artificial margin through the inclusion of a rejection option, which allows for the removal of those extremely noisy data points during both the learning and inference processes. Similarly, the weighted-ERM approach follows a similar principle by utilizing a weight function ω (x) to down-weigh such extremely noisy data points with ω (x) = 0.

Both Theorems 4.1 and 4.2 require that the surrogate bω be reasonably close to ω , and such a condition is presented in the form of Ex[(bω(x) ω (x))2] ε. One may wonder if the task of approximating ω is statistically more challenging than learning the classifier itself in terms of sample efficiency. Theorem 4.3 addresses this concern and suggests that under standard assumptions, the required condition holds with high probability and the sample complexity of learning ω ( ) is comparable to that of learning f ( ).

Published in Transactions on Machine Learning Research (01/2025)

Theorem 4.3 (Risk bound for estimating ω ). Given i.i.d. samples S n = {(xi, yi)}n i=1 drawn from the DGP described in (5), let bη = arg minη G 1 n P zi S n(η(xi) yi)2. Then the following holds with probability at least 1 δ for some ε > 0, δ > 0:

Ex[(bη(x) η (x))2] ε,

provided that the sample size n satisfies n d P (G) log( 1

δ ) ε . Further, let bω(x) |bη(x) 1/2|, and the above result further implies that Ex[(bω(x) ω (x))2)] ε.

Remark 2. Both Theorems 4.1 and 4.2 effectively assume the availability of a sufficiently-well estimated weight function bω( ), and that the subsequent weighted estimation procedure is carried out on samples Sn independent of those used for obtaining bω( ). In practice, this can be operationalized via sample-splitting; namely, one equally splits the training set into two halves at random, and uses one half for weight estimation and the other half for obtaining bf. Such a procedure would increase the generalization error bound by a factor of two, in exchange for the problem-dependent constant improvement in the large margin sub-region.

The following corollary can be derived by combining Theorems 4.1 and 4.3, which takes into account all the randomness embedded in the training samples:

Corollary 1. Under the assumptions in Theorems 4.1 and 4.3, let bω be the one defined in Theorem 4.3, then the following hold simultaneously with probability at least 1 2δ for some ε > 0:

Ez[ω (x)(ℓ( bfw ERM; z) ℓ(f ; z))] ε, Ez[bω(x)(ℓ( bfw ERM; z) ℓ(f ; z))] ε. (10)

Next we present our lower bound results for the conditional excessive risk bound.

Theorem 4.4. There exists a set of DGPs that is in accordance with the description in (5), such that for i.i.d. samples Sn = {(xi, yi)}n i drawn from (any) such DGPs with n 1

ε, the following inequality holds:

inf f F sup D c Ez D[1{f(x) = y} 1{f (x) = y}, ω (x) c] ε.

Theorem 4.4 suggests that for all f F, there exists D where the following holds:

Ez D[1{f(x) = y} 1{f (x) = y}|ω (x) c] ε c P(ω (x) > c);

in other words, there exists a family of data generating process satisfying (5), such that the conditional excessive risk of any estimator irrespective of the learning procedure cannot improve upon the bound established in (8), in the absence of any additional assumptions. The theorem implies that the conditional excessive risk bound of weighted ERM estimator in (8) is tight.

Selective inference in practice While the optimal weight function ω (x) is not available in practice, Theorem 4.1 admits an ε-approximation of ω (x) in the PAC sense, that is, one can have control over the selective risk of the weighted ERM estimator reasonably well using any good estimates of the margin function. Note that using the plug-in estimate is a standard procedure adopted in the literature related to selective classification (Herbei & Wegkamp, 2006), where users have the option to abstain from predicting data with high uncertainty or low margin. This is pertinent in scenarios where prioritizing accuracy in the low uncertainty region (conditional risk) takes precedence over accuracy across the entire domain (unconditional risk). The result in Theorem 4.1 suggests the same, that by using a good estimate of the margin to re-weight the empirical risk, one can improve the conditional risk at the inference stage.

4.2 Bounded heteroscedastic regression

The regression problem considered in this section is formalized next. Let y R be generated according to

y = f (x) + p

σ2 (x) ξ, x X, (11)

Published in Transactions on Machine Learning Research (01/2025)

where f (x) E(y|x), σ2 (x) Var(Y |x), and ξ ( c2, c2) is some bounded noise with zero mean and unit variance. f (x) is effectively the target hypothesis.

Without loss of generality, let F {f : X 7 [ 1, 1]}, G {σ2 : X 7 [c3, 1/γ]}, 0 < c3 < 1 be hypothesis classes with finite pseudo-dimensions d P (F) < and d P (G) < , respectively. Note that for the range of σ2, we assume that c3 is bounded away from 0 and therefore the variance is non-vanishing; on the other hand, the upper bound satisfies 1/γ 1 and thus we do not preclude scenarios where the variance dominates the mean function, similar to the settings considered in Zhang et al. (2023). Throughout this subsection, we consider the well-specified learning setting, namely, f (x) F, σ2 (x) G. Additionally, let W {ω : X 7 [γ, 1/c3]} be some other hypothesis class for the weight function ω (x), which satisfies ω (x) 1/(σ2) (x)

To perform weighted ERM, we adopt the mean-squared-error loss given by ℓ(f, z) = (y f(x))2, and a data-dependent weight that coincides with the inverse of the variance function σ 2(x); this leads to weighted loss function of the form (y f(x))2

σ2 (x) . Note that in practice, σ2 (x) is unavailable and one can replace it with some estimate, while still achieving similar results, provided that the estimate approximates the truth reasonably well.

We first give a high-level account of the results established. Let σ2 (x) 1/γ be the maximum variance as defined in Zhang et al. (2023), then the following holds (see derivation in the appendix):

Varz[(y f(x))2 (y f (x))2] (8/γ) Ez[(y f(x))2 (y f (x))2]. (12)

The expression in (12) suggests that the Bernstein-type condition is satisfied with B = 4/γ; an analysis similar to Corollary 3.7 in Bartlett et al. (2005) further leads to the following risk bound Ex[(f (x) bf ERM(x))2] = O 1 γ2n . For weighted ERM, σ (x) (or ω (x) 1/σ2 (x), equivalently) is introduced to balance the inequality in (12), resulting in the Bernstein-type condition to hold in the following form:

σ2 (x) C(y f (x))2

σ2 (x) C(y f (x))2

where C = 1/2(1 + 4/c3). Once again, leveraging the results in Bartlett et al. (2005) gives Ex[ 1 σ2 (x)(f (x) bfw ERM(x))2] = O 1

γn , which effectively removes the 1/γ factor. These statements are formally stated next.

Theorem 4.5. Suppose we have bσ2 G s.t. Ex[(1/bσ2(x) 1/σ2 (x))2] ε/c2 3. Given i.i.d samples Sn = {(xi, yi)}n i=1 that are drawn according to the DGP and independent of those used for obtaining bσ2, let bfw ERM be the weighted ERM estimator defined in Equation (4) with the weight function substituted by 1/bσ2( ); then the following holds simultaneously with probability at least 1 δ for some ε > 0, δ > 0:

1 σ2 (x)( bfw ERM(x) f (x))2 ε and Ex

1 bσ2(x)( bfw ERM(x) f (x))2 ε,

provided that the sample size n satisfies n d(F)(log( 1

ε )+log(1/γ) log(c3)+log( 1

δ )) γεc2 3 .

Remark 3. The risk bounds in Theorem 4.5 could be made analogous to those in Theorem 4.1. A standard analysis using techniques in Bartlett et al. (2005) implies that the ERM estimator achieves Ex ( bf ERM(x) f (x))2 ε using Θ 1

γ2ε samples whereas the weighted ERM one achieves Ex 1 σ2 (x)( bfw ERM(x) f (x))2

γε samples. Similar to the conclusions in the classification task, weighted ERM achieves sample efficiency at least comparable to ERM in the general region, and is superior in the small-variance region as depicted by σ2 (x) 1/c, with a problem-dependent constant γ/c improvement. Additionally, by using the negative log-likelihood loss, the sample complexity of learning σ2 ( ) is comparable to that of learning f ( ), as presented next in Theorem 4.6.

Next we provide some guarantees for learning the σ2(x) function. Here we seek to obtain bσ2 by minimizing the negative log-likelihood loss, while restricting (µ, σ2) to be in the hypothesis class e F e G, e F F, e G G,

so that the normalized residual square is bounded: (f (x) µ(x)+ρ

σ2(x) 4c2 2, ρ ( c2, c2), x X.

Published in Transactions on Machine Learning Research (01/2025)

Theorem 4.6 (Risk bound for estimating σ2 ). Let S n = {(xi, yi)}n i=1 be i.i.d. samples drawn according to the DGP in (11) and

bµ, bσ2 arg min(µ,σ2) e F e G 1 n

(xi,yi) S n

h log(σ2(xi)) + (yi µ(xi))2

Then for any ε > 0, δ > 0, the following holds with probability at least 1 δ:

1 bσ2(x) 1 σ2 (x)

provided that the sample size satisfies

n T1T 3 2 (d P (G) + d P (F)(T3 + T4 + T5)

where T1 = (1 + c2 2 + 1/c2 3), T2 = (c2 2 + log2(1/γ)), T3 = log(1/c2 3 + c2 2/c2 3 + c2 2/c2 3γ), T4 = log( 1

ε), T5 = log( 1

Note that the PAC guarantee for learning bσ2(x) has been studied in Zhang et al. (2023) which admits a bound at the rate of O 1 n , whereas the bound in Theorem 4.6 is of the order O 1

n . See Appendix A.9 for a discussion that compares the two and the proof for the above theorem.

The following corollary combines Theorems 4.5 and 4.6 and takes into account all the randomness embedded in the samples:

Corollary 2. Under the setting considered in Theorem 4.5 and Theorem 4.6, with bσ2(x) be the one defined in Theorem 4.6, the following hold simultaneously with probability at least 1 2δ:

1 σ2 (x)( bf(x) f (x))2 ε and Ex

1 bσ2(x)( bf(x) f (x))2 ε.

4.3 General case

Next, we generalize the classification and regression examples presented in Sections 4.1 and 4.2 above. Consider hypothesis class F {f : X 7 [ 1, 1]} with complexity measure d(F) < , and W {ω : X 7 [γ, c1]} with complexity measure d(W) < , and c1 > 0, γ > 0 . Assume that the target hypothesis f F and the true weight ω W.

A balanceable Bernstein-type condition is given in the following assumption, together with some other technical assumptions required for the loss function.

Assumption 1. Let D : F F X 7 [0, b] be uniformly bounded function that captures the excess risk. The following are assumed to hold for the loss function ℓ( , ):

Lipschitzness and uniform boundedness.

z, f |ℓ(f, z) ℓ(f , z)| a;

z, f1, f2 |ℓ(f1, z) ℓ(f2, z)| L|f1 f2|. (14)

Under semi-random noise label. Suppose the DGP satisfies semi-random noise label (Diakonikolas et al., 2021; Pia et al., 2022) and the following are satisfied:

Ez[ℓ(f, z) ℓ(f , z)] = Ex[ω (x)D(f (x), f(x))]

Ey[(ℓ(f, z) ℓ(f , z))2] D(f (x), f(x)) (15)

Note that here we effectively assume that ω is bounded away from zero to accommodate both the case of classification and regression. However, note that in the case of classification, the assumption ω being bounded away from zero is equivalent to the margin condition from Massart & N ed elec (2006). The weighted ERM framework is actually agnostic to this condition and allows ω to be zero.

Published in Transactions on Machine Learning Research (01/2025)

Balanceable Bernstein condition. Under the same semi-random noise label assumption for the DGP, there exists a uniformly bounded ω(x) that the following holds:

Varz[ω(x)(ℓ(f, z) ℓ(f , z))] Ez[ω(x)(ℓ(f, z) ℓ(f , z))]. (16)

For (16), if one replaces ω by its uniform lower bound, standard Bernstein-type condition Var[h(x)] BEx[h(x)] can be recovered with ω(x) = 1 and B = 1/γ. On the other hand, there exists ω(x) that balances the ratio between Var[ω(x)h(x)] and E[ω(x)h(x)], such that Bernstein-type condition holds with an improved multiplier. In particular, one can set ω( ) ω ( ) so that B = 1, as in (6) and (13). Remark 4 (Assumption 1 in classification and regression settings). We provide concrete examples for how expressions in Assumption 1 manifest in classification and regression settings, respectively.

For classification settings, let ℓ(f; z) 1{f(x) = y} and D(f , bf, x) 1{f (x) = bf(x)}. It is easy to verify that (14) is satisfied with L = 1, a = 1; (15) is satisfied with ω (x) |2η (x) 1| (see derivation in Appendix A.1); and (16) holds as established in (6)

For regression settings, let ℓ(f; z) (y f(x))2 and D(f , bf, x) σ2 (x)

C (f (x) bf(x))2 for some constant C. (14) is satisfied with a = 8 + (8c2 2/ γ), L = c2/ γ where b = 4/(Cγ), c1 = 1

2; (15) holds with ω (x) C σ2 (x); and (16) holds as established in (13).

Theorem 4.7. Suppose Assumption 1 holds. Let f F and ω W, and suppose we have bω W s.t. Ex[(bω(x) ω (x))2] ε

b. Given i.i.d. samples Sn = {(xi, yi)}n i=1 drawn from the data generative process, let bfw ERM arg minf F 1 n P

zi Sn bω(x)ℓ(f; zi). Then for any ε > 0, δ > 0, the following holds simultaneously with probability at least 1 δ:

Ex[bω2(x)D(f , bfw ERM, x)] ε, Ex[ω 2(x)D(f , bfw ERM, x)] ε, and Ex[bω(x)ω (x)D(f , bfw ERM, x)] ε,

as long as the sample size requirement is satisfied:

n c2 1a2(d(F) log( 1

ε) + log(c1L) + log( 1

δ )) ε + c1a log( 1

Remark 5. The proof of Theorem 4.7 is deferred to the appendix. A major hurdle in completing the proof comes from the inaccessibility of ω (x) and thus one needs to use bω(x) instead; this is out of practical consideration as one can only access an estimated version. To overcome this challenge, it suffices to require Ex[(bω(x) ω (x))2] ε/b, with which one can show the weighted empirical risk satisfies an ε additive error version of the Bernstein type condition, namely, Var[h] BE[h] + ε. To this end, we prove an alternative version of Theorem 3.3 in Bartlett et al. (2005) under this relaxed Bernstein-type condition, and show that the weighted ERM achieves a fast rate in the generalization error bound.

5 Synthetic Data Experiments

We present results from synthetic data experiments to support our theoretical claims, respectively for regression and classification settings. For both settings, we follow a two-step procedure, in which we first perform ERM to obtain estimates for the mean and the weight, followed by a reweighting step. Subsequently, we compare two sets of estimates resp. from ERM and weighted ERM in terms of their selective risk, where the selective set is chosen over a range with varying coverage determined by the variance or the margin, depending on the setting under consideration. For both experiments, the size of the training set is set at 2e4, to ensure that the algorithm has access to adequate number of samples and circumvent any potential issues due to lack of fit, although empirically the conclusion broadly holds even with much smaller sample sizes.

5.1 Experiments under regression settings

We consider regression settings in the presence of heteroscedasticity, similar to the ones used in Skafte et al. (2019); Seitzer et al. (2022). The true data generating process is given by a univariate regression with x R of the form y = f (x) + p

σ2 (x) ξ, E(ξ) = 0; Var(ξ) = 1.

Published in Transactions on Machine Learning Research (01/2025)

The mean f (x) x sin(x) is a sinusoidal function; the scale function of the additive noise ξ depends on the value of x, and is given by (σ2) (x) (0.09)(1 + x2). The regressor x is sampled uniformly from [0, 10] and the noise ξ is standard Gaussian. Figure 1a provides a visualization of the data resulting from this DGP.

(a) Visualization of the DGP and the estimates. The true mean f (x) and the two-standard deviation bands are in black dotted lines. The mean estimates from ERM (blue) and w ERM (orange) are in solid lines, and the shaded area corresponds to the twostandard deviation derived from the variance estimate of the ERM step.

(b) Risk over the selective set with varying coverage α; the solid lines correspond the risk on the test (sub)set averaged over 10 runs of the experiment, and the shaded area correspond to 1 standard deviation.

Figure 1: Regression setting: underlying true data, estimates from ERM and weighted ERM, and the selective risk

We consider the following estimation procedure using ℓ2 loss; f(x) and σ2(x) are both parametrized by multi-layer perceptrons (MLP):

An ERM step that gives rise to the mean and the variance estimates, that is,

bf ERM(x) := arg min f

i=1 (yi f(xi))2 and bσ2(x) := arg min σ2

log σ2(x) + (yi bf ERM(xi))2

A reweighting step, with the weight given by the precision estimate from the ERM step:

bfw ERM(x) := arg min f

i=1 ω(xi) yi f(xi) 2 where ω(xi) := 1/bσ2(xi).

Once bf ERM(x) and bfw ERM(x) are obtained, on the test set, we consider evaluating their risk over a range of selective set with varying coverage α [0, 1]. Concretely, at evaluation time, the selective risk is calculated as

Rα := E h f (x) f(x) 2 | σ2(x) qα(σ2) i ,

where qα(σ2) is the α quantile of the variance over the entire domain; a low-coverage (i.e., small α) selective set corresponds to the low-variance region. Empirically, the risk is obtained by substituting f by either bf ERM or bfw ERM for each test data point in the selective set then taking the average, with bσ(x) coming from the ERM step. The cut-off qα(σ2) is determined by the empirical quantile of the estimated σ2 on the validation set.

Figure 1a presents the mean estimate from ERM (blue) and weighted ERM (orange) respectively, although the quality of the fit is satisfactory for both cases and therefore they largely overlap with the truth and becomes hard to distinguish, visually. Figure 1b displays the risk over the selective set with varying coverage α. As it can be seen from the plot, weighted ERM has an advantage over the ERM estimate in the low-coverage region, as manifested by a lower selective risk, and the advantage diminishes as the coverage α increases. This is in accordance with the theoretical results established in Section 4.2.

As a remark, the practical implication for such results is that in certain scenarios (e.g., some finance applications) where one takes actions only when there is high confidence and abstains otherwise (and therefore being selective ), a weighted ERM procedure can be leveraged to obtain more refined estimates for the region of interest where actions would take place.

Published in Transactions on Machine Learning Research (01/2025)

5.2 Experiments under classification settings

For classification, we consider the following data-generating process for illustration purpose, in which extremely noisy data points are present. The features xi R2 are sampled from class-conditional Gaussian with equal covariance, that is P(ci = k) = pk; P(xi|ci = k) N(µk, Σ). Here we consider 4 clusters with k {0, 0 , 1, 1 }, where p0 = 0.5, p0 = 0.25, p1 = 0.20, p1 = 0.05; µ0 = ( 10, 0) , µ0 = ( 3, 0) , µ1 = (3, 0) , µ1 = (12, 0) , Σ = [ 2 0.5 0.5 2 ]. Let

ϕ (xi) P ci {1, 1 } | xi = X

k {1,1 } pk p(xi; µk, Σ) / X

k {0,0 ,1,1 } pk p(xi; µk, Σ) ,

where p(xi; µk, Σ) denotes the pdf corresponding to N(µk, Σ) evaluated at xi, and the equality follows from Bayes rule. Set yinitial i 1{ϕ (xi) > 1/2}. Further, we inject noise to the class labels for points that are in cluster 0 , such that their final labels are flipped with probability pflip, that is:

yi = 1 yinitial i (w.p. pflip) if ci = 0 otherwise yinitial i .

It is worth noting that given that above-mentioned DGP, the theoretical decision boundary is linear; additionally, the theoretical margin is given by ω (xi) |η (xi) 1/2|, where η (xi) 1{ϕ (xi) > 1/2} if ci = 0 otherwise pflip. In this setting, the flipping probability pflip is set at 0.49.

Similar to the case of regression, we consider the following procedure that entails two steps, using the cross-entropy loss:

An ERM step: we obtain the estimated margin bω(xi) = |bη(xi) 1/2| where bη(xi) is the estimated Bayes-optimal classifier, and a linear decision boundary that gives rise to the class labels bf(xi)ERM.

A reweighting step with the weight set at bw(xi), which then gives rise to an updated decision boundary and the associated class label bf(xi)w ERM.

The selective risk is then evaluated as

Rα := E h 1{f = bf} | ω(x) qα(ω) i , f := 1{η > 0.5}.

where qα(ω) is the α quantile of the margin over the entire domain; a low-coverage (i.e., small α) selective set corresponds to the large-margin region. Empirically, the evaluation is done by averaging the risk on the corresponding test (sub)set, and y is substituted by either bf(xi)ERM or bf(xi)w ERM; the cut off qα(ω) is empirically determined by the quantile of the estimated margin on the validation set.

(a) Visualization of the DGP. The top panel displays the class labels after injecting label noise, and the bottom shows class labels based on the Bayes optimal classifier, respectively

(b) Estimated class labels based on ERM (top) and weighted ERM (bottom), respectively

(c) Risk over the select set with varying coverage α. the solid lines correspond the risk on the test (sub)set averaged over 10 runs of the experiment, and the shaded area corresponds to 1 standard deviation.

Figure 2: Classification setting: underlying true data, estimates from ERM and weighted ERM and the selective risk

Published in Transactions on Machine Learning Research (01/2025)

Figure 2a provides a representative view of the data resulting from this DGP; Figure 2b displays the estimated class labels from the ERM (top) and the weighted ERM (bottom) step, respectively. As it can be seen from the figure, the latter largely aligns with the class labels dictated by the Bayes-optimal classifier, whereas the former has a noticeable number of points near the decision boundary being mis-labeled. Figure 2c provides a comparison of the selective risk for the two sets of estimates. In this specific setting, weighted ERM dominates.

6 Discussion, Limitation, and Future Work

To conclude, this work investigates the generalization error bound of weighted ERM under the fast-rate regime. We show that by additionally incorporating a carefully-designed weight function in each loss term, estimators based on weighted ERM can achieve a tighter bound in selected regions by a problem-dependent constant, when compared with the one from standard ERM. This finding leads to practical applications where one can use plug-in estimates of the weight function to obtain superior performance in sub-regions through a two-step procedure, as demonstrated in our synthetic data experiments.

It is worth noting that recent work Zhai et al. (2023) considers a generalized reweighting scheme where samples are reweighted dynamically during training; in spirit, this is similar to the procedure considered in our work, yet the two differ in the following aspects: (1) the starting point of Zhai et al. (2023) is the generalization of distributional robust optimization (DRO) algorithms; in particular, under a DRO setting where distributional shift is present yet the test distribution is close to the training one, at the conceptual level, training should focus on the hard cases. In our work, on the contrary, the setting under consideration can be made analogous to soft abstention, and thus the weight schema further upweights the easy cases. (2) The result established in Zhai et al. (2023) states that the generalized reweighting procedure leads to a solution that is close to the ERM one, in that the points they converge to are close; this is largely done by analyzing the properties of the estimates between successive iterates. On the other hand, this work is concerned with the statistical error bound of the weighted ERM estimator in particular, its superiority over the standard ERM one in the high-confidence region.

There are several limitations of this work and directions that could be further explored. Throughout the paper, we consider well-specified settings. There are several difficulties in regards to the extension to mis-specified settings where the target hypothesis can potentially live outside of the hypothesis class in question. Possible extension includes exploring mis-specified settings with surrogate losses using tools developed in Awasthi et al. (2022); Mao et al. (2023), and leveraging several recent results which show that fast rate could be achieved under mis-specified setting via model selection aggregation (Tsybakov, 2003; Bousquet & Zhivotovskiy, 2021; Kanade et al., 2022). Other recent work that touched upon this issue includes Puchkin & Zhivotovskiy (2022), where to establish the desired results the authors require both the diameter of the hypothesis class and the star number to be finite. Alternatively, tools that can work in specific setups may be introduced to characterize the approximation error; e.g., if one considers a setting similar to that in Kohler & Langer (2021), namely, the hypothesis class being a set of fully connected DNNs and the target hypothesis being in the class of (p, C)-smooth functions, then the approximation error bound can be calibrated. The results developed in this work can potentially be extended to these settings, which however requires more involved analysis to handle various components (e.g., the approximation error of the weight function bω(x)) and very specific assumptions on the exact setup.

Separately, our analysis is limited to Bernstein-type condition (Lee et al., 1996; Bartlett et al., 2005) of the form Var[h] BE[h]. To study classification problems under Tsybakov noise condition (Mammen & Tsybakov, 1999; Tsybakov, 2004) and regression with ℓp risk (Bartlett & Mendelson, 2006), a more generalized form E[h2] B(E[h])β, β (0, 1] is required. Finally, for some other settings such as Offset Rademacher Complexity (Liang et al., 2015) and small-ball condition (Mendelson, 2018), where the fast rate has been established, we are optimistic they can also enjoy some problem-dependent constant improvement by exploring the structure of semi-random noise label with a properly designed weight function.

Published in Transactions on Machine Learning Research (01/2025)

Ausset, G., Cl emen con, S., and Portier, F. Empirical risk minimization under random censorship. Journal of Machine Learning Research, 2022. Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. H-consistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, pp. 1117 1174. PMLR, 2022.

Bach, F. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384 414, 2010. Bartlett, P. L. and Mendelson, S. Empirical minimization. Probability Theory and Related Fields, 135(3): 311 334, 2006.

Bartlett, P. L. and Wegkamp, M. H. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8), 2008. Bartlett, P. L., Bousquet, O., and Mendelson, S. Local Rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005. Boucheron, S., Bousquet, O., and Lugosi, G. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9:323 375, 2005.

Bousquet, O. and Zhivotovskiy, N. Fast classification rates without standard margin assumptions. Information and Inference: A Journal of the IMA, 10(4):1389 1421, 2021.

Bousquet, O., Boucheron, S., and Lugosi, G. Introduction to statistical learning theory. In Summer school on machine learning, pp. 169 207. Springer, 2003. Cawley, G. C., Talbot, N. L., Foxall, R. J., Dorling, S. R., and Mandic, D. P. Heteroscedastic kernel ridge regression. Neurocomputing, 57:105 124, 2004.

Chow, C. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16 (1):41 46, 1970.

Cortes, C., Mohri, M., Riley, M., and Rostamizadeh, A. Sample selection bias correction theory. In Algorithmic Learning Theory: 19th International Conference, ALT 2008, Budapest, Hungary, October 13-16, 2008. Proceedings 19, pp. 38 53. Springer, 2008. Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for importance weighting. Advances in Neural Information Processing Systems, 23, 2010. Cortes, C., De Salvo, G., and Mohri, M. Learning with rejection. In Algorithmic Learning Theory: 27th International Conference, ALT 2016, Bari, Italy, October 19-21, 2016, Proceedings 27, pp. 67 82. Springer, 2016. Daye, Z. J., Chen, J., and Li, H. High-dimensional heteroscedastic regression with an application to e QTL data analysis. Biometrics, 68(1):316 326, 2012. Diakonikolas, I., Gouleakis, T., and Tzamos, C. Distribution-independent PAC learning of halfspaces with Massart noise. Advances in Neural Information Processing Systems, 32, 2019. Diakonikolas, I., Park, J. H., and Tzamos, C. Re LU regression with Massart noise. Advances in Neural Information Processing Systems, 34:25891 25903, 2021.

Dudley, R. M. Uniform Central Limit Theorems, volume 142. Cambridge University Press, 2014. El-Yaniv, R. et al. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11(5), 2010. Franc, V., Prusa, D., and Voracek, V. Optimal strategies for reject option classifiers. Journal of Machine Learning Research, 24(11):1 49, 2023.

Ge, J., Tang, S., Fan, J., Ma, C., and Jin, C. Maximum likelihood estimation is all you need for well-specified covariate shift. In The Twelfth International Conference on Learning Representations, 2024.

Hanczar, B. and Dougherty, E. R. Classification with reject option in gene expression data. Bioinformatics, 24(17):1889 1895, 2008.

Hanneke, S. Theoretical Foundations of Active Learning. Carnegie Mellon University, 2009.

Haussler, D. Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217 232, 1995.

Published in Transactions on Machine Learning Research (01/2025)

Herbei, R. and Wegkamp, M. H. Classification with reject option. The Canadian Journal of Statistics/La Revue Canadienne de Statistique, pp. 709 721, 2006. Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652 661. PMLR, 2016. Kanade, V., Rebeschini, P., and Vaskevicius, T. Exponential tail local Rademacher complexity risk bounds without the Bernstein condition. ar Xiv preprint ar Xiv:2202.11461, 2022. Kendall, A. and Gal, Y. What uncertainties do we need in Bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017.

Klein, P. and Young, N. E. On the number of iterations for Dantzig Wolfe optimization and packing-covering approximation algorithms. SIAM Journal on Computing, 44(4):1154 1172, 2015.

Klochkov, Y. and Zhivotovskiy, N. Stability and deviation optimal risk bounds with convergence rate o(1/n). Advances in Neural Information Processing Systems, 34:5065 5076, 2021. Kohler, M. and Langer, S. On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231 2249, 2021. Koltchinskii, V. and Panchenko, D. Rademacher processes and bounding the risk of function learning. In High dimensional probability II, pp. 443 457. Springer, 2000.

Koren, T. and Levy, K. Fast rates for exp-concave empirical risk minimization. Advances in Neural Information Processing Systems, 28, 2015. Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30, 2017.

Lee, W. S., Bartlett, P. L., and Williamson, R. C. The importance of convexity in learning with squared loss. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pp. 140 146, 1996.

Liang, T., Rakhlin, A., and Sridharan, K. Learning with square loss: Localization through offset Rademacher complexity. In Conference on Learning Theory, pp. 1260 1285. PMLR, 2015.

Mammen, E. and Tsybakov, A. B. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808 1829, 1999.

Mao, A., Mohri, M., and Zhong, Y. H-consistency bounds: Characterization and extensions. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Massart, P. and N ed elec, E. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326 2366, 2006.

Mendelson, S. Improving the sample complexity using global data. IEEE transactions on Information Theory, 48(7):1977 1991, 2002a.

Mendelson, S. Rademacher averages and phase transitions in Glivenko-Cantelli classes. IEEE transactions on Information Theory, 48(1):251 263, 2002b. Mendelson, S. Learning without concentration for general loss functions. Probability Theory and Related Fields, 171(1-2):459 502, 2018. Min, Y., Wang, T., Zhou, D., and Gu, Q. Variance-aware off-policy evaluation with linear function approximation. Advances in Neural Information Processing Systems, 34:7598 7610, 2021. Mozannar, H. and Sontag, D. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning, pp. 7076 7087. PMLR, 2020. Namkoong, H. and Duchi, J. C. Variance-based regularization with convex objectives. Advances in neural information processing systems, 30, 2017. Pia, A. D., Ma, M., and Tzamos, C. Clustering with queries under semi-random noise. In Loh, P.- L. and Raginsky, M. (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp. 5278 5313. PMLR, 02 05 Jul 2022. URL https: //proceedings.mlr.press/v178/pia22a.html.

Pidan, D. and El-Yaniv, R. Selective prediction of financial trends with hidden markov models. Advances in Neural Information Processing Systems, 24, 2011.

Published in Transactions on Machine Learning Research (01/2025)

Pollard, D. Empirical processes. volume 2, pp. 43 50. Institute of Mathematical Statistics, 1990.

Puchkin, N. and Zhivotovskiy, N. Exponential savings in agnostic active learning through abstention. IEEE Transactions on Information Theory, 68(7):4651 4665, 2022. doi: 10.1109/TIT.2022.3156592.

Seitzer, M., Tavakoli, A., Antic, D., and Martius, G. On the pitfalls of heteroscedastic uncertainty estimation with probabilistic neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=a POp Xln V1T.

Shah, A., Bu, Y., Lee, J. K., Das, S., Panda, R., Sattigeri, P., and Wornell, G. W. Selective regression under fairness criteria. In International Conference on Machine Learning, pp. 19598 19615. PMLR, 2022.

Skafte, N., Jørgensen, M., and Hauberg, S. Reliable training and estimation of variance networks. Advances in Neural Information Processing Systems, 32, 2019. Talagrand, M. Sharper bounds for Gaussian and empirical processes. The Annals of Probability, pp. 28 76, 1994. Tsybakov, A. B. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1): 135 166, 2004. Tsybakov, A. B. Optimal rates of aggregation. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pp. 303 313. Springer, 2003.

Valiant, L. G. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

Van Der Vaart, A. W., Wellner, J. A., van der Vaart, A. W., and Wellner, J. A. Weak Convergence. Springer, 1996.

Vapnik, V. and Chervonenkis, A. Theory of pattern recognition, 1974. Vapnik, V. and Chervonenkis, A. Y. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264 280, 1971. Vidyasagar, M. Learning and Generalisation: with Applications to Neural Networks. Springer Science & Business Media, 2013. Xie, T., Ma, Y., and Wang, Y.-X. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Advances in Neural Information Processing Systems, 32, 2019.

Xu, Y. and Zeevi, A. Towards optimal problem dependent generalization error bounds in statistical learning theory. Mathematics of Operations Research, 2024. Yuan, M. and Wegkamp, M. Classification methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11(1), 2010.

Zhai, R., Dan, C., Kolter, J. Z., and Ravikumar, P. K. Understanding why generalized reweighting does not improve over ERM. In The Eleventh International Conference on Learning Representations, 2023.

Zhang, Y., Lin, J., Li, F., Adler, Y., Rasul, K., Schneider, A., and Nevmyvaka, Y. Risk bounds on aleatoric uncertainty recovery. In International Conference on Artificial Intelligence and Statistics, pp. 6015 6036. PMLR, 2023. Zhang, Y., Zheng, S., Dalirrooyfard, M., Wu, P., Schneider, A., Raj, A., Nevmyvaka, Y., and Chen, C. Learning to abstain from uninformative data. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. Zhivotovskiy, N. and Hanneke, S. Localization of VC classes: Beyond local Rademacher complexities. Theoretical Computer Science, 742:27 49, 2018.

Published in Transactions on Machine Learning Research (01/2025)

A Proofs and Discussions

In this section, we include proofs for the results established in Section 4 and the corresponding discussions. We introduce additional notation and definitions that are used in the ensuing development.

Let [n] {1, , n}. We use {x}i [n] x1:n to denote samples of x of size n; {z}i [n] z1:n = {(xi, yi)}n i=1 is analogously defined, and they are i.i.d. samples drawn from P. Given {z}i [n], we use Pn to denote the empirical measure. We use p to denote the ℓp norm of vectors and Lp to denote the Lp norm of random variables under P. The indicator function 1{x A} equals 1 when the condition x A is true and 0 otherwise. Denote a b = max(a, b) and a b = min(a, b). Let σ1, ..., σn be n independent Rademacher random variables. For a function h : Z R, define:

i=1 h(zi), Ph Ez Ph(z).

For a family of functions H {h : Z R} the Rademacher Complexity and Rademacher Average is defined as:

Rn H = Eσ1:n

i=1 σih(zi) , RH = Ez1:n,σ1:n

i=1 σih(zi)

We further define the Local Rademacher Complexity and Local Rademacher Average with radius r as Rn{h H, Pnh2 r} and R{h H, Ph2 r}.

Definition 3 (Star Hull; see also Bousquet et al. (2003); Bartlett et al. (2005)). The star hull of set of functions F is defined as

F {αf : f F, α [0, 1]}

Definition 4 (Sub-Root Function (Bousquet et al., 2003; Bartlett et al., 2005)). A function ψ : R R is sub-root if

ψ is non-decreasing

ψ is non-negative

ψ(r)/ r is non-increasing

And we say r is a fixed point of ψ if ψ(r ) = r .

The following definition for VC-class could be found in (Vapnik & Chervonenkis, 1971; Van Der Vaart et al., 1996). We include them here for the sake of completeness.

Definition 5 (VC-dimension; Vapnik & Chervonenkis (1971)). The VC-dimension d V C(F) of a hypothesis class F = {f : X 7 {1, 1}} is the largest cardinality of any set S X such that S S, f F:

( 1 if x S 1 if x S \ S

Definition 6 (Pseudo-dimension; Pollard (1990)). The Pseudo-dimension d P (F) of a real-valued hypothesis class G = {g : X 7 [l, u]} is the VC-dimension of the hypothesis class

H = {h : X R 7 { 1, 1} | h(x, t) = sign(g(x) t), g G, t R}.

Published in Transactions on Machine Learning Research (01/2025)

A.1 Derivation of Equation 6

The following derivation establishes the connection between Ez[ℓ(f; z) ℓ(f ; z)] and Ex[ω (x)1{f = f }], which could be found in Boucheron et al. (2005). We include here for completeness.

Ez[ℓ(f; z) ℓ(f ; z)]

=Ex,y[1{y = f(x)} 1{y = f (x)}] =Ex,y[P[y = f (x)]1{f (x) = f(x)} + P[y = f (x)]1{f (x) = f(x)} P[y = f (x)]] (17)

=Ex,y[P[y = f (x)]1{f (x) = f(x)} P[y = f (x)]1{f (x) = f(x)}]

=Ex,y[(P[y = f (x)] P[y = f (x)])1{f (x) = f(x)}}]

=Ex[ω (x)1{f = f }];

Next, we derive Equation 6:

Varx,y[ω (x)(1{y = f(x)} 1{y = f (x)})]

Ex,y[ω 2(x)(1{y = f(x)} 1{y = f (x)})2]

=Ex[ω 2(x)1{f (x) = f(x)}]

=Ex,y[ω (x)(1{y = f(x)} 1{y = f (x)})}]

A.2 Derivation of Equation 9

To see Equation 9, we first bound Ex[1{ω (x) c}ω (x)1{f = f }]. By leveraging Equation (17) and Ex,y[ω (x)(1{f = y} 1{f = y})] ε the follow inequality holds:

Ex,y[ω (x)(1{f = y} 1{f = y})] = Ex[ω 2(x)1{f = f }] ε (18)

which implies the following inequality:

Ex[1{ω (x) c}cω (x)1{f = f }] Ex[1{ω (x) c}ω 2(x)1{f = f }] ε (19)

thus Ex[1{ω (x) c}ω (x)1{f = f }] ε

c. In addition,

Ex[1{ω (x) < c}ω (x)1{f = f }] (20)

=Ex[1{ω (x) < c}ω (x)1{f = f }|ω (x) < c]P(ω (x) < c)

+Ex[1{ω (x) < c}ω (x)1{f = f }|ω (x) c]P(ω (x) c)

=Ex[ω (x)1{f = f }|ω (x) < c]P(ω (x) < c)

c P(ω (x) < c)

Equation (19) and Equation (20) combined gives

Ex,y[1{f = y} 1{f = y}] = Ex[ω (x)1{f = f }]

=Ex[1{ω (x) c}ω (x)1{f = f }] + Ex[1{ω (x) < c}ω (x)1{f = f }]

P(ω (x) < c)c + ε

A.3 Derivation of Equation 12

Equation 12 is a standard result, we include the derivation here for the sake of completeness.

Varx,y[(y f(x))2 (y f (x))2] = Varx,y[(f (x) f(x))(f (x) + f(x) 2f (x) 2ξ p

Ex,ξ[(f (x) f(x))2(f(x) f (x) 2ξ p

Ex[(f (x) f(x))2((f(x) f (x))2 + 4σ2 (x))]

= Ex[Ey[(f (x) y)2 (y f(x))2]((f(x) f (x))2 + 4σ2 (x))]

γ Ex,y[((f (x) y)2 (y f(x))2)]

Published in Transactions on Machine Learning Research (01/2025)

A.4 Proof of Theorem 4.1

The proof readily follows from invoking Theorem 4.7 with ℓ(f; z) 1{f(x = y)}, ω (x) = |2η (x) 1|, D(f1, f2, x) = 1{f1 = f2}. It can be easily verified that Assumption 1 is satisfied with L = 1, a = 1, b = 1, γ = inf x |2η (x) 1|.

A.5 Proof of Theorem 4.3

Proof. Let R(η, zi) (η(xi) yi)2. We define following function class:

H R G R(η; η , z) = R(η; z) R(η ; z) : η G .

which is a composite function of hypothesis class G, loss function R and the difference operation . For simplicity we let R,η = R(η; η , z) when there is no ambiguity.

By definition of bη, we have Pn R(bη; z) Pn R(η ; z), also by definition of R,bη we have Pn R,bη 0.

Next we bound P R,bη Pn R,bη. Since |y f(x)| [0, 1], η(x) [0, 1], it can be easily verified that Varx P[ R,η] 2Ex P[ R,η]. To invoke Theorem 3.3 in Bartlett et al. (2005), we need to find a subroot function ψ(r) such that ψ(r) 2E Pn{ R,bη H : E[h2] r}.

To this end, we show some analysis on the Local Rademacher Average ERn{ R,bη H : E[h2] r}.

ERn( R G, r) =ESnσ1:n

i=1 σi R,bη

The following analysis largely follows from the proof of Corollary 3.7 in Bartlett et al. (2005). Since R,bg is uniformly bounded by 2, for any r ψ(r), Corollary 2.2 in Bartlett et al. (2005) implies that with probability at least 1 1

n, {h H : Ph2 r} {h H : Pnh2 2r}. Let E {h H : Ph2 r} {h H : Pnh2 2r}, then the following holds:

ERn{ H, Ph2 r} P[E]E[Rn{ H, Ph2 r}|E] + P[Ec]E[Rn{ H, Ph2 r}|Ec]

E[Rn{ H, Pnh2 2r}] + 2

Since r = ψ(r ), r satisfies the following

r 20BERn{ H, Pnh2 2r } + 22 log n

Next we leverage Dudley s entropy integral (Dudley, 2014) to upper bound ERn{ H, Pnh2 2r }, using the integral of covering number. Specifically, by applying the chaining bound, it follows from Theorem B.7 (Bartlett et al., 2005) that

E[Rn( H, Pnh2 2r )] const n E Z

log N2(ε, H, x1:n)dε, (22)

where const represents some universal constant. Next we bound the covering number log N2(ε, H, x1:n) by log N2(ε, G, x1:n). We show that for all x1:n, any ε2-cover of G is a ε-cover of H. Specifically, let V [0, 1]n

be an ε-cover of H on x1:n so that for all η G, v1:n V so that q

i [n](η(xi) vi)2 ε. Now we show

Published in Transactions on Machine Learning Research (01/2025)

that for any z1:n, the family of (vi yi)2 (η i yi)2, i [n] is an ε-cover for H, where η i P(yi = 1|xi): v u u t 1

i [n] ((vi yi)2 (η yi)2 R,η)2 =

i [n] ((vi yi)2 (η(xi) yi)2)2

i [n] (vi η(xi))2(vi + η(xi) 2yi)2 4ε.

Combine the above inequality with Corollary 3.7 from Bartlett et al. (2005), we have the following upper bound on the entropy number of H:

log N2(ε, H, x1:n) log N2

ε + 1 log N2

Next we bound const n E R

log N2(ε, H, x1:n)dε from (22). Note that by Haussler s covering number

bound (Haussler, 1995), the following inequality holds: log N2

ε 8, G, x1:n

cd P log 1 ε

, we can therefore

derive the following inequalities:

const n E Z

log N2(ε, H, x1:n)dε const n E Z

const n E Z

d P (G)r log(1/r )

n2 + d P (G)r log(n/ed P (G))

where const represents some universal constants that may change from line to line. Together with (21) one can solve for

r d P (G) log( n d P (G))

Since P R,η ε, we have P|bη η |2 ε, given that the following equality holds:

P R,η = Ez[R(bη; f , z) R(η ; f , z)]

= Ex,y[(bη(x) y)2 (η (x) y)2]

= Ex[|bη(x) η (x)|2]. (as η (x) E[y])

Note that since |bη 1

2 |bη η |2, the following readily follows:

Ex h (|bη(x) 1

2| |η (x) 1

2|)2i ε. (23)

A.6 Proof of Theorem 4.4

Proof. The proof is by construction. We construct the full support of x to be X = X1 X2. For all x X1, let ω (x) = |η (x) 1

2| c and x X2, let ω (x) = |η (x) 1

2| = 0. By decomposing the excessive risk we

Published in Transactions on Machine Learning Research (01/2025)

Ex,y[1{ bf = y} 1{f = y}] = Ex,y[1{ bf = y} 1{f = y}|x X1] P[x X2]

+ Ex,y[1{ bf = y} 1{f = y}|x X2] P[x X2] | {z } =0,since for all x X2, P(y = 1|x) = P(y = 1|x)

The lower bound of term Ex,y[1{ bf = y} 1{f = y}|x X1] 1 cn could be established by Theorem 4 in Massart & N ed elec (2006).

A.7 Proof of Theorem 4.2

Throughout this proof, we use superscript j to index the jth coordinate of a vector x (e.g., xj), and this is to be distinguished from the subscript i that indexes the samples.

Proof. The proof is by construction. Let γ be the minimum margin. Let F = {1 sign(x β)|β Rd}, X Sd j=1 Ej, Ej α ej, α {[ 2, 1] { 0.1} [1, 2] {0.1}}, j [d] where ej is j-th standard basis. Let

21{1 sign(x) 0} 1 + 1{ x = 0.1} + 1{ x 1}γ (24)

Note 1 sign(x β) could be viewed as a composition class of the d-dimensional half-space and the interval function, and its VC-dimension could be bounded as d log(d) (Vidyasagar, 2013). Consequently, the choice of n implies that Ex[1{ bfw ERM = f }|ω (x) > γ] ε for any given bω that satisfies Ex[(ω(x) ω (x))2] ε .

Consider following data generative process:

j Unif{1, 2, ..., d}

0.1, with prob γ 32 0.1, with prob γ 32 Unif(1, 2), with prob 1 3γ

32 Unif( 2, 1), with prob γ 32 x = α ej

( 1, with prob η (x) 1 with prob 1 η (x)

Note that the following version of the Chernoff inequality will be applied multiple times in the proof:

i Xi m E[X]| ξm E[X]] 2e ξ2m EX/3. (25)

By setting ξ = 0.5 in inequality (25) and taking a union bound over E1:d we have

P j, s.t., |Ej \ Sn| 64 log d log(1/δ)

32 log d log(1/δ)

2de 5 log d log(1/δ) 2de 5 log(d/δ) δ. (26)

The inequality above implies that with probability at least 1 δ, j [d], we have

32d log(d) log(1/δ)

γε |Ej \ Sn| 96d log(d) log(1/δ)

Next we show that for each j [d], for sufficiently small γ with constant probability, bβj ERM = β . We first present our argument for the case where d = 1. W.O.L.G, we focus on the case where X = X 1. Given

Published in Transactions on Machine Learning Research (01/2025)

x1 1:n, let x1 ( n1):( 1) x1 (1):(n2):(n2+n3) be an ordering of xn with x1 i x1 i+1 x1 1 < 0 < x1 1 x1 n2 <

x1 n2+i < x1 n2+n3 < x1 n3, where x(1):(n2) = 0.1 and n1 + n2 + n3 = n. Since 32 log(1/δ)

γε n 96 log(1/δ)

by Chernoff inequality in (25) with ξ = 0.5 and picking τ = γ, we have log(1/δ)

ε n2 3 log(1/δ)

ε and τ32 log(1/δ)

γε τn3 τ96 log(1/δ)

γε that hold simultaneously with probability at least 1 2δ. Consider τ fraction of positive samples: x(n2+1):( τn3 ). Given n2, by Lemma 2, we have ,

P (n2+ τn3 ) X

i=n2+1 1{y(i) = f (x(i))} (1

2 + γ)(1 ξ)τn3

0.2e 6ξ2τn3( 1

2 +γ). (28)

By inequality (28) with ξ = 3γ, we have that with probability at least 0.2e 54γ2 log(d) log(1/δ)( 1

(n2+ τn3 ) X

i=n2+1 1{y(i) = f (x(i))} (1

2 + γ)(1 3γ)n3

(n2+ τn3 ) X

i=n2+1 1{y(i) = f (x(i))} (1 + 2γ)(1 3γ)n3

(n2+ τn3 ) X

i=n2+1 1{y(i) = f (x(i))} (1 γ

n2+( τn3 ) X

i=n2+1 1{y(i) = f (x(i))} +

i=1 1{y(i) = f (x(i))}

(n2+ τn3 ) X

i=n2+1 1{y(i) = f (x(i))}.

It can be easily verified that inequality (29) implies that c βj ERM x(τn3). To ensure that

0.2e 54γ2 log(d) log(1/δ)( 1

2 +γ) ε 0.12,

it suffices to pick γ = q

log(d) log(1/δ)

55ε . Since with probability 0.12 we have bβj ERM x(τn3) > 0.1, which implies

that Ex[1{ bf ERM = f }|ω (x) > γ, x E1] 0.5. With markov inequality, we have with probability at least 0.1, 0.03 fraction of E1:d has bβj ERM > 0.1, which implies that Ex[1{ bf ERM = f }|ω (x) > γ, x Ej] 0.5. We have with probability at least 0.1, Ex[1{ bf ERM = f }|ω (x) > γ] 0.015.

A.8 Proof of Theorem 4.5

The proof readily follows from Theorem 4.7 with ℓ(f; z) (y f(x))2, ω (x) = C σ2 (x) and D(f (x), bf(x)) =

C (f bf)2. It can be easily verified that Assumption 1 is satisfied with a = 8 + 8c2 2 γ , b = 4 Cγ , L = c2 γ .

A.9 Proof of Theorem 4.6 and discussion

Our analysis of learning c σ2(x) is based on the following negative log-likelihood loss:

ℓNLL(σ2, f, z) log(σ2(x)) + (y f(x))2

σ2(x) . (30)

Published in Transactions on Machine Learning Research (01/2025)

In particular, we seek to minimize risk of the form in Equation (30) while restricting f and σ2 to be in hypothesis classes e F F, e G G that satisfy (y f(x))2

σ2(x) 4c2 2, uniformly for all z = (x, y). For learning σ2 (x). For simplicity, we assume Eξ[(ξ2 1)2] being a constant.

Verification of Bernstein-type condition We show that the risk function in Equation (30) satisfies some Bernstein-type condition. Since (y f(x))2

σ2(x) 4c2 2, it then follows that |ℓNLL(σ2, f, z) ℓNLL(σ2 , f , z)| 5c2 2 + 2 log( c2

γ ). In following analysis, we let C = 5c2 2 + 2 log( c2

γ ). The expectation term then satisfies:

Ez[ℓNLL(σ2, f, z) ℓNLL(σ2 , f , z)]

(f(x) f (x))2

σ2(x) 1 + σ2 (x)

σ2(x) log σ2 (x)

(f(x) f (x))2

max{ξ4, ξ2 σ2 (x)

the last inequality leverages the fact that for all t > 0

t ) 1 t + (t 1)2

2 max(1, t). (33)

For the variance term, the following holds:

Varz[(ℓNLL(σ2, f, z) ℓNLL(σ2 , f , z))]

= Ez[(ℓNLL(σ2, f, z) ℓNLL(σ2 , f , z))2] Ez[(ℓNLL(σ2, f, z) ℓNLL(σ2 , f , z))]2

(f(x) f (x))2

σ2(x) ξ2 + ξ2σ2 (x)

σ2 (x)(f(x) f (x))

σ2(x) log σ2 (x)

this can be further decomposed as

(ξ2 1) σ2 (x)

σ2(x) 1 2ξ p

σ2 (x)(f(x) f (x))

(f(x) f (x))2

σ2(x) 1 + σ2 (x)

σ2(x) log σ2 (x)

(f(x) f (x))2

σ2(x) 1 + σ2 (x)

σ2(x) log σ2 (x)

(ξ2 1) σ2 (x)

σ2(x) 1 2ξ p

σ2 (x)(f(x) f (x))

(ξ2 1)2 σ2 (x)

| {z } Term I

σ2 (x)(f(x) f (x))

| {z } Term II + CEz[ℓNLL(σ2, f, z) ℓNLL(σ2 , f , z)] | {z } Term III

Bounding Term I: Due to the fact that for all t > 0, max{2, 2t}( log(t) 1 + t) (t 1)2, the following holds:

2 max 1, σ2 (x)

σ2(x) log σ2 (x)

2ξ2 max 1, σ2 (x)

σ2(x) log σ2 (x)

σ2(x) 1 2 .

Published in Transactions on Machine Learning Research (01/2025)

On the other hand, since for all f e F, σ2 e G, (y f(x))2

σ2(x) 4c2 2, we further have ξ2σ2 (x)

σ2(x) 2c2 2 + 4/c2 3. Term I could be bounded as:

(ξ2 1)2 σ2 (x)

σ2(x) 1 2 4c2 2 + 4

σ2(x) log σ2 (x)

Bounding Term II: Since for all f e F, σ2 e G, (y f(x))2

σ2(x) 4c2 2, we further have q

σ2(x) 2c2 + 2/c3. Consequently, we have

σ2 (x)(f(x) f (x))

2 4c2 2 + 4

(f(x) f (x))2

As a result, the following holds:

Varz[(ℓNLL(σ2, f, z) ℓNLL(σ2 , f , z))]

C + c2 2 + 1

Ez[ℓNLL(σ2, f, z) ℓNLL(σ2 , f , z)].

Next, we move on to the proof of the theorem statement.

Proof. Let B = C + c2 2 + 1

, the Bernstein constant. Define the following function class:

H L e F e G ℓNLL(σ2, f; σ2 , f , z) = ℓNLL(σ2, f, z) ℓNLL(σ2 , f , z) : f e F, σ2 e G

which is a composite function of hypothesis class e F e G, loss function L and the difference operation . For simplicity we let ℓ,σ2,f = ℓNLL(bσ2, bf; σ2 , f , z) when there is no ambiguity. By definition of bf, c σ2 , we have PnℓNLL(bσ2, bf, z) PnℓNLL(σ2 , f , z) and therefore Pn ℓ,σ2,f 0.

Next we bound P ℓ,σ2,f Pn ℓ,σ2,f. According to (34), we have Varz[ ℓ,σ2,f] BEz[ ℓ,σ2,f]. To invoke Theorem 3.3 in Bartlett et al. (2005), we need to find a sub-root function ψ(r) such that

ψ(r) 2BE Pn{ ℓ,σ2,f H : E[h2] r}.

To find ψ(r), we show some analysis on the Local Rademacher Average ERn{ ℓ,σ2,f H : E[h2] r}. Note that

ERn(H, r) = ESnσ1:n h sup

f e F,σ2 e G,Ex,y[ ℓ,σ2,f ] r

i=1 σi ℓ,σ2,f i ,

and we have h(x) [ C, C]. By leveraging some analysis from the proof in Corollary 3.7 in Bartlett et al. (2005), we bound {h H : Ph2 r} using {h H : Pnh2 r} where the latter could be applied in the entropy integral. Since ℓ,σ2,f is uniformly bounded by 2C, for any r ψ(r), Corollary 2.2 in Bartlett et al. (2005) implies that with probability at least 1 1

n, {h H : Ph2 r} {h H : Pnh2 2r}. Let E be event that {h H : Ph2 r} {h H : Pnh2 2r} holds, then we have

ERn{ H, Ph2 r} P[E]E[Rn{ H, Ph2 r}|E] + P[Ec]E[Rn{ H, Ph2 r}|Ec]

E[Rn{ H, Pnh2 2r}] + 2C2

Since r = ψ(r ), r satisfies

r 20BCERn{ H, Pnh2 2r } + 22C2 log n

Published in Transactions on Machine Learning Research (01/2025)

Next we leverage the entropy integral (Dudley, 2014) to upper bound ERn{ H, Pnh2 2r } using the integral of covering number. Apply the chaining bound, it follows from Theorem B.7 (Bartlett et al., 2005) that

E[Rn( H, Pnh2 2r )] const n E Z

log N2(ε, H, x1:n)dε, (36)

where const is some universal constant.

Next we bound the covering number log N2(ε, H, x1:n) by log N2(ε, e G, x1:n) + log N2(ε, e F, x1:n). We show that for all x1:n, the composition of an ε-cover of e F and an ε-cover of e G gives rise to a an ε-cover of H. Specifically, let V [c3, 1

γ ]n be an ε-cover of e G on x1:n so that for all σ2 e G, v1:n V so that q

i [n](σ2(xi) vi)2 ε 4c2 2/c3+1/γc3 , U [ 1, 1]n be an ε-cover of e F on x1:n so that for all f e F,

u1:n U so that q

i [n](f(xi) ui)2 ε

1/c2 3(c2 2/γ+16). Next, we show that for any z1:n, the family of

vi (f (xi) yi)2

σ2 (xi) + log vi σ2 (xi) , i [n] is an ε-cover for H

vi (f (xi) yi)2

σ2 (xi) + log vi σ2 (xi) ℓ,σ2,f

vi (f(xi) yi)2

σ2(xi) + log vi σ2(xi) 2

vi (f(xi) yi)2

vi + (f(xi) yi)2

vi (f(xi) yi)2

σ2(xi) + log vi σ2(xi) 2

(ui f(xi))(ui + f(xi) 2yi)

vi + (f(xi) yi)2

σ2(xi) (σ2(xi) vi)

vi + log vi σ2(xi) 2

i [n] 1/c2 3(c2 2/γ + 16)(ui f(xi))2 + (4c2 2/c3 + 1/γc3)2(vi σ2(xi))2

Clearly, above inequality implies that N2(ε, H, x1:n) N2

ε 4c2 2/c3+1/γc3 , e G, x1:n

1/c2 3(c2 2/γ+16), e F, x1:n

Combining the above inequality with Corollary 3.7 from Bartlett et al. (2005) we can bound the entropy number of H:

log N2(ε, H, x1:n) log N2

ε 4c2 2/c3 + 1/γc3 , e G, x1:n

1/c2 3(c2 2/γ + 16) , e F, x1:n

Next we bound const n E R

log N2(ε, H, x1:n)dε from (36). Note that

ε 4c2 2/c3 + 1/γc3 , e G, x1:n

1/c2 3(c2 2/γ + 16) , e F, x1:n

c(d P ( e G) + d P ( e F)) log c2 2 + 1 + c2 2/γ ε 1

Published in Transactions on Machine Learning Research (01/2025)

Consequently,

const n E Z

log N2(ε, H, x1:n)dε

const n E Z

(d P ( e G) + d P ( e F))

(d P ( e G) + d P ( e F))r log(1/c2 3 + c2 2/c2 3 + c2 2/c2 3γ)/r ) n

(d P ( e G) + d P ( e F))2

n2 + (d P ( e G) + d P ( e F))r log((1/c2 3 + c2 2/c2 3 + c2 2/c2 3γ)n/e(d P ( e F) + d P ( e G))) n ,

where const corresponds to some universal constant. Together with (35) one can solve for

r B2C2(d P (G) + d P (F)) log((1/c2 3 + c2 2/c2 3 + c2 2/c2 3γ)n/(d P (G) + d P (F))) n .

Since P ℓNLL(bσ2, bf; σ2 , f , z) = Ex

(f(x) f (x))2

σ2(x) 1 + σ2 (x)

σ2(x) log σ2 (x)

σ2(x) ε 1/c2 3 , one can leverage the

inequality log( 1

t ) 1 t + (t 1)2

2 max(1,t2). to conclude that

1 c σ2(x) 1 σ2 (x)

This completes the proof.

Discussion Note that the risk bound of learning the variance function using NLL loss is also studied in Zhang et al. (2023), wherein Theorem 1 suggests a rate of order O 1 n . On the other hand, the bound in

Theorem 4.6 of this work is of the order O 1

n . Such improvement of learning σ2 (x) might be of independent interest. Compared to risk bounds in Zhang et al. (2023), the major improvement comes from an application of the Local Rademacher Complexity (Bartlett et al., 2005) analysis under a Bernstein-type condition.

A.10 Proof of Theorem 4.7 and discussion

Proposition 1. Suppose Assumption 1 holds. Let f F and ω W, and suppose we have bω W s.t. Ex[(bω(x) ω (x))2] ε

b. Given i.i.d. samples Sn = {(xi, yi)}n i=1 drawn from the data generative process, let bf arg minf F 1 n P

zi Sn bω(x)ℓ(f; zi). Then for any ε > 0, δ > 0, the following holds simultaneously with probability at least 1 δ:

Ex[bω2(x)D(f (x), bf(x))] ε, Ex[ω 2(x)D(f (x), bf(x))] ε, Ex[bω(x)ω (x)D(f (x), bf(x))] ε,

as long as the sample size requirement is satisfied:

n c2 1a2(d(F) log( 1

ε) + log(L) + log(c1) + log( 1

δ )) ε + c1a log( 1

Proof. We start by defining the following function class:

H ω ℓ F bω(x)ℓ(f; f z) = bω(x)ℓ(f; z) bω(x)ℓ(f ; z) : f F

Published in Transactions on Machine Learning Research (01/2025)

which is a composite function of hypothesis class F, loss function ℓand the difference operation .

To simplify the notation, we denote ℓ,f = bω(x)ℓ(f; f , z) when there is no ambiguity. Due to the fact that bω(x)ℓ( bf; z) is the empirical minimizer, we have:

Pnbω(x)ℓ( bf; z) Pnbω(x)ℓ(f ; z),

and thus Pn ℓ,b f 0. Leveraging such fact, next we show that P ℓ,b f is small via an empirical process argument. Note by assumption we have

P ℓ,b f = Ex,y[bω(x)(ℓ( bf, z) ℓ(f , z))] = Ex[bω(x)ω (x)D(f (x), bf(x))],

P 2 ℓ,b f ( P ℓ,b f)2

=Varz[bω(x)(ℓ( bf, z) ℓ(f , z))]

Ez[bω2(x)(ℓ( bf, z) ℓ(f , z))2]

Ex[bω2(x)D(f (x), bf(x))].

Since Ex[(bω(x) ω (x))2] ε

b, and D(f (x), bf(x)) b we have Ex[(bω(x) ω (x))2D(f (x), bf(x))] ε, which implies that Ex[bω2(x)D(f (x), bf(x))] 2Ex[bω(x)ω (x)D(f (x), bf(x))] + ε. To apply Lemma 1 next we find a subroot function ψ(r) that

ψ(r) 2E Pn{ ℓ,b f H : E[h2] r}.

Note, we have h(x) [ c1a, c1a].

To find ψ(r), we first analyze on the Local Rademacher Average ERn{ ℓ,b f H : E[h2] r}.

ERn( ω ℓ F, r) =ESnσ1:n

sup f F,Ex,y[ 2 ℓ,f ] r

By Lemma 3.4 from Bartlett et al. (2005), it suffices to choose ψ(r) 10c1a ERn{ H, Ph2 r} + 11c2 1a2 log n

n . The following analysis largely follows from the proof in Corollary 3.7 in Bartlett et al. (2005) which aims to bound ERn{ H, Ph2 r} using ERn{ H, Pnh2 r} . Since ℓ,b f is uniformly bounded by 2c1a, for any

r ψ(r), Corollary 2.2 in Bartlett et al. (2005) implies that with probability at least 1 1

n, {h H : Ph2 r} {h H : Pnh2 2r}. Let E be event that {h H : Ph2 r} {h H : Pnh2 2r} holds, above implies

ERn{ H, Ph2 r} P[E]E[Rn{ H, Ph2 r}|E] + P[Ec]E[Rn{ H, Ph2 r}|Ec]

E[Rn{ H, Pnh2 2r}] + 2c2 1a2

Since r = ψ(r ), r satisfies

r 100c1a ERn{ H, Pnh2 2r } + 50c2 1a2 log n

Next we leverage Dudley s chaining bound (Dudley, 2014) to upper bound ERn{ H, Pnh2 2r } using the integral of covering number.

Apply the chaining bound, it follows from Bartlett et al. (2005) Theorem B.7 that

E[Rn( H, Pnh2 2r )] const n E Z

log N2(ε, H, x1:n)dε, (38)

Published in Transactions on Machine Learning Research (01/2025)

where const is some universal constant. Next we bound the covering number log N2(ε, H, x1:n) by log N2(ε, F, x1:n). We show that for all x1:n, any ε/c1L-cover of F is a ε-cover of H. Specifically, let V

[ 1, 1]n be an ε/Lc1-cover of F on x1:n so that for all f F, v1:n V so that q

i [n](f(xi) vi)2 ε c1L.

Now we show that for any z1:n, the family of bω(xi)(ℓ(vi, zi) ℓ(f (xi), zi)), i [n] is an ε-cover for H: v u u t 1

bω(xi)(ℓ(vi, zi) ℓ(f (xi), zi)) ℓ,f

bω(xi)(ℓ(vi, zi) ℓ( bf(xi), zi)) 2

i [n] bω2(xi)(vi f (xi))2 4ε.

Next, combining the above inequality with Corollary 3.7 from Bartlett et al. (2005), we have the following inequality on the entropy number:

log N2(ε, H, x1:n) log N2

ε + 1 log N2

ε 8c1L, F, x1:n

Next we bound const n E R

log N2(ε, H, x1:n)dε from (38). Note that by Haussler s covering number

bound (Haussler, 1995), we have: log N2

ε 8c1L, F, x1:n

. Plugging such bound into the

entropy integral yields:

const n E Z

log N2(ε, H, x1:n)dε const n E Z

const n E Z

d(F)r log(c1L/r )

n2 + d(F)r log(c1Ln/ed(F))

where const represents some universal constant. Together with (37), one can solve for

r c2 1a2d(F) log(c1Ln/d(F))

Since Pn ℓ,b f 0, by Lemma 1 we have P ℓ,b f 2ε with probability at least 1 δ where the randomness comes from the training data Sn.

By assumption we have Ex[(bω(x) ω (x))2] ε

b, and D(f (x), bf(x)) b , with probability at least 1 δ,

Ex[(bω(x) ω (x))2D(f (x), bf(x))] ε,

which implies that

Ex[ω 2(x)D(f (x), bf(x))] 2Ex[bω(x)ω (x)D(f (x), bf(x))] + ε 3ε

Ex[bω2(x)D(f (x), bf(x))] 2Ex[bω(x)ω (x)D(f (x), bf(x))] + ε 3ε.

Published in Transactions on Machine Learning Research (01/2025)

It can be easily verified that

Ex[bω2(x)D(f (x), bf(x))] 3ε = Ex[D(f (x), bf(x))|bω(x) c] 3ε c2P(bω(x) c).

B Technical Lemmas

Lemma 1. Let F be a class of functions ranging in [a, b] and assume that there are some functional T : H R+ and some constant B such that for every h H, Var(h) T(h) BP[h] + ε. Let ψ be a subroot function and r be the fixed point of ψ. Assume the ψ satisfies, for any r r ,

ψ(r) BERn{h H : T(h) r}.

Then with c1 = 704 and c2 = 26, for any K > 1 and every t > 1 with probability at least 1 e t,

h H, P[h] K K 1 Pnh + c1K

B r + t(11(b a) + c2BK)

n + const ε

Also with probability at least 1 e t,

h H, Pn[h] K + 1

B r + t(11(b a) + c2BK)

n + const ε

where Pf = Ex[h(x)] and Pn = 1

n Pn i=1 h(xi).

Proof. The proof is similar to the proof of Theorem 3.3 from Bartlett et al. (2005) except that here we are modifying some step so as to apply the argument under the condition T(h) BPh + ε, instead of the original condition T(h) BPf. We introduce notations and concepts: given class H, λ > 1 and r > 0, we let w(h) = min{rλk, k N, rλk T(h)} and set

Gr = r w(h)h, h H .

And similar to Bartlett et al. (2005) we define

V + r = sup g Gr {Pg Png}, V r = sup g Gr {Png Pg}.

Next we modify the proof step of Lemma 3.8 from Bartlett et al. (2005). Suppose K > 1, λ > 0 and r > 0. We aim to prove the following two claims:

if V + r r λBK then f F, Pf K K 1Pnf + r λBK + ε K 1; (39)

if V r r λBK then f F, Pnf K + 1

K Pf + r λBK + ε

When T(h) < r, we use the same conclusion as the one in Lemma 3.8 in Bartlett et al. (2005):

Ph Pnh + V + r Pnh + r λBK .

In the case T(h) > r, we have w(h) = rλk with k > 0 and T(h) (rλk 1, rλk]. Moreover, g = h λk , Pg Png + V + r thus Ph

λk + V + r . Since T(h) > rλk 1, we have:

Ph Pnh + λk V + r < Pnh + λT(h)V + r r Pnh + Ph

= Ph K K 1Pnh + ε K 1 + r λBK

Published in Transactions on Machine Learning Research (01/2025)

Let r r , applying Theorem 2.1 from Bartlett et al. (2005), we have for all 0 < δ 1, with probability at least 1 δ:

V + r 2(1 + α)ERn Gr +

2r log(1/δ)

n + (b a)(1

Let H(u, v) {h MH : u T(h) v} and define k to be the smallest integer that rλk+1 Bb + ε. By assumption we have T(h) BE[h]+ε, and ψ(r) be a sub-root function that ϕ(r) BERn{h H : T(h) r} we have:

ERn Gr ERn H(0, r) + E sup h H(0,Bb+ε)

ERH(0, r) +

j=0 E sup h H(rλj,rλj+1)

r w(h)Rnh = ERH(0, r) +

j=0 λ j E sup h H(rλj,rλj+1) Rnh

j=0 λ jψ(rλj+1).

Since ψ is a sub-root function, we have for all β 1, ψ(βr) βψ(r). Hence,

j=0 λ j/2).

Similar to Bartlett et al. (2005) we can setting λ = 4 to bound RHS by 5ψ(r)

B . Since r r = ψ(r) p

r/r ψ(r ) =

rr , we have:

Vr+ 10(1 + α)

n + (b a)(1

Next we set A = 10(1 + α)

r B + 2 log(1/δ)

n and C = (b a)(1/3 + 1/α) log(1/δ)/n so that V + r+ A r + C. It can be verified that r can be chosen such that V + r r λBK . The largest solution of A r + C = r λBK , denoted as r0. One can verify that r0 is no less than λ2A2B2K2 which is no less than r . Meanwhile r0 (λBK)2A2 + 2λBKC, by the claims in (39) and (40), one can show that for all h H,

Ph K K 1Pnh + λBKA2 + 2C + ε

= K K 1Ph + λBK(100(1 + α2) r

B2 + 20(1 + α)

2 log(1/δ)r

n + 2 log(1/δ)

Setting α = 1/10 use the fact that 2 uv u

α + αv completes the proof.

Next lemma is largely from Lemma 5.2 in Klein & Young (2015) and Zhang et al. (2024). We include the proof here for completeness.

Lemma 2 (Chernoff type lower bound). Let X be the average of k independent, 0/1 random variables (r.v.). For any ϵ (0, 1/2] and p (0, 1/2], assuming ϵpk 6, pk 6, ε 1

3, we have:

If each r.v. is 1 with probability p, then

P[X (1 ϵ)p] 0.2e 6ϵ2pk.

If each r.v. is 1 with probability p, then

P[X (1 + ϵ)p] 0.2e 6ϵ2pk.

Published in Transactions on Machine Learning Research (01/2025)

Proof. With Stirling s approximation, with i! approximated by

2πi(i/e)ieλ with λ [1/(12i + 1), 1/12i] one can show: k ℓ

Since P[X (1 ϵ)p] = P(1 ε)p i=0 P[X = i

k], it suffices to provide a lower bound for P(1 ε)p i=(1 2ε)p P[X = i

where P[X = i

pi(1 p)k i (Klein & Young, 2015).

To this end, let ℓ= (1 2ϵ)pk . Given the fact that εpk 6, we have (1 2ϵ)pk 1 ℓ (1 2ϵ)pk. We have P(1 ε)p i=(1 2ε)p P[X = i

k] is at least

εpk P[X = ℓ

k ] = εpk e

k ℓ pℓ(1 p)k ℓ.

From Equation (41) we know that we need to bound A = 1

2πℓand B = k

ℓ ℓ k k ℓ k ℓpℓ(1 p)k ℓ.

For term A, since εpk 6, l (1 2ε)pk thus we need pk 9e 2ε

ε to get 2ε

2π(1 2ε) e ε2pk. Since ε 1

it suffices to have pk 16. To bound B we need to show:

k l pl(1 p)k l e 4ε2pk.

ℓ ℓpℓ 1 (1 2ε) l and (1 p)k l k k l k l = (1 p)k k(1 p)+1+2εpk k l we have:

(1 2ε)ℓ 1 + 1 + 2εpk

k ℓ e 4ε2p2k

1 p +2εpk 2εpk+4ε2pk+2 7e4ε2pk.