# towards_problemdependent_optimal_learning_rates__8466b414.pdf

Towards Problem-dependent Optimal Learning Rates

Yunbei Xu Columbia University New York, NY 10027 yunbei.xu@gsb.columbia.edu

Assaf Zeevi Columbia University New York, NY 10025 assaf@gsb.columbia.edu

We study problem-dependent rates, i.e., generalization errors that scale tightly with the variance or the effective loss at the "best hypothesis." Existing uniform convergence and localization frameworks, the most widely used tools to study this problem, often fail to simultaneously provide parameter localization and optimal dependence on the sample size. As a result, existing problem-dependent rates are often rather weak when the hypothesis class is "rich" and the worst-case bound of the loss is large. In this paper we propose a new framework based on a "uniform localized convergence" principle. We provide the ﬁrst (moment-penalized) estimator that achieves the optimal variance-dependent rate for general "rich" classes; we also establish improved loss-dependent rate for standard empirical risk minimization.

1 Introduction

Problem Statement. Consider the following statistical learning setting. Assume that a random sample z follows an unknown distribution P with support Z. For each realization of z, let ℓ( ; z) be a real-valued loss function, deﬁned over the hypothesis class H. Let h H be the optimal hypothesis that minimizes the population risk

Pℓ(h; z) := E[ℓ(h; z)].

Given n i.i.d. samples {zi}n i=1 drawn from P, our goal, roughly speaking, is to "learn" a hypothesis ˆh H that makes the generalization error

E (ˆh) := Pℓ(ˆh; z) Pℓ(h ; z)

as small as possible. This pursuit is ubiquitous in machine learning, statistics and stochastic optimization.

Let V and L be the variance and the "effective loss" at the best hypothesis h :

V := Var[ℓ(h ; z)], L := P[ℓ(h ; z) inf H ℓ(h; z)].

We study ﬁnite-sample generalization errors that scale tightly with V or L , which we call problemdependent rates, without invoking strong convexity or margin conditions. While the direct dependence of E (ˆh) on the sample size n is often well-understood, it typically only reﬂects an "asymptotic" perspective, placing less emphasis on the scale of problem-dependent parameters V and L .

Main challenges. In absence of strong convexity and margin conditions, perhaps the most popular framework to study problem-dependent rates is the traditional "local Rademacher complexity" analysis [2, 7, 17], which has become a standard tool in learning theory. However, as we will discuss later, this analysis makes the "direct dependence" on the sample size (n) sub-optimal for all "rich" classes with the exception of parametric classes.

The absence of more precise localization analysis also challenges the design of more reﬁned estimation procedures. For example, designing estimators to achieve variance-dependent rates requires penalizing

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

the empirical second moment to achieve the "right" bias-variance trade-off. Most antecedent work is predicated on either the traditional "local Rademacher complexity" analysis [13, 4] or coarser approaches [9, 15]. Thus, to the best of our knowledge, the question of optimal problem-dependent rates for general rich classes is still open.

Contributions. We introduce a new framework to study localization in statistical learning, dubbed "uniform localized convergence," which simultaneously provides optimal "direct dependence" on the sample size, and correct scaling with problem-dependent parameters. This framework resolves some fundamental limitations of existing localization analysis.

We employ the above ideas to design the ﬁrst estimator that achieves optimal variance-dependent rates for general function classes. The derivation is based on a novel two-stage procedure that optimally penalizes the empirical (centered) second moment. We also establish improved loss-dependent rates for standard empirical risk minimization, which has computational advantages.

When assuming suitable curvature or margin conditions, much progress on problem-dependent rates has been made under particular formulations, such as supervised learning with strong convexity [11, 12, 8]. Our proposed framework leads to substantial progress in these and several other problem settings, a detailed account can be found in the full version of this work [18].

Organization. Section 2 introduces our proposed "uniform localized convergence" principle. Section 3 provides preliminaries. Section 4 presents the loss-dependent rate. Section 5 presents the variancedependent rate. Section 6 illustrates our ﬁndings in two examples: non-parametric classes and VC classes.

2 The "uniform localized convergence" principle

2.1 The current blueprint

Denote the empirical risk

Pnℓ(h; z) := 1

i=1 ℓ(h; zi),

and consider the following straightforward decomposition of the generalization error

E (ˆh) = (P Pn)ℓ(ˆh; z) + Pnℓ(ˆh; z) Pnℓ(h ; z) + (Pn P)ℓ(h ; z). (2.1)

The main difﬁculty in studying E (ˆh) comes from bounding the ﬁrst term (P Pn)ℓ(ˆh; z), since ˆh depends on the n samples. The simplest approach, which does not achieve problem-dependent rates, is to bound the uniform error sup h H (P Pn)ℓ(h; z)

over the entire hypothesis class H. In order to obtain problem-dependent rates, a natural modiﬁcation is to consider uniform convergence over localized subsets of H.

We ﬁrst give an overview of the traditional "local Rademacher complexity" analysis [2, 7, 17]. Consider a generic function class F that we wish to concentrate, which consists of real-valued functions deﬁned on Z (e.g., one can set f(z) = ℓ(h; z)). Denote

Pf := E[f(z)], Pnf := 1

and denote by ψ(r; δ) a surrogate function that upper bounds the uniform error within a localized region {f F : T(f) r}, where we call T : F R+ the "measurement functional". Formally, let ψ be a function that maps [0, ) (0, 1) to [0, ), which possibly depends on the observed samples {zi}n i=1. Assume ψ satisﬁes for arbitrary ﬁxed δ, r, with probability at least 1 δ,

sup f F:T (f) r (P Pn)f ψ(r; δ). (2.2)

By default we ask ψ(r; δ) to be a non-decreasing and non-negative function. The main result of the traditional "local Rademacher complexity" analysis can be stated as follows (adapted from [2, Section 3.2]).

Statement 1 (current blueprint). Assume that ψ is a sub-root function, i.e., ψ(r; δ)/ r is nonincreasing with respect to r R+. Assume the Bernstein condition T(f) Be P[f], Be > 0, f F. Then with probability at least 1 δ, for all f F and K > 1,

K Pf + 100(K 1)r

where r is the "ﬁxed point" solution of the equation r = Beψ(r; δ).

Since its inception, Statement 1 has become a standard tool in learning theory. However, it requires a rather technical proof, and it appears to be loose when compared with the original assumption (2.2) ideally, we would like to directly extend (2.2) to hold uniformly without sacriﬁcing any accuracy. Moreover, some assumptions in the statement are restrictive and might not be necessary.

2.2 Key ideas of the "uniform localized convergence" principle.

We provide a surprisingly simple approach which greatly improves and simpliﬁes the current blueprint. While Statement 1 relies heavily on restrictive assumptions like the "sub-root" property of ψ and the Bernstein condition, the following proposition holds essentially without any restrictions. Proposition 1 (uniform localized convergence). For function class F and functional T : F [0, R], assume there is a function ψ(r; δ), which is non-decreasing with respect to r and satisﬁes that δ (0, 1), r (0, R], with probability at least 1 δ, sup f F:T (f) r (P Pn)f ψ(r; δ). (2.4)

Then, given any δ (0, 1) and r0 (0, R], with probability at least 1 δ, for all f F, either T(f) r0 or

(P Pn)f ψ 2T(f); δ(log2 2R

r0 ) 1 . (2.5)

We note that both T and ψ are allowed to be sample-dependent in the above.

The key intuition behind Proposition 1 is that the uniform restatement of the "localized" argument (2.4) is nearly cost-free, because the deviations (P Pn)f can be controlled solely by the real valued functional T(f). As a result, we essentially only require uniform convergence over an interval [r0, R]. The "cost" of this uniform convergence, namely, the additional log2( 2R

r0 ) term in (2.5), will only appear in the form log(δ/ log2( 2R

r0 )) in high-probability bounds, which is of a negligible O(log log n) order in general.

Formally, we apply a "peeling" technique: we take rk = 2kr0, where k = 1, 2, . . . , log2 R r0 , and we use the union bound to extend (2.4) to hold for all these rk. Then for any f F such that T(f) > r0 is true, there exists a non-negative integer k such that 2kr0 < T(f) 2k+1r0. By the non-decreasing property of the ψ function, we then have

(P Pn)f ψ rk+1; δ(log2 2R

r0 ) 1 ψ 2T(f); δ(log2 2R

which is exactly (2.5). Interestingly, the proof of the classical result (Statement 1) relies on a relatively heavy machinery that includes more complicated peeling and re-weighting arguments (see [2, Section 3.1.4]). However, that analysis obscures the key intuition that we elucidate under inequality (2.5).

The results presented in this paper essentially originate from the noticeable gap between Proposition 1 and Statement 1, illustrated by the following (informal) conclusion.

Statement 2 (improvements over the current blueprint informal statement). Under the assumptions of Statement 1, Proposition 1 provides a strict improvement over Statement 1. In particular, the slower ψ grows, the larger the gap between the two results, and the bounds become identical only when ψ is proportional to r, i.e., when the function class F is parametric and not "rich."

Formalizing as well as providing rigorous justiﬁcation for this conclusion is relatively straightforward: taking the "optimal choice" of K in Statement 1, we can re-write its conclusion as

Be [Statement 1],

where the right hand side is of order p

r Pf/Be when Pf > r /Be, and order r /Be when P[f] r /Be. Our result (2.5) is also of order r /Be when Pf r /Be. However, for every f such that Pf > r /Be, it is straightforward to verify that under the assumptions in Statetment 1, ψ(2T(f); δ) ψ(2Be Pf; δ) [Bernstein condition: T(f) Be Pf]

r ψ(r ; δ) [ψ(r; δ) is sub-root]

Be [r is the ﬁxed point of Bψ(r; δ)]. (2.6)

Therefore, the argument ψ(2T(f); δ) p

2r Pf/Be established by (2.6) shows that the "uniform localized convergence" argument (2.5) strictly improves over Statement 1.

3 Preliminaries

Our results on problem-dependent rates essentially only require the loss function to be uniformly bounded by [ B, B], i.e., |ℓ(h; z)| B for all h H and z Z. This is a standard assumption used in almost all previous works that do not invoke curvature conditions or rely on other problem-speciﬁc structure. Extensions to unbounded targets can be obtained via truncation techniques (see, e.g. [5]), and our problem-dependent results allow B to be very large, potentially scaling with n.

We represent the complexity through a surrogate function ψ(r; δ) that satisﬁes for all δ (0, 1), sup f F:P[f 2] r (P Pn)f ψ(r; δ), (3.1)

with probability at least 1 δ, where F is taken to be the excess loss class ℓ H ℓ h := {z 7 ℓ(h; z) ℓ(h ; z) : h H}. (3.2) To achieve non-trivial complexity control (and ensure existence of the ﬁxed point), we only consider "meaningful" surrogate functions stated below. Deﬁnition 1 (meaningful surrogate function). A bivariate function ψ(r; δ) deﬁned over [0, ) (0, 1) is called a meaningful surrogate function if it is non-decreasing, non-negative and bounded with respect to r for every ﬁxed δ (0, 1).

We note that the above does not place signiﬁcant restrictions on the choice of the surrogate function: the left hand side of (3.1) is itself non-decreasing and non-negative; and the boundedness requirement can always be met by setting ψ(r; δ) = ψ(4B2; δ) for all r 4B2. We now give the formal deﬁnition of ﬁxed points. Deﬁnition 2 (ﬁxed point). Given a non-decreasing, non-negative and bounded function ϕ(r) deﬁned over [0, ), we deﬁne the ﬁxed point of ϕ(r) to be sup{r > 0 : ϕ(r) r}. Equivalently, the ﬁxed point of ϕ(r) is the maximal solution to the equation ϕ(r) = r.

Given a bounded class F, empirical process theory provides a general way to construct surrogate function by upper bounding the "local Rademacher complexity" R{f F : P[f 2] r} (see Lemma 4 in Appendix H). We give the deﬁnition of Rademacher complexity for completeness. Deﬁnition 3 (Rademacher complexity). For a function class F that consists of mappings from Z to R, deﬁne

RF := Ez,υ sup f F

i=1 υif(zi), Rn F := Eυ sup f F

i=1 υif(zi),

as the Rademacher complexiy and the empirical Rademacher complexity of F, respectively, where {υi}n i=1 are i.i.d. Rademacher variables for which Prob(υi = 1) = Prob(υi = 1) = 1

2. Ez means taking expectations over {zi}n i=1 and Eυ means taking expectations over υin i=1.

Furthermore, Dudley s integral bound (Lemma 3 in Appendix H) provides one general solution to construct a computable upper bound of local Rademacher complexity via the covering number of F. We give the deﬁnition of covering number. Deﬁnition 4 (covering number and metric entropy). A ε cover of a function class F with the L2(Pn) metric is a set {f1, . . . , fm} F that satisﬁes for each f F, there exists i {1, . . . , m} such that p

Pn(f(z) fi(z))2 ε. The covering number N (ε, F, L2(Pn)) is the cardinality of the smallest ε cover. We call log N (ε, F, L2(Pn)) the metric entropy.

4 Loss-dependent rates via empirical risk minimization

In this section we are interested in loss-dependent rates, which should scale tightly with L := P[ℓ(h ; z) inf H ℓ(h; z)]; the best achievable effective loss" on H. The following theorem characterizes the loss-dependent rate of empirical risk minimization (ERM) via a surrogate function ψ, its ﬁxed point r , the effective loss L and B. Theorem 1 (loss-dependent rate of ERM). For the excess loss class F in (3.2), assume there is a meaningful surrogate function ψ(r; δ) that satisﬁes δ (0, 1) and r > 0, with probability at least 1 δ, sup f F:P[f 2] r (P Pn)f ψ(r; δ).

Then the empirical risk minimizer ˆh ERM arg min H{Pnℓ(h; z)} satisﬁes for any ﬁxed δ (0, 1) and r0 (0, 4B2), with probability at least 1 δ,

E (ˆh ERM) ψ 24BL ; δ Cr0

where Cr0 = 2 log2 8B2

r0 , and r is the ﬁxed point of 6Bψ 8r; δ Cr0

Remarks. 1) The term r0 is negligible since it can be arbitrarily small. One can simply set r0 = B2/n4, which will much smaller than r in general (r is at least of order B2 log 1

δ /n in the traditional "local Rademacher complexity" analysis). In high-probability bounds, Cr0 will only appear in the form log(Cr0/δ)), which is of a negligible O(log log n) order, so Cr0 can be viewed an absolute constant for all practical purposes. As a result, our generalization error bound can be viewed to be of the order

E (ˆh ERM) O ψ(BL ; δ) r

2) By using the empirical "effective loss," Pn[ℓ(ˆh ERM; z) inf H ℓ(h; z)], to estimate L , the lossdependent rate can be estimated from data without knowledge of L . We defer the details to Appendix A.

Comparison to existing results. Under additional restrictions (to be explained later), the traditional analysis (2.3) leads to a loss-dependent rate of the order [2]

E (ˆh ERM) O

which is strictly worse than our result (4.1) due to reasoning following Statement 2. When BL O(r ), both (4.1) and (4.2) are dominated by the order r /B so there is no difference between them. However, when BL Ω(r ), our result (4.1) will be of order ψ(BL ; δ) and the previous result (4.2) will be of order p

L r /B. In this case, the square-root function p

L r /B is only a coarse relaxation of ψ(BL ; δ): as the traditional analysis requires ψ to be sub-root, we can compare the two orders by

ψ (BL ; δ) sub-root

r ψ(r ; δ) ﬁxed point = O

The "sub-root" inequality (the ﬁrst inequality in (4.3)) becomes an equality when ψ(r; δ) = O( p

dr/n) in the parametric case, where d is the parametric dimension. However, when F is rich, ψ(r; δ)/ r will be strictly decreasing so that the "sub-root" inequality can become quite loose. For example, when F is a non-parametric class we often have ψ(r; δ) = O( p

r1 ρ/n) for some ρ (0, 1). The richer F is (e.g., the larger ρ is), the looser the "sub-root" inequality. This intuition will be validated via examples in Section 6.

Theorem 1 also applies to broader settings than previous results. For example, in [2] it is assumed that the loss is non-negative, and their original result only adapts to Pℓ(h ; z) rather than the "effective loss" L . Our proof (see Appendix D) is quite different as we bypass the Bernstein condition (which is traditionally implied by non-negativity, but not satisﬁed by the class used here), bypass the sub-root assumption on ψ, and adapt to the "better" parameter L .

5 Variance-dependent rates via moment penalization

The loss-dependent rate proved in Theorem 1 contains a complexity parameter BL within its ψ function, which may still be much larger than the optimal variance V . Despite its prevalent use in practice, standard empirical risk minimization is unable to achieve variance-dependent rates in general. An example is given in [13] where V = 0 and the optimal rate is at most O(log n/n), while E (ˆh ERM) is proved to be slower than n 1

We follow the path of penalizing empirical second moments (or variance) [9, 15, 13, 4] to design an estimator that achieves the "right" bias-variance trade-off for general, potentially "rich," classes. Our proposed estimator simultaneously achieves correct scaling on V , along with minimax-optimal sample dependence (n). Besides empirical ﬁrst and second moments, it only depends on the boundedness parameter B, a computable surrogate function ψ, and the conﬁdence parameter δ. All of these quantities are essentially assumed known in previous works: e.g., [9, 15] require covering number of the loss class, which implies a computable surrogate ψ via Dudley s integral bound; and estimators in [13, 4] rely on the ﬁxed point r of a computable surrogate ψ.

In order to adapt to V , we use a sample-splitting two-stage estimation procedure (this idea is inspired by the prior work [4]). Without loss of generality, we assume access to a data set of size 2n. We split the data set into the "primary" data set S and the "auxiliary" data set S , both of which are of size n. We denote Pn the empirical distribution of the "primary" data set, and PS the empirical distribution of the "auxiliary" data set.

Strategy 1 (the two-stage sample-splitting estimation procedure.). At the ﬁrst-stage, we derive a preliminary estimate of L 0 := Pℓ(h ; z) via the "auxiliary" data set S , which we refer to as L S . Then, at the second stage, we perform regularized empirical risk minimization on the "primal" data set S, which penalizes the centered second moment Pn[(ℓ(h; z) L S )2].

As we will present later, it is rather trivial to obtain a qualiﬁed preliminary estimate L S via empirical risk minimization. Therefore, we ﬁrstly introduce the second-stage moment-penalized estimator, which is more crucial and interesting.

Strategy 2 (the second-stage moment-penalized estimator.). Consider the excess loss class F in (3.2). Let ψ(r; δ) be a meaningful surrogate function that satisﬁes δ (0, 1), r > 0, with probability at least 1 δ,

4Rn{f F : Pn[f 2] 2r} +

δ n + 9B log 8

δ n ψ(r; δ).

Denote Cn = 4 log2 n + 10. Given a ﬁxed δ (0, 1), let the estimator ˆh MP be

ˆh MP arg min H

Pnℓ(h; z) + ψ 16Pn[(ℓ(h; z) L S )2]; δ

Given an arbitrary preliminary estimate L S [ B, B], we can prove that the generalization error of the moment-penalized estimator ˆh MP is at most

E (ˆh MP) 2ψ c0 V (L S L 0)2 r ; δ

with probability at least 1 δ, where c0 is an absolute constant, and r is the ﬁxed point of 16Bψ(r; δ Cn ). Moreover, the ﬁrst-stage estimation error will be negligible if

(L S L 0)2 O (r ) . (5.3)

It is rather elementary to show that performing the standard empirical risk minimization on S sufﬁces to satisfy (5.3), provided an additional assumption that ψ is a "sub-root" function. We now give our theorem on the generalization error following this two-stage procedure.

Theorem 2 (variance-dependent rate). Let L S = inf H PS ℓ(h; z) be attained via empirical risk minimization on the auxiliary data set S . Assume that the meaningful surrogate function ψ(r; δ) is

"sub-root," i.e. ψ(r;δ) r is non-increasing over r [0, 4B2] for all ﬁxed δ. Then for any δ (0, 1

2), by performing the moment-penalized estimator in Strategy 2, with probability at least 1 2δ,

E (ˆh MP) 2ψ c1V ; δ

where r is the ﬁxed point of Bψ(r; δ Cn ) and c1 is an absolute constant.

Remarks. 1) In high-probability bounds, Cn will only appear in the form log(Cn/δ)), which is of a negligible O(log log n) order, so there is no much difference to view Cn as an absolute constant.

2) The "sub-root" assumption in Theorem 2 is only used to to bound the ﬁrst-stage estimation error (see (5.3)). This assumption is not needed for the result (5.2) on the second-stage moment penalized estimator.

3) Replacing V by an empirical centered second moment, we can prove a fully data-dependent generalization error bound that is computable from data without the knowledge of V . We leave the full discussion to Appendix A.

Comparison to existing results. The best variance-dependent rate attained by existing estimators is of the order [4] r

which is strictly worse than the rate proved in Theorem 2. The reasoning is similar to Statement 2 and the explanation after Theorem 1: when V O(r ) the two results are essentially identical, but our estimator can perform much better when V Ω(r ). Because ψ is sub-root and r is the ﬁxed point, we can compare the orders of the rates

ψ(V ; δ) sub-root

r ψ(r ; δ) ﬁxed point = O

Since variance-dependent rates are generally used in applications that require robustness or exhibit large worst-case boundedness parameter, V r is the more critical regime where one wants to ensure the estimation performance will not degrade.

Discussion. Per our "uniform localized convergence" principle, the most obvious difﬁculty in proving Theorem 2 is in establishing (5.2): the empirical second moment is sample-dependent, whereas standard tools in empirical process theory (e.g., Talagrand s concentration inequality, see Lemma 4) require the localized subsets to be independent of the samples. The core techniques in our proof essentially overcome this difﬁculty by concentrating data-dependent localized subsets to data-independent ones. This idea may be of independent interest; we defer details to Appendix E.

The tightness of our variance-dependent rates depend on tightness of the computable surrogate function ψ. When covering numbers of the excess loss class are given, a direct choice is Dudley s integral bound (Lemma 3 in Appendix H), which is known to be rate-optimal for many important classes.

Previous approaches usually take a simper regularization term [9, 4] that is proportional to the square root of the empirical second moment (or empirical variance). That type of penalization is "too aggressive" for rich classes from our viewpoint. [13] propose a regularization term that preserves convexity of empirical risk. However, based on an equivalence proved in their paper, they have similar limitations to the approaches that penalizes the square root of the empirical variance.

6 Discussion and illustrative examples

6.1 Discussion

Recall that our loss-dependent rates and variance-dependent (moment-penalized) rates are of the orders

E (ˆh ERM) O ψ(BL ; δ) r

and E (ˆh MP) O ψ(V ; δ) r

respectively. In contrast, the best known loss-dependent rates [2] and variance-dependent rates [4] are of the orders

E (ˆh ERM) O

and E (ˆhprevious) O

respectively (we use ˆhprevious to denote the previous best known moment-penalized estimator proposed in [4]). To illustrate the noticeable gaps between our new results and previous known ones, we compare the two different variance-dependent rates in (6.1) and (6.2) on two important families of "rich" classes: non-parametric classes of polynomial growth and VC classes. The implications of this comparison will similarly apply to loss-dependent rates.

Before presenting the advantages of the new problem-dependent rates, we would like to discuss how to compute them. In Theorem 1 and Theorem 2, the class of concentrated functions, F, is the excess loss class ℓ H ℓ h in (3.2). As we have mentioned in earlier sections, a general solution for the ψ function is to use Dudley s integral bound (Lemma 3 in Appendix H). Knowledge of the metric entropy of the excess loss class can be used to calculate Dudley s integral bound and construct the surrogate function ψ needed in our theorems. Note that there is no difference between the metric entropy of the excess loss class and metric entropy of the loss class itself: from the deﬁnition of covering number and metric entropy, one has

log N (ε, ℓ H ℓ h , L2(Pn)) = log N (ε, ℓ H, L2(Pn)).

We comment that almost all existing theoretical works that discuss general function classes and losses [2, 9, 15, 4] impose metric entropy conditions on the loss class/excess loss class rather than the hypothesis class, and we follows that line as well to allow for a seamless comparison of the results. As a complement, we will discuss how to obtain such metric entropy conditions for practical applications in Appendix B.

6.2 Non-parametric classes of polynomial growth

Example 1 (non-parametric classes of polynomial growth). Consider a loss class ℓ H with the metric entropy condition

log N (ε, ℓ H, L2(Pn)) O ε 2ρ , (6.3)

where ρ (0, 1) is a constant. Using Dudley s integral bound to ﬁnd ψ and solving r O (Bψ(r; δ)), it is not hard to verify that

As a result, our variance-dependent rate (6.1) is of the order

E (ˆh MP) O V 1 ρ

which is O V 1 ρ

2 when V Ω(r ). In contrast, the previous best-known result (6.2) is of the order

E (ˆhprevious) O

V B ρ 1+ρ n 1 2+2ρ r

V B ρ 1+ρ n 1 2+2ρ when V Ω(r ). Therefore, for arbitrary choice of n, V , B, the "sub-optimality gap" is

ratio between (6.5) and (6.4) :=

V B ρ 1+ρ n 1 2+2ρ r

B = 1 (V ( n

B2 ) 1 1+ρ ) ρ 2 , (6.6)

which can be arbitrary large and grows polynomially with n.

We consider two stylized regimes as follows (we use the notation when the left hand side and the right hand side are of the same order).

The more "traditional" regime: B 1, V n a where a > 0 is a ﬁxed constant. This regime captures the traditional supervised learning problems where B is not large, but one wants to use the relatively small order of V to achieve "faster" rates. The "high-risk" regime: B nb where b > 0 is a ﬁxed constant, and V B2 (i.e., V is much smaller than order n2b). This regime captures modern "high-risk" learning problems such as counterfactual risk minimization [15], policy learning [1], and supervised learning with limited number of samples. In those settings, the worst-case boundedness parameter is considered to scale with n so that one wants to avoid (or reduce) the dependence on B.

In both the two regimes, generalization errors via naive (non-localized) uniform convergence arguments will be worse than our approach by orders polynomial in n, so we only need to compare with previous variance-dependent rates.

The "traditional" regime. The "sub-optimality gap" (6.6) is 1 (V n 1 1+ρ ) ρ 2 . It is quite clear that when V n a where 0 < a < 1 1+ρ, our variance-dependent rate improves over all previous generalization error rates by orders polynomial in n.

The "high-risk" regime. We restrict our attention to the simple case B 2 1+ρ V 4B2 to gain some insight, where our result exhibits an improvement of order O(n

ρ 2 ( 1 1+ρ )) relative to the previous result. Clearly the larger ρ, the more improvement we provide. By letting ρ 1 our improvement can be as large as O(n 1 4 ).

6.3 VC-type classes

Our next example considers VC-type classes. Although this classical example has been extensively studied in learning theory, our results provide strict improvements over antecedents. Example 2 (VC-type classes). One general deﬁnition of VC-type classes (which is not necessarily binary) uses the metric entropy condition. Consider a loss class ℓ H that satisﬁes

log N (ε, ℓ H, L2(Pn)) O d log 1

where d is th so-called the Vapnik Chervonenkis (VC) dimension [16]. Using Dudley s integral bound to ﬁnd the surrogate ψ and solving r O(Bψ(r; δ)), it can be proven [7] that

, r O B2d log n

Recently, [4] proposed a moment-penalized estimator whose generalization error is of the rate

E (ˆhprevious) O

n + Bd log n

in the worst case without invoking other assumptions. This result has a O(log n) gap compared with

n ) lower bound [3], which holds for arbitrary sample size. There is much recent interest focused on when the sub-optimal log n factor can be removed [1, 4].

By applying Theorem 2, our reﬁned moment-penalized estimator gives a generalization error bound of tighter rate

E (ˆh MP) O

d V log 8B2

V n Bd log n

This closes the O(log n) gap in the regime V Ω( B2 (log n)α ), where α > 0 is an arbitrary positive constant. Though this is not the central regime, it is the ﬁrst positive result that closes the notorious O(log n) gap without invoking any additional assumptions on the loss/hypothesis class (e.g., the rather complex capacity function" assumption introduced in [4]). We anticipate additional improvements are possible under further assumptions on the hypothesis class and the loss function.

Broader Impact

Problem-dependent rates are relevant to many practical aspects of machine learning. Our theory may be useful to the design of estimators and the certiﬁcation of conﬁdence intervals in causal inference and counterfactual learning problems, as the focus there is typically on avoiding worst-case parameter-dependence (see also discussion on counterfactual risk minimization" in Appendix B). We derive improved loss-dependent guarantees for empirical risk minimization, one of the most widely used methods for constructing estimators in machine learning practice. It is possible that our strategy penalizing the empirical variance may ﬁnd applications in fairness, as variance penalized estimators have been shown to be effective in such problems [6]. Having said that, this work is not expected to have any direct societal consequence absent further adaption to speciﬁc applications.

Funding Disclosure

This work did not receive third party funding or third party support.

[1] Susan Athey and Stefan Wager. Efﬁcient policy learning. ar Xiv preprint ar Xiv:1702.02896, 2017.

[2] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005.

[3] Luc Devroye and Gábor Lugosi. Lower bounds in pattern recognition and learning. Pattern recognition, 28(7):1011 1018, 1995.

[4] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. ar Xiv preprint ar Xiv:1901.09036, 2019.

[5] László Györﬁ, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006.

[6] Tatsunori B Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. ar Xiv preprint ar Xiv:1806.08010, 2018.

[7] Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of function learning. In High dimensional probability II, pages 443 457. Springer, 2000.

[8] Tengyuan Liang, Alexander Rakhlin, and Karthik Sridharan. Learning with square loss: Localization through offset rademacher complexity. In Conference on Learning Theory, pages 1260 1285, 2015.

[9] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009.

[10] Ron Meir and Tong Zhang. Generalization error bounds for bayesian mixture algorithms. Journal of Machine Learning Research, 4(Oct):839 860, 2003.

[11] Shahar Mendelson. Learning without concentration. In Conference on Learning Theory, pages 25 39, 2014.

[12] Shahar Mendelson. Learning without concentration for general loss functions. Probability Theory and Related Fields, 171(1-2):459 502, 2018.

[13] Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971 2980, 2017.

[14] Karthik Sridharan. Note on reﬁned dudley integral covering number bound. Unpublished Manuscript, https://www.cs.cornell.edu/ sridharan/dudley.pdf, 2010.

[15] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814 823, 2015.

[16] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

[17] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

[18] Yunbei Xu and Assaf Zeevi. Towards optimal problem dependent generalization error bounds in statistical learning theory. arxiv preprint ar Xiv: 2011.06186, 2020.