# towards_problemdependent_optimal_learning_rates__8466b414.pdf Towards Problem-dependent Optimal Learning Rates Yunbei Xu Columbia University New York, NY 10027 yunbei.xu@gsb.columbia.edu Assaf Zeevi Columbia University New York, NY 10025 assaf@gsb.columbia.edu We study problem-dependent rates, i.e., generalization errors that scale tightly with the variance or the effective loss at the "best hypothesis." Existing uniform convergence and localization frameworks, the most widely used tools to study this problem, often fail to simultaneously provide parameter localization and optimal dependence on the sample size. As a result, existing problem-dependent rates are often rather weak when the hypothesis class is "rich" and the worst-case bound of the loss is large. In this paper we propose a new framework based on a "uniform localized convergence" principle. We provide the first (moment-penalized) estimator that achieves the optimal variance-dependent rate for general "rich" classes; we also establish improved loss-dependent rate for standard empirical risk minimization. 1 Introduction Problem Statement. Consider the following statistical learning setting. Assume that a random sample z follows an unknown distribution P with support Z. For each realization of z, let ℓ( ; z) be a real-valued loss function, defined over the hypothesis class H. Let h H be the optimal hypothesis that minimizes the population risk Pℓ(h; z) := E[ℓ(h; z)]. Given n i.i.d. samples {zi}n i=1 drawn from P, our goal, roughly speaking, is to "learn" a hypothesis ˆh H that makes the generalization error E (ˆh) := Pℓ(ˆh; z) Pℓ(h ; z) as small as possible. This pursuit is ubiquitous in machine learning, statistics and stochastic optimization. Let V and L be the variance and the "effective loss" at the best hypothesis h : V := Var[ℓ(h ; z)], L := P[ℓ(h ; z) inf H ℓ(h; z)]. We study finite-sample generalization errors that scale tightly with V or L , which we call problemdependent rates, without invoking strong convexity or margin conditions. While the direct dependence of E (ˆh) on the sample size n is often well-understood, it typically only reflects an "asymptotic" perspective, placing less emphasis on the scale of problem-dependent parameters V and L . Main challenges. In absence of strong convexity and margin conditions, perhaps the most popular framework to study problem-dependent rates is the traditional "local Rademacher complexity" analysis [2, 7, 17], which has become a standard tool in learning theory. However, as we will discuss later, this analysis makes the "direct dependence" on the sample size (n) sub-optimal for all "rich" classes with the exception of parametric classes. The absence of more precise localization analysis also challenges the design of more refined estimation procedures. For example, designing estimators to achieve variance-dependent rates requires penalizing 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. the empirical second moment to achieve the "right" bias-variance trade-off. Most antecedent work is predicated on either the traditional "local Rademacher complexity" analysis [13, 4] or coarser approaches [9, 15]. Thus, to the best of our knowledge, the question of optimal problem-dependent rates for general rich classes is still open. Contributions. We introduce a new framework to study localization in statistical learning, dubbed "uniform localized convergence," which simultaneously provides optimal "direct dependence" on the sample size, and correct scaling with problem-dependent parameters. This framework resolves some fundamental limitations of existing localization analysis. We employ the above ideas to design the first estimator that achieves optimal variance-dependent rates for general function classes. The derivation is based on a novel two-stage procedure that optimally penalizes the empirical (centered) second moment. We also establish improved loss-dependent rates for standard empirical risk minimization, which has computational advantages. When assuming suitable curvature or margin conditions, much progress on problem-dependent rates has been made under particular formulations, such as supervised learning with strong convexity [11, 12, 8]. Our proposed framework leads to substantial progress in these and several other problem settings, a detailed account can be found in the full version of this work [18]. Organization. Section 2 introduces our proposed "uniform localized convergence" principle. Section 3 provides preliminaries. Section 4 presents the loss-dependent rate. Section 5 presents the variancedependent rate. Section 6 illustrates our findings in two examples: non-parametric classes and VC classes. 2 The "uniform localized convergence" principle 2.1 The current blueprint Denote the empirical risk Pnℓ(h; z) := 1 i=1 ℓ(h; zi), and consider the following straightforward decomposition of the generalization error E (ˆh) = (P Pn)ℓ(ˆh; z) + Pnℓ(ˆh; z) Pnℓ(h ; z) + (Pn P)ℓ(h ; z). (2.1) The main difficulty in studying E (ˆh) comes from bounding the first term (P Pn)ℓ(ˆh; z), since ˆh depends on the n samples. The simplest approach, which does not achieve problem-dependent rates, is to bound the uniform error sup h H (P Pn)ℓ(h; z) over the entire hypothesis class H. In order to obtain problem-dependent rates, a natural modification is to consider uniform convergence over localized subsets of H. We first give an overview of the traditional "local Rademacher complexity" analysis [2, 7, 17]. Consider a generic function class F that we wish to concentrate, which consists of real-valued functions defined on Z (e.g., one can set f(z) = ℓ(h; z)). Denote Pf := E[f(z)], Pnf := 1 and denote by ψ(r; δ) a surrogate function that upper bounds the uniform error within a localized region {f F : T(f) r}, where we call T : F R+ the "measurement functional". Formally, let ψ be a function that maps [0, ) (0, 1) to [0, ), which possibly depends on the observed samples {zi}n i=1. Assume ψ satisfies for arbitrary fixed δ, r, with probability at least 1 δ, sup f F:T (f) r (P Pn)f ψ(r; δ). (2.2) By default we ask ψ(r; δ) to be a non-decreasing and non-negative function. The main result of the traditional "local Rademacher complexity" analysis can be stated as follows (adapted from [2, Section 3.2]). Statement 1 (current blueprint). Assume that ψ is a sub-root function, i.e., ψ(r; δ)/ r is nonincreasing with respect to r R+. Assume the Bernstein condition T(f) Be P[f], Be > 0, f F. Then with probability at least 1 δ, for all f F and K > 1, K Pf + 100(K 1)r where r is the "fixed point" solution of the equation r = Beψ(r; δ). Since its inception, Statement 1 has become a standard tool in learning theory. However, it requires a rather technical proof, and it appears to be loose when compared with the original assumption (2.2) ideally, we would like to directly extend (2.2) to hold uniformly without sacrificing any accuracy. Moreover, some assumptions in the statement are restrictive and might not be necessary. 2.2 Key ideas of the "uniform localized convergence" principle. We provide a surprisingly simple approach which greatly improves and simplifies the current blueprint. While Statement 1 relies heavily on restrictive assumptions like the "sub-root" property of ψ and the Bernstein condition, the following proposition holds essentially without any restrictions. Proposition 1 (uniform localized convergence). For function class F and functional T : F [0, R], assume there is a function ψ(r; δ), which is non-decreasing with respect to r and satisfies that δ (0, 1), r (0, R], with probability at least 1 δ, sup f F:T (f) r (P Pn)f ψ(r; δ). (2.4) Then, given any δ (0, 1) and r0 (0, R], with probability at least 1 δ, for all f F, either T(f) r0 or (P Pn)f ψ 2T(f); δ(log2 2R r0 ) 1 . (2.5) We note that both T and ψ are allowed to be sample-dependent in the above. The key intuition behind Proposition 1 is that the uniform restatement of the "localized" argument (2.4) is nearly cost-free, because the deviations (P Pn)f can be controlled solely by the real valued functional T(f). As a result, we essentially only require uniform convergence over an interval [r0, R]. The "cost" of this uniform convergence, namely, the additional log2( 2R r0 ) term in (2.5), will only appear in the form log(δ/ log2( 2R r0 )) in high-probability bounds, which is of a negligible O(log log n) order in general. Formally, we apply a "peeling" technique: we take rk = 2kr0, where k = 1, 2, . . . , log2 R r0 , and we use the union bound to extend (2.4) to hold for all these rk. Then for any f F such that T(f) > r0 is true, there exists a non-negative integer k such that 2kr0 < T(f) 2k+1r0. By the non-decreasing property of the ψ function, we then have (P Pn)f ψ rk+1; δ(log2 2R r0 ) 1 ψ 2T(f); δ(log2 2R which is exactly (2.5). Interestingly, the proof of the classical result (Statement 1) relies on a relatively heavy machinery that includes more complicated peeling and re-weighting arguments (see [2, Section 3.1.4]). However, that analysis obscures the key intuition that we elucidate under inequality (2.5). The results presented in this paper essentially originate from the noticeable gap between Proposition 1 and Statement 1, illustrated by the following (informal) conclusion. Statement 2 (improvements over the current blueprint informal statement). Under the assumptions of Statement 1, Proposition 1 provides a strict improvement over Statement 1. In particular, the slower ψ grows, the larger the gap between the two results, and the bounds become identical only when ψ is proportional to r, i.e., when the function class F is parametric and not "rich." Formalizing as well as providing rigorous justification for this conclusion is relatively straightforward: taking the "optimal choice" of K in Statement 1, we can re-write its conclusion as Be [Statement 1], where the right hand side is of order p r Pf/Be when Pf > r /Be, and order r /Be when P[f] r /Be. Our result (2.5) is also of order r /Be when Pf r /Be. However, for every f such that Pf > r /Be, it is straightforward to verify that under the assumptions in Statetment 1, ψ(2T(f); δ) ψ(2Be Pf; δ) [Bernstein condition: T(f) Be Pf] r ψ(r ; δ) [ψ(r; δ) is sub-root] Be [r is the fixed point of Bψ(r; δ)]. (2.6) Therefore, the argument ψ(2T(f); δ) p 2r Pf/Be established by (2.6) shows that the "uniform localized convergence" argument (2.5) strictly improves over Statement 1. 3 Preliminaries Our results on problem-dependent rates essentially only require the loss function to be uniformly bounded by [ B, B], i.e., |ℓ(h; z)| B for all h H and z Z. This is a standard assumption used in almost all previous works that do not invoke curvature conditions or rely on other problem-specific structure. Extensions to unbounded targets can be obtained via truncation techniques (see, e.g. [5]), and our problem-dependent results allow B to be very large, potentially scaling with n. We represent the complexity through a surrogate function ψ(r; δ) that satisfies for all δ (0, 1), sup f F:P[f 2] r (P Pn)f ψ(r; δ), (3.1) with probability at least 1 δ, where F is taken to be the excess loss class ℓ H ℓ h := {z 7 ℓ(h; z) ℓ(h ; z) : h H}. (3.2) To achieve non-trivial complexity control (and ensure existence of the fixed point), we only consider "meaningful" surrogate functions stated below. Definition 1 (meaningful surrogate function). A bivariate function ψ(r; δ) defined over [0, ) (0, 1) is called a meaningful surrogate function if it is non-decreasing, non-negative and bounded with respect to r for every fixed δ (0, 1). We note that the above does not place significant restrictions on the choice of the surrogate function: the left hand side of (3.1) is itself non-decreasing and non-negative; and the boundedness requirement can always be met by setting ψ(r; δ) = ψ(4B2; δ) for all r 4B2. We now give the formal definition of fixed points. Definition 2 (fixed point). Given a non-decreasing, non-negative and bounded function ϕ(r) defined over [0, ), we define the fixed point of ϕ(r) to be sup{r > 0 : ϕ(r) r}. Equivalently, the fixed point of ϕ(r) is the maximal solution to the equation ϕ(r) = r. Given a bounded class F, empirical process theory provides a general way to construct surrogate function by upper bounding the "local Rademacher complexity" R{f F : P[f 2] r} (see Lemma 4 in Appendix H). We give the definition of Rademacher complexity for completeness. Definition 3 (Rademacher complexity). For a function class F that consists of mappings from Z to R, define RF := Ez,υ sup f F i=1 υif(zi), Rn F := Eυ sup f F i=1 υif(zi), as the Rademacher complexiy and the empirical Rademacher complexity of F, respectively, where {υi}n i=1 are i.i.d. Rademacher variables for which Prob(υi = 1) = Prob(υi = 1) = 1 2. Ez means taking expectations over {zi}n i=1 and Eυ means taking expectations over υin i=1. Furthermore, Dudley s integral bound (Lemma 3 in Appendix H) provides one general solution to construct a computable upper bound of local Rademacher complexity via the covering number of F. We give the definition of covering number. Definition 4 (covering number and metric entropy). A ε cover of a function class F with the L2(Pn) metric is a set {f1, . . . , fm} F that satisfies for each f F, there exists i {1, . . . , m} such that p Pn(f(z) fi(z))2 ε. The covering number N (ε, F, L2(Pn)) is the cardinality of the smallest ε cover. We call log N (ε, F, L2(Pn)) the metric entropy. 4 Loss-dependent rates via empirical risk minimization In this section we are interested in loss-dependent rates, which should scale tightly with L := P[ℓ(h ; z) inf H ℓ(h; z)]; the best achievable effective loss" on H. The following theorem characterizes the loss-dependent rate of empirical risk minimization (ERM) via a surrogate function ψ, its fixed point r , the effective loss L and B. Theorem 1 (loss-dependent rate of ERM). For the excess loss class F in (3.2), assume there is a meaningful surrogate function ψ(r; δ) that satisfies δ (0, 1) and r > 0, with probability at least 1 δ, sup f F:P[f 2] r (P Pn)f ψ(r; δ). Then the empirical risk minimizer ˆh ERM arg min H{Pnℓ(h; z)} satisfies for any fixed δ (0, 1) and r0 (0, 4B2), with probability at least 1 δ, E (ˆh ERM) ψ 24BL ; δ Cr0 where Cr0 = 2 log2 8B2 r0 , and r is the fixed point of 6Bψ 8r; δ Cr0 Remarks. 1) The term r0 is negligible since it can be arbitrarily small. One can simply set r0 = B2/n4, which will much smaller than r in general (r is at least of order B2 log 1 δ /n in the traditional "local Rademacher complexity" analysis). In high-probability bounds, Cr0 will only appear in the form log(Cr0/δ)), which is of a negligible O(log log n) order, so Cr0 can be viewed an absolute constant for all practical purposes. As a result, our generalization error bound can be viewed to be of the order E (ˆh ERM) O ψ(BL ; δ) r 2) By using the empirical "effective loss," Pn[ℓ(ˆh ERM; z) inf H ℓ(h; z)], to estimate L , the lossdependent rate can be estimated from data without knowledge of L . We defer the details to Appendix A. Comparison to existing results. Under additional restrictions (to be explained later), the traditional analysis (2.3) leads to a loss-dependent rate of the order [2] E (ˆh ERM) O which is strictly worse than our result (4.1) due to reasoning following Statement 2. When BL O(r ), both (4.1) and (4.2) are dominated by the order r /B so there is no difference between them. However, when BL Ω(r ), our result (4.1) will be of order ψ(BL ; δ) and the previous result (4.2) will be of order p L r /B. In this case, the square-root function p L r /B is only a coarse relaxation of ψ(BL ; δ): as the traditional analysis requires ψ to be sub-root, we can compare the two orders by ψ (BL ; δ) sub-root r ψ(r ; δ) fixed point = O The "sub-root" inequality (the first inequality in (4.3)) becomes an equality when ψ(r; δ) = O( p dr/n) in the parametric case, where d is the parametric dimension. However, when F is rich, ψ(r; δ)/ r will be strictly decreasing so that the "sub-root" inequality can become quite loose. For example, when F is a non-parametric class we often have ψ(r; δ) = O( p r1 ρ/n) for some ρ (0, 1). The richer F is (e.g., the larger ρ is), the looser the "sub-root" inequality. This intuition will be validated via examples in Section 6. Theorem 1 also applies to broader settings than previous results. For example, in [2] it is assumed that the loss is non-negative, and their original result only adapts to Pℓ(h ; z) rather than the "effective loss" L . Our proof (see Appendix D) is quite different as we bypass the Bernstein condition (which is traditionally implied by non-negativity, but not satisfied by the class used here), bypass the sub-root assumption on ψ, and adapt to the "better" parameter L . 5 Variance-dependent rates via moment penalization The loss-dependent rate proved in Theorem 1 contains a complexity parameter BL within its ψ function, which may still be much larger than the optimal variance V . Despite its prevalent use in practice, standard empirical risk minimization is unable to achieve variance-dependent rates in general. An example is given in [13] where V = 0 and the optimal rate is at most O(log n/n), while E (ˆh ERM) is proved to be slower than n 1 We follow the path of penalizing empirical second moments (or variance) [9, 15, 13, 4] to design an estimator that achieves the "right" bias-variance trade-off for general, potentially "rich," classes. Our proposed estimator simultaneously achieves correct scaling on V , along with minimax-optimal sample dependence (n). Besides empirical first and second moments, it only depends on the boundedness parameter B, a computable surrogate function ψ, and the confidence parameter δ. All of these quantities are essentially assumed known in previous works: e.g., [9, 15] require covering number of the loss class, which implies a computable surrogate ψ via Dudley s integral bound; and estimators in [13, 4] rely on the fixed point r of a computable surrogate ψ. In order to adapt to V , we use a sample-splitting two-stage estimation procedure (this idea is inspired by the prior work [4]). Without loss of generality, we assume access to a data set of size 2n. We split the data set into the "primary" data set S and the "auxiliary" data set S , both of which are of size n. We denote Pn the empirical distribution of the "primary" data set, and PS the empirical distribution of the "auxiliary" data set. Strategy 1 (the two-stage sample-splitting estimation procedure.). At the first-stage, we derive a preliminary estimate of L 0 := Pℓ(h ; z) via the "auxiliary" data set S , which we refer to as L S . Then, at the second stage, we perform regularized empirical risk minimization on the "primal" data set S, which penalizes the centered second moment Pn[(ℓ(h; z) L S )2]. As we will present later, it is rather trivial to obtain a qualified preliminary estimate L S via empirical risk minimization. Therefore, we firstly introduce the second-stage moment-penalized estimator, which is more crucial and interesting. Strategy 2 (the second-stage moment-penalized estimator.). Consider the excess loss class F in (3.2). Let ψ(r; δ) be a meaningful surrogate function that satisfies δ (0, 1), r > 0, with probability at least 1 δ, 4Rn{f F : Pn[f 2] 2r} + δ n + 9B log 8 δ n ψ(r; δ). Denote Cn = 4 log2 n + 10. Given a fixed δ (0, 1), let the estimator ˆh MP be ˆh MP arg min H Pnℓ(h; z) + ψ 16Pn[(ℓ(h; z) L S )2]; δ Given an arbitrary preliminary estimate L S [ B, B], we can prove that the generalization error of the moment-penalized estimator ˆh MP is at most E (ˆh MP) 2ψ c0 V (L S L 0)2 r ; δ with probability at least 1 δ, where c0 is an absolute constant, and r is the fixed point of 16Bψ(r; δ Cn ). Moreover, the first-stage estimation error will be negligible if (L S L 0)2 O (r ) . (5.3) It is rather elementary to show that performing the standard empirical risk minimization on S suffices to satisfy (5.3), provided an additional assumption that ψ is a "sub-root" function. We now give our theorem on the generalization error following this two-stage procedure. Theorem 2 (variance-dependent rate). Let L S = inf H PS ℓ(h; z) be attained via empirical risk minimization on the auxiliary data set S . Assume that the meaningful surrogate function ψ(r; δ) is "sub-root," i.e. ψ(r;δ) r is non-increasing over r [0, 4B2] for all fixed δ. Then for any δ (0, 1 2), by performing the moment-penalized estimator in Strategy 2, with probability at least 1 2δ, E (ˆh MP) 2ψ c1V ; δ where r is the fixed point of Bψ(r; δ Cn ) and c1 is an absolute constant. Remarks. 1) In high-probability bounds, Cn will only appear in the form log(Cn/δ)), which is of a negligible O(log log n) order, so there is no much difference to view Cn as an absolute constant. 2) The "sub-root" assumption in Theorem 2 is only used to to bound the first-stage estimation error (see (5.3)). This assumption is not needed for the result (5.2) on the second-stage moment penalized estimator. 3) Replacing V by an empirical centered second moment, we can prove a fully data-dependent generalization error bound that is computable from data without the knowledge of V . We leave the full discussion to Appendix A. Comparison to existing results. The best variance-dependent rate attained by existing estimators is of the order [4] r which is strictly worse than the rate proved in Theorem 2. The reasoning is similar to Statement 2 and the explanation after Theorem 1: when V O(r ) the two results are essentially identical, but our estimator can perform much better when V Ω(r ). Because ψ is sub-root and r is the fixed point, we can compare the orders of the rates ψ(V ; δ) sub-root r ψ(r ; δ) fixed point = O Since variance-dependent rates are generally used in applications that require robustness or exhibit large worst-case boundedness parameter, V r is the more critical regime where one wants to ensure the estimation performance will not degrade. Discussion. Per our "uniform localized convergence" principle, the most obvious difficulty in proving Theorem 2 is in establishing (5.2): the empirical second moment is sample-dependent, whereas standard tools in empirical process theory (e.g., Talagrand s concentration inequality, see Lemma 4) require the localized subsets to be independent of the samples. The core techniques in our proof essentially overcome this difficulty by concentrating data-dependent localized subsets to data-independent ones. This idea may be of independent interest; we defer details to Appendix E. The tightness of our variance-dependent rates depend on tightness of the computable surrogate function ψ. When covering numbers of the excess loss class are given, a direct choice is Dudley s integral bound (Lemma 3 in Appendix H), which is known to be rate-optimal for many important classes. Previous approaches usually take a simper regularization term [9, 4] that is proportional to the square root of the empirical second moment (or empirical variance). That type of penalization is "too aggressive" for rich classes from our viewpoint. [13] propose a regularization term that preserves convexity of empirical risk. However, based on an equivalence proved in their paper, they have similar limitations to the approaches that penalizes the square root of the empirical variance. 6 Discussion and illustrative examples 6.1 Discussion Recall that our loss-dependent rates and variance-dependent (moment-penalized) rates are of the orders E (ˆh ERM) O ψ(BL ; δ) r and E (ˆh MP) O ψ(V ; δ) r respectively. In contrast, the best known loss-dependent rates [2] and variance-dependent rates [4] are of the orders E (ˆh ERM) O and E (ˆhprevious) O respectively (we use ˆhprevious to denote the previous best known moment-penalized estimator proposed in [4]). To illustrate the noticeable gaps between our new results and previous known ones, we compare the two different variance-dependent rates in (6.1) and (6.2) on two important families of "rich" classes: non-parametric classes of polynomial growth and VC classes. The implications of this comparison will similarly apply to loss-dependent rates. Before presenting the advantages of the new problem-dependent rates, we would like to discuss how to compute them. In Theorem 1 and Theorem 2, the class of concentrated functions, F, is the excess loss class ℓ H ℓ h in (3.2). As we have mentioned in earlier sections, a general solution for the ψ function is to use Dudley s integral bound (Lemma 3 in Appendix H). Knowledge of the metric entropy of the excess loss class can be used to calculate Dudley s integral bound and construct the surrogate function ψ needed in our theorems. Note that there is no difference between the metric entropy of the excess loss class and metric entropy of the loss class itself: from the definition of covering number and metric entropy, one has log N (ε, ℓ H ℓ h , L2(Pn)) = log N (ε, ℓ H, L2(Pn)). We comment that almost all existing theoretical works that discuss general function classes and losses [2, 9, 15, 4] impose metric entropy conditions on the loss class/excess loss class rather than the hypothesis class, and we follows that line as well to allow for a seamless comparison of the results. As a complement, we will discuss how to obtain such metric entropy conditions for practical applications in Appendix B. 6.2 Non-parametric classes of polynomial growth Example 1 (non-parametric classes of polynomial growth). Consider a loss class ℓ H with the metric entropy condition log N (ε, ℓ H, L2(Pn)) O ε 2ρ , (6.3) where ρ (0, 1) is a constant. Using Dudley s integral bound to find ψ and solving r O (Bψ(r; δ)), it is not hard to verify that As a result, our variance-dependent rate (6.1) is of the order E (ˆh MP) O V 1 ρ which is O V 1 ρ 2 when V Ω(r ). In contrast, the previous best-known result (6.2) is of the order E (ˆhprevious) O V B ρ 1+ρ n 1 2+2ρ r V B ρ 1+ρ n 1 2+2ρ when V Ω(r ). Therefore, for arbitrary choice of n, V , B, the "sub-optimality gap" is ratio between (6.5) and (6.4) := V B ρ 1+ρ n 1 2+2ρ r B = 1 (V ( n B2 ) 1 1+ρ ) ρ 2 , (6.6) which can be arbitrary large and grows polynomially with n. We consider two stylized regimes as follows (we use the notation when the left hand side and the right hand side are of the same order). The more "traditional" regime: B 1, V n a where a > 0 is a fixed constant. This regime captures the traditional supervised learning problems where B is not large, but one wants to use the relatively small order of V to achieve "faster" rates. The "high-risk" regime: B nb where b > 0 is a fixed constant, and V B2 (i.e., V is much smaller than order n2b). This regime captures modern "high-risk" learning problems such as counterfactual risk minimization [15], policy learning [1], and supervised learning with limited number of samples. In those settings, the worst-case boundedness parameter is considered to scale with n so that one wants to avoid (or reduce) the dependence on B. In both the two regimes, generalization errors via naive (non-localized) uniform convergence arguments will be worse than our approach by orders polynomial in n, so we only need to compare with previous variance-dependent rates. The "traditional" regime. The "sub-optimality gap" (6.6) is 1 (V n 1 1+ρ ) ρ 2 . It is quite clear that when V n a where 0 < a < 1 1+ρ, our variance-dependent rate improves over all previous generalization error rates by orders polynomial in n. The "high-risk" regime. We restrict our attention to the simple case B 2 1+ρ V 4B2 to gain some insight, where our result exhibits an improvement of order O(n ρ 2 ( 1 1+ρ )) relative to the previous result. Clearly the larger ρ, the more improvement we provide. By letting ρ 1 our improvement can be as large as O(n 1 4 ). 6.3 VC-type classes Our next example considers VC-type classes. Although this classical example has been extensively studied in learning theory, our results provide strict improvements over antecedents. Example 2 (VC-type classes). One general definition of VC-type classes (which is not necessarily binary) uses the metric entropy condition. Consider a loss class ℓ H that satisfies log N (ε, ℓ H, L2(Pn)) O d log 1 where d is th so-called the Vapnik Chervonenkis (VC) dimension [16]. Using Dudley s integral bound to find the surrogate ψ and solving r O(Bψ(r; δ)), it can be proven [7] that , r O B2d log n Recently, [4] proposed a moment-penalized estimator whose generalization error is of the rate E (ˆhprevious) O n + Bd log n in the worst case without invoking other assumptions. This result has a O(log n) gap compared with n ) lower bound [3], which holds for arbitrary sample size. There is much recent interest focused on when the sub-optimal log n factor can be removed [1, 4]. By applying Theorem 2, our refined moment-penalized estimator gives a generalization error bound of tighter rate E (ˆh MP) O d V log 8B2 V n Bd log n This closes the O(log n) gap in the regime V Ω( B2 (log n)α ), where α > 0 is an arbitrary positive constant. Though this is not the central regime, it is the first positive result that closes the notorious O(log n) gap without invoking any additional assumptions on the loss/hypothesis class (e.g., the rather complex capacity function" assumption introduced in [4]). We anticipate additional improvements are possible under further assumptions on the hypothesis class and the loss function. Broader Impact Problem-dependent rates are relevant to many practical aspects of machine learning. Our theory may be useful to the design of estimators and the certification of confidence intervals in causal inference and counterfactual learning problems, as the focus there is typically on avoiding worst-case parameter-dependence (see also discussion on counterfactual risk minimization" in Appendix B). We derive improved loss-dependent guarantees for empirical risk minimization, one of the most widely used methods for constructing estimators in machine learning practice. It is possible that our strategy penalizing the empirical variance may find applications in fairness, as variance penalized estimators have been shown to be effective in such problems [6]. Having said that, this work is not expected to have any direct societal consequence absent further adaption to specific applications. Funding Disclosure This work did not receive third party funding or third party support. [1] Susan Athey and Stefan Wager. Efficient policy learning. ar Xiv preprint ar Xiv:1702.02896, 2017. [2] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005. [3] Luc Devroye and Gábor Lugosi. Lower bounds in pattern recognition and learning. Pattern recognition, 28(7):1011 1018, 1995. [4] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. ar Xiv preprint ar Xiv:1901.09036, 2019. [5] László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006. [6] Tatsunori B Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. ar Xiv preprint ar Xiv:1806.08010, 2018. [7] Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of function learning. In High dimensional probability II, pages 443 457. Springer, 2000. [8] Tengyuan Liang, Alexander Rakhlin, and Karthik Sridharan. Learning with square loss: Localization through offset rademacher complexity. In Conference on Learning Theory, pages 1260 1285, 2015. [9] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009. [10] Ron Meir and Tong Zhang. Generalization error bounds for bayesian mixture algorithms. Journal of Machine Learning Research, 4(Oct):839 860, 2003. [11] Shahar Mendelson. Learning without concentration. In Conference on Learning Theory, pages 25 39, 2014. [12] Shahar Mendelson. Learning without concentration for general loss functions. Probability Theory and Related Fields, 171(1-2):459 502, 2018. [13] Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971 2980, 2017. [14] Karthik Sridharan. Note on refined dudley integral covering number bound. Unpublished Manuscript, https://www.cs.cornell.edu/ sridharan/dudley.pdf, 2010. [15] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814 823, 2015. [16] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000. [17] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019. [18] Yunbei Xu and Assaf Zeevi. Towards optimal problem dependent generalization error bounds in statistical learning theory. arxiv preprint ar Xiv: 2011.06186, 2020.