# localization_convexity_and_star_aggregation__e3e03b48.pdf Localization, Convexity, and Star Aggregation Suhas Vijaykumar Statistics Center and Dept. of Economics Massachusetts Institute of Technology Cambridge, MA 02139 suhasv@mit.edu Offset Rademacher complexities have been shown to provide tight upper bounds for the square loss in a broad class of problems including improper statistical learning and online learning. We show that the offset complexity can be generalized to any loss that satisfies a certain general convexity condition. Further, we show that this condition is closely related to both exponential concavity and self-concordance, unifying apparently disparate results. By a novel geometric argument, many of our bounds translate to improper learning in a non-convex class with Audibert s star algorithm. Thus, the offset complexity provides a versatile analytic tool that covers both convex empirical risk minimization and improper learning under entropy conditions. Applying the method, we recover the optimal rates for proper and improper learning with the p-loss for 1 < p < , and show that improper variants of empirical risk minimization can attain fast rates for logistic regression and other generalized linear models. 1 Introduction A central question in statistical learning theory asks which properties of the empirical risk minimizer can be inferred without regard to the underlying data distribution. In particular, the convergence rate of empirical risk minimization has been shown to depend both on distribution-specific properties, such as Tsybakov s margin condition (Tsybakov, 2004), as well as distribution-independent or global properties, such as VC dimension of the class and curvature of the loss (Mendelson, 2002). The local Rademacher complexities of Bartlett et al. (2005) address this issue by giving bounds that are data dependent and also adapt to the global properties of the learning problem. More recently, the offset Rademacher complexities of Rakhlin and Sridharan (2014) and Liang et al. (2015) have been shown to provide similar bounds for the square loss in several contexts. The offset complexity provides a simplified analysis that transparently relates the geometry of the square loss to statistical properties of the estimator. We extend offset Rademacher complexities to systematically study how curvature of the loss translates to fast rates of convergence in proper and improper statistical learning. We show many results in the literature can be studied in a common geometric framework: from classical results on uniform convexity of the loss (Bartlett et al., 2006; Mendelson, 2002), to newer results on exponential concavity (van Erven et al., 2015; Mehta, 2017) and self-concordance (Bilodeau et al., 2020; Marteau-Ferey et al., 2019). Remarkably, our analysis applies simultaneously to convex empirical risk minimization and to improper learning with the star algorithm of Audibert (2008). In each case, excess risk is tightly Supported by the MIT Jerry A. Hausman Graduate Dissertation Fellowship 35th Conference on Neural Information Processing Systems (Neur IPS 2021). bounded by the offset Rademacher complexity. Thus, we give a clean and unified treatment of fast rates in proper and improper learning with curved loss. Our contributions may be summarized as follows. 1. We generalize the offset Rademacher complexities of Liang et al. (2015) to show that when the loss is (µ, d)-convex (defined below), excess risk is bounded by a corresponding offset Rademacher process. 2. We show in this general setting that the star algorithm of Audibert (2008) satisfies an upper bound matching the convex case, allowing us to simultaneously treat convex risk minimization and improper learning under general entropy conditions. 3. We show that (µ, d)-convexity captures the behavior of exponentially concave, uniformly convex, and self concordant losses, leading to fast upper bounds for learning under the p-loss, the relative entropy, and the logistic loss. 4. We observe that generalized linear models (glms) may be expressed as learning in a non-convex class subject to the exponentially concave relative entropy loss, and introduce a general procedure the regularized star estimator which attains fast rates in arbitrary glms from a mixture of at most two proper estimators. In the case of k-ary logistic regression with bounded features X Rq and each predictor norm bounded by B, our upper bounds on the excess risk take the form n polylog(n, k, B, 1/ρ) (1) with probability at least 1 ρ. Thus, in the i.i.d. setting, fast rates that depend only logarithmically on B are achievable by a simple, improper variant of empirical risk minimization. 1.1 Relation to Prior Work This paper examines the localization or fast-rate phenomenon, where the excess loss vanishes at a rate exceeding the model s uniform law of large numbers (Koltchinskii and Panchenko, 2000; Tsybakov, 2004). Recently, Liang et al. (2015) showed that localization for the square loss is captured by the so-called offset Rademacher complexity. In addition to being analytically straightforward, the offset complexity appears in a minimax analysis of online regression (Rakhlin and Sridharan, 2014), suggesting a unified treatment of online and statistical learning. In parallel work, van Erven et al. (2015) have showed that exp-concavity of the loss, often studied in online optimization, implies fast rates in the statistical setting via the probabilistic central condition (see also Juditsky et al. (2008); Mehta (2017)). Our work unifies these results. In particular, we show that the offset complexity upper bound can be generalized to arbitrary losses satisfying a certain curvature condition that includes exp-concave losses. These results considerably extend the scope of offset Rademacher complexities in the statistical setting, as we illustrate; we also believe it can refine the minimax analysis of online regression with exp-concave loss, which we leave to subsequent work. Another noteworthy aspect of our work is its treatment of improper learning that is, minimizing loss compared to the class F with an estimator that need not belong to F. Unlike prior work, which typically involved aggregation over a finite subset of F (Rakhlin et al., 2017; Audibert, 2009), we provide a completely unified analysis that simultaneously leads to optimal rates for convex empirical risk minimization and improper learning under general entropy conditions. In particular, we circumvent the issue that passing to the convex hull of certain classes can blow up their complexity and lead to slower rates (Mendelson, 2002). We apply our analysis to generalized linear models (glms), such as logistic regression, where standard procedures surprisingly do not attain fast rates (Hazan et al., 2014). Inspired by Foster et al. (2018), we note that any glm can be recast as learning in a non-convex class subject to the exp-concave relative entropy. The star algorithm therefore attains fast rates for a wide class of generalized linear models, improving on prior results that exploit strong convexity (see e.g. Rigollet (2012)). Unlike other recent works on improper logistic regression, we do not prove a polynomial time complexity for our procedure, nor do we study the online setting; these are left as open problems for future work. However, our procedure remains of interest for its simplicity it outputs a fixed mixture of two predictors as opposed to optimizing at each query and for its applicability to arbitrary glms (Foster et al., 2018; Mourtada and Gaïffas, 2020; Jézéquel et al., 2020). 2 Margin inequalities We begin our discussion with the following definition which extends the classical uniform convexity (see e.g. Chidume (2009)), giving a quantitative bound on the extent to which the plane tangent to f at (y, f(y)) lies below the graph of f. Definition 1. Let µ : R R be increasing and convex with µ(0) = 0, and let d(x, y) be a pseudometric on the vector space B. A differentiable convex function f : B R with gradient f is (µ, d)-convex if for all x, y B, f(x) f(y) f|y, x y µ(d(x, y)) (2) In particular, this condition implies that the lower sets of f, fx = n z f(z) f(x) o , are strictly convex, and that for any closed and convex K there is a unique ˆx arg minx K f(x). We adopt the notation that for a subset S B, S denotes the boundary of S, S = S \ S denotes the interior, and co S denotes the convex hull. Remark 2. At this point, we are intentionally vague about the role of the metric, d. We shall subsequently see that the choice of d is crucial for producing good upper bounds. The conventional choice is d(x, y) = |x y| and µ(z) = αzp, corresponding to p-uniform convexity (with modulus α). However, our results on self-concordant and exp-concave losses use non-standard choices of d. By a standard argument, given in Lemma 3 below, the condition (2) on the loss implies that the empirical risk minimizer in a convex class satisfies a geometric margin inequality. Intuitively, if another hypothesis performs nearly as well as the empirical minimizer in the sample, it must also be d-close to the empirical minimizer. Lemma 3. Let K be a closed, convex subset of B. If f is (µ, d)-convex and ˆx minimizes f over K then for all x K f(x) f(ˆx) µ(d(x, ˆx)). (3) Proof. Let zλ = λx + (1 λ)ˆx be the line segment interpolating x and ˆx, and put ϕ(λ) = f(zλ). By convexity of K, we must have zλ K, hence ϕ is minimized at λ = 0 by optimality of ˆx. Hence, ϕ(λ) ϕ(0) 0 for all λ, and ϕ must satisfy ϕ|λ=0 = f|ˆx, x ˆx 0. Thus, f(x) f(ˆx) f(x) f(ˆx) f|ˆx, x ˆx µ(d(x, ˆx)), proving the lemma. 2.1 The star algorithm We now introduce our first main result, which extends the above analysis to the problem of learning in a non-convex class. Surprisingly, in the general setting of (µ, d)-convexity, an inequality matching (3) is always attained by a simple, two-step variant of empirical risk minimization. Let us now introduce the procedure. Given x S, the star hull of S with respect to x is defined by star(x, S) = {λx + (1 λ)s | s S, λ [0, 1]} . The procedure may be summarized as follows. Figure 1: Illustration of Lemma 3. Convexity implies the existence of a direct path zλ from x to ˆx. Figure 2: Illustration of Proposition 4. Depicted is the plane containing (x, ˆx, x); C is the minimal cone containing f x with vertex at ˆx. By optimality of x, S cannot intersect the interior of C fˆx, implying existence of one of the indicated paths. 1: procedure STAR(S, f) 2: Find ˆx that minimizes f over S 3: Find x that minimizes f over star(ˆx, S) 4: Output x 5: end procedure Our next goal is to show that if S is an arbitrary (not necessarily convex) set, and x is the output of the star algorithm, then f(x) f( x) µ 1 3d(x, x) (4) whenever f is (µ, d)-convex. In particular, if f is the empirical risk and S is the function class, then x is the star estimator introduced by Audibert (2008). Thus, whenever (µ, d)-convexity leads to an upper bound for empirical risk minimization in a convex class via the inequality (3), the same bound is enjoyed up to constants by the star algorithm in an arbitrary class. To prove (4), recall how we proved Lemma 3: we showed that if the line-segment λ 7 zλ which linearly interpolates between any two points x and y lies entirely in the set fx \ f y , so f(zλ) f(y), then we have that f(x) f(y) µ(d(x, y)). This is illustrated in Figure 1. Unfortunately, the existence of such a line-segment zλ connecting x and x relies on the convexity of S. In order to extend the argument as we want to, we make two observations. Firstly, the line segment λ 7 zλ can be chosen to be any line segment connecting fx to fy and lying entirely in fx \ f y . Then, since fx and fy are level sets, the same argument yields f(x) f(y) = f(z1) f(z0) µ(d(z1, z0)). (5) Secondly, we can show that there exist at most three such line-segments which together comprise a path from x to x. By the triangle inequality, at least one of these segments has length greater than 1 3d( x, x). Combining this with (5) yields the desired inequality (4). We state this as a proposition and prove it formally in Appendix A. A straightforward illustration of the proof, following the preceding discussion, is provided in Figure 2. Proposition 4. Let f be (µ, d)-convex. Suppose ˆx minimizes f over S and x minimizes f over star(ˆx, S). Then, for any x S, f(x) f( x) µ 1 3 From margins to upper bounds We will now show how the previous section s margin inequalities lead to upper bounds on the excess risk. 3.1 Statistical setting and notation First, let us introduce the statistical setting and some notation. We observe data in the form of n i.i.d. pairs {(Xi, Yi)}n i=1 D R on a common probability space, P. Given a set F of predictors f : D R and a loss function ψ : R R R, we aim to minimize the excess ψ-risk relative to F, E( ˆf; F, ψ) = Eψ( ˆf(X), Y ) inf f F{Eψ(f(X), Y )]}. Here, (X, Y ) is an independent pair with the same law as the observed data (Xi, Yi). We will suppress the dependence on F and ψ where confusion does not arise. We will use the shorthand h or hi to denote the random variables h(X) or h(Xi), and will denote the function (x, y) 7 ψ(h(x), y) by simply ψ(h). Frequently, we combine these two abuses of notation to write ψi(h) = ψ(h(Xi), Yi). We write ψ F for the corresponding class of functions {ψ(h) | h F}, and will use En to denote the empirical expectation 1 i δ(Xi,Yi). 3.2 Empirical margin inequality Our first lemma extends (µ, d)-convexity of the loss x 7 ψ(x, y) to the empirical risk. Lemma 5. Suppose that for each y, x 7 ψ(x, y) is (µ, d)-convex in x. Then the empirical ψ-risk, given by f 7 Enψ(f), is (µ, dn)-convex in f, where dn = µ 1 En(µ d). Proof. By (µ, d)-convexity, we have for each 1 i n ψi(f) ψi(g) ψ|gi, fi gi µ(d(fi, gi)). Averaging these inequalities, noting that the inner product and sub-gradient commute with the averaging operation, and also that µ µ 1 En(µ d) = En(µ d), yields Lemma 5. Now, if F is convex with empirical risk minimizer ˆf, applying Lemma 3 with x chosen to be the population risk minimizer f gives Enψ(f ) Enψ( ˆf) Enµ(d( ˆf, f )). (6) Similarly, if F is not convex and f is produced by the star algorithm applied to Enψ, we have by Proposition 4 Enψ(f ) Enψ( f) Enµ(d( f, f )/3). (7) We note that in our applications, µ will typically have the form x 7 cx2, so (6) and (7) are comparable up to a constant factor of 9. To avoid complication, we will only work with the more general inequality (7). The reader may note that in a convex class the star estimator coincides with the empirical risk minimizer, so we are treating both cases simultaneously at the expense of slightly larger constants. Example 6 (Square loss). Before proceeding, it is helpful to consider the square loss ψ(x, a) = (x a)2 as an example. A convenient aspect of (µ, d)-convexity is that the optimal µ d can be computed as µ(d(x, y)) = ψ(x) ψ(y) ψ|y, x y = lim λ 0 λλψ(x) + (1 λ)ψ(y) ψ(λx + (1 λ)y), which in this case is simply (x y)2. Taking d(x, y) = |x y| and µ(z) = z2, we recover the familiar inequality En(Y f )2 En(Y ˆf)2 En(f ˆf)2. Moving to the non-convex case with the star algorithm introduces a factor of 1/9 on the right-hand side, improving on Liang et al. (2015) by a constant. This analysis extends, with little complication to the p-loss for all 1 < p < . We will discuss these results in Section 5.1. 3.3 The offset empirical process We have now derived a family of margin inequalities for the star estimator, Enψ(f ) Enψ( f) Enµ(d(f , f)/3) 0. (8) Adding this to the excess risk gives us the upper bound E( f) = Eψ( f) Eψ(f ) (En E)(ψ(f ) ψ( f)) Enµ(d( f, f )/3) n (En E)(ψ(f ) ψ(f)) Enµ(d(f, f )/3) o (9) for F = λ [0,1]λF + (1 λ)F (the Minkowski sum). After symmetrization and contraction arguments, which follow along the lines of Liang et al. (2015), we have the following result. The proof is given in Appendix B. Proposition 7. Let F be a model class, ψ a (µ, d)-convex loss, and f the population minimizer of the ψ-risk. Then, the star estimator f satisfies the excess risk bound EΨ(E( f)) EΨ i=1 2ε i(ψi(f ) ψi(f)) (1 + ε i)µ( 1 3d(fi, f i )) where the ε i are i.i.d. symmetric Rademacher random variables, F = λ [0,1]λF + (1 λ)F, and Ψ : R R is any increasing, convex function. Remark 8. Proposition 7 quantifies localization for the loss ψ. The excess risk is upper bounded by a non-centered Rademacher complexity with the asymmetric term µ(d(fi, f i )/3). This term captures the fact that hypotheses f which are far from f perform worse in expectation, and are thus unlikely to be selected. As a result, these hypotheses do not contribute much to the excess risk. Remark 9. This bound contains the unknown quantity f F, and using it requires taking the supremum over possible f . This leaves open the possibility of exploiting prior information of the form f P F. While our bound is stated in expectation, a more elaborate symmetrization argument can provide data-dependent bounds on the excess risk, see Liang et al. (2015). 4 Exp-concave localization We will now investigate how our upper bound (10) applies to η-exp-concave losses, namely those for which e ηψ is concave, or equivalently ψ /(ψ )2 η. It turns out that this condition leads to very natural bounds on the excess risk. To provide intuition as to why, note that exploiting the bound (10) requires comparing the risk increments ψ(f) ψ(f ) to the offset µ(d(f, f )). Under strong convexity, we have an offset of the form α|f f |2 and must resort to a Lipschitz bound of the form ψ(f) ψ(f ) ψ lip |f f | to make the necessary comparison. Thus, we have upper bounded the first derivative uniformly by ψ lip and lower bounded the second derivative uniformly by α. It turns out that exp-concave losses possess an offset term of the form |ψ(f) ψ(f )|2. Thus, localization naturally occurs at the same scale as the risk increments. Exp-concavity only requires the ratio ψ /(ψ )2 to stay bounded, which is crucial for the log loss, where ψ lip = but ψ /(ψ )2 = 1. First, we ll establish a particular version (µ, d)-convexity for the logarithmic loss. We will subsequently extend it to all exp-concave losses. Lemma 10. For all x, y [δ, 1] ( ln x) ( ln y) ( ln)|y(x y) | ln x ln y|2 2 ln(1/δ) 4 . (11) Proof. The left-hand side of (11) is equivalent to ln(y/x) + (x y)/y = e z 1 + z with the substitution z = ln(y) ln(x). Finally, we verify in Lemma G3 that e z 1 + z z2 2 ln(1/δ) 4, (12) since our assumption x, y [δ, 1] implies |z| ln(1/δ). The inequality (11) establishes that the logarithmic loss, when restricted to [δ, 1], is (µ, d)-convex for µ : x 7 x2/(2 ln(1/δ) 4) and d(x, y) = | ln x ln y|. This choice of d captures an important fact: the logarithm is more convex precisely when it is faster varying. Remarkably, a simple reduction shows that the same property is enjoyed by bounded, η-exp-concave losses. Thus, we neatly recover past results on exp-concave learning (cf. Mehta (2017)), and extend them to the improper setting. Proposition 11. Suppose ψ is an η-exp-concave loss function which satisfies the uniform bound 0 ψ m. Then, for each x and y, ψ(x) ψ(y) ψ|y, x y |ψ(x) ψ(y)|2 2m (4/η) . (13) Proof. First note that since e ηψ is concave, we have exp( ηψ(λx + (1 λ)y)) λe ηψ(x) + (1 λ)e ηψ(y). Note that equality holds at λ = 0. Taking logarithms of both sides and passing to the upper sub-gradient at λ = 0 therefore yields η ψ|y, x y D ln |e ηψ(y), e ηψ(x) e ηψ(y)E . (14) We then apply Lemma 10 with x and y replaced by e ηψ(x) and e ηψ(y), and with δ = e ηm. Using (14) to bound the gradient term then yields Proposition 11. Plugging (13) into Proposition 7 and making a contraction argument yields the following upper bound for the excess ψ-risk. The full argument is in Appendix C. Theorem 12. Let ψ be an η-exp-concave loss function taking values in [0, m]. Then the star estimator in F satisfies the excess risk bound EΨ(E( f)) EΨ i=1 4ε i(ψi(f) ψi(g)) η(ψi(f) ψ(gi))2 where Ψ is any increasing, convex function and F = λ [0,1]λF + (1 λ)F. Alternatively, when ψ is p-uniformly convex with modulus α and ψ lip-Lipschitz, we have EΨ(E( f)) EΨ i=1 4 ψ lip (fi gi)ε i α|fi gi|p Remark 13. An especially nice feature of (15), as well as (18) below, is that all quantities appearing in the bound are invariant to affine re-parameterizations of the learning problem. Self-concordance We close the section by considering self-concordant losses ψ, namely those for which ψ 2(ψ ) 3 2 . In order to see how localization for self-concordant losses can be fit into the above framework, we rely on the following inequality which was also used by Bilodeau et al. (2020). Lemma 14 (Nesterov (2014, Theorem 4.1.7)). Suppose ψ is self-concordant, and define the local norm z ψ,w = p z2ψ (w). Then ψ(x) ψ(y) f|y, x y x y ψ,y ln 1 + x y ψ,y . (17) The first thing that one should note is that, when one takes ψ to be the negative logarithm, the above is exactly equivalent to the modulus (12) appearing in the proof of Lemma 10. Thus, our upper bound for exp-concave losses is a consequence of self-concordance of the logarithm. The path to general upper bounds via self-concordance is significantly more involved, so we will only sketch it. Applying (17) to the excess risk and performing standard manipulations, one arrives at the following upper bound, proved in Appendix D. Proposition 15. If ψ is a self-concordant loss and ˆf is the empirical risk minimizer in a convex class F, then EΨ(E( f)) EΨ i=1 4(ψi(f) ψ(f i ))ε i ω fi f i ψ,f i for ω(z) = z log(1 + z), z ψ,w = p z2ψ (w), and (ε i)n i=1 are independent, symmetric Rademacher random variables and Ψ is any increasing, convex function. Comparing this to (16), the scale of the offset term has been improved from the modulus of strong convexity, which is the minimum second derivative, to the second derivative at the risk minimizer, f . Expressing the increments ψi(f) ψi(f ) at a scale comparable to fi f i ψ,f i is known to be achievable when f belongs to the so-called Dikin ellipsoid of (Enψ, f ), requiring a more careful argument (see e.g. Marteau-Ferey et al. (2019)). Notably, we do not know how to extend this bound to the improper setting due to the local nature of the offset term. 5 Applications We will now discuss several applications of Theorem 12. First, we ll sketch how the previous section s results straightforwardly imply fast rates. We ll then discuss applications, first to the p-loss and then to generalized linear models. Proofs in both cases are to be found in Appendix F. While most general bound uses chaining, and is provided by Theorem 19 below, the essence of the argument is simple; we will sketch it now. Let Sϵ(A, ν) be a minimal covering of A in L2(ν) at resolution ϵ, and define the entropy numbers H2(ϵ, G) = supν ln(#Sϵ(G, ν)) (where the supremum is over probability measures ν).2 Putting Sϵ = Sϵ(ψ F , Pn) and applying (15) with Ψ(z) = exp(λnz) gives E exp(λn(E( f) ϵ)) E exp λn i=1 4ε i(si ti) η(si ti)2 s,t Sϵ Eε exp i=1 4λε i(si ti) λη(si ti)2 where we first moved the supremum outside exp (by monotonicity of exp) and then bounded it by a sum (by positivity of the arguments). Here, Eε denotes an expectation w.r.t. the multipliers ε i, conditional upon the data. Applying Hoeffding s lemma to the inner expectations with respect to ε i gives i=1 2λ2(si ti)2 λη(si ti)2 Choosing λ = 36m 72/η gives E exp(λ n(E( f) ϵ)) E(#Sϵ)2 exp{2H2(ϵ, ψ F )}, so by a Chernoff bound P E( f) ϵ + 36m 72 2H2(ϵ, ψ F ) + ln(1/ρ) The result (19) extends Mehta (2017, Theorem 1) to the non-convex and improper setting for the price of a factor 9, and differs from the standard bound in numerous ways. Firstly, the Lipschitz constant ψ lip is replaced by the range, m. Here, ψ lip appears only implicitly in the covering numbers, entering logarithmically in small classes. For the log loss, this is an exponential improvement. Next, the modulus of strong convexity is replaced by the exp-concavity modulus η. This implies that the optimal 1/n rate can be obtained, in both proper and improper settings, for losses which are not strongly convex but are exp-concave, notably regression with p-loss for p > 2. 2Our results can be sharpened by restricting the supremum to empirical probability measures on n data points. With further work, it is also possible to derive high-probability bounds where ν is restricted to be the empirical distribution of the observed data, see Remark 9. Lastly, we show in Appendix Lemma G4 that the entropy numbers of ψ F are equivalent to those of ψ F in all but finite classes, where the gap is at most ln(1/ϵ). On the other hand, when the entropy numbers of ψ F are polynomial, particularly ϵ q with q < 2, then they can be polynomially smaller than those of ψ (co F) (Carl et al., 1999). 5.1 Regression with p-loss A surprising consequence of (19) concerns regression with the p-loss for p > 2, which is p-uniformly convex but not strongly convex. When the inputs to the loss are bounded, the standard argument pursued by Mendelson (2002) and Bartlett et al. (2006) (and also Rakhlin and Sridharan (2015) in the online setting) exploits uniform convexity, producing the suboptimal rate exponent p/(2p 2) < 1. However, the p-loss for inputs in [ B, B] is B p-exp-concave, so we recover optimal (1/n)-type rates matching the square loss (cf. Audibert (2009)). Indeed, the suboptimal rates occur exactly when one plugs (16) into the above argument instead of (15). For 1 < p 2, the loss is both strongly convex and exp-concave, so the two bounds coincide up to constants. Corollary 16 (cf. Mendelson (2002, Theorem 5.1)). Let ψ(f, y) = |f y|p for p > 1 and let the class F and response Y take values in [ B, B]. Then there exists a universal C(p, B) = O(p2p Bp) such that the star estimator f has excess ψ-risk bounded as P E( f, F) ϵ + C(p, B)(H2(ϵ/Cp,B, F) + ln(1/ϵ) + ln(1/ρ)) By a minor extension to Tsybakov (2003, Theorem 1), this result is sharp for each p. 5.2 Improper maximum likelihood We next derive upper bounds for learning under the log loss, ψ : f 7 ln f in a class L. If we interpret f as the likelihood of an observation, empirical risk minimization coincides with maximum likelihood. In this setting, a generalized linear model for multi-class prediction refers to the case where L = {(x, y) 7 ϕ(f), y | f F} for some base class F. Here the link function ϕ outputs probability distributions over [k], and y Rk is a basis vector encoding the label. These models are typically formulated as minimizing the loss ψ[ϕ] : f 7 ln ϕ(f), y over f F, which is equivalent for proper learning. Since L may be non-convex even for convex F, however, one must consider improper predictors ˆf L in order to exploit localization of the log loss. When ϕ is surjective, these yield improper predictors under ψ[ϕ] by composing with a right-inverse, ϕ . To illustrate, we will derive our quoted bound (1) for logistic regression. In this setting, Hazan et al. (2014) showed that improper learning is necessary for fast rates with sub-exponential dependence on the predictor norm, B. This was achieved by Foster et al. (2018), whose results fall neatly into the aforementioned recipe: they propose a mixed prediction corresponding to an exponentially weighted average, which is simply another estimator that takes values in the convex hull of L. For classes not bounded away from 0, where the improvement is most dramatic, we need to replace L by the δ-regularization Lδ = (1 δ)L + δ. The approximation error is controlled as follows. Lemma 17 (Foster et al. (2018)). For all f and δ (0, 1/2], the excess risk relative to Lδ satisfies E(f; Lδ) E(f; L) + 2δ We now state our result. Corollary 18. With probability at least 1 ρ, the star estimator fδ in Lδ satisfies E( fδ; L) ϵ + 2δ + C ln(1/δ) H2(δϵ, L) + ln(1/ϵδ) + ln(1/ρ) Let L be the generalized linear model corresponding to FB = x 7 Wx W Rk q, W 2 B with A-Lipschitz, surjective link ϕ and features X Rq that satisfy X 2 R q. Then with probability at least 1 ρ E(ϕ fδ; FB) ln(n) n Ckq ln(ABRn k) + ln(1/ρ) o . (22) Our bound for logistic regression, quoted in the introduction, follows after noting that the soft-max function is 1-Lipschitz and surjective. Before investigating how these bounds translate to larger classes namely those with H2(ϵ, F) of order ϵ q for small ϵ we need to state the chaining version of our bound, proved in Appendix E. Theorem 19. Let ψ be an η-exp-concave loss taking values in [0, m]. Then, with probability at least 1 9e z, the star estimator f applied to (ψ, F) satisfies E( f) inf 0 α γ H2(s) ds + 2H2(γ) where H2(s) = H2(s, ψ F ) and c = 36 1(1/m η/2). In classes with entropy ϵ q, our results fall into two regimes. The first covers losses such as the logistic loss, for which ln ϕ is A-Lipschitz. Here, the covering numbers of ln Lδ differ from those of F by the constant Aq, and we match the optimal rates for the square loss (Rakhlin et al., 2017). For general L, the covering numbers of ln Lδ may contain an extra factor δ q, leading to suboptimal rates. Suboptimality follows from combining the minimax regret analysis of Bilodeau et al. (2020), which produces the average regret n 1/(1+q), with a standard online-to-batch reduction. We note that the upper bound of that paper is non-constructive, whereas these upper rates come with an explicit algorithm. Tightening our analysis to close this gap, or proposing an algorithm that attains the correct rates, remains a significant open problem. Corollary 20. Consider a generalized linear model with A-Lipschitz loss f 7 ln ϕ(f), y . Suppose the entropy numbers H2(ϵ; F) are of order ϵ q. Then, the regularized star estimator ϕ fδ with δ = 1/n satisfies the rates appearing on the left. On the other hand, for an arbitrary class L taking values in [0, 1] subject to the log loss, the regularized star estimator f δ for appropriately chosen δ attains the rates appearing on the right. Here, the symbol ρ denotes an upper bound that holds with probability 1 ρ, hiding universal constants and a multiplicative factor ln(1/ρ). Aqn 2/(2+q) ln(n) q < 2 Aqn 1/2 ln(n) q = 2 Aqn 1/q q > 2 E( f δ ) ρ n 1/(1+3q/2) q < 2 n 1/4 ln(n) q = 2 n 1/(2q) q > 2 (24) In closing, we would like to highlight the additional, unresolved problem of designing a computationally efficient implementation of the regularized star estimator in particular for log-concave link functions ϕ where the first step is a convex program, such as logistic regression. This problem is left to future work. Acknowledgments and Disclosure of Funding This work was partially funded by the MIT Jerry A. Hausman Graduate Dissertation Fellowship. We would like to thank Alexander Rakhlin, Victor Chernozhukov, and Anna Mikusheva for their extensive guidance and for many helpful conversations. We would also like to thank Stefan Stein and Claire Lazar Reich for their helpful feedback on the manuscript. All errors are, however, the fault of the author. Jean-yves Audibert. Progressive mixture rules are deviation suboptimal. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 41 48. Curran Associates, Inc., 2008. Jean-Yves Audibert. Fast learning rates in statistical inference through aggregation. The Annals of Statistics, 37 (4):1591 1646, 2009. doi: 10.1214/08-AOS623. URL https://doi.org/10.1214/08-AOS623. Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497 1537, August 2005. ISSN 0090-5364, 2168-8966. doi: 10.1214/009053605000000282. Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe. Convexity, Classification, and Risk Bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. ISSN 0162-1459. Blair Bilodeau, Dylan Foster, and Daniel Roy. Tight Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance. In International Conference on Machine Learning, pages 919 929. PMLR, November 2020. Bernd Carl, Ioanna Kyrezi, and Alain Pajor. Metric Entropy of Convex Hulls in Banach Spaces. Journal of the London Mathematical Society, 60(3):871 896, December 1999. ISSN 0024-6107. doi: 10.1112/ S0024610799008005. Charles Chidume. Inequalities in Uniformly Convex Spaces. In Charles Chidume, editor, Geometric Properties of Banach Spaces and Nonlinear Iterations, Lecture Notes in Mathematics, pages 29 44. Springer, London, 2009. ISBN 978-1-84882-190-3. doi: 10.1007/978-1-84882-190-3_4. Dylan J. Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, and Karthik Sridharan. Logistic Regression: The Importance of Being Improper. In Conference On Learning Theory, pages 167 208. PMLR, July 2018. Elad Hazan, Tomer Koren, and Kfir Y. Levy. Logistic regression: Tight bounds for stochastic and online optimization. In Maria Florina Balcan, Vitaly Feldman, and Csaba Szepesvári, editors, Proceedings of the 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 197 209, Barcelona, Spain, June 2014. PMLR. Rémi Jézéquel, Pierre Gaillard, and Alessandro Rudi. Efficient improper learning for online logistic regression. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 2085 2108. PMLR, 09 12 Jul 2020. URL https://proceedings.mlr.press/v125/jezequel20a.html. A. Juditsky, P. Rigollet, and A. B. Tsybakov. Learning by mirror averaging. Annals of Statistics, 36(5): 2183 2206, October 2008. ISSN 0090-5364, 2168-8966. doi: 10.1214/07-AOS546. Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher Processes and Bounding the Risk of Function Learning. In Evarist Giné, David M. Mason, and Jon A. Wellner, editors, High Dimensional Probability II, Progress in Probability, pages 443 457, Boston, MA, 2000. Birkhäuser. ISBN 978-1-4612-1358-1. doi: 10.1007/978-1-4612-1358-1_29. Michel Ledoux and Michel Talagrand. Rademacher Averages. In Michel Ledoux and Michel Talagrand, editors, Probability in Banach Spaces: Isoperimetry and Processes, Ergebnisse Der Mathematik Und Ihrer Grenzgebiete, pages 89 121. Springer, Berlin, Heidelberg, 1991. ISBN 978-3-642-20212-4. doi: 10.1007/978-3-642-20212-4_6. Tengyuan Liang, Alexander Rakhlin, and Karthik Sridharan. Learning with Square Loss: Localization through Offset Rademacher Complexity. In Conference on Learning Theory, pages 1260 1285, June 2015. Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, and Alessandro Rudi. Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance. In Conference on Learning Theory, pages 2294 2340. PMLR, June 2019. Nishant Mehta. Fast rates with high probability in exp-concave statistical learning. In Artificial Intelligence and Statistics, pages 1085 1093. PMLR, April 2017. S. Mendelson. Improving the sample complexity using global data. IEEE Transactions on Information Theory, 48(7):1977 1991, July 2002. ISSN 0018-9448. doi: 10.1109/TIT.2002.1013137. Jaouad Mourtada and Stéphane Gaïffas. An improper estimator with optimal excess risk in misspecified density estimation and logistic regression. ar Xiv:1912.10784 [math, stat], 2020. Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, first edition, 2014. ISBN 1-4613-4691-6. Dmitry Panchenko. Symmetrization approach to concentration inequalities for empirical processes. Annals of Probability, 31(4):2068 2081, October 2003. ISSN 0091-1798, 2168-894X. doi: 10.1214/aop/1068646378. Alexander Rakhlin and Karthik Sridharan. Online Non-Parametric Regression. In Conference on Learning Theory, pages 1232 1264, May 2014. Alexander Rakhlin and Karthik Sridharan. Online Nonparametric Regression with General Loss Functions. ar Xiv:1501.06598 [cs, math, stat], January 2015. Comment: ar Xiv admin note: substantial text overlap with ar Xiv:1402.2594. Alexander Rakhlin, Karthik Sridharan, and Alexandre B. Tsybakov. Empirical entropy, minimax regret and minimax risk. Bernoulli, 23(2):789 824, May 2017. ISSN 1350-7265. doi: 10.3150/14-BEJ679. Philippe Rigollet. Kullback Leibler aggregation and misspecified generalized linear models. Annals of Statistics, 40(2):639 665, April 2012. ISSN 0090-5364, 2168-8966. doi: 10.1214/11-AOS961. Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23, pages 2199 2207. Curran Associates, Inc., 2010. Alexander B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, 32(1): 135 166, February 2004. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1079120131. Alexandre B. Tsybakov. Optimal Rates of Aggregation. In Bernhard Schölkopf and Manfred K. Warmuth, editors, Learning Theory and Kernel Machines, Lecture Notes in Computer Science, pages 303 313, Berlin, Heidelberg, 2003. Springer. ISBN 978-3-540-45167-9. doi: 10.1007/978-3-540-45167-9_23. Tim van Erven, Peter D. Grünwald, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson. Fast Rates in Statistical and Online Learning. Journal of Machine Learning Research, 16(54):1793 1861, 2015. ISSN 1533-7928.