# agnostic_learnability_of_halfspaces_via_logistic_loss__d47c8fb9.pdf Agnostic Learnability of Halfspaces via Logistic Loss Ziwei Ji 1 Kwangjun Ahn 2 Pranjal Awasthi 3 Satyen Kale 3 Stefani Karp 3 4 We investigate approximation guarantees provided by logistic regression for the fundamental problem of agnostic learning of homogeneous halfspaces. Previously, for a certain broad class of well-behaved distributions on the examples, Diakonikolas et al. (2020d) proved an eΩ(OPT) lower bound, while Frei et al. (2021b) proved an e O OPT upper bound, where OPT denotes the best zero-one/misclassification risk of a homogeneous halfspace. In this paper, we close this gap by constructing a well-behaved distribution such that the global minimizer of the logistic risk over this distribution only achieves Ω misclassification risk, matching the upper bound in (Frei et al., 2021b). On the other hand, we also show that if we impose a radial-Lipschitzness condition in addition to well-behaved-ness on the distribution, logistic regression on a ball of bounded radius reaches e O(OPT) misclassification risk. Our techniques also show for any well-behaved distribution, regardless of radial Lipschitzness, we can overcome the Ω OPT lower bound for logistic loss simply at the cost of one additional convex optimization step involving the hinge loss and attain e O(OPT) misclassification risk. This two-step convex optimization algorithm is simpler than previous methods obtaining this guarantee, all of which require solving O log(1/OPT) minimization problems. Part of this work was done when Ziwei Ji and Kwangjun Ahn were interns at Google. 1Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA 2Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA 3Google Research, New York City, NY, USA 4School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. Correspondence to: Ziwei Ji . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 1. Introduction In this paper, we consider the fundamental problem of agnostically learning homogeneous halfspaces. Specifically, we assume there is an unknown distribution P over Rd { 1, +1} to which we have access in the form of independent and identically distributed samples drawn from P. A sample from P consists of an input feature vector x Rd, and a binary label y { 1, +1}. Our goal is to compete with a homogeneous linear classifier u (i.e. one that predicts the label sign( u, x ) for input x) that achieves the optimal zero-one risk of OPT > 0 over P; formally, this means Pr(x,y) P sign( u, x ) = y = OPT. Alternatively, we can think that the labels of the examples are first generated by u, and then an OPT fraction of the labels are adversarially corrupted. There have been many algorithmic and hardness results on this topic, see Section 1.1 for a discussion. A very natural heuristic for solving the problem is to use logistic regression. However, the analysis of logistic regression for this problem is still largely incomplete, even though it is one of the most fundamental algorithms in machine learning. One reason for this is that it can return extremely poor solutions in the worst case: Ben-David et al. (2012) showed that the minimizer of the logistic risk may attain a zero-one risk as bad as 1 OPT on an adversarially-constructed distribution. As a result, much attention has been devoted to certain wellbehaved distributions, for which much better results can be obtained. However, even when the marginal distribution on the feature space, Px, is assumed to be isotropic log-concave, in a recent work, Diakonikolas et al. (2020d) proved an eΩ OPT lower bound on the zero-one risk for any convex surrogate, including the logistic loss. On the positive side, in another recent work, Frei et al. (2021b) proved that vanilla gradient descent on the logistic loss can attain a zero-one risk of e O OPT , as long as Px satisfies some well-behavedness conditions. (See Sections 1.1 and 3 for precise details.) The above results still leave a big gap between the upper and the lower bounds, raising the question of identifying the fundamental limits of logistic regression for this problem. In this work, we study this question and develop the following set of results. Agnostic Learnability of Halfspaces via Logistic Loss A matching Ω OPT lower bound. In Section 2, we construct a distribution Q over R2 { 1, 1}, and prove a lower bound for logistic regression that matches the upper bound in (Frei et al., 2021b), thereby closing the gap in recent works (Diakonikolas et al., 2020d; Frei et al., 2021b). Specifically, the marginal distribution Qx is isotropic and bounded, and satisfies all the well-behaved-ness conditions from the aforementioned papers, but the global minimizer of the logistic risk on Q only attains Ω OPT zero-one risk on Q. An e O(OPT) upper bound for radially Lipschitz densities. The lower bound mentioned above shows that one needs to make additional assumptions to prove better bounds. In Section 3, we show that by making a radial Lipschitzness assumption in addition to well-behaved-ness, it is indeed possible to achieve the near-optimal e O(OPT) zero-one risk via logistic regression. In particular, our upper bound result holds if the projection of Px onto any twodimensional subspace has Lipschitz continuous densities. Moreover, our upper bound analysis is versatile: it can recover the e O OPT guarantee for general well-behaved distributions shown by Frei et al. (2021b), and it also works for the hinge loss, which motivates a simple and efficient two-phase algorithm, as we describe next. An e O(OPT) upper bound for general well-behaved distributions with a two-phase algorithm. Motivated by our analysis, in Section 4, we describe a simple two-phase algorithm that achieves e O OPT risk for general wellbehaved distributions, without assuming radial Lipschitzness. Thus, we show that the cost of avoiding the radial Lipschitzness condition is simply an additional convex loss minimization. Our two-phase algorithm involves logistic regression followed by stochastic gradient descent with the hinge loss (i.e., the perceptron algorithm) with a restricted domain and a warm start. For general well-behaved distributions, the first phase can only achieve an e O guarantee, however we show that the second phase can boost the upper bound to e O(OPT). Previously, for any given ϵ > 0, Diakonikolas et al. (2020d) designed a nonconvex optimization algorithm that can achieve an O(OPT + ϵ) risk using e O(d/ϵ4) samples. Their algorithm requires guessing OPT within a constant multiplicative factor via a binary search and running a nonconvex SGD using each guess as an input. Similarly, prior algorithms achieving an O(OPT + ϵ) risk involve solving multiple rounds of convex loss minimization (Awasthi et al., 2014; Daniely, 2015). In contrast, our two-phase algorithm is a simple logistic regression followed by a perceptron algorithm, and the output is guaranteed to have an O OPT ln(1/OPT) + ϵ zero-one risk using only e O(d/ϵ2) samples. 1.1. Related work The problem of agnostic learning of halfspaces has a long and rich history (Kearns et al., 1994). Here we survey the results most relevant to our work. It is well known that in the distribution independent setting, even weak agnostic learning is computationally hard (Feldman et al., 2006; Guruswami & Raghavendra, 2009; Daniely, 2016). As a result, most algorithmic results have been obtained under assumptions on the marginal distribution Px over the examples. The work of Kalai et al. (2008) designed algorithms that achieve OPT + ϵ error for any ϵ > 0 in time dpoly( 1 ϵ ) for isotropic log-concave densities and for the uniform distribution over the hypercube. There is also a recent evidence that removing the exponential dependence on 1/ϵ, even for Gaussian marginals is computationally hard (Klivans & Kothari, 2014; Diakonikolas et al., 2020a; Goel et al., 2020). As a result, another line of work aims to design algorithms with polynomial running time and sample complexity (in d and 1 ϵ ) and achieve an error of g(OPT) + ϵ, for g being a simple function. Along these lines, Klivans et al. (2009) designed a polynomial-time algorithm that attains e O(OPT1/3) + ϵ zero-one risk for isotropic log-concave distributions. Awasthi et al. (2014) improved the upper bound to O(OPT) + ϵ, using a localization-based algorithm. Balcan & Zhang (2017) further extended the algorithm to more general s-concave distributions. The work of Daniely (2015) further provided a PTAS guarantee: an error of (1 + η)OPT + ϵ for any desired constant η > 0 via an improper learner. In a recent work, Diakonikolas et al. (2020d) studied the problem for distributions satisfying certain well-behavedness conditions which include isotropy and certain regularity conditions on the projection of Px on any 2-dimensional subspace (see Assumption 3.2 for a subset of these conditions). This class of distributions include any isotropic log-concave distributions such as standard Gaussian. In addition to their nonconvex optimization method discussed above, for any convex, nonincreasing, and nonconstant loss function, they also showed an Ω OPT ln(1/OPT) lower bound for log-concave marginals and an Ω OPT1 1/s lower bound for s-heavy-tailed marginals. In another recent work, Frei et al. (2021b) assumed Px satisfies a soft-margin condition: for anti-concentrated marginals such as isotropic log-concave marginals, this assumes Pr u, x γ = O(γ) for any γ > 0. For subexponential distributions with soft-margins, they proved an e O OPT upper bound for gradient descent on the logistic loss, which can be improved to O OPT for bounded distributions. Note that these upper bounds and the lower bounds in (Diakonikolas et al., 2020d) do not match: if Px is sub-exponential, then Diakonikolas et al. (2020d) only Agnostic Learnability of Halfspaces via Logistic Loss gave an eΩ(OPT) lower bound, while if Px is s-heavy-tailed, then the upper bound in (Frei et al., 2021b) becomes worse. Finally, some prior works on agnostic learning of halfspaces have considered various extensions of the problem such as active agnostic learning (Awasthi et al., 2014; Yan & Zhang, 2017), agnostic learning of sparse halfspaces with sample complexity scaling logarithmically in the ambient dimensionality (Shen & Zhang, 2021), and agnostic learning under weaker noise models such as the random classification noise (Blum et al., 1998; Dunagan & Vempala, 2008), Massart s noise model (Awasthi et al., 2015; 2016; Zhang et al., 2020; Diakonikolas et al., 2019; 2020b; 2021; Chen et al., 2020) and the Tsybakov noise model (Diakonikolas et al., 2020c; Zhang & Li, 2021). We do not consider these extensions in our work. 1.2. Notation Let denote the ℓ2 (Euclidean) norm. Given r > 0, let B(r) := x x r denote the Euclidean ball with radius r. Given two nonzero vectors u and v, let ϕ(u, v) [0, π] denote the angle between them. Given a data distribution P over Rd { 1, +1}, let Px denote the marginal distribution of P on the feature space Rd. We will frequently need the projection of the input features onto a two-dimensional subspace V ; in such cases, it will be convenient to use polar coordinates (r, θ) for the associated calculations, such as parameterizing the density with respect to the Lebesgue measure as p V (r, θ). Given a nonincreasing loss function ℓ: R R, we consider the population risk Rℓ(w) := E(x,y) P ℓ y w, x , and the corresponding empirical risk b Rℓ(w) := 1 i=1 ℓ yi w, xi , defined over n i.i.d. samples drawn from P. We will focus on the logistic loss ℓlog(z) := ln(1 + e z), and the hinge loss ℓh(z) := max{ z, 0}. Let Rlog := Rℓlog for simplicity, and also define b Rlog, Rh and b Rh similarly. Let R0 1(w) := Pr(x,y) P y = sign w, x denote the population zero-one risk. OPT lower bound for logistic loss In this section, we construct a distribution Q over R2 { 1, +1} which satisfies standard regularity conditions in (Diakonikolas et al., 2020d; Frei et al., 2021a), but the global minimizer w of the population logistic risk Rlog on Q only achieves a zero-one risk of Ω OPT . Our focus on the global logistic optimizer is motivated by the lower bounds from (Diakonikolas et al., 2020d); in particular, this means that the large classification error is not caused by the sampling error. The distribution Q has four parts Q1, Q2, Q3, and Q4, as described below. It can be verified that if OPT 1/16, the construction is valid. 1. The feature distribution of Q1 consists of two squares: one has edge length q 2 2 and density 1, with label 1; the other has edge length q 2 2 , density 1, with label +1. 2. The feature distribution of Q2 is supported on h 0, OPT i [0, 1] h OPT, 0 i [ 1, 0] with density 1, and the label is given by sign(x1). 3. Let q3 := 2 OPT(1 OPT), then Q3 consists of two squares: one has edge length p q3 2 , center (1, 0), density 1 and label +1, and the other has edge length p q3 2 , center ( 1, 0), density 1 and label 1. 4. The feature distribution of Q4 is the uniform distribution over the unit ball B(1) := x x 1 with den- sity q4 := 1 OPT 2 OPT q3 π , and the label is given by sign(x1). Figure 1. An illustration of Q when OPT = 1/16. Red areas denote the +1 label, while blue areas denote the 1 label. The parts Q1, Q2 and Q3 are marked in the figure, while Q4 is supported on the unit circle and marked by horizontal lines. Note that the correct label is given by sign(x1) on Q2, Q3 and Q4; therefore u := (1, 0) is our ground-truth solution that is only wrong on the noisy part Q1. Here is our lower bound result. Agnostic Learnability of Halfspaces via Logistic Loss Theorem 2.1. Suppose OPT 1/100, and let Qx denote the marginal distribution of Q on the feature space. It holds that Ex Qx[x] = 0, and Ex Qx[x1x2] = 0, and Ex Qx[x2 1 x2 2] = 0. Moreover, the population logistic risk Rlog has a global minimizer w , and R0 1(w ) = Pr y = sign w , x Note that we can further normalize Qx to unit variance and make it isotropic. Then it is easy to check that Qx satisfies the well-behaved-ness conditions in (Diakonikolas et al., 2020d), and the soft-margin and sub-exponential conditions in (Frei et al., 2021b). In particular, our lower bound matches the upper bound in (Frei et al., 2021b). 2.1. Proof of Theorem 2.1 Here is a proof sketch of Theorem 2.1; the full proof is given in Appendix B. First, basic calculation shows that Qx is isotropic up to a constant multiplicative factor. Specifically, Q1, Q2 and Q4 are constructed to make the risk lower bound proof work, while Q3 is included to make Q isotropic. It turns out that Q3 does not change the risk lower bound proof too much: the reason is that we will prove Theorem 2.1 by contradiction, and show that if its conclusion does not hold, then Rlog(w ) = 0. We show this mainly using Q1, Q2 and Q4, but Q3 does not significantly change the argument either, since it is highly aligned with the groundtruth solution u := (1, 0). As a result, if we assume the conclusion of Theorem 2.1 does not hold, which means Q3 is also aligned with w , then ℓ y w , x will be close to 0 on Q3, and we can still obtain a nonzero Rlog(w ) and derive a contradiction. Next we consider the risk lower bound. We only need to show that ϕ( u, w ), the angle between u and w , is Ω OPT , since it then follows that w is wrong on an Ω OPT fraction of Q4, which is enough since Q4 accounts for more than a half of the distribution Q. Note that the minimizer of the logistic risk on Q4 by itself is infinitely far in the direction of u. However, this will incur a large risk on Q1. By balancing these two parts, we can show that by moving along the direction of u by a distance of Θ 1 , we can achieve a logistic risk of O Lemma 2.2. Suppose OPT 1/100, let w := ( r, 0) where r = 3 OPT, then Rlog( w) 5 Next we consider the global minimizer w of Rlog, which exists since Rlog has bounded sub-level sets. Let (r , θ ) denote the polar coordinates of w . We will assume θ h 30 i , and derive a contradiction. In our construction, Q3 and Q4 are symmetric with respect to the horizontal axis, and they will induce the ground-truth solution. However, Q1 and Q2 are skew, and they will pull w above, meaning we actually have θ h 0, 30 i . The first observation is an upper bound on r : if r is too large, then the risk of w over Q1 will already be larger than Rlog( w) for w constructed in Lemma 2.2, a contradiction. Lemma 2.3. Suppose OPT 1/100 and θ h 0, However, our next lemma shows that under the above conditions, the gradient of Rlog at w does not vanish, which contradicts the definition of w . Lemma 2.4. Suppose OPT 1/100, then for any w = (r, θ) with 0 r 10 OPT and 0 θ 30 , it holds that Rlog(w) = 0. To prove Lemma 2.4, let us consider an arbitrary w = (r, θ) under the conditions of Lemma 2.4. For simplicity, let us first look at the case θ = 0. In this case, note that y w, x OPT = 10 on Q2, therefore ℓ y w, x is bounded away from 0 on Q2. We can then show that Q2 induces a component of length C1 OPT in the gradient Rlog(w) along the direction of e2 = (0, 1), where C1 is some universal constant. Moreover, Q1 also induces a component in the gradient along e2, while Q3 and Q4 induce a zero component along e2. As a result, Rlog(w), e2 < 0, and thus Rlog(w) is nonzero. Now if 0 θ C2 OPT for some small enough constant C2 (1/30 in our case), we can show that Q3 and Q4 cannot cancel the effect of Q2, and it still holds that Rlog(w), e2 < 0. 3. An e O(OPT) upper bound for logistic loss with radial Lipschitzness OPT lower bound construction in Section 2 shows that further assumptions on the distribution are necessary in order to improve the upper bound on the zero-one risk of the logistic regression solution. In particular, we note that the distribution Q constructed in Section 2 has a discontinuous density. In this section, we show that if we simply add a very mild Lipschitz continuity condition on the density, then we can achieve e O(OPT) zero-one risk using logistic regression. First, we formally provide the standard assumptions from prior work. Because of the lower bound for s-heavy-tailed distributions from (Diakonikolas et al., 2020d) (cf. Section 1.1), to get an e O(OPT) zero-one risk, we need to assume Px has a light tail. Following (Frei et al., 2021b), we will either consider a bounded distribution, or assume Px Agnostic Learnability of Halfspaces via Logistic Loss is sub-exponential as defined below (cf. (Vershynin, 2018, Proposition 2.7.1 and Section 3.4.4)). Definition 3.1. We say Px is (α1, α2) sub-exponential for constants α1, α2 > 0, if for any unit vector v and any t > 0, Prx Px v, x t α1 exp t/α2 . We also need the next assumption, which is part of the well-behaved-ness conditions from (Diakonikolas et al., Assumption 3.2. There exist constants U, R > 0 and a function σ : R+ R+, such that if we project Px onto an arbitrary two-dimensional subspace V , the corresponding density p V satisfies p V (r, θ) 1/U for all r R, and p V (r, θ) σ(r) for all r 0, and R 0 σ(r) dr U, and R 0 rσ(r) dr U. While Assumption 3.2 may look a bit technically involved, it basically consists of some mild concentration and anticoncentration conditions. In particular, for a broad class of distributions including isotropic log-concave distributions, the sub-exponential condition and Assumption 3.2 hold with α1, α2, U, R all being universal constants. Finally, as discussed earlier, the previous conditions are also satisfied by Q from Section 2, and thus to get the improved e O(OPT) risk bound, we need the following radial Lipschitz continuity assumption. Assumption 3.3. There exists a measurable function κ : R+ R+ such that for any two-dimensional subspace V , p V (r, θ) p V (r, θ ) κ(r)|θ θ |. We will see Assumption 3.3 is crucial for the upper bound analysis in Lemma 3.13. For some concrete examples, note that if Px is radially symmetric (e.g., standard Gaussian), then its projection onto any two-dimensional subspace V is also radially symmetric, therefore we can let κ(r) = 0. On the other hand, if p V is λ-Lipschitz continuous on R2 under ℓ2 (e.g., general Gaussian), then it implies |p V (r, θ) p V (r, θ )| λr|θ θ |, therefore we can let κ(r) = λr. Now we can state our main results. In the following, we denote the unit linear classifier with the optimal zero-one risk by u, with R0 1( u) = OPT (0, 1/e). Our first result shows that, with Assumption 3.3, minimizing the logistic risk yields a solution with e O(OPT) zero-one risk. Theorem 3.4. Under Assumptions 3.2 and 3.3, let w denote the global minimizer of Rlog. 1. If x B almost surely, then R0 1(w ) = O (1 + Cκ)OPT , where Cκ := R B 0 κ(r) dr. 2. If Px is (α1, α2)-sub-exponential, then R0 1(w ) = O (1 + Cκ)OPT ln(1/OPT) , where Cκ := R 3α2 ln(1/OPT) 0 κ(r) dr. Remark 3.5. Given Theorem 3.4, we only need to estimate Cκ to get a concrete bound. First, for radially symmetric distributions, since κ(r) = 0, we have Cκ = 0. On the other hand, if p V is λ-Lipschitz continuous on R2, then we can let κ(r) = λr, and then by definition, we can show Cκ λB2/2 in the bounded case, and Cκ 9λα2 2 ln(1/OPT)2/2 in the sub-exponential case. Theorem 3.4 shows that with radial Lipschitzness, the global minimizer can attain e O(OPT) zero-one risk; next we also give an algorithmic result. Given a target error ϵ (0, 1), we consider projected gradient descent on the empirical risk with a norm bound of 1/ ϵ: let w0 := 0, and wt+1 := ΠB(1/ ϵ) h wt η b Rlog(wt) i . (1) Our next result shows that projected gradient descent can also give an e O(OPT) risk. Note that for the two cases discussed below (bounded or sub-exponential), we use the corresponding Cκ defined in Theorem 3.4. Theorem 3.6. Suppose Assumptions 3.2 and 3.3 hold. 1. If x B almost surely, then with η = 4/B2, and ϵ4 samples and O 1 ϵ5/2 iterations, with probability 1 δ, projected gradient descent outputs wt satisfying R0 1(wt) = O (1 + Cκ)(OPT + ϵ) . 2. On the other hand, if Px is (α1, α2)-sub-exponential, then with η = eΘ(1/d), using e O d ln(1/δ)3 and e O d ln(1/δ)2 ϵ5/2 iterations, with probability 1 δ, projected gradient descent outputs wt with R0 1(wt) = O (1 + Cκ)(OPT ln(1/OPT) + ϵ) . Next we give proof outlines of our results; the full proofs are given in Appendix C. For simplicity, here we focus only on the bounded case, while the sub-exponential case will be handled in Appendix C. Although the proofs of the two cases share some similarity, we want to emphasize that the sub-exponential case does not follow by simply truncating the distribution to a certain radius and thus reducing to the bounded case. The reason is that the truncation radius can be as large as d, while for the bounded case in our results, B is considered a constant independent of d and hidden in the O notation; therefore this truncation argument will introduce a poly(d) dependency in the final bound. By contrast, our zero-one risk upper bounds for sub-exponential distributinos only depend on α1, α2, U and R, but do not depend on d. Agnostic Learnability of Halfspaces via Logistic Loss 3.1. Proof of Theorems 3.4 and 3.6 Theorems 3.4 and 3.6 rely on the following key lemma which provides a zero-one risk bound on near optimal solutions to the logistic regression problem. It basically says that a near optimal solution with reasonably large norm also attains a good zero-one risk. Lemma 3.7. Under Assumptions 3.2 and 3.3, suppose ˆw satisfies Rlog( ˆw) Rlog( ˆw u) + ϵℓfor some ϵℓ [0, 1). If x B almost surely, then R0 1( ˆw) = O max OPT, q ϵℓ ˆ w , Cκ ˆ w 2 In this subsection, we will sketch the proofs of Theorems 3.4 and 3.6 using Lemma 3.7; the details are given in Appendix C.1. In the next subsection, we will prove Lemma 3.7. We first prove Theorem 3.4. Note that by Lemma 3.7, it suffices to show that w = Ω 1 (since ϵℓ= 0 in this case), which is true due to the next result. Lemma 3.8. Under Assumption 3.2, if x B almost surely and OPT < R4 200U 3B , then w = Ω 1 Next we prove Theorem 3.6. Once again motivated by Lemma 3.7, we need to show that projected gradient descent can achieve a near optimal logistic risk and a large norm. Recall that given the target (zero-one) error ϵ (0, 1), we run projected gradient descent on a Euclidean ball with radius 1/ ϵ (cf. Equation (1)). Using standard optimization and generalization analyses, we can prove the following guarantee on Rlog(wt). Lemma 3.9. Let the target optimization error ϵℓ (0, 1) and the failure probability δ (0, 1/e) be given. If x B almost surely, then with η = 4/B2, using O (B+1)2 ln(1/δ) samples and O B2 ϵϵℓ iterations, with probability 1 δ, projected gradient descent outputs wt satisfying Rlog(wt) Rlog wt u + ϵℓ. (2) We also need the following lower bounds on wt . Lemma 3.10. Under Assumption 3.2, suppose and that Equation (2) holds. If x B almost surely and OPT < R4 500U 3B , then wt = Ω min n 1 ϵ, 1 Now to prove Theorem 3.6, we simply need to combine Lemmas 3.7, 3.9 and 3.10 with ϵℓ= ϵ3/2. 3.2. Proof of Lemma 3.7 Here we give a proof sketch of Lemma 3.7; the details are given in Appendix C.2. As mentioned before, here we focus on the bounded setting; in the appendix, we also prove a version of Lemma 3.7 for the sub-exponential setting (cf. Lemma C.4). One remark is that some of the lemmas in the proof are also true for the hinge loss, and this fact will be crucial in the later discussion regarding our two-phase algorithm (cf. Section 4). Let w := ˆw u, and consider ℓ {ℓlog, ℓh}. The first step is to express Rℓ( ˆw) Rℓ( w) as the sum of three terms, and then bound them separately. The first term is given by Rℓ( ˆw) Rℓ( w) E h ℓ sign w, x ˆw, x ℓ sign w, x w, x i , the second term is given by E h ℓ sign w, x ˆw, x ℓ sign ˆw, x ˆw, x i , and the third term is given by E h ℓ sign ˆw, x ˆw, x ℓ sign w, x w, x i , where the expectations are taken over Px. We first bound term (3), which is the approximation error of replacing the true label y with the label given by u. Since ℓ( z) ℓ(z) = z for the logistic loss and hinge loss, we have the following equality: term (3) = E 1y =sign( w,x ) y w ˆw, x . The approximation error can be bounded as below, using the tail bound on Px and the fact R0 1( w) = OPT. Lemma 3.11. For ℓ {ℓlog, ℓh}, if x B almost surely, |term (3)| B w ˆw OPT. Next we bound term (4). Note that we only need to consider the case where w, x and ˆw, x have different signs; in this case, we can use the property ℓ( z) ℓ(z) = z again and show the next result. Lemma 3.12. Under Assumption 3.2, for ℓ {ℓlog, ℓh}, term (4) 4R3 3Uπ2 ˆw ϕ( ˆw, w)2. Lastly, we consider term (5). Note that it is 0 for the hinge loss ℓh, because ℓh(z) = 0 when z 0. For the logistic loss, term (5) is also 0 if Px is radially symmetric; in general, we will bound it using Assumption 3.3. Agnostic Learnability of Halfspaces via Logistic Loss Lemma 3.13. For ℓ= ℓh, term (5) is 0. For ℓ= ℓlog, under Assumption 3.3, if x B almost surely, then |term (5)| 12Cκ ϕ( ˆw, w)/ ˆw , where Cκ := R B 0 κ(r) dr. Now we are ready to prove Lemma 3.7. For simplicity, here we let ϕ denote ϕ( ˆw, w). For bounded distributions, Lemmas 3.11 to 3.13 imply C1 ˆw ϕ2 ϵℓ+ B w ˆw OPT + C2Cκ ϕ/ ˆw ϵℓ+ B ˆw ϕ OPT + C2Cκ ϕ/ ˆw , where C1 = 4R3/(3Uπ2) and C2 = 12. It follows that at least one of the following three cases is true: 1. C1 ˆw ϕ2 3ϵℓ, which implies ϕ = O p 2. C1 ˆw ϕ2 3B ˆw ϕ OPT, and it follows that ϕ = O(OPT); 3. C1 ˆw ϕ2 3C2Cκ ϕ/ ˆw , and it follows that ϕ = O Cκ/ ˆw 2 . Therefore we can show a bound on the angle between w and ˆw, which further implies a zero-one risk bound for ˆw, in light of (Diakonikolas et al., 2020d, Claim 3.4) which is stated below. Lemma 3.14. Under Assumption 3.2, R0 1( ˆw) R0 1( w) Pr sign ˆw, x = sign w, x 2Uϕ( ˆw, w). 3.3. Recovering the general Frei et al. (2021b) showed an e O OPT upper bound under the soft-margin and sub-exponential conditions. Here we give an alternative proof of this result using our proof technique. The result in this section will later serve as a guarantee of the first phase of our two-phase algorithm (cf. Section 4) that achieves e O(OPT) risk. Recall that the only place we need Assumption 3.3 is in the proof of Lemma 3.13. However, even without Assumption 3.3, we can still prove the following general bound which only needs Assumption 3.2. Lemma 3.15. Under Assumption 3.2, for ℓ= ℓlog, |term (5)| 12U Now with Lemma 3.15, we can prove a weaker but more general version of Lemma 3.7 (cf. Theorem C.15). Further invoking Lemmas 3.9 and 3.10 (cf. Lemmas C.6 and C.7 for the corresponding sub-exponential results), and let ϵℓ= ϵ, we can show the next result. We present the bound in terms of the angle instead of zero-one risk for later applications in Section 4. Lemma 3.16. Given the target error ϵ (0, 1) and the failure probability δ (0, 1/e), consider projected gradient descent (1). If x B almost surely, then with η = 4/B2, using O (B+1)2 ln(1/δ) ϵ2 samples and O B2 ϵ3/2 iterations, with probability 1 δ, projected gradient descent outputs wt with ϕ(wt, u) = O On the other hand, if Px is (α1, α2)-sub-exponential, then with η = eΘ(1/d), using e O d ln(1/δ)3 ϵ2 samples and e O d ln(1/δ)2 ϵ3/2 iterations, with probability 1 δ, projected gradient descent outputs wt with ϕ(wt, u) = O p OPT ln(1/OPT) + ϵ . The proofs of the results above are given in Appendix C.3. 4. An e O(OPT) upper bound with hinge loss We now show how to avoid Assumption 3.3 and achieve an e O(OPT) zero-one risk bound using an extra step of hinge loss minimization. The key observation here is that the only place where Assumption 3.3 is used is in Lemma 3.13 for bounding term (5) for logistic loss. However, as noted in Lemma 3.13, for hinge loss, term (5) is conveniently 0. So a version of Lemma 3.7 holds for hinge loss, without using Assumption 3.3, and dropping the third term of Cκ ˆ w 2 in the max. Thus, to get an e O(OPT) upper bound, it is enough to minimize the hinge loss to find a solution ˆw such that ˆw = Ω(1) and Rh( ˆw) Rh( ˆw u) + ϵℓfor some ϵℓ= e O((OPT+ϵ)2). However, there is still one remaining challenge: note that the global minimizer of Rh is given by 0, while if we add the explicit requirement ˆw = Ω(1), the problem becomes nonconvex. Fortunately, we can bypass having to solve this nonconvex problem by leveraging the solution of the logistic regression problem, which is guaranteed to make an angle of at most e O( OPT + ϵ) with u, even without Assumption 3.3, by Lemma 3.16. This solution, represented by a unit vector v, gives us a warm start for hinge loss minimization. Specifically, suppose we optimize the hinge loss over the halfspace D := n w Rd w, v 1 o , (6) then any solution we find must have norm at least 1. Furthermore, using the fact that ϕ(v, u) e O( Agnostic Learnability of Halfspaces via Logistic Loss and the positive homogeneity of the hinge loss, we can also conclude that the optimizer of the hinge loss satisfies Rh( ˆw) Rh( ˆw u) + ϵℓ, giving us the desired solution. While the above analysis does yield a simple two-phase polynomial time algorithm for getting an e O(OPT) zeroone risk bound, closer analysis reveals a sample complexity requirement of e O(1/ϵ4). We can improve the sample complexity requirement to e O(1/ϵ2) by doing a custom analysis of SGD on the hinge loss (i.e., perceptron, (Novikoff, 1963)) inspired by the above considerations. Thus we get the following two-phase algorithm1: 1. Run projected gradient descent under the settings of Lemma 3.16, and find a unit vector v such that ϕ(v, u) is O OPT + ϵ for bounded distributions, or O p OPT ln(1/OPT) + ϵ for sub-exponential distributions. 2. Run projected SGD over the domain D defined in Equation (6) starting from w0 := v: at step t, we sample (xt, yt) P, and let wt+1 := ΠD h wt ηℓ h yt wt, xt ytxt i . (7) Here, we set the convention that ℓ h(0) = 1. Below, we present the results regarding the expected zeroone risk for simplicity; we note that the results can be turned into high-probability bounds using the repeated probability amplification technique. Theorem 4.1. Given the target error ϵ (0, 1/e), suppose Assumption 3.2 holds. 1. First, for bounded distributions, with η = Θ(ϵ), for all T = Ω(1/ϵ2), E min 0 t 0 be given, then 2 ρ(1 e rρ) Z 2π 0 ℓlog rρ cos(θ) r dθ 8 Proof. First note that by symmetry, Z 2π 0 ℓlog rρ cos(θ) r dθ = 4 Z π 2 0 ℓlog rρ cos(θ) r dθ. On the upper bound, note that ℓlog rρ cos(θ) is increasing as θ goes from 0 to π 2 , and moreover sin(θ) 2 2 for θ π 2 , therefore 0 ℓlog rρ cos(θ) r dθ 8 Z π 2 π 4 ℓlog rρ cos(θ) r dθ 8 π 4 ℓlog rρ cos(θ) rρ sin(θ) dθ. Also because ℓlog(z) exp( z), 0 ℓlog rρ cos(θ) r dθ 8 π 4 exp rρ cos(θ) rρ sin(θ) dθ On the lower bound, note that ℓlog(z) 1 2 exp( z) for z 0, therefore 0 ℓlog rρ cos(θ) r dθ = 4 Z π 2 0 ℓlog rρ cos(θ) r dθ 2 Z π 2 0 exp rρ cos(θ) r dθ 0 exp rρ cos(θ) rρ sin(θ) dθ Lemma A.2. Given w, w Rd, suppose Pr(x,y) P y = sign w, x = OPT. If x B almost surely, then 1y =sign( w,x ) w , x B w OPT. If Px is (α1, α2)-sub-exponential, and OPT 1 1y =sign( w,x ) w , x (1 + 2α1)α2 w OPT ln 1 OPT Proof. If x B almost surely, then 1y =sign( w,x ) w , x B w E(x,y) P 1y =sign( w,x ) Agnostic Learnability of Halfspaces via Logistic Loss Below we assume Px is (α1, α2)-sub-exponential. Let νx := w , x ; we first give some tail bounds for νx. Since Px is (α1, α2)-sub-exponential, for any t > 0, we have , equivalently Pr |νx| t α1 exp t α2 w Let µ(t) := Pr |νx| t . Given any threshold τ > 0, integration by parts gives E h 1|νx| τ|νx| i = Z τ t dµ(t) = τµ(τ) + Z τ µ(t) dt α1 α2 w + τ exp τ α2 w Now let τ := α2 w ln 1 OPT . Note that 1y =sign( w,x ) w , x = E(x,y) P 1|νx| τ1y =sign( w,x )|νx| + E(x,y) P 1|νx| τ1y =sign( w,x )|νx| . We bound the two parts separately. When |νx| τ, we have E 1|νx| τ1y =sign( w,x )|νx| τE 1y =sign( w,x ) = τ OPT = α2 w OPT ln 1 OPT On the other hand, when |νx| τ, Equation (12) gives 1|νx| τ1y =sign( w,x )|νx| E h 1|νx| τ|νx| i 1 + ln 1 OPT 2α1α2 w OPT ln 1 OPT where we also use OPT 1 e. To sum up, 1y =sign( w,x ) w , x (1 + 2α1)α2 w OPT ln 1 OPT B. Omitted proofs from Section 2 In this section, we will prove Theorem 2.1. First, we bound the density and support of Qx. Lemma B.1. If OPT 1 100, then it holds that q3 1 15, and 1 2π q4 1 π. As a result, Qx is supported on B(2) := x x 2 with its density bounded by 2. Proof. For q3, we have OPT(1 OPT) 2 For Q4, its total measure can be bounded as below: OPT q3 1 1 100 2 therefore q4 1 2π. The upper bound q4 1 π is trivial. Agnostic Learnability of Halfspaces via Logistic Loss On the support of Qx, note that for Q1, the largest ℓ2 norm is given by For Q2, the largest ℓ2 norm can be bounded by For Q3, the largest ℓ2 norm can be bounded by Finally, it is easy to verify that if OPT 1 100, then Q1, Q2 and Q3 do not overlap, therefore the density of Q is bounded by 1 + 1 Next we verify that Qx is isotropic up to a multiplicative factor. We first note the following fact; its proof is straightforward and omitted. Lemma B.2. It holds that Z a+ δ 2 xy dy dx = abδ2, and Z a+ δ 2 (x2 y2) dy dx = (a2 b2)δ2. Then we can prove the following result. Lemma B.3. It holds that Ex Qx[x] = 0, and Ex Qx[x1x2] = 0, and Ex Qx x2 1 x2 2 = 0. Proof. It follows from the symmetry of Q that Ex Qx[x] = 0. To verify Ex Qx[x1x2] = 0, note that the expectation of x1x2 is 0 on Q3 and Q4, and thus we only need to check Q1 and Q2. First, due to Lemma B.2, we have E(x,y) Q1 [x1x2] = OPT Additionally, E(x,y) Q2 [x1x2] = 2 Z 0 x1x2 dx2 dx1 = OPT Therefore Ex Qx[x1x2] = 0. Finally, note that the expectation of x2 1 x2 2 is 0 on Q1 due to Lemma B.2, and also 0 on Q4 due to symmetry; therefore we only need to consider Q2 and Q3. We have E(x,y) Q2 h x2 1 x2 2 i = 2 Z x2 1 x2 2 dx2 dx1 = 2 Since E(x,y) Q3 x2 1 x2 2 = q3 by Lemma B.2, it follows that Ex Qx x2 1 x2 2 = 0. Next, we give a proof of the risk lower bound of Theorem 2.1. For simplicity, in this section we will let R denote Rlog. For i = 1, 2, 3, 4, we also let Ri(w) := E(x,y) Qi ℓlog y w, x ; therefore R(w) := P4 i=1 Ri(w). We first prove Lemma 2.2, showing that there exists a solution w with w = Θ 1 and R( w) = O Proof of Lemma 2.2. We consider R1, R2, R3 and R4 respectively. Agnostic Learnability of Halfspaces via Logistic Loss 1. For Q1, note that the minimum of y w, x is Because ℓlog(z) z + 1 when z 0, and OPT 1 100, we have R1( w) ℓlog 2. For Q2, we have R2( w) = 2 Z 0 ℓlog(x1 r) dx2 dx1 = 2 Z 0 ℓlog(x1 r) dx1 0 exp( x1 r) dx1 where we use ℓlog(z) exp( z). 3. For Q3, the minimum of y w, x is where we use q3 1 15 by Lemma B.1. Further note that ℓlog(z) 1/z when z > 0, we have R3( w) q3ℓlog 2 r/3 1 10 r. R4( w) = Z 1 0 ℓlog r r cos(θ) q4r dθ dr 1 0 ℓlog r r cos(θ) r dθ dr, where we use q4 1 π from Lemma B.1. Lemma A.1 then implies Putting everything together, we have R( w) = R1( w) + R2( w) + R3( w) + R4( w) r + 1 10 r + 8 Agnostic Learnability of Halfspaces via Logistic Loss Next we prove Lemma 2.3, the upper bound on w . Proof of Lemma 2.3. Let Let φ denote the angle between u and v, then and it follows that the angle between v and w is bounded by Moreover, note that the maximum of y w , x on Q1 is given by w , v r v cos π Additionally because ℓlog(z) > z, we have R(w ) R1(w ) ℓlog OPT, then R(w ) > 5 OPT, which contradicts the definition of w in light of Lemma 2.2. Therefore r 10 Next we prove Lemma 2.4. Proof of Lemma 2.4. Let w = (r, θ), where 0 r 10 OPT and 0 θ 30 . We will consider the projection of R(w) onto the direction e2 := (0, 1), and show that this projection cannot be zero. 1. For Q1, the gradient of this part has a negative inner product with e2, due to the construction of Q1 and the fact ℓ log < 0. 2. For Q2, the inner product between e2 and the gradient of this part is given by 0 ℓ log(x1w1 + x2w2)x2 dx2 dx1. (13) Note that x1w1 rx1, while x2w2 = x2r sin (θ) rθ 10 and that ℓ log is increasing, therefore ℓ log(x1w1 + x2w2) ℓ log Agnostic Learnability of Halfspaces via Logistic Loss We can then upper bound Equation (13) as follows: Equation (13) 2 Z Now we consider two cases. If r OPT 2, then it follows from the convexity of ℓlog that Equation (13) 1 OPT ℓ log(3) On the other hand, if r OPT 2, then Equation (13) 1 Therefore, it always holds that Equation (13) 3. For Q3, the gradient of this part can have a positive inner product with e2. For simplicity, let ρ := 1 2 . To upper bound this inner product, it is enough to consider the region given by [1 ρ, 1 + ρ] [ ρ, 0] [ 1 ρ, 1 + ρ] [0, ρ] . Moreover, note that y w, x 0 on Q3, therefore ℓ log y w, x 1 2. Therefore the inner product between e2 and the gradient of Q3 can be upper bounded by (note that x2 0 in the integral) 2x2 dx2 dx1 = ρ3 = q3 16 where we use q3 1 15 by Lemma B.1 and q3 2 OPT by its definition. 4. For Q4, we further consider two cases. (a) Consider the part of Q4 with polar angles in ( π 2 ). By symmetry, the gradient of this part is along the direction with polar angle π + θ, and it has a negative inner product with e2. (b) Consider the part of Q4 with polar angles in ( π 2 + 2θ) ( π 2 + 2θ). We can verify that the gradient of this part has a positive inner product with e2; moreover, since 1 < ℓ log < 0, this inner product can be upper bounded by 0 r cos(θ )q4r dθ dr = 2q4 1 3 sin(2θ) 4θ where we also use q4 1 π and sin(z) z for z 0. As a result, item 3 and item 4(b) cannot cancel item 2, and thus R(w) cannot be 0. Now we are ready to prove the risk lower bound of Theorem 2.1. Proof of Theorem 2.1 risk lower bound. It is clear that R has bounded sub-level sets, and therefore can be globally minimized. Let the polar coordinates of the global minimizer be given by (r , θ ), where |θ | π. Assume that Agnostic Learnability of Halfspaces via Logistic Loss 30 i ; due to Q1 and Q2, it actually follows that θ h 0, 30 i . Lemma 2.3 then implies r 10 and then Lemma 2.4 implies R(w ) = 0, a contradiction. It then follows that w is wrong on a θ π portion of Q4. Since the total measure of Q4 is more than half due to Lemma B.1, we have C. Omitted proofs from Section 3 In this section, we provide omitted proofs from Section 3. First, we prove some general results that will be used later. Lemma C.1. Under Assumption 3.2, for any w Rd, E ℓlog w, x 12U Proof. Let v denote an arbitrary vector orthogonal to w, and let p denote the density of the projection of Px onto the space spanned by w and v. Then we have E ℓlog w, x = Z 0 ℓlog r w cos(θ) p(r, θ)r dθ dr. Invoking Assumption 3.2, we have E ℓlog w, x Z 0 ℓlog r w cos(θ) r dθ Lemma A.1 then implies E ℓlog w, x Z Then it follows from Assumption 3.2 that E ℓlog w, x 8 Next, we note that following the direction of the ground-truth solution u can achieve e O OPT logistic risk. Lemma C.2. Given ρ > 0, under Assumption 3.2, if x B almost surely, then Rlog(ρ u) 12U ρ + ρB OPT, with inf ρ>0 Rlog(ρ u) while if Px is (α1, α2)-sub-exponential, then Rlog(ρ u) 12U ρ + (1 + 2α1)α2ρ OPT ln 1 OPT inf ρ>0 Rlog(ρ u) 50(1 + 2α1)α2U OPT ln 1 OPT Agnostic Learnability of Halfspaces via Logistic Loss Proof. Note that Rlog(ρ u) = E(x,y) P h ℓlog y ρ u, x i ℓlog ρ u, x + E(x,y) P ℓlog y ρ u, x ℓlog ρ u, x . Since ℓlog( z) ℓlog(z) = z, and also invoking Lemma C.1, we have Rlog(ρ u) = Ex Px ℓlog ρ u, x + E(x,y) P 1y =sign( u,x ) ( y) ρ u, x ρ + E(x,y) P 1y =sign( u,x ) ( y) ρ u, x . If x B almost surely, then Lemma A.2 further implies Rlog(ρ u) 12U ρ + ρB OPT, inf ρ>0 Rlog(ρ u) 2 If Px is (α1, α2)-sub-exponential, then Lemma A.2 further implies Rlog(ρ u) 12U ρ + (1 + 2α1)α2ρ OPT ln 1 OPT and therefore inf ρ>0 Rlog(ρ u) 2 12(1 + 2α1)α2U OPT ln 1 OPT 50(1 + 2α1)α2U OPT ln 1 OPT Next we prove a risk lower bound, that will later be used to prove lower bounds on w and wt . Lemma C.3. Under Assumption 3.2, given w Rd, if R w 2, then while if R w 2, then Rlog(w) R U w . Proof. First, since ℓlog(z) ℓlog |z| , Rlog(w) = E(x,y) P h ℓlog y w, x i Ex Px ℓlog w, x . (14) Let v denote an arbitrary vector that is orthogonal to w, and let p denote the density of the projection of Px onto the space spanned by w and v. Without loss of generality, we can assume w has polar angle 0. Then Equation (14) becomes 0 ℓlog r w cos(θ) p(r, θ)r dθ dr. Agnostic Learnability of Halfspaces via Logistic Loss Assumption 3.2 and Lemma A.1 then imply 0 ℓlog r w cos(θ) r dθ dr e R w 1 + R w . If R w 2, then because e z 1 + z z2 4 when 0 z 2, we have U 1 w 2 R2 w 2 Otherwise if R w 2, then because e z 1 + z z 2 when z 2, we have U 1 w 2 R w 2 = R U w . C.1. Omitted proofs from Section 3.1 In this section, we prove Theorems 3.4 and 3.6. First, we state the following general version of Lemma 3.7, which also handles sub-exponential distributions; it will be proved in Appendix C.2. Lemma C.4 (Lemma 3.7, including the sub-exponential case). Under Assumptions 3.2 and 3.3, suppose ˆw satisfies Rlog( ˆw) Rlog( ˆw u) + ϵℓfor some ϵℓ [0, 1). 1. If x B almost surely, then R0 1( ˆw) = O max OPT, q ϵℓ ˆ w , Cκ ˆ w 2 2. If Px is (α1, α2)-sub-exponential and ˆw = Ω(1), then R0 1( ˆw) = O max OPT ln(1/OPT), q ϵℓ ˆ w , Cκ ˆ w 2 Next, we prove the following norm lower bound on w , which covers Lemma 3.8 and also the sub-exponential case. Lemma C.5 (Lemma 3.8, including the sub-exponential case). Under Assumption 3.2, if x B almost surely and OPT < R4 200U 3B , then w = Ω 1 ; if Px is (α1, α2)-sub-exponential and OPT ln(1/OPT) < R4 200(1+2α1)α2U 3 , then w = Ω 1 OPT ln(1/OPT) Proof. Suppose x B almost surely. Since OPT < R4 200U 3B , Lemma C.2 implies Rlog(w ) inf ρ>0 Rlog(ρ u) 200U 3B = R2 Therefore it follows from Lemma C.3 that R w 2, and R U w Rlog(w ) inf ρ>0 Rlog(ρ u) Agnostic Learnability of Halfspaces via Logistic Loss which implies Now suppose Px is (α1, α2)-sub-exponential. Since OPT ln 1 OPT < R4 200(1+2α1)α2U 3 , Lemma C.2 implies inf ρ>0 Rlog(ρ u) 50(1 + 2α1)α2U OPT ln 1 OPT 50(1 + 2α1)α2U R4 200(1 + 2α1)α2U 3 = R2 Therefore it follows from Lemma C.3 that R w 2, and R U w Rlog(w ) inf ρ>0 Rlog(ρ u) 50(1 + 2α1)α2U OPT ln 1 OPT which implies 50(1 + 2α1)α2U OPT ln(1/OPT) . Now we can prove Theorem 3.4. Proof of Theorem 3.4. If x B almost surely, Lemma C.4 implies R0 1(w ) = O max OPT, Cκ w 2 If OPT R4 200U 3B , then Theorem 3.4 holds vacuously; otherwise Lemma C.5 ensures w = Ω 1 R0 1(w ) = O max {OPT, Cκ OPT} = O (1 + Cκ)OPT . The proof of the sub-exponential case is similar. Next, we analyze project gradient descent. First we restate Lemmas 3.9 and 3.10, and also handle sub-exponential distributions. Lemma C.6 (Lemma 3.9, including the sub-exponential case). Let the target optimization error ϵℓ (0, 1) and the failure probability δ (0, 1/e) be given. If x B almost surely, then with η = 4/B2, using O (B+1)2 ln(1/δ) and O B2 ϵϵℓ iterations, with probability 1 δ, projected gradient descent outputs wt satisfying Rlog(wt) min 0 ρ 1/ ϵ Rlog(ρ u) + ϵℓ. (15) If Px is (α1, α2)-sub-exponential, then with η = eΘ(1/d), using e O d ln(1/δ)3 samples and e O d ln(1/δ)2 iterations, with probability 1 δ, projected gradient descent outputs wt satisfying Equation (15). Lemma C.7 (Lemma 3.10, including the sub-exponential case). Under Assumption 3.2, suppose and that Equation (15) holds. If x B almost surely and OPT < R4 500U 3B , then wt = Ω min n 1 ϵ, 1 On the other hand, if Px is (α1, α2)-sub-exponential, and OPT ln(1/OPT) < R4 500U 3(1+2α1)α2 , then it holds that wt = Ω min n 1 ϵ, 1 OPT ln(1/OPT) Agnostic Learnability of Halfspaces via Logistic Loss Next we prove Lemmas C.6 and C.7. We first consider bounded distributions, and then handle sub-exponential distributions. For simplicity, in the rest of this subsection we will use R and b R to denote Rlog and b Rlog, respectively. Bounded distributions. First, here are some standard optimization and generalization results for projected gradient descent. Lemma C.8. If xi B for all 1 i n, then b R is B2 4 -smooth. Moreover, if w0 := 0 and η 4 B2 , then for all t 1, b R(wt) min w B(1/ ϵ) b R(w) + 1 2ηϵt. Proof. Note that ℓlog is 1 4-smooth. To show b R is B2 4 -smooth, note that given any w, w Rd, b R(w) b R(w ) = ℓ log yi w, xi ℓ log yi w , xi yixi ℓ log yi w, xi ℓ log yi w , xi B yi w, xi yi w , xi i=1 w w B = B2 The following analysis basically comes from the proof of (Bubeck, 2014, Theorem 6.3); we include it for completeness, and also handle the last iterate. Let w := arg minw B(1/ ϵ) b R(w). Convexity gives b R(wt) b R(w ) D b R(wt), wt w E = D b R(wt), wt wt+1 E + D b R(wt), wt+1 w E . Smoothness implies D b R(wt), wt wt+1 E b R(wt) b R(wt+1) + B2/4 2 wt wt+1 2 b R(wt) b R(wt+1) + 1 2η wt wt+1 2. On the other hand, the projection step ensures D b R(wt), wt+1 w E 1 η wt wt+1, wt+1 w wt w 2 wt+1 w 2 wt wt+1 2 . b R(wt) b R(w ) b R(wt) b R(wt+1) + 1 wt w 2 wt+1 w 2 , which implies b R(wt+1) b R(w ) 1 wt w 2 wt+1 w 2 . (16) Agnostic Learnability of Halfspaces via Logistic Loss Next we show that b R(wt+1) b R(wt). Smoothness implies b R(wt+1) b R(wt) D b R(wt), wt+1 wt E + B2/4 2 wt+1 wt 2 η wt wt+1 2 + B2/4 2 wt+1 wt 2 η wt wt+1 2 + 1 2η wt+1 wt 2 2η wt+1 wt 2, where we also use the property of the projection step on the second line. It now follow from Equation (16) and b R(wt+1) b R(wt) that for t 1, b R(wt) b R(w ) + w0 w 2 2ηt b R(w ) + 1 2ηϵt. Lemma C.9. If x B almost surely, then with probability 1 δ, for all w B 1 ϵ , R(w) b R(w) 2B ϵn + 3 B ϵ + 1 r Proof. Note that ℓlog(z) |z| + 1, therefore ℓlog y w, x w x + 1 B ϵ + 1. Since ℓlog is 1-Lipschitz continuous, (Shalev-Shwartz & Ben-David, 2014, Theorem 26.5, Lemma 26.9, Lemma 26.10) imply that with probability 1 δ, for all w B 1 ϵ , R(w) b R(w) 2B ϵn + 3 B ϵ + 1 r Next we can just apply the same technique and get a uniform deviation bound on b R(w) R(w). We can now prove Lemma C.6. Proof of Lemma C.6 for bounded distributions. Lemma C.8 implies that b R(wt) min 0 ρ 1/ ϵ R(ρ u) 1 2ηϵt = B2 Moreover, Lemma C.9 ensures with probability 1 δ, for all w B2 1 ϵ , b R(w) R(w) 2B ϵn + 3 B ϵ + 1 r Therefore, to ensure R(wt) min0 ρ 1/ ϵ R(ρ u) ϵℓ, we only need steps, and O (B + 1)2 ln(1/δ) Agnostic Learnability of Halfspaces via Logistic Loss Next we prove the norm lower bound on wt . Proof of Lemma C.7. First, we consider the case x B almost surely. It follows from Lemma C.2 that ρ + ρB OPT. (17) 12U B OPT. We consider two cases below, ρ 1 ϵ or ρ 1 ϵ. First, we assume ρ 1 ϵ. Then by the conditions of Lemma C.7 and Equation (17), we have R(wt) R( ρ u) + ϵℓ 2 12UB OPT + ϵ It then follows from Lemma C.3 that R wt 2, and R U wt R(wt) R( ρ u) + ϵℓ 2 12UB OPT + ϵ. since ρ 1 ϵ, As a result, R U wt = O OPT , which implies wt = Ω 1 Next, assume ρ 1 ϵ, which implies that 12U ϵ, and B OPT 12Uϵ. Moreover, Equation (17) implies R 1 ϵ u 12U ϵ + 1 ϵB OPT 12U ϵ + 1 ϵ12Uϵ = 24U ϵ. Then because R(wt) R 1 ϵ u + ϵℓ 24U ϵ + ϵ < 24U it further follows from Lemma C.3 that R wt 2, and R U wt R 1 ϵ u + ϵℓ 24U ϵ + ϵ, therefore wt = Ω 1 ϵ . Now assume Px is (α1, α2)-sub-exponential. Lemma C.2 implies ρ + (1 + 2α1)α2ρ OPT ln 1 OPT 12U (1 + 2α1)α2 OPT ln(1/OPT), and similarly consider the two cases ρ 1 ϵ and ρ 1 ϵ, we can finish the proof. Agnostic Learnability of Halfspaces via Logistic Loss Now we are ready to prove Theorem 3.6. Proof of Theorem 3.6 for bounded distributions. First, note that if ϵ or OPT does not satisfy the conditions of Lemma C.7, then Theorem 3.6 holds vacuously. Under the conditions of Lemmas C.6 and C.7, let ϵℓ:= ϵ3/2, we have that projected gradient descent can find wt satisfying Rlog(wt) min 0 ρ 1/ ϵ Rlog(ρ u) + ϵ3/2 Rlog wt u + ϵ3/2, wt = Ω min 1 ϵ, 1 Now we just need to invoke Lemma C.4. If ϵ OPT, then wt = Ω 1 , and Lemma C.4 implies OPT, Cκ OPT = O (1 + Cκ)OPT . If ϵ OPT, then wt = Ω 1 ϵ , and similarly we can show = O max OPT, q ϵ3/2 ϵ, Cκϵ = O (1 + Cκ)(OPT + ϵ) . The sample and iteration complexity follow from Lemma C.6 and that ϵℓ= ϵ3/2. Sub-exponential distributions. Next we handle (α1, α2)-sub-exponential distributions. We will prove Lemma C.6 for sub-exponential distributions; the rest of the proof is similar to the bounded case and thus omitted. Let the target zero-one error ϵ, the target optimization error ϵℓ, and failure probability δ be given. Given r > 0, we overload the notation a little bit and let δ(r) := dα1 exp In particular, note that j=1 Pr |xj| r Let B > 1 be large enough such that 1 δ(B) 100(B+1)2 ln(4/δ)/(ϵϵ2 ℓ) 1 δ, and α1(α2 + B) exp B We have the following bound on B. Lemma C.10. To satisfy Equation (18), it is enough to let Agnostic Learnability of Halfspaces via Logistic Loss Proof. First, we let B α2 d ln(2dα1) to ensure δ(B) 1/2. Since for 0 z 1/2, we have e z 1 z e 2z, to satisfy the first condition of Equation (18), it is enough to ensure e δ(B) 200(B+1)2 ln(4/δ)/(ϵϵ2 ℓ) e δ, equivalently δ(B) δϵϵ2 ℓ 200(B + 1)2 ln(4/δ). Invoking the definition of δ(B), we only need 200(B + 1)2dα1 ln(4/δ) In other words, it is enough if B = Ω d ln d ϵϵℓδ . Similarly, to satisfy the second condition of Equation (18), we only need B α2 ln α1(α2 + B) and it is enough if B = Ω d ln d ϵϵℓδ . Now we define a truncated logistic loss ℓ log as following: ℓ log(z) := ℓlog(z) if z B ϵ. We also let R (w) and b R (w) denote the population and empirical risk with the truncated logistic loss. We have the next result. Lemma C.11. Suppose B > 1 is chosen according to Equation (18). Using a constant step size 4/B2, and 100(B + 1)2 ln(4/δ) ϵϵ2 ℓ samples, and B2 4ϵϵℓ steps, with probability 1 2δ, projected gradient descent can ensure R (wt) min 0 ρ 1/ ϵ R(ρ u) + ϵℓ. Proof. It follows from Equation (18) that with probability 1 δ, it holds that xi B for all training examples. Therefore Lemma C.8 implies that b R(wt) min 0 ρ 1/ ϵ b R(ρ u) + B2 Since xi B, and the domain is B(1/ ϵ), it follows that b R (wt) min 0 ρ 1/ ϵ b R (ρ u) + B2 Letting t = B2 4ϵϵℓ, we get b R (wt) min 0 ρ 1/ ϵ b R (ρ u) + ϵℓ Agnostic Learnability of Halfspaces via Logistic Loss Note that by the construction of the truncated logistic loss, it holds that ℓ log(z) B ϵ + 1. Then by invoking the standard Rademacher complexity results (Shalev-Shwartz & Ben-David, 2014, Theorem 26.5, Lemma 26.9, Lemma 26.10), and recall that we work under the event xi B for all training examples, we can show with probability 1 2δ that for all w B(1/ ϵ), R (w) b R (w) 2B ϵn + 3 B ϵ + 1 r n + 3(B + 1) ϵ Letting n = 100(B+1)2 ln(4/δ) ϵϵ2 ℓ , we have R (w) b R (w) ϵℓ It then follows from Equations (19) and (20) that with probability 1 2δ, R (wt) min 0 ρ 1/ ϵ R (ρ u) + ϵℓ min 0 ρ 1/ ϵ R(ρ u) + ϵℓ, where we use ℓ log ℓlog in the last inequality. Finally, we show that R (wt) is close to R(wt). Lemma C.12. For all w B(1/ ϵ), it holds that R (w) R(w) ϵℓ. Proof. Note that if ℓlog y w, x = ℓ log y w, x , then y w, x B/ ϵ, which implies w, x B/ ϵ. Moreover, in this case ℓlog y w, x ℓ log y w, x ℓlog y w, x ℓlog(0) w, x . R(w) R (w) = Ex Px h ℓlog y w, x ℓ log y w, x i Ex Px w, x 1| w,x | B/ ϵ We can then invoke Equation (12) and get R(w) R (w) α1 exp B α2 w ϵ Note that the right hand side of Equation (21) is increasing with w , therefore we can let w be 1/ ϵ and get R(w) R (w) α1 α2 + B ϵ exp B where we use Equation (18) in the last inequality. Now putting everything together, under the conditions of Lemma C.11, with probability 1 2δ, projected gradient descent ensures R(wt) min0 ρ 1/ ϵ R(ρ u) + 2ϵℓ. Moreover, by applying Lemma C.10 to Lemma C.11, we can see the sample complexity is e O d ln(1/δ)3/(ϵϵ2 ℓ) , and the iteration complexity is e O d ln(1/δ)2/(ϵϵℓ) . Agnostic Learnability of Halfspaces via Logistic Loss C.2. Omitted proofs from Section 3.2 In this section, we prove Lemma C.4. We first prove the following approximation bound after we replace the true label with the label given by the ground-truth solution, which covers Lemma 3.11 and sub-exponential distributions. Lemma C.13 (Lemma 3.11, including the sub-exponential case). For ℓ {ℓlog, ℓh}, if x B almost surely, |term (3)| B w ˆw OPT. If Px is (α1, α2)-sub-exponential, then |term (3)| (1 + 2α1)α2 w ˆw OPT ln(1/OPT). Proof. Note that for both the logistic loss and the hinge loss, it holds that ℓ( z) ℓ(z) = z, therefore term (3) = E(x,y) P 1y =sign( w,x ) y w ˆw, x , (22) It then follows from the triangle inequality that |term (3)| E(x,y) P 1y =sign( w,x ) w ˆw, x Now we can invoke Lemma A.2 with w = w and w = w ˆw to prove Lemma C.13. Next we prove the lower bound on term (4). Proof of Lemma 3.12. Note that in term (4), we only care about ˆw, x and w, x , therefore we can focus on the twodimensional space spanned by w and ˆw. Let ϕ denote the angle between w and ˆw. Without loss of generality, we can consider the following graph, where we put w at angle 0, and ˆw at angle ϕ. We divide the graph into four parts given by different polar angles: (i) ( π 2 + ϕ), (ii) ( π 2 ), (iii) ( π 2 + ϕ), and (iv) ( π 2 ). Note that term (4) is 0 on parts (ii) and (iv), therefore we only need to consider parts (i) and (iii): term (4) = E(i) and (iii) ℓ sign w, x ˆw, x ℓ sign ˆw, x ˆw, x = E(i) and (iii) h sign w, x ˆw, x i . Here we use the fact that ℓ( z) ℓ(z) = z for both the logistic loss and the hinge loss. Agnostic Learnability of Halfspaces via Logistic Loss For simplicity, let p denote the density of the projection of Px onto the space spanned by ˆw and w. Under Assumption 3.2, we have term (4) = E(i) and (iii) h sign w, x ˆw, x i 2 r ˆw cos(ϕ θ)p(r, θ)r dθ dr + Z π 2 r ˆw cos(θ ϕ)p(r, θ)r dθ dr 0 r ˆw sin(θ)r dθ dr = 2R3 ˆw 1 cos(ϕ) 3U 4R3 ˆw ϕ2 where we use the fact that 1 cos(ϕ) 2ϕ2 π2 for all ϕ [0, π]. Next, we prove the following upper bound on term (5), covering Lemma 3.13 and the sub-exponential case. Lemma C.14 (Lemma 3.13, including the sub-exponential case). For ℓ= ℓh, term (5) is 0. For ℓ= ℓlog, under Assumption 3.3, if x B almost surely, then |term (5)| 12Cκ ϕ( ˆw, w) where Cκ := R B 0 κ(r) dr, while if Px is (α1, α2)-sub-exponential, then |term (5)| 2α1OPT2 + 12Cκ ϕ( ˆw, w) where Cκ := R 3α2 ln(1/OPT) 0 κ(r) dr. Proof. For the hinge loss, term (5) is 0 simply because ℓh(z) = 0 when z 0. Next we consider the logistic loss. Note that term (5) only depends on ˆw, x and w, x , therefore we can focus on the subspace spanned by ˆw and w. For simplicity, let p denote the density function of the projection of Px onto the space spanned by ˆw and w. Moreover, without loss of generality we can assume w has polar angle 0 while ˆw has polar angle ϕ, where we let ϕ denote ϕ( ˆw, w) for simplicity. It then follows that term (5) = Z 0 ℓlog r ˆw cos(θ ϕ) p(r, θ)r dθ dr Z 0 ℓlog r ˆw cos(θ) p(r, θ)r dθ dr 0 ℓlog r ˆw cos(θ) p(r, θ + ϕ) p(r, θ) r dθ dr. First, if x B almost surely, then |term (5)| Z B 0 ℓlog r ˆw cos(θ) p(r, θ + ϕ) p(r, θ) r dθ dr 0 ℓlog r ˆw cos(θ) κ(r)ϕ r dθ dr 0 ℓlog r ˆw cos(θ) r dθ Then Lemma A.1 implies |term (5)| ϕ Z B 2 ˆw dr = 8 2Cκ ϕ ˆw 12Cκ ϕ ˆw . Agnostic Learnability of Halfspaces via Logistic Loss Next, assume Px is (α1, α2)-sub-exponential. For a 2-dimensional random vector x sampled according to p V , note that Letting B := 2 2α2 ln 1 OPT , we get Pr x B 2α1OPT2. Since ℓlog(z) 1 when z 0, we have term (5) 2α1OPT2 0 ℓlog r ˆw cos(θ ϕ) p(r, θ)r dθ dr Z B 0 ℓlog r ˆw cos(θ) p(r, θ)r dθ dr. Invoking the previous bound for bounded distributions, we get term (5) 2α1OPT2 + 12 ϕ ˆw Z 2 2α2 ln( 1 OPT) 0 κ(r) dr 2α1OPT2 + 12Cκ ϕ ˆw , where Cκ := R 3α2 ln( 1 OPT) 0 κ(r) dr. Similarly, we can show term (5) 2α1OPT2 + 12Cκ ϕ ˆw . Next we prove Lemma 3.14, which is basically (Diakonikolas et al., 2020d, Claim 3.4). Proof of Lemma 3.14. Under Assumption 3.2, we have Pr sign ˆw, x = sign w, x 2ϕ( ˆw, w) Z 0 σ(r)r dr 2Uϕ( ˆw, w). Lastly, we prove Lemma C.4 for sub-exponential distributions. Proof of Lemma C.4, sub-exponential distributions. For simplicity, let ϕ denotes ϕ( ˆw, w). Lemmas 3.12, C.13 and C.14 imply C1 ˆw ϕ2 ϵℓ+ C2 w ˆw OPT ln 1 OPT + C3OPT2 + C4Cκ ϕ ˆw ϵℓ+ C2 ˆw ϕ OPT ln 1 OPT + C3OPT2 + C4Cκ ϕ ˆw , where C1 = 4R3 3Uπ2 , and C2 = (1 + 2α1)α2, and C3 = 2α1, and C4 = 12. It follows that at least one of the following four cases is true: 1. C1 ˆw ϕ2 4ϵℓ, which implies ϕ = O p 2. C1 ˆw ϕ2 4C2 ˆw ϕ OPT ln 1 OPT , which implies ϕ = O OPT ln 1 OPT . 3. C1 ˆw ϕ2 4C3OPT2, which implies ϕ = O(OPT) since ˆw = Ω(1). C1 ˆw ϕ2 4C2Cκ ϕ ˆw , which implies ϕ = O Cκ Finally, we just need to invoke Lemma 3.14 to finish the proof. Agnostic Learnability of Halfspaces via Logistic Loss C.3. Omitted proofs from Section 3.3 We first prove the upper bound of term (5) under Assumption 3.2, without assuming the radially Lipschitz condition. Proof of Lemma 3.15. Note that term (5) E ℓlog sign ˆw, x ˆw, x = E ℓlog ˆw, x 12U where we invoke Lemma C.1 at the end. Similarly, we can show term (5) 12U Next we prove a general result similar to Lemma C.4. Theorem C.15. Under Assumption 3.2, suppose ˆw satisfies Rlog( ˆw) Rlog( ˆw u) + ϵℓfor some ϵℓ [0, 1). If x B almost surely, then ϕ( ˆw, u) = O If Px is (α1, α2)-sub-exponential and ˆw = Ω(1), then ϕ( ˆw, u) = O OPT ln 1 OPT Proof. For simplicity, let ϕ denote ϕ( ˆw, u). Consider the case x B almost surely. The condition Rlog( ˆw) Rlog( ˆw u) + ϵℓ, and Lemmas 3.12, 3.15 and C.13 imply C1 ˆw ϕ2 ϵℓ+ B w ˆw OPT + C2 ϵℓ+ B ˆw ϕ OPT + C2 where C1 = 4R3/(3Uπ2) and C2 = 12U. Now at least one of the following three cases is true: 1. C1 ˆw ϕ2 3ϵℓ, which implies ϕ = O p 2. C1 ˆw ϕ2 3B ˆw ϕ OPT, which implies ϕ = O(OPT); 3. C1 ˆw ϕ2 3C1/ ˆw , which implies ϕ = O(1/ ˆw ). The proof of the sub-exponential case is similar. Now we prove Lemma 3.16. Proof of Lemma 3.16. First, if ϵ or OPT does not satisfy the conditions of Lemma C.7, then Lemma 3.16 holds vacuously; therefore in the following we consider the settings of Lemmas C.6 and C.7 with ϵℓ= ϵ. First, if x B almost surely, Equation (15) and Theorem C.15 imply ϕ(wt, u) = O Agnostic Learnability of Halfspaces via Logistic Loss and moreover Lemma C.7 implies wt = Ω min 1 ϵ, 1 If ϵ OPT, then wt = Ω 1 ϕ(wt, u) = O max OPT, q = O max OPT, q ϵ If ϵ OPT, then wt = Ω 1 ϵ , and ϕ(wt, u) = O max OPT, q = O max OPT, q ϵ ϵ, ϵ The proof for the sub-exponential case is similar. D. Omitted proofs from Section 4 In this section, we prove Theorem 4.1. We first prove a bound on Rh( u). Lemma D.1. If x B almost surely, then Rh( u) B OPT, while if Px is (α1, α2)-sub-exponential, then Rh( u) (1 + 2α1)α2 OPT ln(1/OPT). Proof. Note that Rh( u) = E(x,y) P h ℓh y u, x i = E(x,y) P 1sign( u,x =y) u, x . It then follows from Lemma A.2 that if x B almost surely, then Rh( u) B OPT, while if Px is (α1, α2)-sub-exponential, then Rh( u) (1 + 2α1)α2 OPT ln 1 OPT Next we prove the following result, which covers Lemma 4.2 but also handles sub-exponential distributions. Lemma D.2 (Lemma 4.2, including the sub-exponential case). Suppose Assumption 3.2 holds. Consider an arbitrary w D, and let ϕ denote ϕ(w, u). If x B almost surely, then Rh( r u) Rh( w u) + O (OPT + ϵ)2 Agnostic Learnability of Halfspaces via Logistic Loss Rh(w) Rh( w u) 4R3 3Uπ2 w ϕ2 B w ϕ OPT. If Px is (α1, α2)-sub-exponential, then Rh( r u) Rh( w u) + O OPT ln(1/OPT) + ϵ 2 Rh(w) Rh( w u) 4R3 (1 + 2α1)α2 w ϕ OPT ln(1/OPT). Proof. First assume x B almost surely. Note that ℓh is positive homogeneous, and thus for any positive constant c, we have Rh(cw) = c Rh(w). Therefore, if r w , then Rh( r u) = r w Rh( w u) Rh( w u). If r w , then Rh( r u) = Rh w u + Rh( u) r w Rh w u + Rh( u) ( r 1) , since w 1 for all w D. Recall that r := 1 v, u = 1 cos ϕ(v, u) 1 1 ϕ(v, u)2/2, and therefore the first-phase of algorithm ensures r = 1 + O(OPT + ϵ) for bounded distributions, and r = 1 + O OPT ln(1/OPT) + ϵ for sub-exponential distributions. It then follows that for bounded distributions, Rh( r u) Rh wt u + Rh( u) O(OPT + ϵ) Rh wt u + B OPT O(OPT + ϵ) = Rh wt u + O (OPT + ϵ)2 , where we apply Lemma D.1 at the end. It also follows directly from Lemmas 3.12, C.13 and C.14 that Rh(w) Rh( w u) 4R3 3Uπ2 w ϕ2 B w w u OPT 3Uπ2 w ϕ2 B w ϕ OPT. The proof for the sub-exponential case is similar. Next we prove Theorem 4.1. We first consider the bounded case. Proof of Theorem 4.1, bounded distribution. Here we assume x B almost surely. We will show that under the conditions of Theorem 4.1, then E min 0 t 0, it holds that E h x 21 x τ i dα1 τ 2 + 2 dα2τ + 2dα2 2 exp Proof. First recall that j=1 Pr |xj| τ Let µ(τ) := Pr x τ . Integration by parts gives E h x 21 x τ i = Z τ r2 ( dµ(r)) = τ 2µ(τ) + Z τ 2rµ(r) dr τ 2δ(τ) + Z τ 2rδ(r) dr. Calculation gives E h x 21 x τ i dα1 τ 2 + 2 dα2τ + 2dα2 2 exp Now we are ready to prove Theorem 4.1 for sub-exponential distributions. Proof of Theorem 4.1, sub-exponential distributions. At step t, we have wt+1 r u 2 wt r u 2 2η D ℓ h yt wt, xt ytxt, wt r u E + η2ℓ h yt wt, xt 2 xt 2 = wt r u 2 2η D ℓ h yt wt, xt ytxt, wt r u E η2ℓ h yt wt, xt xt 2, (29) where we use (ℓ h)2 = ℓ h. Next we bound E(xt,yt) ℓ h yt wt, xt xt 2 . Let τ := dα2 ln(d/ϵ). When xt τ, we have E h ℓ h yt wt, xt xt 21 xt τ i τ 2M(wt) dα2 2M(wt) ln(d/ϵ)2. On the other hand, when xt τ, Lemma D.3 implies E h ℓ h yt wt, xt xt 21 xt τ i E h xt 21 xt τ i dα1 O d ln(d/ϵ)2 ϵ d = O dϵ ln(d/ϵ)2 , where we also use ln(1/ϵ) > 1, since ϵ < 1/e. To sum up, E(xt,yt) ℓ h yt wt, xt xt 2 Cd M(wt) + ϵ ln(d/ϵ)2 for some constant C. Now taking the expectation with respect to (xt, yt) on both sides of Equation (29), we have E h wt+1 r u 2i wt r u 2 2η Rh(wt) Rh( r u) + η2Cd M(wt) + ϵ ln(d/ϵ)2. (30) Similarly to the bounded case, we will show that E min 0 t