# agnostic_learnability_of_halfspaces_via_logistic_loss__d47c8fb9.pdf

Agnostic Learnability of Halfspaces via Logistic Loss

Ziwei Ji 1 Kwangjun Ahn 2 Pranjal Awasthi 3 Satyen Kale 3 Stefani Karp 3 4

We investigate approximation guarantees provided by logistic regression for the fundamental problem of agnostic learning of homogeneous halfspaces. Previously, for a certain broad class of well-behaved distributions on the examples, Diakonikolas et al. (2020d) proved an eΩ(OPT) lower bound, while Frei et al. (2021b) proved an e O

OPT upper bound, where OPT denotes the best zero-one/misclassiﬁcation risk of a homogeneous halfspace. In this paper, we close this gap by constructing a well-behaved distribution such that the global minimizer of the logistic risk over this distribution only achieves Ω

misclassiﬁcation risk, matching the upper bound in (Frei et al., 2021b). On the other hand, we also show that if we impose a radial-Lipschitzness condition in addition to well-behaved-ness on the distribution, logistic regression on a ball of bounded radius reaches e O(OPT) misclassiﬁcation risk. Our techniques also show for any well-behaved distribution, regardless of radial Lipschitzness, we can overcome the Ω

OPT lower bound for logistic loss simply at the cost of one additional convex optimization step involving the hinge loss and attain e O(OPT) misclassiﬁcation risk. This two-step convex optimization algorithm is simpler than previous methods obtaining this guarantee, all of which require solving O log(1/OPT)

minimization problems.

Part of this work was done when Ziwei Ji and Kwangjun Ahn were interns at Google. 1Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA 2Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA 3Google Research, New York City, NY, USA 4School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. Correspondence to: Ziwei Ji <ziweiji2@illinois.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

1. Introduction

In this paper, we consider the fundamental problem of agnostically learning homogeneous halfspaces. Speciﬁcally, we assume there is an unknown distribution P over Rd { 1, +1} to which we have access in the form of independent and identically distributed samples drawn from P. A sample from P consists of an input feature vector x Rd, and a binary label y { 1, +1}. Our goal is to compete with a homogeneous linear classiﬁer u (i.e. one that predicts the label sign( u, x ) for input x) that achieves the optimal zero-one risk of OPT > 0 over P; formally, this means Pr(x,y) P sign( u, x ) = y = OPT. Alternatively, we can think that the labels of the examples are ﬁrst generated by u, and then an OPT fraction of the labels are adversarially corrupted.

There have been many algorithmic and hardness results on this topic, see Section 1.1 for a discussion. A very natural heuristic for solving the problem is to use logistic regression. However, the analysis of logistic regression for this problem is still largely incomplete, even though it is one of the most fundamental algorithms in machine learning. One reason for this is that it can return extremely poor solutions in the worst case: Ben-David et al. (2012) showed that the minimizer of the logistic risk may attain a zero-one risk as bad as 1 OPT on an adversarially-constructed distribution.

As a result, much attention has been devoted to certain wellbehaved distributions, for which much better results can be obtained. However, even when the marginal distribution on the feature space, Px, is assumed to be isotropic log-concave, in a recent work, Diakonikolas et al. (2020d) proved an eΩ OPT lower bound on the zero-one risk for any convex surrogate, including the logistic loss. On the positive side, in another recent work, Frei et al. (2021b) proved that vanilla gradient descent on the logistic loss can attain a zero-one risk of e O

OPT , as long as Px satisﬁes some well-behavedness conditions. (See Sections 1.1 and 3 for precise details.)

The above results still leave a big gap between the upper and the lower bounds, raising the question of identifying the fundamental limits of logistic regression for this problem. In this work, we study this question and develop the following set of results.

Agnostic Learnability of Halfspaces via Logistic Loss

A matching Ω

OPT lower bound. In Section 2, we construct a distribution Q over R2 { 1, 1}, and prove a lower bound for logistic regression that matches the upper bound in (Frei et al., 2021b), thereby closing the gap in recent works (Diakonikolas et al., 2020d; Frei et al., 2021b). Speciﬁcally, the marginal distribution Qx is isotropic and bounded, and satisﬁes all the well-behaved-ness conditions from the aforementioned papers, but the global minimizer of the logistic risk on Q only attains Ω

OPT zero-one risk on Q.

An e O(OPT) upper bound for radially Lipschitz densities. The lower bound mentioned above shows that one needs to make additional assumptions to prove better bounds. In Section 3, we show that by making a radial Lipschitzness assumption in addition to well-behaved-ness, it is indeed possible to achieve the near-optimal e O(OPT) zero-one risk via logistic regression. In particular, our upper bound result holds if the projection of Px onto any twodimensional subspace has Lipschitz continuous densities. Moreover, our upper bound analysis is versatile: it can recover the e O

OPT guarantee for general well-behaved distributions shown by Frei et al. (2021b), and it also works for the hinge loss, which motivates a simple and efﬁcient two-phase algorithm, as we describe next.

An e O(OPT) upper bound for general well-behaved distributions with a two-phase algorithm. Motivated by our analysis, in Section 4, we describe a simple two-phase algorithm that achieves e O OPT risk for general wellbehaved distributions, without assuming radial Lipschitzness. Thus, we show that the cost of avoiding the radial Lipschitzness condition is simply an additional convex loss minimization. Our two-phase algorithm involves logistic regression followed by stochastic gradient descent with the hinge loss (i.e., the perceptron algorithm) with a restricted domain and a warm start. For general well-behaved distributions, the ﬁrst phase can only achieve an e O

guarantee, however we show that the second phase can boost the upper bound to e O(OPT).

Previously, for any given ϵ > 0, Diakonikolas et al. (2020d) designed a nonconvex optimization algorithm that can achieve an O(OPT + ϵ) risk using e O(d/ϵ4) samples. Their algorithm requires guessing OPT within a constant multiplicative factor via a binary search and running a nonconvex SGD using each guess as an input. Similarly, prior algorithms achieving an O(OPT + ϵ) risk involve solving multiple rounds of convex loss minimization (Awasthi et al., 2014; Daniely, 2015). In contrast, our two-phase algorithm is a simple logistic regression followed by a perceptron algorithm, and the output is guaranteed to have an O OPT ln(1/OPT) + ϵ zero-one risk using only e O(d/ϵ2) samples.

1.1. Related work

The problem of agnostic learning of halfspaces has a long and rich history (Kearns et al., 1994). Here we survey the results most relevant to our work. It is well known that in the distribution independent setting, even weak agnostic learning is computationally hard (Feldman et al., 2006; Guruswami & Raghavendra, 2009; Daniely, 2016). As a result, most algorithmic results have been obtained under assumptions on the marginal distribution Px over the examples.

The work of Kalai et al. (2008) designed algorithms that achieve OPT + ϵ error for any ϵ > 0 in time dpoly( 1

ϵ ) for isotropic log-concave densities and for the uniform distribution over the hypercube. There is also a recent evidence that removing the exponential dependence on 1/ϵ, even for Gaussian marginals is computationally hard (Klivans & Kothari, 2014; Diakonikolas et al., 2020a; Goel et al., 2020).

As a result, another line of work aims to design algorithms with polynomial running time and sample complexity (in d and 1

ϵ ) and achieve an error of g(OPT) + ϵ, for g being a simple function. Along these lines, Klivans et al. (2009) designed a polynomial-time algorithm that attains e O(OPT1/3) + ϵ zero-one risk for isotropic log-concave distributions. Awasthi et al. (2014) improved the upper bound to O(OPT) + ϵ, using a localization-based algorithm. Balcan & Zhang (2017) further extended the algorithm to more general s-concave distributions. The work of Daniely (2015) further provided a PTAS guarantee: an error of (1 + η)OPT + ϵ for any desired constant η > 0 via an improper learner.

In a recent work, Diakonikolas et al. (2020d) studied the problem for distributions satisfying certain well-behavedness conditions which include isotropy and certain regularity conditions on the projection of Px on any 2-dimensional subspace (see Assumption 3.2 for a subset of these conditions). This class of distributions include any isotropic log-concave distributions such as standard Gaussian. In addition to their nonconvex optimization method discussed above, for any convex, nonincreasing, and nonconstant loss function, they also showed an Ω OPT ln(1/OPT) lower bound for log-concave marginals and an Ω OPT1 1/s

lower bound for s-heavy-tailed marginals.

In another recent work, Frei et al. (2021b) assumed Px satisﬁes a soft-margin condition: for anti-concentrated marginals such as isotropic log-concave marginals, this assumes Pr u, x γ = O(γ) for any γ > 0. For subexponential distributions with soft-margins, they proved an e O

OPT upper bound for gradient descent on the logistic loss, which can be improved to O

OPT for bounded distributions. Note that these upper bounds and the lower bounds in (Diakonikolas et al., 2020d) do not match: if Px is sub-exponential, then Diakonikolas et al. (2020d) only

Agnostic Learnability of Halfspaces via Logistic Loss

gave an eΩ(OPT) lower bound, while if Px is s-heavy-tailed, then the upper bound in (Frei et al., 2021b) becomes worse.

Finally, some prior works on agnostic learning of halfspaces have considered various extensions of the problem such as active agnostic learning (Awasthi et al., 2014; Yan & Zhang, 2017), agnostic learning of sparse halfspaces with sample complexity scaling logarithmically in the ambient dimensionality (Shen & Zhang, 2021), and agnostic learning under weaker noise models such as the random classiﬁcation noise (Blum et al., 1998; Dunagan & Vempala, 2008), Massart s noise model (Awasthi et al., 2015; 2016; Zhang et al., 2020; Diakonikolas et al., 2019; 2020b; 2021; Chen et al., 2020) and the Tsybakov noise model (Diakonikolas et al., 2020c; Zhang & Li, 2021). We do not consider these extensions in our work.

1.2. Notation

Let denote the ℓ2 (Euclidean) norm. Given r > 0, let B(r) := x x r denote the Euclidean ball with radius r. Given two nonzero vectors u and v, let ϕ(u, v) [0, π] denote the angle between them.

Given a data distribution P over Rd { 1, +1}, let Px denote the marginal distribution of P on the feature space Rd. We will frequently need the projection of the input features onto a two-dimensional subspace V ; in such cases, it will be convenient to use polar coordinates (r, θ) for the associated calculations, such as parameterizing the density with respect to the Lebesgue measure as p V (r, θ).

Given a nonincreasing loss function ℓ: R R, we consider the population risk

Rℓ(w) := E(x,y) P ℓ y w, x ,

and the corresponding empirical risk

b Rℓ(w) := 1

i=1 ℓ yi w, xi ,

deﬁned over n i.i.d. samples drawn from P. We will focus on the logistic loss ℓlog(z) := ln(1 + e z), and the hinge loss ℓh(z) := max{ z, 0}. Let Rlog := Rℓlog for simplicity, and also deﬁne b Rlog, Rh and b Rh similarly. Let R0 1(w) := Pr(x,y) P y = sign w, x denote the population zero-one risk.

OPT lower bound for logistic loss

In this section, we construct a distribution Q over R2 { 1, +1} which satisﬁes standard regularity conditions in (Diakonikolas et al., 2020d; Frei et al., 2021a), but the global minimizer w of the population logistic risk Rlog on Q only achieves a zero-one risk of Ω

OPT . Our focus

on the global logistic optimizer is motivated by the lower bounds from (Diakonikolas et al., 2020d); in particular, this means that the large classiﬁcation error is not caused by the sampling error.

The distribution Q has four parts Q1, Q2, Q3, and Q4, as described below. It can be veriﬁed that if OPT 1/16, the construction is valid.

1. The feature distribution of Q1 consists of two squares:

one has edge length q

2 2 and density 1, with label 1; the other has edge length q

2 2 , density 1, with label +1.

2. The feature distribution of Q2 is supported on h 0,

OPT i [0, 1] h

OPT, 0 i [ 1, 0]

with density 1, and the label is given by sign(x1).

3. Let q3 := 2

OPT(1 OPT), then Q3 consists of two squares: one has edge length p q3

2 , center (1, 0), density 1 and label +1, and the other has edge length p q3

2 , center ( 1, 0), density 1 and label 1.

4. The feature distribution of Q4 is the uniform distribution over the unit ball B(1) := x x 1 with den-

sity q4 := 1 OPT 2

OPT q3 π , and the label is given by sign(x1).

Figure 1. An illustration of Q when OPT = 1/16. Red areas denote the +1 label, while blue areas denote the 1 label. The parts Q1, Q2 and Q3 are marked in the ﬁgure, while Q4 is supported on the unit circle and marked by horizontal lines.

Note that the correct label is given by sign(x1) on Q2, Q3 and Q4; therefore u := (1, 0) is our ground-truth solution that is only wrong on the noisy part Q1.

Here is our lower bound result.

Agnostic Learnability of Halfspaces via Logistic Loss

Theorem 2.1. Suppose OPT 1/100, and let Qx denote the marginal distribution of Q on the feature space. It holds that Ex Qx[x] = 0, and Ex Qx[x1x2] = 0, and Ex Qx[x2 1 x2 2] = 0. Moreover, the population logistic risk Rlog has a global minimizer w , and

R0 1(w ) = Pr y = sign w , x

Note that we can further normalize Qx to unit variance and make it isotropic. Then it is easy to check that Qx satisﬁes the well-behaved-ness conditions in (Diakonikolas et al., 2020d), and the soft-margin and sub-exponential conditions in (Frei et al., 2021b). In particular, our lower bound matches the upper bound in (Frei et al., 2021b).

2.1. Proof of Theorem 2.1

Here is a proof sketch of Theorem 2.1; the full proof is given in Appendix B.

First, basic calculation shows that Qx is isotropic up to a constant multiplicative factor. Speciﬁcally, Q1, Q2 and Q4 are constructed to make the risk lower bound proof work, while Q3 is included to make Q isotropic. It turns out that Q3 does not change the risk lower bound proof too much: the reason is that we will prove Theorem 2.1 by contradiction, and show that if its conclusion does not hold, then Rlog(w ) = 0. We show this mainly using Q1, Q2 and Q4, but Q3 does not signiﬁcantly change the argument either, since it is highly aligned with the groundtruth solution u := (1, 0). As a result, if we assume the conclusion of Theorem 2.1 does not hold, which means Q3 is also aligned with w , then ℓ y w , x will be close to 0 on Q3, and we can still obtain a nonzero Rlog(w ) and derive a contradiction.

Next we consider the risk lower bound. We only need to show that ϕ( u, w ), the angle between u and w , is Ω

OPT , since it then follows that w is wrong on an Ω

OPT fraction of Q4, which is enough since Q4 accounts for more than a half of the distribution Q.

Note that the minimizer of the logistic risk on Q4 by itself is inﬁnitely far in the direction of u. However, this will incur a large risk on Q1. By balancing these two parts, we can show that by moving along the direction of u by a distance of Θ 1

, we can achieve a logistic risk of O

Lemma 2.2. Suppose OPT 1/100, let w := ( r, 0) where r = 3

OPT, then Rlog( w) 5

Next we consider the global minimizer w of Rlog, which exists since Rlog has bounded sub-level sets. Let (r , θ ) denote the polar coordinates of w . We will assume θ h

30 i , and derive a contradiction.

In our construction, Q3 and Q4 are symmetric with respect to the horizontal axis, and they will induce the ground-truth solution. However, Q1 and Q2 are skew, and they will pull w above, meaning we actually have θ h 0,

30 i . The ﬁrst observation is an upper bound on r : if r is too large, then the risk of w over Q1 will already be larger than Rlog( w) for w constructed in Lemma 2.2, a contradiction.

Lemma 2.3. Suppose OPT 1/100 and θ h 0,

However, our next lemma shows that under the above conditions, the gradient of Rlog at w does not vanish, which contradicts the deﬁnition of w .

Lemma 2.4. Suppose OPT 1/100, then for any w = (r, θ) with 0 r 10

OPT and 0 θ

30 , it holds that Rlog(w) = 0.

To prove Lemma 2.4, let us consider an arbitrary w = (r, θ) under the conditions of Lemma 2.4. For simplicity, let us ﬁrst look at the case θ = 0. In this case, note that y w, x

OPT = 10 on Q2, therefore ℓ y w, x is bounded away from 0 on Q2. We can then show that Q2 induces a component of length C1

OPT in the gradient Rlog(w) along the direction of e2 = (0, 1), where C1 is some universal constant. Moreover, Q1 also induces a component in the gradient along e2, while Q3 and Q4 induce a zero component along e2. As a result, Rlog(w), e2 < 0, and thus Rlog(w) is nonzero. Now if 0 θ C2

OPT for some small enough constant C2 (1/30 in our case), we can show that Q3 and Q4 cannot cancel the effect of Q2, and it still holds that Rlog(w), e2 < 0.

3. An e O(OPT) upper bound for logistic loss with radial Lipschitzness

OPT lower bound construction in Section 2 shows that further assumptions on the distribution are necessary in order to improve the upper bound on the zero-one risk of the logistic regression solution. In particular, we note that the distribution Q constructed in Section 2 has a discontinuous density. In this section, we show that if we simply add a very mild Lipschitz continuity condition on the density, then we can achieve e O(OPT) zero-one risk using logistic regression.

First, we formally provide the standard assumptions from prior work. Because of the lower bound for s-heavy-tailed distributions from (Diakonikolas et al., 2020d) (cf. Section 1.1), to get an e O(OPT) zero-one risk, we need to assume Px has a light tail. Following (Frei et al., 2021b), we will either consider a bounded distribution, or assume Px

Agnostic Learnability of Halfspaces via Logistic Loss

is sub-exponential as deﬁned below (cf. (Vershynin, 2018, Proposition 2.7.1 and Section 3.4.4)).

Deﬁnition 3.1. We say Px is (α1, α2) sub-exponential for constants α1, α2 > 0, if for any unit vector v and any t > 0,

Prx Px v, x t α1 exp t/α2 .

We also need the next assumption, which is part of the well-behaved-ness conditions from (Diakonikolas et al.,

Assumption 3.2. There exist constants U, R > 0 and a function σ : R+ R+, such that if we project Px onto an arbitrary two-dimensional subspace V , the corresponding density p V satisﬁes p V (r, θ) 1/U for all r R, and p V (r, θ) σ(r) for all r 0, and R 0 σ(r) dr U, and R 0 rσ(r) dr U.

While Assumption 3.2 may look a bit technically involved, it basically consists of some mild concentration and anticoncentration conditions. In particular, for a broad class of distributions including isotropic log-concave distributions, the sub-exponential condition and Assumption 3.2 hold with α1, α2, U, R all being universal constants.

Finally, as discussed earlier, the previous conditions are also satisﬁed by Q from Section 2, and thus to get the improved e O(OPT) risk bound, we need the following radial Lipschitz continuity assumption.

Assumption 3.3. There exists a measurable function κ : R+ R+ such that for any two-dimensional subspace V , p V (r, θ) p V (r, θ ) κ(r)|θ θ |.

We will see Assumption 3.3 is crucial for the upper bound analysis in Lemma 3.13. For some concrete examples, note that if Px is radially symmetric (e.g., standard Gaussian), then its projection onto any two-dimensional subspace V is also radially symmetric, therefore we can let κ(r) = 0. On the other hand, if p V is λ-Lipschitz continuous on R2

under ℓ2 (e.g., general Gaussian), then it implies |p V (r, θ) p V (r, θ )| λr|θ θ |, therefore we can let κ(r) = λr.

Now we can state our main results. In the following, we denote the unit linear classiﬁer with the optimal zero-one risk by u, with R0 1( u) = OPT (0, 1/e). Our ﬁrst result shows that, with Assumption 3.3, minimizing the logistic risk yields a solution with e O(OPT) zero-one risk.

Theorem 3.4. Under Assumptions 3.2 and 3.3, let w denote the global minimizer of Rlog.

1. If x B almost surely, then

R0 1(w ) = O (1 + Cκ)OPT ,

where Cκ := R B 0 κ(r) dr.

2. If Px is (α1, α2)-sub-exponential, then

R0 1(w ) = O (1 + Cκ)OPT ln(1/OPT) ,

where Cκ := R 3α2 ln(1/OPT) 0 κ(r) dr.

Remark 3.5. Given Theorem 3.4, we only need to estimate Cκ to get a concrete bound. First, for radially symmetric distributions, since κ(r) = 0, we have Cκ = 0. On the other hand, if p V is λ-Lipschitz continuous on R2, then we can let κ(r) = λr, and then by deﬁnition, we can show Cκ λB2/2 in the bounded case, and Cκ 9λα2 2 ln(1/OPT)2/2 in the sub-exponential case.

Theorem 3.4 shows that with radial Lipschitzness, the global minimizer can attain e O(OPT) zero-one risk; next we also give an algorithmic result. Given a target error ϵ (0, 1), we consider projected gradient descent on the empirical risk with a norm bound of 1/ ϵ: let w0 := 0, and

wt+1 := ΠB(1/ ϵ) h wt η b Rlog(wt) i . (1)

Our next result shows that projected gradient descent can also give an e O(OPT) risk. Note that for the two cases discussed below (bounded or sub-exponential), we use the corresponding Cκ deﬁned in Theorem 3.4. Theorem 3.6. Suppose Assumptions 3.2 and 3.3 hold.

1. If x B almost surely, then with η = 4/B2, and

ϵ4 samples and O 1 ϵ5/2 iterations, with probability 1 δ, projected gradient descent outputs wt satisfying

R0 1(wt) = O (1 + Cκ)(OPT + ϵ) .

2. On the other hand, if Px is (α1, α2)-sub-exponential, then with η = eΘ(1/d), using e O d ln(1/δ)3

and e O d ln(1/δ)2

ϵ5/2 iterations, with probability 1 δ, projected gradient descent outputs wt with

R0 1(wt) = O (1 + Cκ)(OPT ln(1/OPT) + ϵ) .

Next we give proof outlines of our results; the full proofs are given in Appendix C. For simplicity, here we focus only on the bounded case, while the sub-exponential case will be handled in Appendix C. Although the proofs of the two cases share some similarity, we want to emphasize that the sub-exponential case does not follow by simply truncating the distribution to a certain radius and thus reducing to the bounded case. The reason is that the truncation radius can be as large as

d, while for the bounded case in our results, B is considered a constant independent of d and hidden in the O notation; therefore this truncation argument will introduce a poly(d) dependency in the ﬁnal bound. By contrast, our zero-one risk upper bounds for sub-exponential distributinos only depend on α1, α2, U and R, but do not depend on d.

Agnostic Learnability of Halfspaces via Logistic Loss

3.1. Proof of Theorems 3.4 and 3.6

Theorems 3.4 and 3.6 rely on the following key lemma which provides a zero-one risk bound on near optimal solutions to the logistic regression problem. It basically says that a near optimal solution with reasonably large norm also attains a good zero-one risk.

Lemma 3.7. Under Assumptions 3.2 and 3.3, suppose ˆw satisﬁes Rlog( ˆw) Rlog( ˆw u) + ϵℓfor some ϵℓ [0, 1). If x B almost surely, then

R0 1( ˆw) = O max OPT, q

ϵℓ ˆ w , Cκ ˆ w 2

In this subsection, we will sketch the proofs of Theorems 3.4 and 3.6 using Lemma 3.7; the details are given in Appendix C.1. In the next subsection, we will prove Lemma 3.7.

We ﬁrst prove Theorem 3.4. Note that by Lemma 3.7, it sufﬁces to show that w = Ω 1

(since ϵℓ= 0 in this case), which is true due to the next result.

Lemma 3.8. Under Assumption 3.2, if x B almost

surely and OPT < R4 200U 3B , then w = Ω 1

Next we prove Theorem 3.6. Once again motivated by Lemma 3.7, we need to show that projected gradient descent can achieve a near optimal logistic risk and a large norm. Recall that given the target (zero-one) error ϵ (0, 1), we run projected gradient descent on a Euclidean ball with radius 1/ ϵ (cf. Equation (1)). Using standard optimization and generalization analyses, we can prove the following guarantee on Rlog(wt).

Lemma 3.9. Let the target optimization error ϵℓ (0, 1) and the failure probability δ (0, 1/e) be given. If x B almost surely, then with η = 4/B2, using

O (B+1)2 ln(1/δ)

samples and O B2 ϵϵℓ

iterations, with probability 1 δ, projected gradient descent outputs wt satisfying

Rlog(wt) Rlog wt u + ϵℓ. (2)

We also need the following lower bounds on wt .

Lemma 3.10. Under Assumption 3.2, suppose

and that Equation (2) holds. If x B almost surely and

OPT < R4 500U 3B , then wt = Ω min n 1 ϵ, 1

Now to prove Theorem 3.6, we simply need to combine Lemmas 3.7, 3.9 and 3.10 with ϵℓ= ϵ3/2.

3.2. Proof of Lemma 3.7

Here we give a proof sketch of Lemma 3.7; the details are given in Appendix C.2. As mentioned before, here we focus on the bounded setting; in the appendix, we also prove a version of Lemma 3.7 for the sub-exponential setting (cf. Lemma C.4). One remark is that some of the lemmas in the proof are also true for the hinge loss, and this fact will be crucial in the later discussion regarding our two-phase algorithm (cf. Section 4).

Let w := ˆw u, and consider ℓ {ℓlog, ℓh}. The ﬁrst step is to express Rℓ( ˆw) Rℓ( w) as the sum of three terms, and then bound them separately. The ﬁrst term is given by

Rℓ( ˆw) Rℓ( w)

E h ℓ sign w, x ˆw, x ℓ sign w, x w, x i ,

the second term is given by

E h ℓ sign w, x ˆw, x ℓ sign ˆw, x ˆw, x i ,

and the third term is given by

E h ℓ sign ˆw, x ˆw, x ℓ sign w, x w, x i ,

where the expectations are taken over Px.

We ﬁrst bound term (3), which is the approximation error of replacing the true label y with the label given by u. Since ℓ( z) ℓ(z) = z for the logistic loss and hinge loss, we have the following equality:

term (3) = E 1y =sign( w,x ) y w ˆw, x .

The approximation error can be bounded as below, using the tail bound on Px and the fact R0 1( w) = OPT. Lemma 3.11. For ℓ {ℓlog, ℓh}, if x B almost surely,

|term (3)| B w ˆw OPT.

Next we bound term (4). Note that we only need to consider the case where w, x and ˆw, x have different signs; in this case, we can use the property ℓ( z) ℓ(z) = z again and show the next result. Lemma 3.12. Under Assumption 3.2, for ℓ {ℓlog, ℓh},

term (4) 4R3

3Uπ2 ˆw ϕ( ˆw, w)2.

Lastly, we consider term (5). Note that it is 0 for the hinge loss ℓh, because ℓh(z) = 0 when z 0. For the logistic loss, term (5) is also 0 if Px is radially symmetric; in general, we will bound it using Assumption 3.3.

Agnostic Learnability of Halfspaces via Logistic Loss

Lemma 3.13. For ℓ= ℓh, term (5) is 0. For ℓ= ℓlog, under Assumption 3.3, if x B almost surely, then

|term (5)| 12Cκ ϕ( ˆw, w)/ ˆw ,

where Cκ := R B 0 κ(r) dr.

Now we are ready to prove Lemma 3.7. For simplicity, here we let ϕ denote ϕ( ˆw, w). For bounded distributions, Lemmas 3.11 to 3.13 imply

C1 ˆw ϕ2 ϵℓ+ B w ˆw OPT + C2Cκ ϕ/ ˆw

ϵℓ+ B ˆw ϕ OPT + C2Cκ ϕ/ ˆw ,

where C1 = 4R3/(3Uπ2) and C2 = 12. It follows that at least one of the following three cases is true:

1. C1 ˆw ϕ2 3ϵℓ, which implies ϕ = O p

2. C1 ˆw ϕ2 3B ˆw ϕ OPT, and it follows that ϕ = O(OPT);

3. C1 ˆw ϕ2 3C2Cκ ϕ/ ˆw , and it follows that ϕ = O Cκ/ ˆw 2 .

Therefore we can show a bound on the angle between w and ˆw, which further implies a zero-one risk bound for ˆw, in light of (Diakonikolas et al., 2020d, Claim 3.4) which is stated below.

Lemma 3.14. Under Assumption 3.2,

R0 1( ˆw) R0 1( w)

Pr sign ˆw, x = sign w, x 2Uϕ( ˆw, w).

3.3. Recovering the general

Frei et al. (2021b) showed an e O

OPT upper bound under the soft-margin and sub-exponential conditions. Here we give an alternative proof of this result using our proof technique. The result in this section will later serve as a guarantee of the ﬁrst phase of our two-phase algorithm (cf. Section 4) that achieves e O(OPT) risk.

Recall that the only place we need Assumption 3.3 is in the proof of Lemma 3.13. However, even without Assumption 3.3, we can still prove the following general bound which only needs Assumption 3.2.

Lemma 3.15. Under Assumption 3.2, for ℓ= ℓlog,

|term (5)| 12U

Now with Lemma 3.15, we can prove a weaker but more general version of Lemma 3.7 (cf. Theorem C.15). Further

invoking Lemmas 3.9 and 3.10 (cf. Lemmas C.6 and C.7 for the corresponding sub-exponential results), and let ϵℓ= ϵ, we can show the next result. We present the bound in terms of the angle instead of zero-one risk for later applications in Section 4.

Lemma 3.16. Given the target error ϵ (0, 1) and the failure probability δ (0, 1/e), consider projected gradient descent (1). If x B almost surely, then with η = 4/B2,

using O (B+1)2 ln(1/δ)

ϵ2 samples and O B2

ϵ3/2 iterations, with probability 1 δ, projected gradient descent outputs wt with

ϕ(wt, u) = O

On the other hand, if Px is (α1, α2)-sub-exponential, then with η = eΘ(1/d), using e O d ln(1/δ)3

ϵ2 samples and

e O d ln(1/δ)2

ϵ3/2 iterations, with probability 1 δ, projected gradient descent outputs wt with

ϕ(wt, u) = O p

OPT ln(1/OPT) + ϵ .

The proofs of the results above are given in Appendix C.3.

4. An e O(OPT) upper bound with hinge loss

We now show how to avoid Assumption 3.3 and achieve an e O(OPT) zero-one risk bound using an extra step of hinge loss minimization. The key observation here is that the only place where Assumption 3.3 is used is in Lemma 3.13 for bounding term (5) for logistic loss. However, as noted in Lemma 3.13, for hinge loss, term (5) is conveniently 0. So a version of Lemma 3.7 holds for hinge loss, without using Assumption 3.3, and dropping the third term of Cκ ˆ w 2 in the max. Thus, to get an e O(OPT) upper bound, it is enough to minimize the hinge loss to ﬁnd a solution ˆw such that ˆw = Ω(1) and Rh( ˆw) Rh( ˆw u) + ϵℓfor some ϵℓ= e O((OPT+ϵ)2). However, there is still one remaining challenge: note that the global minimizer of Rh is given by 0, while if we add the explicit requirement ˆw = Ω(1), the problem becomes nonconvex.

Fortunately, we can bypass having to solve this nonconvex problem by leveraging the solution of the logistic regression problem, which is guaranteed to make an angle of at most e O(

OPT + ϵ) with u, even without Assumption 3.3, by Lemma 3.16. This solution, represented by a unit vector v, gives us a warm start for hinge loss minimization. Speciﬁcally, suppose we optimize the hinge loss over the halfspace

D := n w Rd w, v 1 o , (6)

then any solution we ﬁnd must have norm at least 1. Furthermore, using the fact that ϕ(v, u) e O(

Agnostic Learnability of Halfspaces via Logistic Loss

and the positive homogeneity of the hinge loss, we can also conclude that the optimizer of the hinge loss satisﬁes Rh( ˆw) Rh( ˆw u) + ϵℓ, giving us the desired solution.

While the above analysis does yield a simple two-phase polynomial time algorithm for getting an e O(OPT) zeroone risk bound, closer analysis reveals a sample complexity requirement of e O(1/ϵ4). We can improve the sample complexity requirement to e O(1/ϵ2) by doing a custom analysis of SGD on the hinge loss (i.e., perceptron, (Novikoff, 1963)) inspired by the above considerations. Thus we get the following two-phase algorithm1:

1. Run projected gradient descent under the settings of Lemma 3.16, and ﬁnd a unit vector v such that ϕ(v, u) is O

OPT + ϵ for bounded distributions, or O p

OPT ln(1/OPT) + ϵ for sub-exponential distributions.

2. Run projected SGD over the domain D deﬁned in Equation (6) starting from w0 := v: at step t, we sample (xt, yt) P, and let

wt+1 := ΠD h wt ηℓ h yt wt, xt ytxt i . (7)

Here, we set the convention that ℓ h(0) = 1.

Below, we present the results regarding the expected zeroone risk for simplicity; we note that the results can be turned into high-probability bounds using the repeated probability ampliﬁcation technique.

Theorem 4.1. Given the target error ϵ (0, 1/e), suppose Assumption 3.2 holds.

1. First, for bounded distributions, with η = Θ(ϵ), for all T = Ω(1/ϵ2),

E min 0 t<T R0 1(wt) = O(OPT + ϵ).

2. On the other hand, for sub-exponential distributions, with η = Θ ϵ d ln(d/ϵ)2 , for all T = Ω d ln(d/ϵ)2

E min 0 t<T R0 1(wt) = O(OPT ln(1/OPT) + ϵ).

4.1. Proof of Theorem 4.1

Here we give a proof sketch of Theorem 4.1, and again, we focus on bounded distributions for simplicity. The full proof is given in Appendix D.

1Note that the parameters η, T, etc. in this section are all chosen for the second phase.

Let r := 1/ v, u , and thus r u D; we will treat r u as a reference solution in the proof. At step t, we have

wt r u 2 2η D ℓ h yt wt, xt ytxt, wt r u E

+ η2ℓ h yt wt, xt 2 xt 2.

M(w) := E(x,y) P h ℓ h y w, x i = R0 1(w).

Taking expectation of Equation (8) w.r.t. (xt, yt), and note that x B almost surely and (ℓ h)2 = ℓ h, we have

E h wt+1 r u 2i wt r u 2

2η Rh(wt), wt r u + η2B2M(wt)

2η Rh(wt) Rh( r u) + η2B2M(wt). (9)

To continue, we note the following lemma, which follows from Lemmas 3.11 to 3.13, and the homogeneity of the hinge loss ℓh.

Lemma 4.2. Suppose Assumption 3.2 holds. Consider an arbitrary w D, and let ϕ denote ϕ(w, u). If x B almost surely, then

Rh( r u) Rh( w u) + O (OPT + ϵ)2

Rh(w) Rh( w u) 4R3

3Uπ2 w ϕ2 B w ϕ OPT.

The remaining steps of the proof proceed as follows. We will prove the following: for ϕt := ϕ(wt, u),

E min 0 t T ϕt

= O(OPT + ϵ). (10)

First, note that we can assume

3Uπ2 ϕt B OPT (11)

for all t, since otherwise Equation (10) holds vacuously. It then follows from Equation (11) and Lemma 4.2 that for C1 = 2R3/(3Uπ2),

Rh(wt) Rh( wt u) C1 wt ϕ2 t C1ϕ2 t,

where we also use the fact that w 1 for all w D.

Next, note that M(wt) = O(ϕt), due to Equation (11) and Lemma 3.14. If ϕt ϵ, then Equation (10) also holds,

Agnostic Learnability of Halfspaces via Logistic Loss

otherwise we can assume ϵ ϕt, and let η = C2ϵ for some small enough constant C2, such that

ηB2M(wt) C1ϵϕt C1ϕ2 t.

Now Equation (9) and Lemma 4.2 imply

E h wt+1 r u 2i wt r u 2

2ηC1ϕ2 t + η O (OPT + ϵ)2 + ηC1ϕ2 t = ηC1ϕ2 t + η O (OPT + ϵ)2 .

Taking the total expectation and telescoping the above inequality for all t, we have

ηC1T + O (OPT + ϵ)2 .

Recall that

w0 r u = v r u = O

due to the ﬁrst phase of the algorithm. Since η = C2ϵ, we can further let T = Ω(1/ϵ2) and ﬁnish the proof.

5. Open problems

We conclude our paper with some open questions. First, as shown by Theorem 4.1, we can achieve an O OPT ln(1/OPT) + ϵ zero-one risk using the twophase algorithm. However, previous algorithms attain an O(OPT + ϵ) bound (Awasthi et al., 2014; Diakonikolas et al., 2020d). Is it possible to develop an algorithm that relies on solving a (small) constant number of convex problems and achieves an O(OPT + ϵ) risk?

Next, it would be also interesting to extend our results to more practical neural network settings. On one hand, Frei et al. (2021a) showed that stochastic gradient descent on a two-layer leaky Re LU network of any width achieves an e O

OPT zero-one risk, where OPT still denotes the best zero-one risk of a linear classiﬁer. On the other hand, Ji et al. (2021) showed that a wide two-layer Re LU network can achieve the optimal Bayes risk; however, their results require the width of the network to depend on a complexity measure that could be exponentially large in the worst case. Can a neural network with a reasonable width reach a zeroone risk of O(OPT)?

Acknowledgements

Ziwei Ji thanks Matus Telgarsky for helpful discussions, and the NSF for support under grant IIS-1750051.

Awasthi, P., Balcan, M. F., and Long, P. M. The power of localization for efﬁciently learning linear separators with noise. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp. 449 458, 2014.

Awasthi, P., Balcan, M.-F., Haghtalab, N., and Urner, R. Efﬁcient learning of linear separators under bounded noise. In Conference on Learning Theory, pp. 167 190. PMLR, 2015.

Awasthi, P., Balcan, M.-F., Haghtalab, N., and Zhang, H. Learning and 1-bit compressed sensing under asymmetric noise. In Conference on Learning Theory, pp. 152 192. PMLR, 2016.

Balcan, M.-F. and Zhang, H. Sample and computationally efﬁcient learning algorithms under s-concave distributions. ar Xiv preprint ar Xiv:1703.07758, 2017.

Ben-David, S., Loker, D., Srebro, N., and Sridharan, K. Minimizing the misclassiﬁcation error rate using a surrogate convex loss. ar Xiv preprint ar Xiv:1206.6442, 2012.

Blum, A., Frieze, A., Kannan, R., and Vempala, S. A polynomial-time algorithm for learning noisy linear threshold functions. Algorithmica, 22(1):35 52, 1998.

Bubeck, S. Convex optimization: Algorithms and complexity. ar Xiv preprint ar Xiv:1405.4980, 2014.

Chen, S., Koehler, F., Moitra, A., and Yau, M. Classiﬁcation under misspeciﬁcation: Halfspaces, generalized linear models, and connections to evolvability. ar Xiv preprint ar Xiv:2006.04787, 2020.

Daniely, A. A ptas for agnostically learning halfspaces. In Conference on Learning Theory, pp. 484 502. PMLR, 2015.

Daniely, A. Complexity theoretic limitations on learning halfspaces. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 105 117, 2016.

Diakonikolas, I., Gouleakis, T., and Tzamos, C. Distributionindependent pac learning of halfspaces with massart noise. ar Xiv preprint ar Xiv:1906.10075, 2019.

Diakonikolas, I., Kane, D. M., and Zariﬁs, N. Nearoptimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. ar Xiv preprint ar Xiv:2006.16200, 2020a.

Diakonikolas, I., Kontonis, V., Tzamos, C., and Zariﬁs, N. Learning halfspaces with massart noise under structured distributions. In Conference on Learning Theory, pp. 1486 1513. PMLR, 2020b.

Agnostic Learnability of Halfspaces via Logistic Loss

Diakonikolas, I., Kontonis, V., Tzamos, C., and Zariﬁs, N. Learning halfspaces with tsybakov noise. ar Xiv preprint ar Xiv:2006.06467, 2020c.

Diakonikolas, I., Kontonis, V., Tzamos, C., and Zariﬁs, N. Non-convex sgd learns halfspaces with adversarial label noise. ar Xiv preprint ar Xiv:2006.06742, 2020d.

Diakonikolas, I., Kane, D. M., Kontonis, V., Tzamos, C., and Zariﬁs, N. Threshold phenomena in learning halfspaces with massart noise. ar Xiv preprint ar Xiv:2108.08767, 2021.

Dunagan, J. and Vempala, S. A simple polynomial-time rescaling algorithm for solving linear programs. Mathematical Programming, 114(1):101 114, 2008.

Feldman, V., Gopalan, P., Khot, S., and Ponnuswami, A. K. New results for learning noisy parities and halfspaces. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 06), pp. 563 574. IEEE, 2006.

Frei, S., Cao, Y., and Gu, Q. Provable generalization of sgdtrained neural networks of any width in the presence of adversarial label noise. ar Xiv preprint ar Xiv:2101.01152, 2021a.

Frei, S., Cao, Y., and Gu, Q. Agnostic learning of halfspaces with gradient descent via soft margins. In International Conference on Machine Learning, pp. 3417 3426. PMLR, 2021b.

Goel, S., Gollakota, A., and Klivans, A. Statistical-query lower bounds via functional gradients. ar Xiv preprint ar Xiv:2006.15812, 2020.

Guruswami, V. and Raghavendra, P. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39 (2):742 765, 2009.

Ji, Z., Li, J. D., and Telgarsky, M. Early-stopped neural networks are consistent. ar Xiv preprint ar Xiv:2106.05932, 2021.

Kalai, A. T., Klivans, A. R., Mansour, Y., and Servedio, R. A. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777 1805, 2008.

Kearns, M. J., Schapire, R. E., and Sellie, L. M. Toward efﬁcient agnostic learning. Machine Learning, 17(2-3): 115 141, 1994.

Klivans, A. and Kothari, P. Embedding hard learning problems into gaussian space. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014.

Klivans, A. R., Long, P. M., and Servedio, R. A. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(12), 2009.

Novikoff, A. B. On convergence proofs for perceptrons. Technical report, STANFORD RESEARCH INST MENLO PARK CA, 1963.

Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

Shen, J. and Zhang, C. Attribute-efﬁcient learning of halfspaces with malicious noise: Near-optimal label complexity and noise tolerance. In Algorithmic Learning Theory, pp. 1072 1113. PMLR, 2021.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Yan, S. and Zhang, C. Revisiting perceptron: Efﬁcient and label-optimal learning of halfspaces. ar Xiv preprint ar Xiv:1702.05581, 2017.

Zhang, C. and Li, Y. Improved algorithms for efﬁcient active learning halfspaces with massart and tsybakov noise. ar Xiv preprint ar Xiv:2102.05312, 2021.

Zhang, C., Shen, J., and Awasthi, P. Efﬁcient active learning of sparse halfspaces with arbitrary bounded noise. ar Xiv preprint ar Xiv:2002.04840, 2020.

Agnostic Learnability of Halfspaces via Logistic Loss

A. Technical lemmas

Here are some technical results we will need in our analysis.

Lemma A.1. Let r, ρ > 0 be given, then

2 ρ(1 e rρ) Z 2π

0 ℓlog rρ cos(θ) r dθ 8

Proof. First note that by symmetry, Z 2π

0 ℓlog rρ cos(θ) r dθ = 4 Z π 2

0 ℓlog rρ cos(θ) r dθ.

On the upper bound, note that ℓlog rρ cos(θ) is increasing as θ goes from 0 to π

2 , and moreover sin(θ)

2 2 for θ π

2 , therefore

0 ℓlog rρ cos(θ) r dθ 8 Z π 2

π 4 ℓlog rρ cos(θ) r dθ 8

π 4 ℓlog rρ cos(θ) rρ sin(θ) dθ.

Also because ℓlog(z) exp( z),

0 ℓlog rρ cos(θ) r dθ 8

π 4 exp rρ cos(θ) rρ sin(θ) dθ

On the lower bound, note that ℓlog(z) 1

2 exp( z) for z 0, therefore

0 ℓlog rρ cos(θ) r dθ = 4 Z π 2

0 ℓlog rρ cos(θ) r dθ 2 Z π 2

0 exp rρ cos(θ) r dθ

0 exp rρ cos(θ) rρ sin(θ) dθ

Lemma A.2. Given w, w Rd, suppose Pr(x,y) P y = sign w, x = OPT. If x B almost surely, then

1y =sign( w,x ) w , x B w OPT.

If Px is (α1, α2)-sub-exponential, and OPT 1

1y =sign( w,x ) w , x (1 + 2α1)α2 w OPT ln 1 OPT

Proof. If x B almost surely, then

1y =sign( w,x ) w , x B w E(x,y) P

1y =sign( w,x )

Agnostic Learnability of Halfspaces via Logistic Loss

Below we assume Px is (α1, α2)-sub-exponential.

Let νx := w , x ; we ﬁrst give some tail bounds for νx. Since Px is (α1, α2)-sub-exponential, for any t > 0, we have

, equivalently Pr |νx| t α1 exp t α2 w

Let µ(t) := Pr |νx| t . Given any threshold τ > 0, integration by parts gives

E h 1|νx| τ|νx| i = Z

τ t dµ(t) = τµ(τ) + Z

τ µ(t) dt α1 α2 w + τ exp τ α2 w

Now let τ := α2 w ln 1 OPT . Note that

1y =sign( w,x ) w , x = E(x,y) P

1|νx| τ1y =sign( w,x )|νx| + E(x,y) P

1|νx| τ1y =sign( w,x )|νx| .

We bound the two parts separately. When |νx| τ, we have

E 1|νx| τ1y =sign( w,x )|νx| τE 1y =sign( w,x )

= τ OPT = α2 w OPT ln 1 OPT

On the other hand, when |νx| τ, Equation (12) gives

1|νx| τ1y =sign( w,x )|νx| E h 1|νx| τ|νx| i

1 + ln 1 OPT

2α1α2 w OPT ln 1 OPT

where we also use OPT 1

e. To sum up,

1y =sign( w,x ) w , x (1 + 2α1)α2 w OPT ln 1 OPT

B. Omitted proofs from Section 2

In this section, we will prove Theorem 2.1. First, we bound the density and support of Qx.

Lemma B.1. If OPT 1 100, then it holds that q3 1 15, and 1 2π q4 1

π. As a result, Qx is supported on B(2) := x x 2 with its density bounded by 2.

Proof. For q3, we have

OPT(1 OPT) 2

For Q4, its total measure can be bounded as below:

OPT q3 1 1 100 2

therefore q4 1 2π. The upper bound q4 1

π is trivial.

Agnostic Learnability of Halfspaces via Logistic Loss

On the support of Qx, note that for Q1, the largest ℓ2 norm is given by

For Q2, the largest ℓ2 norm can be bounded by

For Q3, the largest ℓ2 norm can be bounded by

Finally, it is easy to verify that if OPT 1 100, then Q1, Q2 and Q3 do not overlap, therefore the density of Q is bounded by 1 + 1

Next we verify that Qx is isotropic up to a multiplicative factor. We ﬁrst note the following fact; its proof is straightforward and omitted.

Lemma B.2. It holds that Z a+ δ

2 xy dy dx = abδ2, and Z a+ δ

2 (x2 y2) dy dx = (a2 b2)δ2.

Then we can prove the following result.

Lemma B.3. It holds that Ex Qx[x] = 0, and Ex Qx[x1x2] = 0, and Ex Qx x2 1 x2 2 = 0.

Proof. It follows from the symmetry of Q that Ex Qx[x] = 0.

To verify Ex Qx[x1x2] = 0, note that the expectation of x1x2 is 0 on Q3 and Q4, and thus we only need to check Q1 and Q2. First, due to Lemma B.2, we have

E(x,y) Q1 [x1x2] = OPT

Additionally,

E(x,y) Q2 [x1x2] = 2 Z

0 x1x2 dx2 dx1 = OPT

Therefore Ex Qx[x1x2] = 0.

Finally, note that the expectation of x2 1 x2 2 is 0 on Q1 due to Lemma B.2, and also 0 on Q4 due to symmetry; therefore we only need to consider Q2 and Q3. We have

E(x,y) Q2 h x2 1 x2 2 i = 2 Z

x2 1 x2 2 dx2 dx1 = 2

Since E(x,y) Q3 x2 1 x2 2 = q3 by Lemma B.2, it follows that Ex Qx x2 1 x2 2 = 0.

Next, we give a proof of the risk lower bound of Theorem 2.1. For simplicity, in this section we will let R denote Rlog. For i = 1, 2, 3, 4, we also let Ri(w) := E(x,y) Qi ℓlog y w, x ; therefore R(w) := P4 i=1 Ri(w). We ﬁrst prove

Lemma 2.2, showing that there exists a solution w with w = Θ 1

and R( w) = O

Proof of Lemma 2.2. We consider R1, R2, R3 and R4 respectively.

Agnostic Learnability of Halfspaces via Logistic Loss

1. For Q1, note that the minimum of y w, x is

Because ℓlog(z) z + 1 when z 0, and OPT 1 100, we have

R1( w) ℓlog

2. For Q2, we have

R2( w) = 2 Z

0 ℓlog(x1 r) dx2 dx1 = 2 Z

0 ℓlog(x1 r) dx1

0 exp( x1 r) dx1

where we use ℓlog(z) exp( z).

3. For Q3, the minimum of y w, x is

where we use q3 1 15 by Lemma B.1. Further note that ℓlog(z) 1/z when z > 0, we have

R3( w) q3ℓlog

2 r/3 1 10 r.

R4( w) = Z 1

0 ℓlog r r cos(θ) q4r dθ dr 1

0 ℓlog r r cos(θ) r dθ dr,

where we use q4 1

π from Lemma B.1. Lemma A.1 then implies

Putting everything together, we have

R( w) = R1( w) + R2( w) + R3( w) + R4( w)

r + 1 10 r + 8

Agnostic Learnability of Halfspaces via Logistic Loss

Next we prove Lemma 2.3, the upper bound on w .

Proof of Lemma 2.3. Let

Let φ denote the angle between u and v, then

and it follows that the angle between v and w is bounded by

Moreover, note that the maximum of y w , x on Q1 is given by

w , v r v cos π

Additionally because ℓlog(z) > z, we have

R(w ) R1(w ) ℓlog

OPT, then R(w ) > 5

OPT, which contradicts the deﬁnition of w in light of Lemma 2.2. Therefore r 10

Next we prove Lemma 2.4.

Proof of Lemma 2.4. Let w = (r, θ), where 0 r 10

OPT and 0 θ

30 . We will consider the projection of R(w) onto the direction e2 := (0, 1), and show that this projection cannot be zero.

1. For Q1, the gradient of this part has a negative inner product with e2, due to the construction of Q1 and the fact ℓ log < 0.

2. For Q2, the inner product between e2 and the gradient of this part is given by

0 ℓ log(x1w1 + x2w2)x2 dx2 dx1. (13)

Note that x1w1 rx1, while

x2w2 = x2r sin (θ) rθ 10

and that ℓ log is increasing, therefore

ℓ log(x1w1 + x2w2) ℓ log

Agnostic Learnability of Halfspaces via Logistic Loss

We can then upper bound Equation (13) as follows:

Equation (13) 2 Z

Now we consider two cases. If r

OPT 2, then it follows from the convexity of ℓlog that

Equation (13) 1

OPT ℓ log(3)

On the other hand, if r

OPT 2, then

Equation (13) 1

Therefore, it always holds that Equation (13)

3. For Q3, the gradient of this part can have a positive inner product with e2. For simplicity, let ρ := 1

2 . To upper bound this inner product, it is enough to consider the region given by [1 ρ, 1 + ρ] [ ρ, 0] [ 1 ρ, 1 + ρ] [0, ρ] .

Moreover, note that y w, x 0 on Q3, therefore ℓ log y w, x 1

2. Therefore the inner product between e2 and the gradient of Q3 can be upper bounded by (note that x2 0 in the integral)

2x2 dx2 dx1 = ρ3 = q3 16

where we use q3 1 15 by Lemma B.1 and q3 2

OPT by its deﬁnition.

4. For Q4, we further consider two cases.

(a) Consider the part of Q4 with polar angles in ( π

2 ). By symmetry, the gradient of this part is along the direction with polar angle π + θ, and it has a negative inner product with e2. (b) Consider the part of Q4 with polar angles in ( π

2 + 2θ) ( π

2 + 2θ). We can verify that the gradient of this part has a positive inner product with e2; moreover, since 1 < ℓ log < 0, this inner product can be upper bounded by

0 r cos(θ )q4r dθ dr = 2q4 1

3 sin(2θ) 4θ

where we also use q4 1

π and sin(z) z for z 0.

As a result, item 3 and item 4(b) cannot cancel item 2, and thus R(w) cannot be 0.

Now we are ready to prove the risk lower bound of Theorem 2.1.

Proof of Theorem 2.1 risk lower bound. It is clear that R has bounded sub-level sets, and therefore can be globally minimized. Let the polar coordinates of the global minimizer be given by (r , θ ), where |θ | π. Assume that

Agnostic Learnability of Halfspaces via Logistic Loss

30 i ; due to Q1 and Q2, it actually follows that θ h 0,

30 i . Lemma 2.3 then implies r 10

and then Lemma 2.4 implies R(w ) = 0, a contradiction.

It then follows that w is wrong on a θ

π portion of Q4. Since the total measure of Q4 is more than half due to Lemma B.1, we have

C. Omitted proofs from Section 3

In this section, we provide omitted proofs from Section 3. First, we prove some general results that will be used later.

Lemma C.1. Under Assumption 3.2, for any w Rd,

E ℓlog w, x 12U

Proof. Let v denote an arbitrary vector orthogonal to w, and let p denote the density of the projection of Px onto the space spanned by w and v. Then we have

E ℓlog w, x = Z

0 ℓlog r w cos(θ) p(r, θ)r dθ dr.

Invoking Assumption 3.2, we have

E ℓlog w, x Z

0 ℓlog r w cos(θ) r dθ

Lemma A.1 then implies

E ℓlog w, x Z

Then it follows from Assumption 3.2 that

E ℓlog w, x 8

Next, we note that following the direction of the ground-truth solution u can achieve e O

OPT logistic risk.

Lemma C.2. Given ρ > 0, under Assumption 3.2, if x B almost surely, then

Rlog(ρ u) 12U

ρ + ρB OPT, with inf ρ>0 Rlog(ρ u)

while if Px is (α1, α2)-sub-exponential, then

Rlog(ρ u) 12U

ρ + (1 + 2α1)α2ρ OPT ln 1 OPT

inf ρ>0 Rlog(ρ u)

50(1 + 2α1)α2U OPT ln 1 OPT

Agnostic Learnability of Halfspaces via Logistic Loss

Proof. Note that

Rlog(ρ u) = E(x,y) P h ℓlog y ρ u, x i

ℓlog ρ u, x + E(x,y) P

ℓlog y ρ u, x ℓlog ρ u, x .

Since ℓlog( z) ℓlog(z) = z, and also invoking Lemma C.1, we have

Rlog(ρ u) = Ex Px

ℓlog ρ u, x + E(x,y) P

1y =sign( u,x ) ( y) ρ u, x

ρ + E(x,y) P

1y =sign( u,x ) ( y) ρ u, x .

If x B almost surely, then Lemma A.2 further implies

Rlog(ρ u) 12U

ρ + ρB OPT,

inf ρ>0 Rlog(ρ u) 2

If Px is (α1, α2)-sub-exponential, then Lemma A.2 further implies

Rlog(ρ u) 12U

ρ + (1 + 2α1)α2ρ OPT ln 1 OPT

and therefore

inf ρ>0 Rlog(ρ u) 2

12(1 + 2α1)α2U OPT ln 1 OPT

50(1 + 2α1)α2U OPT ln 1 OPT

Next we prove a risk lower bound, that will later be used to prove lower bounds on w and wt .

Lemma C.3. Under Assumption 3.2, given w Rd, if R w 2, then

while if R w 2, then

Rlog(w) R U w .

Proof. First, since ℓlog(z) ℓlog |z| ,

Rlog(w) = E(x,y) P h ℓlog y w, x i Ex Px

ℓlog w, x . (14)

Let v denote an arbitrary vector that is orthogonal to w, and let p denote the density of the projection of Px onto the space spanned by w and v. Without loss of generality, we can assume w has polar angle 0. Then Equation (14) becomes

0 ℓlog r w cos(θ) p(r, θ)r dθ dr.

Agnostic Learnability of Halfspaces via Logistic Loss

Assumption 3.2 and Lemma A.1 then imply

0 ℓlog r w cos(θ) r dθ dr

e R w 1 + R w .

If R w 2, then because e z 1 + z z2

4 when 0 z 2, we have

U 1 w 2 R2 w 2

Otherwise if R w 2, then because e z 1 + z z

2 when z 2, we have

U 1 w 2 R w

2 = R U w .

C.1. Omitted proofs from Section 3.1

In this section, we prove Theorems 3.4 and 3.6. First, we state the following general version of Lemma 3.7, which also handles sub-exponential distributions; it will be proved in Appendix C.2.

Lemma C.4 (Lemma 3.7, including the sub-exponential case). Under Assumptions 3.2 and 3.3, suppose ˆw satisﬁes Rlog( ˆw) Rlog( ˆw u) + ϵℓfor some ϵℓ [0, 1).

1. If x B almost surely, then

R0 1( ˆw) = O max OPT, q

ϵℓ ˆ w , Cκ ˆ w 2

2. If Px is (α1, α2)-sub-exponential and ˆw = Ω(1), then

R0 1( ˆw) =

O max OPT ln(1/OPT), q

ϵℓ ˆ w , Cκ ˆ w 2

Next, we prove the following norm lower bound on w , which covers Lemma 3.8 and also the sub-exponential case.

Lemma C.5 (Lemma 3.8, including the sub-exponential case). Under Assumption 3.2, if x B almost surely and

OPT < R4 200U 3B , then w = Ω 1

; if Px is (α1, α2)-sub-exponential and OPT ln(1/OPT) < R4 200(1+2α1)α2U 3 ,

then w = Ω 1

OPT ln(1/OPT)

Proof. Suppose x B almost surely. Since OPT < R4 200U 3B , Lemma C.2 implies

Rlog(w ) inf ρ>0 Rlog(ρ u)

200U 3B = R2

Therefore it follows from Lemma C.3 that R w 2, and

R U w Rlog(w ) inf ρ>0 Rlog(ρ u)

Agnostic Learnability of Halfspaces via Logistic Loss

which implies

Now suppose Px is (α1, α2)-sub-exponential. Since OPT ln 1 OPT < R4 200(1+2α1)α2U 3 , Lemma C.2 implies

inf ρ>0 Rlog(ρ u)

50(1 + 2α1)α2U OPT ln 1 OPT

50(1 + 2α1)α2U R4

200(1 + 2α1)α2U 3 = R2

Therefore it follows from Lemma C.3 that R w 2, and

R U w Rlog(w ) inf ρ>0 Rlog(ρ u)

50(1 + 2α1)α2U OPT ln 1 OPT

which implies

50(1 + 2α1)α2U

OPT ln(1/OPT) .

Now we can prove Theorem 3.4.

Proof of Theorem 3.4. If x B almost surely, Lemma C.4 implies

R0 1(w ) = O max OPT, Cκ w 2

If OPT R4 200U 3B , then Theorem 3.4 holds vacuously; otherwise Lemma C.5 ensures w = Ω 1

R0 1(w ) = O max {OPT, Cκ OPT} = O (1 + Cκ)OPT .

The proof of the sub-exponential case is similar.

Next, we analyze project gradient descent. First we restate Lemmas 3.9 and 3.10, and also handle sub-exponential distributions. Lemma C.6 (Lemma 3.9, including the sub-exponential case). Let the target optimization error ϵℓ (0, 1) and the

failure probability δ (0, 1/e) be given. If x B almost surely, then with η = 4/B2, using O (B+1)2 ln(1/δ)

and O B2 ϵϵℓ

iterations, with probability 1 δ, projected gradient descent outputs wt satisfying

Rlog(wt) min 0 ρ 1/ ϵ Rlog(ρ u) + ϵℓ. (15)

If Px is (α1, α2)-sub-exponential, then with η = eΘ(1/d), using e O d ln(1/δ)3

samples and e O d ln(1/δ)2

iterations, with probability 1 δ, projected gradient descent outputs wt satisfying Equation (15). Lemma C.7 (Lemma 3.10, including the sub-exponential case). Under Assumption 3.2, suppose

and that Equation (15) holds. If x B almost surely and OPT < R4 500U 3B , then wt = Ω min n 1 ϵ, 1

On the other hand, if Px is (α1, α2)-sub-exponential, and OPT ln(1/OPT) < R4 500U 3(1+2α1)α2 , then it holds that

wt = Ω min n 1 ϵ, 1

OPT ln(1/OPT)

Agnostic Learnability of Halfspaces via Logistic Loss

Next we prove Lemmas C.6 and C.7. We ﬁrst consider bounded distributions, and then handle sub-exponential distributions. For simplicity, in the rest of this subsection we will use R and b R to denote Rlog and b Rlog, respectively.

Bounded distributions. First, here are some standard optimization and generalization results for projected gradient descent.

Lemma C.8. If xi B for all 1 i n, then b R is B2

4 -smooth. Moreover, if w0 := 0 and η 4 B2 , then for all t 1,

b R(wt) min w B(1/ ϵ) b R(w) + 1 2ηϵt.

Proof. Note that ℓlog is 1

4-smooth. To show b R is B2

4 -smooth, note that given any w, w Rd,

b R(w) b R(w ) =

ℓ log yi w, xi ℓ log yi w , xi yixi

ℓ log yi w, xi ℓ log yi w , xi B

yi w, xi yi w , xi

i=1 w w B = B2

The following analysis basically comes from the proof of (Bubeck, 2014, Theorem 6.3); we include it for completeness, and also handle the last iterate. Let w := arg minw B(1/ ϵ) b R(w). Convexity gives

b R(wt) b R(w ) D b R(wt), wt w E = D b R(wt), wt wt+1 E + D b R(wt), wt+1 w E .

Smoothness implies

D b R(wt), wt wt+1 E b R(wt) b R(wt+1) + B2/4

2 wt wt+1 2

b R(wt) b R(wt+1) + 1

2η wt wt+1 2.

On the other hand, the projection step ensures

D b R(wt), wt+1 w E 1

η wt wt+1, wt+1 w

wt w 2 wt+1 w 2 wt wt+1 2 .

b R(wt) b R(w ) b R(wt) b R(wt+1) + 1

wt w 2 wt+1 w 2 ,

which implies

b R(wt+1) b R(w ) 1

wt w 2 wt+1 w 2 . (16)

Agnostic Learnability of Halfspaces via Logistic Loss

Next we show that b R(wt+1) b R(wt). Smoothness implies

b R(wt+1) b R(wt) D b R(wt), wt+1 wt E + B2/4

2 wt+1 wt 2

η wt wt+1 2 + B2/4

2 wt+1 wt 2

η wt wt+1 2 + 1

2η wt+1 wt 2

2η wt+1 wt 2,

where we also use the property of the projection step on the second line.

It now follow from Equation (16) and b R(wt+1) b R(wt) that for t 1,

b R(wt) b R(w ) + w0 w 2

2ηt b R(w ) + 1 2ηϵt.

Lemma C.9. If x B almost surely, then with probability 1 δ, for all w B 1 ϵ ,

R(w) b R(w) 2B ϵn + 3 B ϵ + 1 r

Proof. Note that ℓlog(z) |z| + 1, therefore

ℓlog y w, x w x + 1 B ϵ + 1.

Since ℓlog is 1-Lipschitz continuous, (Shalev-Shwartz & Ben-David, 2014, Theorem 26.5, Lemma 26.9, Lemma 26.10)

imply that with probability 1 δ, for all w B 1 ϵ ,

R(w) b R(w) 2B ϵn + 3 B ϵ + 1 r

Next we can just apply the same technique and get a uniform deviation bound on b R(w) R(w).

We can now prove Lemma C.6.

Proof of Lemma C.6 for bounded distributions. Lemma C.8 implies that

b R(wt) min 0 ρ 1/ ϵ R(ρ u) 1 2ηϵt = B2

Moreover, Lemma C.9 ensures with probability 1 δ, for all w B2 1 ϵ ,

b R(w) R(w) 2B ϵn + 3 B ϵ + 1 r

Therefore, to ensure R(wt) min0 ρ 1/ ϵ R(ρ u) ϵℓ, we only need

steps, and O

(B + 1)2 ln(1/δ)

Agnostic Learnability of Halfspaces via Logistic Loss

Next we prove the norm lower bound on wt .

Proof of Lemma C.7. First, we consider the case x B almost surely. It follows from Lemma C.2 that

ρ + ρB OPT. (17)

12U B OPT. We consider two cases below, ρ 1 ϵ or ρ 1 ϵ.

First, we assume ρ 1 ϵ. Then by the conditions of Lemma C.7 and Equation (17), we have

R(wt) R( ρ u) + ϵℓ 2

12UB OPT + ϵ

It then follows from Lemma C.3 that R wt 2, and

R U wt R(wt) R( ρ u) + ϵℓ 2

12UB OPT + ϵ.

since ρ 1 ϵ,

As a result, R U wt = O

OPT , which implies wt = Ω 1

Next, assume ρ 1 ϵ, which implies that

12U ϵ, and B OPT 12Uϵ.

Moreover, Equation (17) implies

R 1 ϵ u 12U ϵ + 1 ϵB OPT 12U ϵ + 1 ϵ12Uϵ = 24U ϵ.

Then because

R(wt) R 1 ϵ u + ϵℓ 24U ϵ + ϵ < 24U

it further follows from Lemma C.3 that R wt 2, and

R U wt R 1 ϵ u + ϵℓ 24U ϵ + ϵ,

therefore wt = Ω 1 ϵ .

Now assume Px is (α1, α2)-sub-exponential. Lemma C.2 implies

ρ + (1 + 2α1)α2ρ OPT ln 1 OPT

12U (1 + 2α1)α2 OPT ln(1/OPT),

and similarly consider the two cases ρ 1 ϵ and ρ 1 ϵ, we can ﬁnish the proof.

Agnostic Learnability of Halfspaces via Logistic Loss

Now we are ready to prove Theorem 3.6.

Proof of Theorem 3.6 for bounded distributions. First, note that if ϵ or OPT does not satisfy the conditions of Lemma C.7, then Theorem 3.6 holds vacuously. Under the conditions of Lemmas C.6 and C.7, let ϵℓ:= ϵ3/2, we have that projected gradient descent can ﬁnd wt satisfying

Rlog(wt) min 0 ρ 1/ ϵ Rlog(ρ u) + ϵ3/2 Rlog wt u + ϵ3/2,

wt = Ω min 1 ϵ, 1

Now we just need to invoke Lemma C.4. If ϵ OPT, then wt = Ω 1

, and Lemma C.4 implies

OPT, Cκ OPT

= O (1 + Cκ)OPT .

If ϵ OPT, then wt = Ω 1 ϵ , and similarly we can show

= O max OPT, q

ϵ3/2 ϵ, Cκϵ

= O (1 + Cκ)(OPT + ϵ) .

The sample and iteration complexity follow from Lemma C.6 and that ϵℓ= ϵ3/2.

Sub-exponential distributions. Next we handle (α1, α2)-sub-exponential distributions. We will prove Lemma C.6 for sub-exponential distributions; the rest of the proof is similar to the bounded case and thus omitted.

Let the target zero-one error ϵ, the target optimization error ϵℓ, and failure probability δ be given. Given r > 0, we overload the notation a little bit and let

δ(r) := dα1 exp

In particular, note that

j=1 Pr |xj| r

Let B > 1 be large enough such that

1 δ(B) 100(B+1)2 ln(4/δ)/(ϵϵ2 ℓ) 1 δ, and α1(α2 + B) exp B

We have the following bound on B.

Lemma C.10. To satisfy Equation (18), it is enough to let

Agnostic Learnability of Halfspaces via Logistic Loss

Proof. First, we let B α2

d ln(2dα1) to ensure δ(B) 1/2. Since for 0 z 1/2, we have e z 1 z e 2z, to satisfy the ﬁrst condition of Equation (18), it is enough to ensure

e δ(B) 200(B+1)2 ln(4/δ)/(ϵϵ2 ℓ) e δ, equivalently δ(B) δϵϵ2 ℓ 200(B + 1)2 ln(4/δ).

Invoking the deﬁnition of δ(B), we only need

200(B + 1)2dα1 ln(4/δ)

In other words, it is enough if B = Ω

d ln d ϵϵℓδ .

Similarly, to satisfy the second condition of Equation (18), we only need

B α2 ln α1(α2 + B)

and it is enough if B = Ω

d ln d ϵϵℓδ .

Now we deﬁne a truncated logistic loss ℓ log as following:

ℓ log(z) :=

ℓlog(z) if z B ϵ.

We also let R (w) and b R (w) denote the population and empirical risk with the truncated logistic loss. We have the next result.

Lemma C.11. Suppose B > 1 is chosen according to Equation (18). Using a constant step size 4/B2, and

100(B + 1)2 ln(4/δ)

ϵϵ2 ℓ samples, and B2

4ϵϵℓ steps,

with probability 1 2δ, projected gradient descent can ensure

R (wt) min 0 ρ 1/ ϵ R(ρ u) + ϵℓ.

Proof. It follows from Equation (18) that with probability 1 δ, it holds that xi B for all training examples. Therefore Lemma C.8 implies that

b R(wt) min 0 ρ 1/ ϵ b R(ρ u) + B2

Since xi B, and the domain is B(1/ ϵ), it follows that

b R (wt) min 0 ρ 1/ ϵ b R (ρ u) + B2

Letting t = B2 4ϵϵℓ, we get

b R (wt) min 0 ρ 1/ ϵ b R (ρ u) + ϵℓ

Agnostic Learnability of Halfspaces via Logistic Loss

Note that by the construction of the truncated logistic loss, it holds that

ℓ log(z) B ϵ + 1.

Then by invoking the standard Rademacher complexity results (Shalev-Shwartz & Ben-David, 2014, Theorem 26.5, Lemma 26.9, Lemma 26.10), and recall that we work under the event xi B for all training examples, we can show with probability 1 2δ that for all w B(1/ ϵ),

R (w) b R (w) 2B ϵn + 3 B ϵ + 1 r

n + 3(B + 1) ϵ

Letting n = 100(B+1)2 ln(4/δ)

ϵϵ2 ℓ , we have

R (w) b R (w) ϵℓ

It then follows from Equations (19) and (20) that with probability 1 2δ,

R (wt) min 0 ρ 1/ ϵ R (ρ u) + ϵℓ min 0 ρ 1/ ϵ R(ρ u) + ϵℓ,

where we use ℓ log ℓlog in the last inequality.

Finally, we show that R (wt) is close to R(wt).

Lemma C.12. For all w B(1/ ϵ), it holds that R (w) R(w) ϵℓ.

Proof. Note that if ℓlog y w, x = ℓ log y w, x , then y w, x B/ ϵ, which implies w, x B/ ϵ. Moreover, in this case

ℓlog y w, x ℓ log y w, x ℓlog y w, x ℓlog(0) w, x .

R(w) R (w) = Ex Px h ℓlog y w, x ℓ log y w, x i Ex Px

w, x 1| w,x | B/ ϵ

We can then invoke Equation (12) and get

R(w) R (w) α1

exp B α2 w ϵ

Note that the right hand side of Equation (21) is increasing with w , therefore we can let w be 1/ ϵ and get

R(w) R (w) α1 α2 + B ϵ exp B

where we use Equation (18) in the last inequality.

Now putting everything together, under the conditions of Lemma C.11, with probability 1 2δ, projected gradient descent ensures R(wt) min0 ρ 1/ ϵ R(ρ u) + 2ϵℓ. Moreover, by applying Lemma C.10 to Lemma C.11, we can see the sample complexity is e O d ln(1/δ)3/(ϵϵ2 ℓ) , and the iteration complexity is e O d ln(1/δ)2/(ϵϵℓ) .

Agnostic Learnability of Halfspaces via Logistic Loss

C.2. Omitted proofs from Section 3.2

In this section, we prove Lemma C.4. We ﬁrst prove the following approximation bound after we replace the true label with the label given by the ground-truth solution, which covers Lemma 3.11 and sub-exponential distributions.

Lemma C.13 (Lemma 3.11, including the sub-exponential case). For ℓ {ℓlog, ℓh}, if x B almost surely,

|term (3)| B w ˆw OPT.

If Px is (α1, α2)-sub-exponential, then

|term (3)| (1 + 2α1)α2 w ˆw OPT ln(1/OPT).

Proof. Note that for both the logistic loss and the hinge loss, it holds that ℓ( z) ℓ(z) = z, therefore

term (3) = E(x,y) P

1y =sign( w,x ) y w ˆw, x , (22)

It then follows from the triangle inequality that

|term (3)| E(x,y) P

1y =sign( w,x ) w ˆw, x

Now we can invoke Lemma A.2 with w = w and w = w ˆw to prove Lemma C.13.

Next we prove the lower bound on term (4).

Proof of Lemma 3.12. Note that in term (4), we only care about ˆw, x and w, x , therefore we can focus on the twodimensional space spanned by w and ˆw. Let ϕ denote the angle between w and ˆw. Without loss of generality, we can consider the following graph, where we put w at angle 0, and ˆw at angle ϕ.

We divide the graph into four parts given by different polar angles: (i) ( π

2 + ϕ), (ii) ( π

2 ), (iii) ( π

2 + ϕ), and (iv) ( π

2 ). Note that term (4) is 0 on parts (ii) and (iv), therefore we only need to consider parts (i) and (iii):

term (4) = E(i) and (iii)

ℓ sign w, x ˆw, x ℓ sign ˆw, x ˆw, x

= E(i) and (iii) h sign w, x ˆw, x i .

Here we use the fact that ℓ( z) ℓ(z) = z for both the logistic loss and the hinge loss.

Agnostic Learnability of Halfspaces via Logistic Loss

For simplicity, let p denote the density of the projection of Px onto the space spanned by ˆw and w. Under Assumption 3.2, we have

term (4) = E(i) and (iii) h sign w, x ˆw, x i

2 r ˆw cos(ϕ θ)p(r, θ)r dθ dr + Z

π 2 r ˆw cos(θ ϕ)p(r, θ)r dθ dr

0 r ˆw sin(θ)r dθ dr

= 2R3 ˆw 1 cos(ϕ)

3U 4R3 ˆw ϕ2

where we use the fact that 1 cos(ϕ) 2ϕ2

π2 for all ϕ [0, π].

Next, we prove the following upper bound on term (5), covering Lemma 3.13 and the sub-exponential case.

Lemma C.14 (Lemma 3.13, including the sub-exponential case). For ℓ= ℓh, term (5) is 0. For ℓ= ℓlog, under Assumption 3.3, if x B almost surely, then

|term (5)| 12Cκ ϕ( ˆw, w)

where Cκ := R B 0 κ(r) dr, while if Px is (α1, α2)-sub-exponential, then

|term (5)| 2α1OPT2 + 12Cκ ϕ( ˆw, w)

where Cκ := R 3α2 ln(1/OPT) 0 κ(r) dr.

Proof. For the hinge loss, term (5) is 0 simply because ℓh(z) = 0 when z 0. Next we consider the logistic loss.

Note that term (5) only depends on ˆw, x and w, x , therefore we can focus on the subspace spanned by ˆw and w. For simplicity, let p denote the density function of the projection of Px onto the space spanned by ˆw and w. Moreover, without loss of generality we can assume w has polar angle 0 while ˆw has polar angle ϕ, where we let ϕ denote ϕ( ˆw, w) for simplicity. It then follows that

term (5) = Z

0 ℓlog r ˆw cos(θ ϕ) p(r, θ)r dθ dr Z

0 ℓlog r ˆw cos(θ) p(r, θ)r dθ dr

0 ℓlog r ˆw cos(θ) p(r, θ + ϕ) p(r, θ) r dθ dr.

First, if x B almost surely, then

|term (5)| Z B

0 ℓlog r ˆw cos(θ) p(r, θ + ϕ) p(r, θ) r dθ dr

0 ℓlog r ˆw cos(θ) κ(r)ϕ r dθ dr

0 ℓlog r ˆw cos(θ) r dθ

Then Lemma A.1 implies

|term (5)| ϕ Z B

2 ˆw dr = 8

2Cκ ϕ ˆw 12Cκ ϕ ˆw .

Agnostic Learnability of Halfspaces via Logistic Loss

Next, assume Px is (α1, α2)-sub-exponential. For a 2-dimensional random vector x sampled according to p V , note that

Letting B := 2

2α2 ln 1 OPT , we get Pr x B 2α1OPT2. Since ℓlog(z) 1 when z 0, we have

term (5) 2α1OPT2

0 ℓlog r ˆw cos(θ ϕ) p(r, θ)r dθ dr Z B

0 ℓlog r ˆw cos(θ) p(r, θ)r dθ dr.

Invoking the previous bound for bounded distributions, we get

term (5) 2α1OPT2 + 12 ϕ ˆw Z 2

2α2 ln( 1 OPT)

0 κ(r) dr 2α1OPT2 + 12Cκ ϕ ˆw ,

where Cκ := R 3α2 ln( 1 OPT) 0 κ(r) dr. Similarly, we can show

term (5) 2α1OPT2 + 12Cκ ϕ ˆw .

Next we prove Lemma 3.14, which is basically (Diakonikolas et al., 2020d, Claim 3.4).

Proof of Lemma 3.14. Under Assumption 3.2, we have

Pr sign ˆw, x = sign w, x 2ϕ( ˆw, w) Z

0 σ(r)r dr 2Uϕ( ˆw, w).

Lastly, we prove Lemma C.4 for sub-exponential distributions.

Proof of Lemma C.4, sub-exponential distributions. For simplicity, let ϕ denotes ϕ( ˆw, w). Lemmas 3.12, C.13 and C.14 imply

C1 ˆw ϕ2 ϵℓ+ C2 w ˆw OPT ln 1 OPT

+ C3OPT2 + C4Cκ ϕ ˆw

ϵℓ+ C2 ˆw ϕ OPT ln 1 OPT

+ C3OPT2 + C4Cκ ϕ ˆw ,

where C1 = 4R3 3Uπ2 , and C2 = (1 + 2α1)α2, and C3 = 2α1, and C4 = 12. It follows that at least one of the following four cases is true:

1. C1 ˆw ϕ2 4ϵℓ, which implies ϕ = O p

2. C1 ˆw ϕ2 4C2 ˆw ϕ OPT ln 1 OPT , which implies ϕ = O OPT ln 1 OPT .

3. C1 ˆw ϕ2 4C3OPT2, which implies ϕ = O(OPT) since ˆw = Ω(1).

C1 ˆw ϕ2 4C2Cκ ϕ ˆw , which implies ϕ = O Cκ

Finally, we just need to invoke Lemma 3.14 to ﬁnish the proof.

Agnostic Learnability of Halfspaces via Logistic Loss

C.3. Omitted proofs from Section 3.3

We ﬁrst prove the upper bound of term (5) under Assumption 3.2, without assuming the radially Lipschitz condition.

Proof of Lemma 3.15. Note that

term (5) E ℓlog sign ˆw, x ˆw, x = E ℓlog ˆw, x 12U

where we invoke Lemma C.1 at the end. Similarly, we can show

term (5) 12U

Next we prove a general result similar to Lemma C.4.

Theorem C.15. Under Assumption 3.2, suppose ˆw satisﬁes Rlog( ˆw) Rlog( ˆw u) + ϵℓfor some ϵℓ [0, 1). If x B almost surely, then

ϕ( ˆw, u) = O

If Px is (α1, α2)-sub-exponential and ˆw = Ω(1), then

ϕ( ˆw, u) = O

OPT ln 1 OPT

Proof. For simplicity, let ϕ denote ϕ( ˆw, u). Consider the case x B almost surely. The condition Rlog( ˆw) Rlog( ˆw u) + ϵℓ, and Lemmas 3.12, 3.15 and C.13 imply

C1 ˆw ϕ2 ϵℓ+ B w ˆw OPT + C2

ϵℓ+ B ˆw ϕ OPT + C2

where C1 = 4R3/(3Uπ2) and C2 = 12U. Now at least one of the following three cases is true:

1. C1 ˆw ϕ2 3ϵℓ, which implies ϕ = O p

2. C1 ˆw ϕ2 3B ˆw ϕ OPT, which implies ϕ = O(OPT);

3. C1 ˆw ϕ2 3C1/ ˆw , which implies ϕ = O(1/ ˆw ).

The proof of the sub-exponential case is similar.

Now we prove Lemma 3.16.

Proof of Lemma 3.16. First, if ϵ or OPT does not satisfy the conditions of Lemma C.7, then Lemma 3.16 holds vacuously; therefore in the following we consider the settings of Lemmas C.6 and C.7 with ϵℓ= ϵ.

First, if x B almost surely, Equation (15) and Theorem C.15 imply

ϕ(wt, u) = O

Agnostic Learnability of Halfspaces via Logistic Loss

and moreover Lemma C.7 implies

wt = Ω min 1 ϵ, 1

If ϵ OPT, then wt = Ω 1

ϕ(wt, u) = O max OPT, q

= O max OPT, q ϵ

If ϵ OPT, then wt = Ω 1 ϵ , and

ϕ(wt, u) = O max OPT, q

= O max OPT, q ϵ ϵ, ϵ

The proof for the sub-exponential case is similar.

D. Omitted proofs from Section 4

In this section, we prove Theorem 4.1. We ﬁrst prove a bound on Rh( u).

Lemma D.1. If x B almost surely, then Rh( u) B OPT, while if Px is (α1, α2)-sub-exponential, then Rh( u) (1 + 2α1)α2 OPT ln(1/OPT).

Proof. Note that

Rh( u) = E(x,y) P h ℓh y u, x i = E(x,y) P

1sign( u,x =y) u, x .

It then follows from Lemma A.2 that if x B almost surely, then

Rh( u) B OPT,

while if Px is (α1, α2)-sub-exponential, then

Rh( u) (1 + 2α1)α2 OPT ln 1 OPT

Next we prove the following result, which covers Lemma 4.2 but also handles sub-exponential distributions.

Lemma D.2 (Lemma 4.2, including the sub-exponential case). Suppose Assumption 3.2 holds. Consider an arbitrary w D, and let ϕ denote ϕ(w, u). If x B almost surely, then

Rh( r u) Rh( w u) + O (OPT + ϵ)2

Agnostic Learnability of Halfspaces via Logistic Loss

Rh(w) Rh( w u) 4R3

3Uπ2 w ϕ2 B w ϕ OPT.

If Px is (α1, α2)-sub-exponential, then

Rh( r u) Rh( w u) + O OPT ln(1/OPT) + ϵ 2

Rh(w) Rh( w u) 4R3

(1 + 2α1)α2 w ϕ OPT ln(1/OPT).

Proof. First assume x B almost surely. Note that ℓh is positive homogeneous, and thus for any positive constant c, we have Rh(cw) = c Rh(w). Therefore, if r w , then

Rh( r u) = r w Rh( w u) Rh( w u).

If r w , then

Rh( r u) = Rh w u + Rh( u) r w Rh w u + Rh( u) ( r 1) ,

since w 1 for all w D. Recall that

r := 1 v, u = 1 cos ϕ(v, u) 1 1 ϕ(v, u)2/2,

and therefore the ﬁrst-phase of algorithm ensures r = 1 + O(OPT + ϵ) for bounded distributions, and r = 1 + O OPT ln(1/OPT) + ϵ for sub-exponential distributions. It then follows that for bounded distributions,

Rh( r u) Rh wt u + Rh( u) O(OPT + ϵ)

Rh wt u + B OPT O(OPT + ϵ)

= Rh wt u + O (OPT + ϵ)2 ,

where we apply Lemma D.1 at the end. It also follows directly from Lemmas 3.12, C.13 and C.14 that

Rh(w) Rh( w u) 4R3

3Uπ2 w ϕ2 B w w u OPT

3Uπ2 w ϕ2 B w ϕ OPT.

The proof for the sub-exponential case is similar.

Next we prove Theorem 4.1. We ﬁrst consider the bounded case.

Proof of Theorem 4.1, bounded distribution. Here we assume x B almost surely. We will show that under the conditions of Theorem 4.1, then

E min 0 t<T ϕt

= O(OPT + ϵ), where ϕt := ϕ(wt, u). (24)

Further invoking Lemma 3.14 ﬁnishes the proof.

Agnostic Learnability of Halfspaces via Logistic Loss

Recall that at step t, after taking the expectation with respect to (xt, yt), we have

E h wt+1 r u 2i wt r u 2 2η Rh(wt), wt r u + η2B2M(wt)

wt r u 2 2η Rh(wt) Rh( r u) + η2B2M(wt). (25)

First, Lemma D.2 implies

Rh(wt) Rh( r u) Rh(wt) Rh( wt u) O (OPT + ϵ)2

2C1 wt ϕ2 t B wt ϕt OPT O (OPT + ϵ)2 ,

where C1 := 2R3/(3Uπ2). Note that if ϕt B OPT/C1, then Equation (24) holds; therefore in the following we assume

C1 OPT, (26)

which implies

Rh(wt) Rh( r u) C1 wt ϕ2 t O (OPT + ϵ)2 C1ϕ2 t O (OPT + ϵ)2 , (27)

since w 1 for all w D.

On the other hand, Equation (26) and Lemma 3.14 imply

M(wt) = R0 1(wt) OPT + 2Uϕt C1

C2 := C1 C1

B + 2U B2 .

Note that if ϕt ϵ, then Equation (24) is true; otherwise we can assume ϵ ϕt, and let η = C2ϵ, we have

ηB2M(wt) C2ϵB2 C1

B + 2U ϕt = C1ϵϕt C1ϕ2 t. (28)

Now Equations (25), (27) and (28) imply

E h wt+1 r u 2i wt r u 2 2ηC1ϕ2 t + ηC1ϕ2 t + η O (OPT + ϵ)2

= wt r u 2 ηC1ϕ2 t + η O (OPT + ϵ)2 .

Taking the expectation and average, we have

ηC1T + O (OPT + ϵ)2

w0 r u = tan(ϕ0) = O

and also recall η = C2ϵ, we have

C1C2ϵT + O (OPT + ϵ)2

Letting T = Ω(1/ϵ2), we have

O (OPT + ϵ)ϵ + O (OPT + ϵ)2 = O (OPT + ϵ)2 ,

and thus Equation (24) holds.

Agnostic Learnability of Halfspaces via Logistic Loss

Next we consider sub-exponential distributions. We ﬁrst prove the following bound on the square of norm.

Lemma D.3. Suppose Px is (α1, α2)-sub-exponential. Given any threshold τ > 0, it holds that

E h x 21 x τ i dα1 τ 2 + 2

dα2τ + 2dα2 2 exp

Proof. First recall that

j=1 Pr |xj| τ

Let µ(τ) := Pr x τ . Integration by parts gives

E h x 21 x τ i = Z

τ r2 ( dµ(r)) = τ 2µ(τ) + Z

τ 2rµ(r) dr τ 2δ(τ) + Z

τ 2rδ(r) dr.

Calculation gives

E h x 21 x τ i dα1 τ 2 + 2

dα2τ + 2dα2 2 exp

Now we are ready to prove Theorem 4.1 for sub-exponential distributions.

Proof of Theorem 4.1, sub-exponential distributions. At step t, we have

wt+1 r u 2 wt r u 2 2η D ℓ h yt wt, xt ytxt, wt r u E + η2ℓ h yt wt, xt 2 xt 2

= wt r u 2 2η D ℓ h yt wt, xt ytxt, wt r u E η2ℓ h yt wt, xt xt 2, (29)

where we use (ℓ h)2 = ℓ h. Next we bound E(xt,yt) ℓ h yt wt, xt xt 2 . Let τ :=

dα2 ln(d/ϵ). When xt τ, we have

E h ℓ h yt wt, xt xt 21 xt τ i τ 2M(wt) dα2 2M(wt) ln(d/ϵ)2.

On the other hand, when xt τ, Lemma D.3 implies

E h ℓ h yt wt, xt xt 21 xt τ i E h xt 21 xt τ i dα1 O d ln(d/ϵ)2 ϵ

d = O dϵ ln(d/ϵ)2 ,

where we also use ln(1/ϵ) > 1, since ϵ < 1/e. To sum up,

E(xt,yt) ℓ h yt wt, xt xt 2 Cd M(wt) + ϵ ln(d/ϵ)2

for some constant C.

Now taking the expectation with respect to (xt, yt) on both sides of Equation (29), we have

E h wt+1 r u 2i wt r u 2 2η Rh(wt) Rh( r u) + η2Cd M(wt) + ϵ ln(d/ϵ)2. (30)

Similarly to the bounded case, we will show that

E min 0 t<T ϕt

= O OPT ln(1/OPT) + ϵ , where ϕt := ϕ(wt, u). (31)

Agnostic Learnability of Halfspaces via Logistic Loss

First, Lemma D.2 implies

Rh(wt) Rh( r u) Rh(wt) Rh( wt u) O OPT ln(1/OPT) + ϵ 2

2C1 wt ϕ2 t C2 wt ϕt OPT ln(1/OPT) O OPT ln(1/OPT) + ϵ 2 ,

where C1 := 2R3/(3Uπ2) and C2 = (1 + 2α1)α2. Note that if ϕt C2 OPT ln(1/OPT)/C1, then Equation (31) holds; therefore in the following we assume

C1 OPT ln(1/OPT), (32)

which implies

Rh(wt) Rh( r u) C1 wt ϕ2 t O OPT (1/OPT) + ϵ 2

C1ϕ2 t O OPT ln(1/OPT) + ϵ 2 , (33)

since w 1 for all w D.

On the other hand, for OPT 1/e, Equation (32) and Lemma 3.14 imply

M(wt) = R0 1(wt) OPT + 2Uϕt C1

C2 + 2U ϕt.

C2 := C1 C1 C2 + 2U C .

Note that if ϕt ϵ, then Equation (31) is true; otherwise we can assume ϵ ϕt, and let η = C2ϵ d ln(d/ϵ)2 , we have

ηCd M(wt) + ϵ ln(d/ϵ)2 = C2ϵ d ln(d/ϵ)2 Cd M(wt) ln(d/ϵ)2 + C2ϵ d ln(d/ϵ)2 Cdϵ ln(d/ϵ)2

C2 + 2U ϕt + C2Cϵ2

= C1ϵϕt + O OPT ln(1/OPT) + ϵ 2

C1ϕ2 t + O OPT ln(1/OPT) + ϵ 2 . (34)

Now Equations (30), (33) and (34) imply

E h wt+1 r u 2i wt r u 2 2ηC1ϕ2 t + ηC1ϕ2 t + η O OPT ln(1/OPT) + ϵ 2

= wt r u 2 ηC1ϕ2 t + η O OPT ln(1/OPT) + ϵ 2 .

Taking the expectation and average, we have

ηC1T + O OPT ln(1/OPT) + ϵ 2

w0 r u = tan(ϕ0) = O p

OPT ln(1/OPT) + ϵ ,

Agnostic Learnability of Halfspaces via Logistic Loss

and also recall η = C2ϵ d ln(d/ϵ)2 , we have

O OPT ln(1/OPT) + ϵ d ln(d/ϵ)2

C1C2ϵT + O OPT ln(1/OPT) + ϵ 2

Letting T = Ω d ln(d/ϵ)2

ϵ2 , we have

O OPT ln(1/OPT) + ϵ ϵ + O OPT ln(1/OPT) + ϵ 2

= O OPT ln(1/OPT) + ϵ 2 ,

and thus Equation (31) holds.