# an_efficient_testerlearner_for_halfspaces__ccf8e45b.pdf

Published as a conference paper at ICLR 2024

AN EFFICIENT TESTER-LEARNER FOR HALFSPACES

Aravind Gollakota Apple Adam R. Klivans UT Austin Konstantinos Stavropoulos UT Austin Arsen Vasilyan MIT

We give the first efficient algorithm for learning halfspaces in the testable learning model recently defined by Rubinfeld and Vasilyan (RV23). In this model, a learner certifies that the accuracy of its output hypothesis is near optimal whenever the training set passes an associated test, and training sets drawn from some target distribution must pass the test. This model is more challenging than distribution-specific agnostic or Massart noise models where the learner is allowed to fail arbitrarily if the distributional assumption does not hold. We consider the setting where the target distribution is the standard Gaussian in d dimensions and the label noise is either Massart or adversarial (agnostic). For Massart noise, our tester-learner runs in polynomial time and outputs a hypothesis with (information-theoretically optimal) error opt+ϵ (and extends to any fixed strongly log-concave target distribution). For adversarial noise, our tester-learner obtains error O(opt) + ϵ in polynomial time. Prior work on testable learning ignores the labels in the training set and checks that the empirical moments of the covariates are close to the moments of the base distribution. Here we develop new tests of independent interest that make critical use of the labels and combine them with the moment-matching approach of (GKK23). This enables us to implement a testable variant of the algorithm of (DKTZ20a; DKTZ20b) for learning noisy halfspaces using nonconvex SGD.

1 INTRODUCTION

Learning halfspaces in the presence of noise is one of the most basic and well-studied problems in computational learning theory. A large body of work has obtained results for this problem under a variety of different noise models and distributional assumptions (see e.g. (BH21) for a survey). A major issue with common distributional assumptions such as Gaussianity, however, is that they can be hard or impossible to verify in the absence of any prior information.

The recently defined model of testable learning (RV23) addresses this issue by replacing such assumptions with efficiently testable ones. In this model, the learner is required to work with an arbitrary input distribution DXY and verify any assumptions it needs to succeed. It may choose to reject a given training set, but if it accepts, it is required to output a hypothesis with error close to opt(C, DXY), the optimal error achievable over DXY by any function in a concept class C. Further, whenever the training set is drawn from a distribution DXY whose marginal is truly a well-behaved target distribution D (such as the standard Gaussian), the algorithm is required to accept with high probability. Such an algorithm, or tester-learner, is then said to testably learn C with respect to target marginal D . (See Definition 2.1.) Note that unlike ordinary distribution-specific agnostic learners, a tester-learner must take some nontrivial action regardless of the input distribution.

The work of (RV23; GKK23) established foundational algorithmic and statistical results for this model and showed that testable learning is in general provably harder than ordinary distributionspecific agnostic learning. As one of their main algorithmic results, they showed tester-learners for the class of halfspaces over Rd that succeed whenever the target marginal is Gaussian (or one of a more general class of distributions), achieving error opt + ϵ in time and sample complexity

aravindg@cs.utexas.edu klivans@cs.utexas.edu kstavrop@cs.utexas.edu vasilyan@mit.edu

Published as a conference paper at ICLR 2024

d e O(1/ϵ2). This matches the running time of ordinary distribution-specific agnostic learning of halfspaces over the Gaussian using the standard approach of (KKMS08). Their testers are simple and label-oblivious, and are based on checking whether the low-degree empirical moments of the unknown marginal match those of the target D .

These works essentially resolve the question of designing tester-learners achieving error opt + ϵ for halfspaces, matching known hardness results for (ordinary) agnostic learning (GGK20; DKZ20; DKPZ21). Their running time, however, necessarily scales exponentially in 1/ϵ.

A long line of research has sought to obtain more efficient algorithms at the cost of relaxing the optimality guarantee (ABL17; DKS18; DKTZ20a; DKTZ20b). These works give polynomial-time algorithms achieving bounds of the form opt+ϵ and O(opt)+ϵ for the Massart and agnostic setting respectively under structured distributions (see Section 1.1 for more discussion). The main question we consider here is whether such guarantees can be obtained in the testable learning framework.

Our contributions. In this work we design the first tester-learners for halfspaces that run in fully polynomial time in all parameters. We match the optimality guarantees of fully polynomial-time learning algorithms under Gaussian marginals for the Massart noise model (where the labels arise from a halfspace but are flipped by an adversary with probability at most η) as well as for the agnostic model (where the labels can be completely arbitrary). In fact, for the Massart setting our guarantee holds with respect to any chosen target marginal D that is isotropic and strongly log-concave, and the same is true of the agnostic setting albeit with a slightly weaker guarantee. Theorem 1.1 (Formally stated as Theorem 4.1). Let C be the class of origin-centered halfspaces over Rd, and let D be any isotropic strongly log-concave distribution. In the setting where the labels are corrupted with Massart noise at rate at most η < 1

2, C can be testably learned w.r.t. D

up to error opt + ϵ using poly(d, 1

ϵ , 1 1 2η) time and sample complexity. Theorem 1.2 (Formally stated as Theorem 5.1). Let C be as above. In the adversarial noise or agnostic setting where the labels are completely arbitrary, C can be testably learned w.r.t. N(0, Id) up to error O(opt) + ϵ using poly(d, 1

ϵ ) time and sample complexity.

Our techniques. The tester-learners we develop are significantly more involved than prior work on testable learning. We build on the nonconvex optimization approach to learning noisy halfspaces due to (DKTZ20a; DKTZ20b) as well as the structural results on fooling functions of halfspaces using moment matching due to (GKK23). Unlike the label-oblivious, global moment tests of (RV23; GKK23), our tests make crucial use of the labels and check local properties of the distribution in regions described by certain candidate vectors. These candidates are approximate stationary points of a natural nonconvex surrogate of the 0-1 loss, obtained by running gradient descent. When the distribution is known to be well-behaved, (DKTZ20a; DKTZ20b) showed that any such stationary point is in fact a good solution (for technical reasons we must use a slightly different surrogate loss). Their proof relies crucially on structural geometric properties that hold for these well-behaved distributions, an important one being that the probability mass of any region close to the origin is proportional to its geometric measure.

In the testable learning setting, we must efficiently check this property for candidate solutions. Since these regions may be described as intersections of halfspaces, we may hope to apply the momentmatching framework of (GKK23). Na ıvely, however, they only allow us to check in polynomial time that the probability masses of such regions are within an additive constant of what they should be under the target marginal. But we can view these regions as sub-regions of a known band described by our candidate vector. By running moment tests on the distribution conditioned on this band and exploiting the full strength of the moment-matching framework, we are able to effectively convert our weak additive approximations to good multiplicative ones. This allows us to argue that our stationary points are indeed good solutions.

Independent and Subsequent Works. In this paper we provide the first efficient tester-learners for halfspaces when the noise is either adversarial or Massart. In independent and concurrent work by (DKK+23), an efficient tester-learner for homogeneous halfspaces achieving error O(opt) + ϵ for Gaussian target marginals is also provided, but they do not provide any results for arbitrary strongly log-concave target marginals (see Theorem 5.1) or a guarantee of opt + ϵ for Massart noise. In subsequent work by (GKSV23), our techniques were used to provide tester-learners that are not tailored to a single target distribution, but are guaranteed to accept any member of a large family of distributions. Although their main results are more general, their approach crucially extends our

Published as a conference paper at ICLR 2024

approach here. Moreover, on the technical side, the proof we give here shows how to make use of the moment-matching approach of (GKK23) to provide fully polynomial-time efficient tester-learners, which might be of independent interest.

1.1 RELATED WORK

We provide a partial summary of some of the most relevant prior and related work on efficient algorithms for learning halfspaces in the presence of adversarial label or Massart noise, and refer the reader to (BH21) for a survey.

In the distribution-specific agnostic setting where the marginal is assumed to be isotropic and logconcave, (KLS09) showed an algorithm achieving error O(opt1/3)+ϵ for the class of origin-centered halfspaces. (ABL17) later obtained O(opt)+ϵ using an approach that introduced the principle of iterative localization, where the learner focuses attention on a band around a candidate halfspace in order to produce an improved candidate. (Dan15) used this principle to obtain a PTAS for agnostically learning halfspaces under the uniform distribution on the sphere, and (BZ17) extended it to more general s-concave distributions. Further works in this line include (YZ17; Zha18; ZSA20; ZL21). (DKTZ20b) introduced the simplest approach yet, based entirely on nonconvex SGD, and showed that it achieves O(opt)+ϵ for origin-centered halfspaces over a wide class of structured distributions. Other related works include (DKS18; DKTZ22).

In the Massart noise setting with noise rate bounded by η, work of (DGT19) gave the first efficient distribution-free algorithm achieving error η + ϵ; further improvements and followups include (DKT21; DTK22). However, the optimal error opt achievable by a halfspace may be much smaller than η, and it has been shown that there are distributions where achieving error competitive with opt as opposed to η is computationally hard (DK22; DKMR22). As a result, the distribution-specific setting remains well-motivated for Massart noise. Early distribution-specific algorithms were given by (ABHU15; ABHZ16), but a key breakthrough was the nonconvex SGD approach introduced by (DKTZ20a), which achieved error opt + ϵ for origin-centered halfspaces efficiently over a wide range of distributions. This was later generalized by (DKK+22).

1.2 TECHNICAL OVERVIEW

Our starting point is the nonconvex optimization approach to learning noisy halfspaces due to (DKTZ20a; DKTZ20b). The algorithms in these works consist of running SGD on a natural nonconvex surrogate Lσ for the 0-1 loss, namely a smooth version of the ramp loss. The key structural property shown is that if the marginal distribution is structured (e.g. log-concave) and the slope of the ramp is picked appropriately, then any w that has large angle with an optimal w cannot be an approximate stationary point of the surrogate loss Lσ, i.e. that Lσ(w) must be large. This is proven by carefully analyzing the contributions to the gradient norm from certain critical regions of span(w, w ), and crucially using the distributional assumption that the probability masses of these regions are proportional to their geometric measures. (See Fig. 3.) In the testable learning setting, the main challenge we face in adapting this approach is checking such a property for the unknown distribution we have access to.

A preliminary observation is that the critical regions of span(w, w ) that we need to analyze are rectangles, and are hence functions of a small number of halfspaces. Encouragingly, one of the key structural results of the prior work of (GKK23) pertains to fooling such functions. Concretely, they show that whenever the true marginal DX matches moments of degree at most e O(1/τ 2) with a target D that satisfies suitable concentration and anticoncentration properties, then | EDX [f] ED [f]| τ for any f that is a function of a small number of halfspaces. If we could run such a test and ensure that the probabilities of the critical regions over our empirical marginal are also related to their areas, then we would have a similar stationary point property. However, the difficulty is that since we wish to run in fully polynomial time, we can only hope to fool such functions up to τ that is a constant. Unfortunately, this is not sufficient to analyze the probability masses of the critical regions we care about as they may be very small.

The chief insight that lets us get around this issue is that each critical region R is in fact of a very specific form, namely a rectangle that is axis-aligned with w: R = {x : w, x [ σ, σ] and v, x [α, β]} for some values α, β, σ and some v orthogonal to w. Moreover, we know w, meaning

Published as a conference paper at ICLR 2024

we can efficiently estimate the probability PDX [ w, x [ σ, σ]] up to constant multiplicative factors without needing moment tests. Denoting the band {x : w, x [ σ, σ]} by T and writing PDX [R] = PDX [ v, x [α, β] | x T] PDX [T], it turns out that we should expect PDX [ v, x [α, β] | x T] = Θ(1), as this is what would occur under the structured target distribution D . (Such a localization property is also at the heart of the algorithms for approximately learning halfspaces of, e.g., (ABL17; Dan15).) To check this, it suffices to run tests that ensure that PDX [ v, x [α, β] | x T] is within an additive constant of this probability under D .

We can now describe the core of our algorithm (omitting some details such as the selection of the slope of the ramp). First, we run SGD on the surrogate loss L to arrive at an approximate stationary point and candidate vector w (technically a list of such candidates). Then, we define the band T based on w, and run tests on the empirical distribution conditioned on T. Specifically, we check that the low-degree empirical moments conditioned on T match those of D conditioned on T, and then apply the structural result of (GKK23) to ensure conditional probabilities of the form PDX [ v, x [α, β] | x T] match PD [ v, x [α, β] | x T] up to a suitable additive constant. This suffices to ensure that even over our empirical marginal, the particular stationary point w we have is indeed close in angular distance to an optimal w .

A final hurdle that remains, often taken for granted under structured distributions, is that closeness in angular distance (w, w ) does not immediately translate to closeness in terms of agreement, P[sign( w, x ) = sign( w , x )], over our unknown marginal. Nevertheless, we show that when the target distribution is Gaussian, we can run polynomial-time tests that ensure that an angle of θ = (w, w ) translates to disagreement of at most O(θ). When the target distribution is a general strongly log-concave distribution, we show a slightly weaker relationship: for any k N, we can run tests requiring time d e O(k) that ensure that an angle of θ translates to disagreement of at most O(

k θ1 1/k). In the Massart noise setting, we can make (w, w ) arbitrarily small, and so obtain our opt + ϵ guarantee for any target strongly log-concave distribution in polynomial time. In the adversarial noise setting, we face a more delicate tradeoff and can only make (w, w ) as small as Θ(opt). When the target distribution is Gaussian, this is enough to obtain final error O(opt) + ϵ in polynomial time. When the target distribution is a general strongly log-concave distribution, we instead obtain e O(opt) + ϵ in quasipolynomial time.

2 PRELIMINARIES

Notation and setup Throughout, the domain will be X = Rd, and labels will lie in Y = { 1}. The unknown joint distribution over X Y that we have access to will be denoted by DXY, and its marginal on X will be denoted by DX . The target marginal on X will be denoted by D . We use the following convention for monomials: for a multi-index α = (α1, . . . , αd) Zd 0, xα denotes Q

i xαi i , and |α| = P

i αi denotes its total degree. We use C to denote a concept class mapping Rd to { 1}, which throughout this paper will be the class of halfspaces or functions of halfspaces over Rd. We use opt(C, DXY) to denote the optimal error inff C P(x,y) DXY[f(x) = y], or just opt when C and DXY are clear from context. We recall the definitions of the noise models we consider. In the Massart noise model, the labels satisfy Py DXY|x[y = sign( w , x ) | x] = η(x), where η(x) η < 1

2 for all x. In the adversarial label noise or agnostic model, the labels may be completely arbitrary. In both cases, the learner s goal is to produce a hypothesis with error competitive with opt. We now formally define testable learning. The following definition is an equivalent reframing of the original definition (RV23, Def 4), folding the (label-aware) tester and learner into a single tester-learner. Definition 2.1 (Testable learning, (RV23)). Let C be a concept class mapping Rd to { 1}. Let D be a certain target marginal on Rd. Let ϵ, δ > 0 be parameters, and let ψ : [0, 1] [0, 1] be some function. We say C can be testably learned w.r.t. D up to error ψ(opt) + ϵ with failure probability δ if there exists a tester-learner A meeting the following specification. For any distribution DXY on Rd { 1}, A takes in a large sample S drawn from DXY, and either rejects S or accepts and produces a hypothesis h : Rd { 1}. Further, the following conditions must be met:

(a) (Soundness.) Whenever A accepts and produces a hypothesis h, with probability at least 1 δ (over the randomness of S and A), h must satisfy P(x,y) DXY[h(x) = y] ψ(opt(C, DXY)) + ϵ.

Published as a conference paper at ICLR 2024

(b) (Completeness.) Whenever DXY truly has marginal D , A must accept with probability at least 1 δ (over the randomness of S and A).

3 TESTING PROPERTIES OF STRONGLY LOG-CONCAVE DISTRIBUTIONS

In this section we define the testers that we will need for our algorithm. All the proofs from this section can be found in Appendix B. We begin with a structural lemma that strengthens the key structural result of (GKK23), stated here as Proposition A.3. It states that even when we restrict an isotropic strongly log-concave D to a band around the origin, moment matching suffices to fool functions of halfspaces whose weights are orthogonal to the normal of the band. Proposition 3.1. Let D be an isotropic strongly log-concave distribution. Let w Sd 1 be any fixed direction. Let p be a constant. Let f : Rd R be a function of p halfspaces of the form in Eq. (A.2), with the additional restriction that its weights vi Sd 1 satisfy vi, w = 0 for all i. For some σ [0, 1], let T denote the band {x : | w, x | σ}. Let D be any distribution such that D|T matches moments of degree at most k = e O(1/τ 2) with D |T up to an additive slack of d e O(k). Then |ED [f | T] ED[f | T]| τ.

We now describe some of the testers that we use. First, we need a tester that ensures that the distribution is concentrated in every single direction. More formally, the tester checks that the moments of the distribution along any direction are small. Proposition 3.2. For any isotropic strongly log-concave D , there exists some constants C1 and a tester T1 that takes a set S Rd { 1}, an even k N, a parameter δ (0, 1) and runs and in time poly dk, |S|, log 1

δ . Let D denote the uniform distribution over S. If T1 accepts, then for any v Sd 1 E (x,y) D[( v, x )k] (C1k)k/2. (3.1)

Moreover, if S is obtained by taking at least dk, log 1

δ k C1 i.i.d. samples from a distribution

whose Rd-marginal is D , the test T1 passes with probability at least 1 δ.

Secondly, we will use a tester that makes sure the distribution is not concentrated too close to a specific hyperplane. This is one of the properties we will need to use in order to employ the localization technique of (ABL17). Proposition 3.3. For any isotropic strongly log-concave D , there exist some constants C2, C3 and a tester T2 that takes a set S Rd { 1} a vector w Sd 1, parameters σ, δ (0, 1) and runs in time poly d, |S|, log 1

δ . Let D denote the uniform distribution over S. If T2 accepts, then

P (x,y) D[| w, x | σ] (C2σ, C3σ). (3.2)

Moreover, if S is obtained by taking at least 100 K1σ2 log 1

δ i.i.d. samples from a distribution whose Rd-marginal is D , the test T2 passes with probability at least 1 δ.

Finally, in order to use the localization idea of (ABL17) in a manner similar to (DKTZ20b), we need to make sure that the distribution is well-behaved also within a band around to a certain hyperplane. The main property of the distribution that we establish is that functions of constantly many halfspaces have expectations very close to what they would be under our distributional assumption. As we show later in this work, having the aforementioned property allows us to derive many other properties that strongly log-concave distributions have, including many of the key properties that make the localization technique successful. Proposition 3.4. For any isotropic strongly log-concave D and a constant C4, there exists a constant C5 and a tester T3 that takes a set S Rd { 1} a vector w Sd 1, parameters

σ, τ δ (0, 1) and runs in time poly d O( 1

σ, |S|, log 1

δ . Let D denote the uniform distribution

over S, let T denote the band {x : | w, x | σ} and let Fw denote the set { 1}-valued functions of C4 halfspaces whose weight vectors are orthogonal to w. If T3 accepts, then

E x D [f(x) | x T] E (x,y) D[f(x) | x T] τ, (3.3)

Published as a conference paper at ICLR 2024

max v Sd 1: v,w =0

E x D [( v, x )2 | x T] E (x,y) D[( v, x )2 | x T] τ. (3.4)

Moreover, if S is obtained by taking at least 1 τ 1

σ d 1 τ2 log C5( 1

τ2 log C5( 1

τ ) C5 i.i.d.

samples from a distribution whose Rd-marginal is D , the test T3 passes w.p. at least 1 δ.

4 TESTABLY LEARNING HALFSPACES WITH MASSART NOISE

In this section we prove that we can testably learn halfspaces with Massart noise with respect to isotropic strongly log-concave distributions (see Definition A.1). Theorem 4.1 (Tester-Learner for Halfspaces with Massart Noise). Let DXY be a distribution over Rd { 1} and let D be an isotropic strongly log-concave distribution over Rd. Let C be the class of origin centered halfspaces in Rd. Then, for any η < 1/2, ϵ > 0 and δ (0, 1), there exists an algorithm (Algorithm 1) that testably learns C w.r.t. D up to excess error ϵ and error probability at most δ in the Massart noise model with rate at most η, using time and a number of samples from DXY that are polynomial in d, 1/ϵ, 1 1 2η and log(1/δ).

Algorithm 1: Tester-learner for halfspaces Input: Training sets S1, S2, parameters σ, δ, α Output: A near-optimal weight vector w, or rejection Run PSGD on the empirical loss Lσ over S1 to get a list L of candidate vectors. Test whether L contains an α-approximate stationary point w of the empirical loss Lσ over S2. Reject if no such w exists. for each candidate w in {w, w} do

Let Bw (σ) denote the band {x : | w , x | σ}. Let Fw denote the class of functions of at most two halfspaces with weights orthogonal to w . Let δ = Θ(δ). Run T1(S2, k = 2, δ) to verify that the empirical marginal is approximately isotropic. Reject if T1 rejects. Run T2(S2, w , σ, δ ) to verify that PS[Bw (σ)] = Θ(σ). Reject if T2 rejects. Run T3(S2, w , σ = σ/6, τ, δ ) and T3(S, w , σ = σ/2, τ, δ ) for a suitable constant τ to verify that the empirical distribution conditioned on Bw (σ/6) and Bw (σ/2) fools Fw up to τ. Reject if T3 rejects. Estimate the empirical error of w on S. If all tests have accepted, output w {w, w} with the best empirical error.

To show our result, we revisit the approach of (DKTZ20a) for learning halfspaces with Massart noise under well-behaved distributions. Their result is based on the idea of minimizing a surrogate loss that is non convex, but whose stationary points correspond to halfspaces with low error. They also require that their surrogate loss is sufficiently smooth, so that one can find a stationary point efficiently. While the distributional assumptions that are used to demonstrate that stationary points of the surrogate loss can be discovered efficiently are mild, the main technical lemma, which demostrates that any stationary point suffices, requires assumptions that are not necessarily testable. We establish a label-dependent approach for testing, making use of tests that are applied during the course of our algorithm.

We consider a slightly different surrogate loss than the one used in (DKTZ20a). In particular, for σ > 0, we let

Lσ(w) = E (x,y) DXY

where ℓσ : R [0, 1] is a smooth approximation to the ramp function with the properties described in Proposition C.1 (see Appendix C), obtained using a piecewise polynomial of degree 3. Unlike the standard logistic function, our loss function has derivative exactly 0 away from the origin (for |t| > σ/2). This makes the analysis of the gradient of Lσ easier, since the contribution from points lying outside a certain band is exactly 0.

Published as a conference paper at ICLR 2024

The smoothness allows us to run PSGD to obtain stationary points efficiently, and we now state the convergence lemma we need.

Proposition 4.2 (PSGD Convergence, Lemmas 4.2 and B.2 in (DKTZ20a)). Let Lσ be as in Equation equation 4.1 with σ (0, 1], ℓσ as described in Proposition C.1 and DXY such that the marginal DX on Rd satisfies Property equation 3.1 for k = 2. Then, for any ϵ > 0 and δ (0, 1), there is an algorithm whose time and sample complexity is O( d

σ4 + log(1/δ)

ϵ4σ4 ), which, having access to samples from DXY, outputs a list L of vectors w Sd 1 with |L| = O( d

σ4 + log(1/δ)

ϵ4σ4 ) so that there exists w L with

w Lσ(w) 2 ϵ , with probability at least 1 δ .

In particular, the algorithm performs Stochastic Gradient Descent on Lσ Projected on Sd 1 (PSGD).

It now suffices to show that, upon performing PSGD on Lσ, for some appropriate choice of σ, we acquire a list of vectors that testably contain a vector which is approximately optimal. We first prove the following lemma, whose distributional assumptions are relaxed compared to the corresponding structural Lemma 3.2 of (DKTZ20a). In particular, instead of requiring the marginal distribution to be well-behaved , we assume that the quantities of interest (for the purposes of our proof) have expected values under the true marginal distribution that are close, up to multiplicative factors, to their expected values under some well-behaved (in fact, strongly log-concave) distribution. While some of the quantities of interest have values that are miniscule and estimating them up to multiplicative factors could be too costly, it turns out that the source of their vanishing scaling can be completely attributed to factors of the form P[| w, x | σ] (where σ is small), which, due to standard concentration arguments, can be approximated up to multiplicative factors, given w Sd 1 and σ > 0 (see Proposition 3.3). As a result, we may estimate the remaining factors up to sufficiently small additive constants (see Proposition 3.4) to get multiplicative overall closeness to the well behaved baseline. We defer the proof of the following Lemma to Appendix C.1.

Lemma 4.3. Let Lσ be as in Equation equation 4.1 with σ (0, 1], ℓσ as described in Proposition C.1, let w Sd 1 and consider DXY such that the marginal DX on Rd satisfies Properties equation 3.2 and equation 3.3 for C4 = 2 and accuracy τ. Let w Sd 1 define an optimum halfspace and let η < 1/2 be an upper bound on the rate of the Massart noise. Then, there are constants c1, c2, c3 > 0 such that if w Lσ(w) 2 < c1(1 2η) and τ c2, then

(w, w ) c3 1 2η σ or ( w, w ) c3 1 2η σ

Combining Proposition 4.2 and Lemma 4.3, we get that for any choice of the parameter σ (0, 1], by running PSGD on Lσ, we can construct a list of vectors of polynomial size (in all relevant parameters) that testably contains a vector that is close to the optimum weight vector. In order to link the zero-one loss to the angular similarity between a weight vector and the optimum vector, we use the following Proposition (for the proof, see Appendix C.2).

Proposition 4.4. Let DXY be a distribution over Rd { 1}, w arg minw Sd 1 PDXY[y = sign( w, x )] and w Sd 1. Then, for any θ (w, w ), θ [0, π/4], if the marginal DX on Rd satisfies Property equation 3.1 for C1 > 0 and some even k N and Property equation 3.2 with σ set to (C1k)

k 2(k+1) (tan θ) k k+1 , then, there exists a constant c > 0 such that the following is true.

P DXY[y = sign( w, x )] opt + c k1/2 θ1 1 k+1 .

Proof of Theorem 4.1. Throughout the proof we consider δ to be a sufficiently small polynomial in all the relevant parameters. Each of the failure events will have probability at least δ and their number will be polynomial in all the relevant parameters, so by the union bound, we may pick δ so that the probability of failure is at most δ.

The algorithm we run is Algorithm 1, with appropriate selection of parameters and given samples S1, S2, each of which are sufficiently large sets of independent samples from the true unknown distribution DXY. For some σ (0, 1] to be defined later, we run PSGD on the empirical loss Lσ over S1 as described in Proposition 4.2 with ϵ = c1(1 2η)σ/4, where c1 is given by Lemma 4.3.

Published as a conference paper at ICLR 2024

Figure 1: Critical regions in the proofs of main structural lemmas (Lemmas 4.3, 5.2). We analyze the contributions of the regions labeled A1, A2 to the quantities A1, A2 in the proofs. Specifically, the regions A1 (which have height σ/3 so that the value of ℓ σ(xw) for any x in these regions is exactly 1/σ, by Proposition C.1) form a subset of the region G, and their probability mass under DX is (up to a multiplicative factor) a lower bound on the quantity A1 (see Eq equation C.3). Similarly, the region A2 is a subset of the intersection of Gc with the band of height σ, and has probability mass that is (up to a multiplicative factor) an upper bound on the quantity A2 (see Eq equation C.4).

By Proposition 4.2, we get a list L of vectors w Sd 1 with |L| = poly(d, 1/σ) such that there is w L with w Lσ(w) 2 < 1

2c1(1 2η) under the true distribution, if the marginal is isotropic.

Having acquired the list L using sample S1, we use the independent samples in S2 to test whether L contains an approximately stationary point of the empirical loss on S2. If this is not the case, then we may safely reject: for large enough |S1|, if the distribution is indeed isotropic strongly logconcave, there is an approximate stationary of the population loss in L and if |S2| is large enough, the gradient of the empirical loss on S2 will be close to the gradient of the population loss on each of the elements of L, due to appropriate concentration bounds for log-concave distributions as well as the fact that the elements of L are independent from S2. For the following, let w be a point such that w Lσ(w) 2 < c1(1 2η) under the empirical distribution over S2

In Lemma 4.3 and Proposition 4.4 we have identified certain properties of the marginal distribution that are sufficient for our purposes, given that L contains an approximately stationary point of the empirical (surrogate) loss on S2. Our testers T1, T2, T3 verify that these properties hold for the empirical marginal over our sample S2, and it will be convenient to analyze the optimality of our algorithm purely over S2. In particular, we will need to require that |S2| is sufficiently large, so that when the true marginal is indeed the target D , our testers succeed with high probability (for the corresponding sample complexity, see Propositions 3.2, 3.3 and 3.4). Moreover, by standard generalization theory, since the VC dimension of halfspaces is only O(d) and for us |S2| is a large poly(d, 1/ϵ), both the error of our final output and the optimal error over S2 will be close to that over DXY. So in what follows, we will abuse notation and refer to the uniform distribution over S2 as DXY and the optimal error over S2 simply as opt.

We proceed with some basic tests. Throughout the rest of the algorithm, whenever a tester fails, we reject, otherwise we proceed. First, we run testers T2 with inputs (w, σ/2, δ ) and (w, σ/6, δ ) (Proposition 3.3) and T3 with inputs (w, σ/2, c2, δ ) and with (w, σ/6, c2, δ ) (Proposition 3.4, c2 as defined in Lemma 4.3). This ensures that for the approximate stationary point w of the Lσ, the probability within the band Bw(σ/2) = {x : | w, x | σ/2} is Θ(σ) (and similarly for Bw(σ/6)) and moreover that our marginal conditioned on each of the bands fools (up to an additive constant) functions of halfspaces with weights orthogonal to w. As a result, we may apply Lemma 4.3 to w and form a list of 2 vectors {w, w} which contains some w with (w , w ) c2σ/(1 2η) (where c3 is as defined in Lemma 4.3).

Published as a conference paper at ICLR 2024

We run T1 (Proposition 3.2) with k = 2 to verify that the marginals are approximately isotropic and we use T2 once again, with appropriate parameters for each w and its negation, to apply Proposition 4.4 and get that {w, w} contains a vector w with

P DXY[y = sign( w , x )] opt + c θ2/3, where (w , w ) θ := c2σ/ p

By picking σ = Θ(ϵ3/2(1 2η)) we have P DXY[y = sign( w , x )] opt + ϵ .

However, we do not know which of the weight vectors in {w, w} is the one guaranteed to achieve small error. In order to discover this vector, we estimate the probability of error of each of the corresponding halfspaces (which can be done efficiently, due to Hoeffding s bound) and pick the one with the smallest error. This final step does not require any distributional assumptions and we do not need to perform any further tests.

5 TESTABLY LEARNING HALFSPACES IN THE AGNOSTIC SETTING

In this section, we provide our result on efficiently and testably learning halfspaces in the agnostic setting with respect to isotropic strongly log-concave target marginals. We defer the proofs to Appendix D. The algorithm we use is once more Algorithm 1, but we call it multiple times for different choices of the parameter σ, reject if any call rejects and output the vector that achieved the minimum empirical error overall, otherwise. Also, the tester T1 is called for a general k (not necessarily k = 2).

Theorem 5.1 (Efficient Tester-Learner for Halfspaces in the Agnostic Setting). Let DXY be a distribution over Rd { 1} and let D be a strongly log-concave distribution over Rd (Definition A.1). Let C be the class of origin centered halfspaces in Rd. Then, for any even k N, any ϵ > 0 and δ (0, 1), there exists an algorithm that agnostically testably learns C w.r.t. D up to error O(k1/2 opt1 1 k+1 )+ϵ, where opt = minw Sd 1 PDXY[y = sign( w, x )], and error probability at most δ, using time and a number of samples from DXY that are polynomial in d O(k), (1/ϵ) O(k) and (log(1/δ))O(k). In particular, by picking some appropriate k log2 d, we obtain error O(opt) + ϵ in quasipolynomial time and sample complexity, i.e. poly(2polylog d, ( 1

ϵ )polylog d).

To prove Theorem 5.1, we may follow a similar approach as the one we used for the case of Massart noise. However, in this case, the main structural lemma regarding the quality of the stationary points involves an additional requirement about the parameter σ. In particular, σ cannot be arbitrarily small with respect to the error of the optimum halfspace, because, in this case, there is no upper bound on the amount of noise that any specific point x might be associated with. As a result, picking σ to be arbitrarily small would imply that our algorithm only considers points that lie within a region that has arbitrarily small probability and can hence be completely corrupted with the adversarial opt budget. On the other hand, the polynomial slackness that the testability requirement introduces (through Proposition 4.4) between the error we achieve and the angular distance guarantee we can get via finding a stationary point of Lσ (which is now coupled with opt), appears to the exponent of the guarantee we achieve in Theorem 5.1.

Lemma 5.2. Let Lσ be as in Equation equation 4.1 with σ (0, 1], ℓσ as described in Proposition C.1, let w Sd 1 and consider DXY such that the marginal DX on Rd satisfies Properties equation 3.2, equation 3.3 and equation 3.4 for w with C4 = 2 and accuracy parameter τ. Let opt be the minimum error achieved by some origin centered halfspace and let w Sd 1 be a corresponding vector. Then, there are constants c1, c2, c3, c4 > 0 such that if opt c1σ, w Lσ(w) 2 < c2, and τ c3 then either (w, w ) c4σ or ( w, w ) c4σ.

We obtain our main result for Gaussian target marginals by refining Proposition 4.4 for the specific case when the target marginal distribution D is the standard multivariate Gaussian distribution. The algorithm for the Gaussian case is similar to the one of Theorem 5.1, but it runs different tests for the improved version (see Proposition D.1) of Proposition 4.4.

Theorem 5.3. In Theorem 5.1, if D is the standard Gaussian in d dimensions, we obtain error O(opt) + ϵ in polynomial time and sample complexity, i.e. poly(d, 1/ϵ, log(1/δ)).

Published as a conference paper at ICLR 2024

ACKNOWLEDGMENTS

We wish to thank the anonymous reviewers of ICLR 2024 for their constructive feedback. Aravind Gollakota was at UT Austin while this work was done, supported by NSF award AF-1909204 and the NSF AI Institute for Foundations of Machine Learning (IFML). Adam R. Klivans was supported by NSF award AF-1909204 and the NSF AI Institute for Foundations of Machine Learning (IFML). Konstantinos Stavropoulos was supported by NSF award AF-1909204, the NSF AI Institute for Foundations of Machine Learning (IFML), and by scholarships from Bodossaki Foundation and Leventis Foundation. Arsen Vasilyan was supported in part by NSF awards CCF-2006664, DMS-2022448, CCF-1565235, CCF-1955217, CCF-2310818, Big George Fellowship and Fintech@CSAIL. Part of this work was done while Arsen Vasilyan was visiting UT Austin.

[ABHU15] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In Conference on Learning Theory, pages 167 190. PMLR, 2015.

[ABHZ16] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Hongyang Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Conference on Learning Theory, pages 152 192. PMLR, 2016.

[ABL17] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM (JACM), 63(6):1 27, 2017.

[BH21] Maria-Florina Balcan and Nika Haghtalab. Noise in classification. Beyond the Worst Case Analysis of Algorithms, page 361, 2021.

[BZ17] Maria-Florina F Balcan and Hongyang Zhang. Sample and computationally efficient learning algorithms under s-concave distributions. Advances in Neural Information Processing Systems, 30, 2017.

[Dan15] Amit Daniely. A ptas for agnostically learning halfspaces. In Conference on Learning Theory, pages 484 502. PMLR, 2015.

[DGT19] Ilias Diakonikolas, Themis Gouleakis, and Christos Tzamos. Distribution-independent pac learning of halfspaces with massart noise. Advances in Neural Information Processing Systems, 32, 2019.

[DK22] Ilias Diakonikolas and Daniel Kane. Near-optimal statistical query hardness of learning halfspaces with massart noise. In Conference on Learning Theory, pages 4258 4282. PMLR, 2022.

[DKK+22] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning general halfspaces with general massart noise under the gaussian distribution. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 874 885, 2022.

[DKK+23] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, Sihan Liu, and Nikos Zarifis. Efficient testable learning of halfspaces with adversarial label noise. ar Xiv preprint ar Xiv:2303.05485, 2023.

[DKMR22] Ilias Diakonikolas, Daniel Kane, Pasin Manurangsi, and Lisheng Ren. Cryptographic hardness of learning halfspaces with massart noise. In Advances in Neural Information Processing Systems, 2022.

[DKPZ21] Ilias Diakonikolas, Daniel M Kane, Thanasis Pittas, and Nikos Zarifis. The optimality of polynomial regression for agnostic learning under gaussian marginals in the sq model. In Conference on Learning Theory, pages 1552 1584. PMLR, 2021.

Published as a conference paper at ICLR 2024

[DKS18] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1061 1073, 2018.

[DKT21] Ilias Diakonikolas, Daniel Kane, and Christos Tzamos. Forster decomposition and learning halfspaces with noise. Advances in Neural Information Processing Systems, 34:7732 7744, 2021.

[DKTZ20a] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning halfspaces with massart noise under structured distributions. In Conference on Learning Theory, pages 1486 1513. PMLR, 2020.

[DKTZ20b] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Non-convex sgd learns halfspaces with adversarial label noise. Advances in Neural Information Processing Systems, 33:18540 18549, 2020.

[DKTZ22] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning general halfspaces with adversarial label noise via online gradient descent. In International Conference on Machine Learning, pages 5118 5141. PMLR, 2022.

[DKZ20] Ilias Diakonikolas, Daniel Kane, and Nikos Zarifis. Near-optimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. Advances in Neural Information Processing Systems, 33:13586 13596, 2020.

[DTK22] Ilias Diakonikolas, Christos Tzamos, and Daniel M Kane. A strongly polynomial algorithm for approximate forster transforms and its application to halfspace learning. ar Xiv preprint ar Xiv:2212.03008, 2022.

[GGK20] Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33:2147 2158, 2020.

[GKK23] Aravind Gollakota, Adam R Klivans, and Pravesh K Kothari. A moment-matching approach to testable learning and a new characterization of rademacher complexity. Proceedings of the fifty-fifth annual ACM Symposium on Theory of Computing, 2023. To appear.

[GKSV23] Aravind Gollakota, Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Tester-learners for halfspaces: Universal algorithms. ar Xiv preprint ar Xiv:2305.11765, 2023.

[KKMS08] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777 1805, 2008.

[KLS09] Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(12), 2009.

[RV23] Ronitt Rubinfeld and Arsen Vasilyan. Testing distributional assumptions of learning algorithms. Proceedings of the fifty-fifth annual ACM Symposium on Theory of Computing, 2023. To appear.

[SW14] Adrien Saumard and Jon A Wellner. Log-concavity and strong log-concavity: a review. Statistics surveys, 8:45, 2014.

[YZ17] Songbai Yan and Chicheng Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. Advances in Neural Information Processing Systems, 30, 2017.

[Zha18] Chicheng Zhang. Efficient active learning of sparse halfspaces. In Conference on Learning Theory, pages 1856 1880. PMLR, 2018.

[ZL21] Chicheng Zhang and Yinan Li. Improved algorithms for efficient active learning halfspaces with massart and tsybakov noise. In Conference on Learning Theory, pages 4526 4527. PMLR, 2021.

Published as a conference paper at ICLR 2024

[ZSA20] Chicheng Zhang, Jie Shen, and Pranjal Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. Advances in Neural Information Processing Systems, 33:7184 7197, 2020.

Published as a conference paper at ICLR 2024

A STRONGLY LOG-CONCAVE DISTRIBUTIONS

We also formally define the class of strongly log-concave distributions, which is the class that our target marginal D is allowed to belong to, and collect some useful properties of such distributions. We will state the definition for isotropic D (i.e. with mean 0 and covariance I) for simplicity.

Definition A.1 (Strongly log-concave distribution, see e.g. (SW14, Def 2.8)). We say an isotropic distribution D on Rd is strongly log-concave if the logarithm of its density q is a strongly concave function. Equivalently, q can be written as

q(x) = r(x)γκ2I(x) (A.1)

for some log-concave function r and some constant κ > 0, where γκ2I denotes the density of the spherical Gaussian N(0, κ2I).

Proposition A.2 (see e.g. (SW14)). Let D be an isotropic strongly log-concave distribution on Rd with density q.

(a) Any orthogonal projection of D onto a subspace is also strongly log-concave.

(b) There exist constants U, R such that q(x) U for all x, and q(x) 1/U for all x R.

(c) There exist constants U and κ such that q(x) U γκ2I(x) for all x.

(d) There exist constants K1, K2 such that for any σ [0, 1] and any v Sd 1, P[| v, x | σ] (K1σ, K2σ).

(e) There exists a constant K3 such that for any k N, E[| v, x |k] (K3k)k/2.

(f) Let α = (α1, . . . , αd) Zd 0 be a multi-index with total degree |α| = P

i αi = k, and let xα = Q

i xαi i . There exists a constant K4 such that for any such α, E[|xα|] (K4k)k/2.

For (a), see e.g. (SW14, Thm 3.7). The other properties follow readily from Eq. (A.1), which allows us to treat the density as subgaussian.

A key structural fact that we will need about strongly log-concave distributions is that approximately matching moments of degree at most e O(1/τ 2) with such a D is sufficient to fool any function of a constant number of halfspaces up to an additive τ.

Proposition A.3 (Variant of (GKK23, Thm 5.6)). Let p be a fixed constant, and let F be the class of all functions of p halfspaces mapping Rd to { 1} of the form

f(x) = g sign( v1, x + θ1), . . . , sign( vp, x + θp) (A.2)

for some g : { 1}p { 1} and weights vi Sd 1. Let D be any target marginal such that for every i, the projection vi, x has subgaussian tails and is anticoncentrated: (a) P[| vi, x | > t] exp( Θ(t2)), and (b) for any interval [a, b], P[ vi, x [a, b]] Θ(|b a|). Let D be any distribution such that for all monomials xα = Q

i xαi of total degree |α| = P

E D [xα] E D[xα] c|α|

for some sufficiently small constant c (in particular, it suffices to have d e O(k) moment closeness for every α). Then

E D [f] E D[f] e O 1

Note that this is a variant of the original statement of (GKK23, Thm 5.6), which requires that the 1D projection of D along any direction satisfy suitable concentration and anticoncentration. Indeed, an inspection of their proof reveals that it suffices to verify these properties for projections only along the directions {vi}i [p] as opposed to all directions. This is because to fool a function f of the form above, their proof only analyzes the projected distribution ( v1, x , . . . , vp, x ) on Rp, and requires only concentration and anticoncentration for each individual projection vi, x .

Published as a conference paper at ICLR 2024

B PROOFS FOR SECTION 3

B.1 PROOF OF PROPOSITION 3.1

Our plan is to apply Proposition A.3. To do so, we must verify that D |T satisfies the assumptions required. In particular, it suffices to verify that the 1D projection along any direction orthogonal to w has subgaussian tails and is anticoncentrated. Let v Sd 1 be any direction that is orthogonal to w. By Proposition A.2(d), we may assume that PD [T] Ω(σ).

To verify subgaussian tails, we must show that for any t, PD |T [| v, x | > t] exp( Ct2) for some constant C. The main fact we use is Proposition A.2(c), i.e. that any strongly log-concave density is pointwise upper bounded by a Gaussian density times a constant. Write

P D |T [| v, x | > t] = PD [ v, x > t and w, x [ σ, σ]]

PD [ w, x [ σ, σ]] .

The claim now follows from the fact that the numerator is upper bounded by a constant times the corresponding probability under a Gaussian density, which is at most O(exp( C t2)σ) for some constant C , and that the denominator is Ω(σ).

To check anticoncentration, for any interval [a, b], write

P D |T [ v, x [a, b]] = PD [ v, x [a, b] and w, x [ σ, σ]]

PD [ w, x [ σ, σ]] .

After projecting onto span(v, w) (an operation that preserves logconcavity), the numerator is the probability mass under a rectangle with side lengths |b a| and 2σ, which is at most O(σ|b a|) as by Proposition A.2(b) the density is pointwise upper bounded by a constant. The claim follows since the denominator is Ω(σ).

Now we are ready to apply Proposition A.3. We see that if D|T matches moments of degree at most k with D |T up to an additive slack of d O(k), then | ED [f | T] ED[f | T]| e O(1/

k). Rewriting in terms of τ gives the theorem.

B.2 PROOF OF PROPOSITION 3.2

The tester T1 does the following:

1. For all α Zd 0 with |α| = k:

(a) Compute the corresponding moment E(x,y) D xα := 1 |S| P

(b) If E(x,y) D[xα] Ex D [xα] > 1 dk then reject.

2. If all the checks above passed, accept.

First, we claim that for some absolute constant C1, if the tester above accepts, we have E(x,y) D[( v, x )k] (C1k)k/2 for any v Sd 1. To show this, we first recall that by Proposition A.2(e) it is the case that E(x,y) D [( v, x )k] (K3k)k/2. But we have E (x,y) D[( v, x )k] E (x,y) D [( v, x )k] X

E (x,y) D[xα] E x D [xα]

dk max α:|α|=k

E (x,y) D[xα] E x D [xα] 1

Together with the bound E(x,y) D [( v, x )k] (K3k)k/2, the above implies that E(x,y) D[( v, x )k] (C1k)k/2 for some constant C1.

Now, we need to show that if the elements of S are chosen i.i.d. from D , and |S| dk, log 1

δ k C1 then the tester above accepts with probability at least 1 δ. Consider any specific

Published as a conference paper at ICLR 2024

multi-index α Zd 0 with |α| = k. Now, by Proposition A.2(f) we have the following:

xα E z D [zα] 2 log(1/δ)

2 log(1/δ) X

E x D (xα)ℓ E z D [zα] 2 log(1/δ) ℓ

2 log(1/δ) X

ℓ=0 (K4ℓk)ℓk/2(K4k)k(2 log(1/δ) ℓ)/2

2 log(1/δ)(2K4 log(1/δ)k)log(1/δ)k

This, together with Markov s inequality implies that

x S xα E x D [xα]

dk(3K4k log(1/δ))k/2

Since S is obtained by taking at least |S| dk, log 1

δ k C1 , for sufficiently large C1 we see that

the above is upper-bounded by 1 dk δ. Taking a union bound over all α Zd 0 with |α| = k, we see that with probability at least 1 δ the tester T1 accepts, finishing the proof.

B.3 PROOF OF PROPOSITION 3.3

Let K1 be as in part (d) of Proposition A.2. The tester T2 computes the fraction of elements in S that are in T. If this fraction is K1σ/2-close to Px D [| w, x | σ], the algorithm accepts. The algorithm rejects otherwise.

Now, from (d) of Proposition A.2 we have that Px D [| w, x | σ] [K1σ, K2σ]. Therefore, if the fraction of elements in S that belong in T is K1σ/100-close to Px D [| w, x | σ], then this quantity is in [K1σ/2, (K2 + K1/2)σ] as required.

Finally, if |S| 100 K1σ2 log 1

δ by standard Hoeffding bound, with probability at least 1 δ we indeed have that the fraction of elements in S that are in T is K1σ/2-close to Px D [| w, x | σ].

B.4 PROOF OF PROPOSITION 3.4

The tester T3 does the following:

1. Runs the tester T2 from Proposition 3.3. If T2 rejects, T3 rejects as well. 2. Let S|T be the set of elements in S for which x T.

3. Let k = O(1/τ 2) be chosen as in Proposition 3.1. 4. For all α Zd 0 with |α| = k:

(a) Compute the corresponding moment E(x,y) D[xα | x T] := 1 |S|T | P x S|T xα.

(b) If E(x,y) D[xα | x T] Ex D [xα | x T] > τ dk d O(k) then reject, where the polylogarithmic in d O(k) is chosen to satisfy the additive slack condition in Proposition 3.1. 5. If all the checks above passed, accept.

First, we argue that if the checks above pass, then Equations 3.3 and 3.4 will hold. If the tester passes, Equation 3.3 follows immediately from the guarantees in step (4b) of T3 together with Proposition 3.1. Equation 3.4, in turn, is proven as follows: E (x,y) D[( v, x )2] E (x,y) D [( v, x )2] X

E (x,y) D[xα] E x D [xα]

d2 max α:|α|=2

E (x,y) D[xα] E x D [xα] τ

Published as a conference paper at ICLR 2024

Now, we need to show that if the elements of S are chosen i.i.d. from D , and |S| ... then the tester above accepts with probability at least 1 δ. Consider any specific mult-index α Zd 0 with |α| = k. Now, by Proposition A.2(f) we have for any positive integer ℓthe following:

h (xα)ℓ i (K4ℓk)k/2

But by Proposition A.2(d) we have that Px D [x T] = Px D [| x, w | σ] K1σ. Therefore, the density of the distribution D |T (which is defined as the distribution one obtains by taking D

and conditioning on x T) is upper bounded by the product of the density of the distribution D and 1 K1σ. This allows us to bound

h (xα)ℓ | x T i 1 K1σ E x D

h (xα)ℓ i (K4ℓk)k/2

K1σ This implies that

xα E z D [zα | z T] 2 log(1/δ) | x T

2 log(1/δ) X

h (xα)ℓ| x T i E x D [(xα | x T]) 2 log(1/δ) ℓ

1 (K1σ)2 log(1/δ)

2 log(1/δ) X

ℓ=0 (K4ℓk)ℓk/2(K4k)k(2 log(1/δ) ℓ)/2

1 (K1σ)2 log(1/δ) 2 log(1/δ)(2K4 log(1/δ)k)log(1/δ)k

This, together with Markov s inequality implies that

x S xα E x D [xα]

dk d O(k) #

d O(k)(3K4k log(1/δ))k/2

!2 log(1/δ)

Now, recall that the tester T2 in step (1) accepted, we have |S|T | 1 C2σ|S|. Since S is obtained

by taking at least |S| 1 τ 1

σ d 1 τ2 log C5( 1

τ2 log C5( 1

τ ) C5 , for sufficiently large C5 we

see that the expression above is upper-bounded by 1 dk δ. Taking a union bound over all α Zd 0 with |α| = k, we see that with probability at least 1 δ the tester T3 accepts, finishing the proof.

C PROOFS FROM SECTION 4

We first present the following Proposition, which ensures that we can form a loss function with certain desired properties. Proposition C.1. There are constants c, c > 0, such that for any σ > 0, there exists a continuously differentiable function ℓσ : R [0, 1] with the following properties.

1. For any t [ σ/6, σ/6], ℓσ(t) = 1

2. For any t > σ/2, ℓσ(t) = 1 and for any t < σ/2, ℓσ(t) = 0.

3. For any t R, ℓ σ(t) [0, c/σ], ℓ σ(t) = ℓ σ( t) and |ℓ σ(t)| c /σ2.

Proof. We define ℓσ as follows.

2, if |t| σ

6 1, if t > σ

2 0, if t < σ

2 ℓ+(t), t ( σ

2 ] ℓ (t), t [ σ

Published as a conference paper at ICLR 2024

Figure 2: The function ℓσ used to smoothly approximate the ramp.

for some appropriate functions ℓ+, ℓ . It is sufficient that we pick ℓ+ satisfying the following conditions (then ℓ would be defined symmetrically, i.e., ℓ (t) = 1 ℓ+( t)).

ℓ+(σ/2) = 1 and ℓ+ (σ/2) = 0.

ℓ+(σ/6) = 2/3 and ℓ+ (σ/6) = 1/σ.

ℓ+ is defined and bounded, except, possibly on σ/6 and/or σ/2.

We therefore need to satisfy four equations for ℓ+. So we set ℓ+ to be a degree 3 polynomial: ℓ+(t) = a1t3 + a2t2 + a3t + a4. Whenever σ > 0, the system has a unique solution that satisfies the desired inequalities. In particular, we may solve the equation to get a1 = 9/σ3, a2 = 15/(2σ2), a3 = 3/(4σ) and a4 = 5/8. For the resulting function (see Figure 2 below and Figure 4 in the appendix) we have that there are constants c, c > 0 such that ℓ+ (t) [0, c/σ] and |ℓ+ (t)| c /σ2 for any t [σ/6, σ/2].

C.1 PROOF OF LEMMA 4.3

We will prove the contrapositive of the claim, namely, that there are constants c1, c2, c3 > 0 such that if (w, w ), ( w, w ) > c3 1 2η σ, and τ c2, then w Lσ(w) 2 c1(1 2η).

Consider the case where (w, w ) < π/2 (otherwise, perform the same argument for w). Let v be a unit vector orthogonal to w that can be expressed as a linear combination of w and w and for which v, w = 0. Then {v, w} is an orthonormal basis for V = span(w, w ). For any vector x Rd, we will use the following notation: xw = w, x , xv = v, x . It follows that proj V (x) = xww + xvv, where proj V is the operator that orthogonally projects vectors on V .

Using the fact that w( w, x / w 2) = x w, x w = x xww for any w Sd 1, the interchangeability of the gradient and expectation operators and the fact that ℓ σ is an even function we get that

w Lσ(w) = E h ℓ σ(| w, x |) y (x xww) i

Since the projection operator proj V is a contraction, we have w Lσ(w) 2 proj V w Lσ(w) 2, and we can therefore restrict our attention to a simpler, two dimensional problem. In particular, since proj V (x) = xww + xvv, we get

proj V w Lσ(w) 2 = E h ℓ σ(|xw|) y xvv i 2 = E h ℓ σ(|xw|) y xv i

= E h ℓ σ(|xw|) sign( w , x ) (1 2 1{y = sign( w , x )}) xv i

Published as a conference paper at ICLR 2024

Figure 3: Critical regions in the proofs of main structural lemmas (Lemmas 4.3, 5.2). We analyze the contributions of the regions labeled A1, A2 to the quantities A1, A2 in the proofs. Specifically, the regions A1 (which have height σ/3 so that the value of ℓ σ(xw) for any x in these regions is exactly 1/σ, by Proposition C.1) form a subset of the region G, and their probability mass under DX is (up to a multiplicative factor) a lower bound on the quantity A1 (see Eq equation C.3). Similarly, the region A2 is a subset of the intersection of Gc with the band of height σ, and has probability mass that is (up to a multiplicative factor) an upper bound on the quantity A2 (see Eq equation C.4).

Let F(y, x) denote 1 2 1{y = sign( w , x )}. We may write xv as |xv| sign(xv) and let G R2 such that sign(xv) sign( w , x ) = 1 iff x G. Then, sign(xv) sign( w , x ) = 1{x G} 1{x G}. We get

proj V w Lσ(w) 2 =

= E h ℓ σ(|xw|) (1{x G} 1{x G}) F(y, x) |xv| i

E h ℓ σ(|xw|) 1{x G} F(y, x) |xv| i E h ℓ σ(|xw|) 1{x G} F(y, x) |xv| i

Let A1 = E[ℓ σ(|xw|) 1{x G} F(y, x) |xv|] and A2 = E[ℓ σ(|xw|) 1{x G} F(y, x) |xv|]. (See Figure 3.) Note that Ey|x[F(y, x)] = 1 2η(x) [1 2η, 1], where 1 2η > 0. Therefore, we have that A1 (1 2η) E[ℓ σ(|xw|) 1{x G} |xv|] and A2 E[ℓ σ(|xw|) 1{x G} |xv|].

Note that due to Proposition C.1, ℓ σ(|xw|) c/σ for some constant c and ℓ σ(|xw|) = 0 whenever |xw| > σ/2. Therefore, if U2 is the band Bw(σ/2) = {x : |xw| σ/2} we have

σ E[1{x G} 1{x U2} |xv|] (C.1)

Moreover, for each individual x, we have ℓ σ(|xw|) 1{x G} |xv| 0, due to the properties of ℓ σ (Proposition C.1). Hence, for any set U1 Rd we have that

A1 (1 2η) E[ℓ σ(|xw|) 1{x G} 1{x U1} |xv|]

Setting U1 = Bw(σ/6) = {x : |xw| σ/6}, by Proposition C.1, we get ℓ σ(|xw|) 1{x U1} = 1 σ 1{x U1}.

σ E[1{x G} 1{x U1} |xv|] (C.2)

We now observe that by the definitions of G, U1, U2, for any constant R > 0, there exist some constants c , c > 0 such that if σ/ tan θ < c R (the points in R2 where G intersects either U1 or

Published as a conference paper at ICLR 2024

U2 have projections on v that are Θ(σ/ tan θ)) we have that

1{x G} 1{x U1} 1{|xv| [c R, 2c R]} 1{x U1} and

1{x G} 1{x U2} 1{|xv| c σ/ tan θ} 1{x U2}

By equations equation C.1 and equation C.2, we get the following bounds whose graphical representations can be found in Figure 3.

A1 c R(1 2η)

σ E[1{|xv| [c R, 2c R]} 1{x U1}] (C.3)

tan θ E[1{|xv| c σ/ tan θ} 1{x U2}] (C.4)

So far, we have used no distributional assumptions. Now, consider the corresponding expectations under the target marginal D (which we assumed to be strongly log-concave).

I1 = E D [1{|xv| [c R, 2c R]} 1{x U1}]

I2 = E D [1{|xv| c σ/ tan θ} 1{x U2}]

Any strongly log-concave distribution enjoys the well-behaved properties defined by (DKTZ20a), and therefore, if R is picked to be small enough, then I1 and I2 are of order Θ(σ) (due to upper and lower bounds on the two dimensional marginal density over V within constant radius balls aka anti-anticoncentration and anticoncentration). Moreover, by Proposition A.2, we have P[x U1] and P[x U2] are both of order Θ(σ). Hence we have that there exist constants c 1, c 2 > 0 such that for the conditional expectations we have

E D 1{|xv| [c R, 2c R]} 1{x U1} c 1

E D 1{|xv| c σ/ tan θ} 1{x U2} c 2

By assumption, Property equation 3.3 holds and, therefore, if τ c 1/2, c 2/2 =: c2, we get that

1{|xv| [c R, 2c R]} 1{x U1} c 1/2

1{|xv| c σ/ tan θ} 1{x U2} c 2/2

Moreover, by Property equation 3.2, we have that (under the true marginal) P[x U1] and P[x U2] are both Θ(σ). Hence, in total, we get that for some constants c1, c2, we have

A1 c1 (1 2η) and A2 c2 σ tan θ

Hence, if we pick σ = Θ((1 2η) tan θ), we get the desired result.

C.2 PROOF OF PROPOSITION 4.4

For the following all the probabilities and expectations are over DXY. First we observe that

P[y = sign( w, x )] P[y = sign( w, x ) y = sign( w , x )] + P[y = sign( w , x )] P[sign( w, x ) = sign( w , x )] + opt .

Then, we observe that by assumption that DXY satisfies Property equation 3.2, we have

P[| w, x | σ] C3σ

P[sign( w, x ) = sign( w , x ) | w, x | > σ] P h | v, x | σ tan θ

where v is some vector perpendicular to w. Using Markov s inequality, we get

P h | v, x | σ tan θ

σk E[| v, x |k] .

Published as a conference paper at ICLR 2024

But, by assumption that DXY satisfies Property equation 3.1, there is some constant C1 > 0 such that E[| v, x |k] (C1k)k/2. Thus

P[sign( w, x ) = sign( w , x )] P[| w, x | σ] + P[sign( w, x ) = sign( w , x ) | w, x | > σ]

C3σ + (C1k)k/2(tan θ)k

By picking σ appropriately in order to balance the two terms (note that this is a different σ than the one in Lemma 4.3), we get the desired result.

D PROOFS FROM SECTION 5

D.1 PROOF OF THEOREM 5.1

We will follow the same steps as for proving Theorem 4.1. Once more, we draw a sufficiently large sample so that our testers are ensured to accept with high probability when the true marginal is indeed the target marginal D and so that we have generalization, i.e. the guarantee that any approximate minimizer of the empirical error (error on the uniform empirical distribution over the sample drawn) is also an approximate minimizer of the true error. The algorithm we use is once more Algorithm 1, but this time we make multiple calls for different parameters σ (and we run T1 with higher k, as we will see shortly) and reject if any of these calls rejects. If we accept, we output the output of the execution of Algorithm 1 with the minimum empirical error.

The main difference between the Massart noise case and the agnostic case is that in the former we were able to pick σ arbitrarily small, while in the latter we face a more delicate tradeoff. To balance competing contributions to the gradient norm, we must ensure that σ is at least Θ(opt) while also ensuring that it is not too large. And since we do not know the value of opt, we will need to search over a space of possible values for σ that is only polynomially large in relevant parameters (similar to the approach of (DKTZ20b)). In our case, we may sparsify the space (0, 1] of possible values for σ up to accuracy Θ(( ϵ

k)1+1/k) and form a list of poly(k/ϵ) possible values for σ, one of which will

satisfy c1σ Θ(( ϵ

k)1+1/k) opt c1σ. hence, we perform the same (testing-learning) process for each of the possible values of σ and get a list of candidate vectors which is still of polynomial size.

The final step is, again, to use Proposition 4.4, after running tester T1 with parameter k (Proposition 3.2) and tester T2 with appropriate parameters for each of the candidate weight vectors. We get that our list contains a vector w with

P DXY[y = sign( w, x )] opt + c k1/2 θ1 1/(k+1),

where (w, w ) θ := c2σ for σ such that c1σ Θ(( ϵ

k)1+1/k) opt c1σ.

P DXY[y = sign( w, x )] opt+c

k 1 1 k+1 O(

k opt1 1 k+1 )+ϵ .

However, we do not know which of the weight vectors in our list is the one guaranteed to achieve small error. In order to discover this vector, we estimate the probability of error of each of the corresponding halfspaces (which can be done efficiently, due to Hoeffding s bound) and pick the one with the smallest error. This final step does not require any distributional assumptions and we do not need to perform any further tests.

In order to obtain our O(opt) quasipolynomial time guarantee, observe first that we may assume without loss of generality that opt 1/d C for some C; if instead opt = o(1/d2), say, then a sample of O(d) points will with high probability be noiseless, and so simple linear programming will recover a consistent halfspace that will generalize. Moreover, we may assume that opt 1/10, since otherwise achieving O(opt) is trivial (we may output an arbitrary halfspace). Let us adapt our algorithm so that we run tester T1 (see Proposition 3.2) multiple times for all k = 1, 2, . . . , log2 d (this only changes our time and sample complexity by a polylog(d) factor). Then Proposition 4.4

Published as a conference paper at ICLR 2024

holds for some k such that k [log(1/opt), 2 log(1/opt)], since the interval has length at least 1 (and therefore it contains some integer) and 2 log(1/opt) 2C log d log2 d (for large enough d). Therefore, by picking the best candidate we get a guarantee of order

k opt1 1/k =

k opt 1/k opt

k 2 1 k log 1 opt opt

2 log(1/opt) 2 opt (since log(1/opt) k 2 log(1/opt))

= e O(opt) .

This concludes the proof of Theorem 5.1.

D.2 PROOF OF LEMMA 5.2

In the agnostic case, the proof is analogous to the proof of Lemma 4.3. However, in this case, the difference is that the random variable F(y, x) = 1 2 1{y = sign( w , x )} does not have conditional expectation on x that is lower bounded by a constant. Instead, we need to consider an additional term A3 correcponding to the part 2 1{y = sign( w , x )} and the term A1 will not be scaled by the factor (1 2η) as in Lemma 4.3. Hence, with similar arguments we have that

w Lσ(w) 2 A1 A2 A3 ,

where A1 c1, A2 c2 σ tan θ and (using properties of ℓ σ as in Lemma 4.3 and the Cauchy-Schwarz inequality)

A3 = 2 E[ℓ σ(|xw|) 1{x G} 1{y = sign( w, x )} |xv|]

σ E[1{x U2} 1{y = sign( w, x )} |xv|]

E[1{x U2} (xv)2] p

E[1{y = sign( w, x )}] =

E[ v, x 2 | x U2] P[x U2] .

Similarly to our approach in the proof of Lemma 4.3, we can use the assumed properties equation 3.2 and equation 3.4 to get that

which gives that in order for the gradient loss to be small, we require opt Θ(σ).

D.3 PROOF OF THEOREM 5.3

Before presenting the proof of Theorem 5.3, we prove the following Proposition, which is, essentially, a stronger version of Proposition 4.4 for the specific case when the target marginal distribution D is the standard multivariate Gaussian distribution. Proposition D.1 is important to get an O(opt) guarantee for the case where the target distribution is the standard Gaussian.

Proposition D.1. Let DXY be a distribution over Rd { 1}, w arg minw Sd 1 PDXY[y = sign( w, x )] and w Sd 1. Let θ (w, w ) and suppose that θ [0, π/4]. Then, for a sufficiently large constant C, there is a tester that given δ (0, 1), θ, w and a set S of samples from DX with size at least d

δ C, runs in time poly 1

θ, d, log 1

δ and with probability 1 δ satisfies the following specifications:

If the distribution DX is N(0, Id), the tester accepts.

If the tester accepts, then we have:

Pr x S[sign( w , x ) = sign( w, x )] O(θ)

Proof. The testing algorithm does the following:

Published as a conference paper at ICLR 2024

1. Given: Integer d, set S Rd, w Sd 1, θ (0, π/4] and δ (0, 1).

2. Let proj w : Rd Rd 1 denote the operator that projects a vector x Rd to it s projection into the (d 1)-dimensional subspace of Rd that is orthogonal to w.

3. For i in 0, 1, ,

(a) Si {x S : w, x [iθ, (i + 1)θ]}

(b) If |Si|

|S| > 2θ, then reject.

(c) If 1 |Si| P x Si(proj w(x))(proj w(x))T I(d 1) op > 0.1, reject.

4. If 1 |S| P

x S 1| w,x |>

θ > 5θ, then reject.

5. If reached this step, accept.

If the tester accepts, then we have the following properties for some sufficiently large constant C > 0. For the following, consider the vector v Rd to be the vector that is perpendicular to w, lies within the plane defined by w and w and v, w 0.

1. Px S[| w, x | [θi, θ(i + 1)]] C θ, for any i n 0, 1, . . . , 1

2. Px Si h | v, x | > θ tan θ i i C /i2, for any i n 0, 1, . . . , 1

3. Px S h | w, x | q

Then, for k = 1

θ and Stripi = {x Rd : w, x | [θi, θ(i + 1)]}, we have that

Pr x S[sign( w, x ) = sign( w , x )]

i= k P x S[x Stripi] P x S

h | v, x | > θ tan θ i x Stripi i + P x S

h | v, x | > θ tan θ i i + C θ (C )2θ

+ C θ = O(θ)

Now, suppose the distribution DX is indeed the standard Gaussian N(0, Id). We would like to show that our tester accepts with probability at least 1 δ. Since D = N(0, Id), we see that for x D we have that x w is distributed as N(0, 1). This implies that

For all i 0, 1, ,

Prx N(0,Id) [ w, x [iθ, (i + 1)θ]] 1

Prx N(0,Id) [ w, x [iθ, (i + 1)θ]] θ minx h

Prx N(0,Id) [ w, x [iθ, (i + 1)θ]] 1

Prx N(0,Id) h w, x > 2 q

2 dx θ R 0 1

Published as a conference paper at ICLR 2024

Therefore, via the standard Hoeffding bound, we see that for sufficiently large absolute constant C we have with probability at least 1 δ

4 over the choice of S that

For all i 0, 1, ,

Prx S [ w, x [iθ, (i + 1)θ]] θ

Prx S [ w, x [iθ, (i + 1)θ]] θ2

Prx S h w, x > 2 q

Prx S h w, x < 2 q

Finally, we would like to show that conditioned on the above, the probability of rejection in step (3b) is small.

Fact D.2. Given a set S Rd 1 of i.i.d. samples from N(0, Id), with probability at least 1

d we have 1 |S|

x S 1 w,x [iθ,(i+1)θ]xx T I(d 1)

Now, since each sample xi is drawn i.i.d. from N(0, Id), we have that w, xi and proj w(xi) are all independent from each other for all i. Since all the events we conditioned on depend on { w, xi } we see that {proj w(xi)} are still distributed as i.i.d. samples from N(0, I(d 1)).

Recall that one of the events we have already conditioned on is that

Prx S [ w, x [iθ, (i + 1)θ]] θ2 20 for all i 0, 1, ,

. This allows us

to lower bound by θ2/20 the ratio |Si|/|S|. And since, as we described, for all these elements xi the vectors proj w(xi) are distributed as i.i.d. samples from N(0, I(d 1)), we can use Fact D.2

to conclude that for sufficiently large absolute constant C, when |S| = d

δ C we have with

probability 1 δ

4 for all i 0, 1, ,

x Si (proj w(x))(proj w(x))T I(d 1)

Overall, this allows us to conclude that with probability at least 1 δ the tester accepts.

We now present the proof of Theorem 5.3.

In the proof of Theorem 5.1, when the target distribution is the standard Gaussian in d dimensions, we may apply Proposition D.1 (and run the corresponding tester), instead of Proposition 4.4, in order to ensure that our list will contain a vector w with

P DXY[y = sign( w, x )] P DXY[y = sign( w , x )] + P DXY[sign( w , x ) = sign( w, x )]

where (w, w ) θ := c2σ and σ is such that c1σ Θ(ϵ) opt c1σ, which gives the desired O(opt) + ϵ bound. To get the value of σ with the desired property, we once again sparsified the space (0, 1] of possible values for σ, this time up to accuracy Θ(ϵ).

Published as a conference paper at ICLR 2024

Figure 4: Figure illustrating the (normalized) first two derivatives of the function ℓσ used to define the non convex surrogate loss Lσ. The normalization is appropriate since ℓ σ and ℓ σ are homogeneous in 1/σ and 1/σ2 respectively. In particular, we see that ℓ σ Θ(1/σ) and |ℓ σ| Θ(1/σ2) everywhere.