# testerlearners_for_halfspaces_universal_algorithms__1b8166c4.pdf Tester-Learners for Halfspaces: Universal Algorithms Aravind Gollakota Apple aravindg@cs.utexas.edu Adam R. Klivans UT Austin klivans@cs.utexas.edu Konstantinos Stavropoulos UT Austin kstavrop@cs.utexas.edu Arsen Vasilyan MIT vasilyan@mit.edu We give the first tester-learner for halfspaces that succeeds universally over a wide class of structured distributions. Our universal tester-learner runs in fully polynomial time and has the following guarantee: the learner achieves error O(opt) + ϵ on any labeled distribution that the tester accepts, and moreover, the tester accepts whenever the marginal is any distribution that satisfies a Poincaré inequality. In contrast to prior work on testable learning, our tester is not tailored to any single target distribution but rather succeeds for an entire target class of distributions. The class of Poincaré distributions includes all strongly log-concave distributions, and, assuming the Kannan Lóvasz Simonovits (KLS) conjecture, includes all log-concave distributions. In the special case where the label noise is known to be Massart, our tester-learner achieves error opt + ϵ while accepting all log-concave distributions unconditionally (without assuming KLS). Our tests rely on checking hypercontractivity of the unknown distribution using a sum-of-squares (SOS) program, and crucially make use of the fact that Poincaré distributions are certifiably hypercontractive in the SOS framework. 1 Introduction In this paper we study the recent model of testable learning, due to Rubinfeld and Vasilyan [RV23]. Testable learning addresses a key issue with essentially all known algorithms for the basic problem of agnostic learning, in which a learner is required to produce a hypothesis competitive with the best-fitting hypothesis in a concept class C. The issue is that these algorithms make distributional assumptions (such as Gaussianity or log-concavity) that are in general hard to verify. This means that in the absence of any prior information about the distribution or the optimal achievable error, it can be hard to check that the learner has even succeeded at meeting its guarantee. In the testable learning model, the learning algorithm, or tester-learner, is given access to labeled examples from an unknown distribution and may either reject or accept the unknown distribution. If it accepts, it must successfully produce a near-optimal hypothesis. Moreover, it is also required to accept whenever the unknown distribution truly has a certain target marginal D . Work of [RV23, GKK23, GKSV23, DKK+23] provided tester-learners for a range of basic classes (including halfspaces, intersections of halfspaces, and more) with respect to particular target marginals D (such as the standard Gaussian). All of these algorithms, however, have the shortcoming that they are closely tailored to the particular target marginal D that is chosen. Indeed, their tests would reject many well-behaved distributions that are appreciably far from D . A highly natural question from both a theoretical and a practical perspective is: can we design tester-learners that accept a wide class of distributions simultaneously, without being tailored to any particular one? 37th Conference on Neural Information Processing Systems (Neur IPS 2023). In this work we answer this question in the affirmative by introducing and studying universally testable learning. We formally define this model as follows. Definition 1.1 (Universally Testable Learning). Let C be a concept class mapping Rd to { 1}. Let D be a family of distributions over Rd. Let ϵ, δ > 0 be parameters, and let ψ : [0, 1] [0, 1] be some function. We say C can be universally testably learned w.r.t. D up to error ψ(opt) + ϵ with failure probability δ if there exists a tester-learner A meeting the following specification. For any distribution DXY on Rd { 1}, A takes in a large sample S drawn from DXY, and either rejects S or accepts and produces a hypothesis h : Rd { 1} such that: (a) (Soundness.) With probability at least 1 δ over the sample S the following is true: If A accepts, then the output h satisfies P(x,y) DXY[h(x) = y] ψ(opt(C, DXY)) + ϵ, where opt(C, DXY) = inff C P(x,y) DXY[h(x) = y]. (b) (Completeness.) Whenever the marginal of DXY lies within D, A accepts with probability at least 1 δ over the sample S. In this terminology, the original definition of testable learning reduces to the special case where D = {D }. We stress that while the prior work of [GKK23] allowed D to be, say, any fixed strongly log-concave distribution, their tester-learners are still tailored to the particular D that is selected. This is because their tests rely on checking that the unknown distribution closely matches moments with D . By contrast, a universal tester-learner must accept all marginals in a family D. Our main contribution in this paper is the first universal tester-learner for the class of halfspaces with respect to a broad family of structured continuous distributions. This family is the set of all distributions with bounded Poincaré constant (see Definition 2.4) and some mild concentration and anti-concentration properties (see Definition 2.1). It captures all strongly log-concave distributions, and in fact, under the well-known Kannan Lóvasz Simonovits (KLS) conjecture (see Conjecture 2.6), it captures all log-concave distributions as well. Our universal tester-learner significantly generalizes the main result of [GKSV23], who showed comparable guarantees only for the case where the target marginal is the standard Gaussian. Theorem 1.2 (Universal Tester-Learner for Halfspaces; formally stated as Theorem 4.1). Let C be the class of origin-centered halfspaces over Rd. Let D be the class of Θ(1)-nice and Θ(1)-Poincaré distributions (see Definitions 2.1 and 2.4), which includes all isotropic strongly log-concave and, under KLS, all isotropic log-concave distributions. Then C can be universally testably learned w.r.t. D up to error O(opt) + ϵ in poly(d, 1 ϵ ) time and sample complexity. A special and well-studied case of interest is when the label noise follows the Massart model, i.e. the label of every example is flipped by an adversary with probability at most η. In this case we are able to handle a considerably larger class D while also providing a stronger guarantee. Theorem 1.3 (Universal Tester-Learner for Massart Halfspaces; formally stated as Theorem 4.1). Let C be the class of origin-centered halfspaces over Rd. Let D be the class of poly(d)-nice and poly(d)- Poincaré distributions, which includes all isotropic log-concave distributions (unconditionally). Suppose the label noise follows the Massart model with noise rate at most η < 1 2. Then C can be universally testably learned w.r.t. D up to error opt + ϵ in poly(d, 1 ϵ , 1 1 2η) time and sample complexity. Technical Overview. We first describe the key reasons why prior tester-learners were tailored to a specific target D . All known polynomial-time algorithms for agnostically learning halfspaces up to error O(opt)+ϵ require some concentration and anti-concentration properties from the input marginal distribution (encapsulated e.g. in Definition 2.1). While concentration is relatively straightforward to check (e.g. by checking that the moments do not grow at too fast a rate), the key challenge in designing tester-learners for halfspaces is to check anti-concentration. All prior tester-learners [RV23, GKK23, GKSV23, DKK+23] use the heavy machinery of moment-matching to achieve this. This approach relies on establishing structural properties of the following type: if D is a wellbehaved distribution (e.g. a strongly log-concave distribution), and D approximately matches D in its low-degree moments, then D is also well-behaved (in particular, anti-concentrated). A canonical statement of such a property is the main pseudorandomness result of [GKK23] (see Theorem 5.6 therein), which establishes that approximate moment-matching fools functions of a constant number of halfspaces. Applying this property inherently requires comparing the low-degree moments of D with those of D . Such tests do (implicitly) succeed universally for the class of all distributions that match low-degree moments with D (e.g., if D is the uniform distribution over the hypercube, moment matching would accept all k-wise independent distributions). Definition 1.1, however, seeks a far broader kind of universality. Our tests are not tailored to a single target in any way, and are intended to succeed over practical classes of distributions that are commonly considered in learning theory (e.g., log-concave distributions).1 We overcome this hurdle and design a conceptually simple way of checking anti-concentration without requiring the hammer of moment-matching. Our approach follows and improves on the roadmap used by [GKSV23] to design efficient tester-learners for halfspaces using non-convex SGD (building on [DKTZ20a, DKTZ20b]). Let us outline this approach at a high level (a more detailed technical overview may be found in [GKSV23, Sec 1.2]). The tester-learner first computes a stationary point w of a certain smooth version of the ramp loss, a surrogate for the 0-1 loss. Let w be any solution achieving 0-1 error opt. The tester-learner now checks distributional properties of the unknown marginal D that ensure that w is close in angular distance to w (specifically, they ensure the contrapositive, namely that any w that has large gradient norm must have large angle with w ). By a more careful analysis of the gradient norm than in [GKSV23] (see Proposition 4.2), we are able to reduce to showing the following weak anti-concentration property. Let v denote any unit vector orthogonal to w, and let DT denote D restricted to the band T = {x | | w, x | σ} (where the width σ is carefully selected according to certain constraints). Then the property we need is that P x DT[| v, x | Θ(1)] Θ(1). Our key observation is that the classical Paley Zygmund inequality applied to the random variable Z = v, x 2, where x DT , already gives us the following type of anti-concentration: This turns out to suffice for our purposes provided we can show a hypercontractivity property for Z, namely that E[Z2] Θ(1) E[Z]2 (as well as that E[Z] = Θ(1), which is just a second moment constraint). Our main algorithmic idea is to use a sum-of-squares (SOS) program to check hypercontractivity of the random variable Z. To do so, we crucially leverage a result due to [KS17] stating that any D that has bounded Poincaré constant is certifiably hypercontractive in the SOS framework (and it turns out this extends to DT as well). This means that we can run a certain polynomial-time semidefinite program that checks hypercontractivity of Z over the sample, and whenever D is in fact Poincaré, we are guaranteed that the test will pass with high probability (see Proposition 3.5). This is sufficient to ensure that the stationary point w we have computed is indeed close in angular distance to w . In order to finally arrive at our main results, we need to run further tests which ensure that the disagreement between our computed w and any (unknown) optimum w is bounded by the angle between them, i.e., Px D[sign( w, x = sign( w , x )] O( (w, w )) (see Lemma 3.1). This in turn guarantees that w has error O(opt) + ϵ. We stress that while [GKSV23] introduced similar testers for the special case of Gaussian marginals, our tests succeed universally with respect to a broad family of distributions including some heavy-tailed distributions (see Definition 2.1). From a technical perspective, prior to our work, such tests either produced a suboptimal bound, or required estimating the operator norms of a polynomial number of random matrices formed using rejection sampling. We significantly simplify this approach by showing that it is sufficient to estimate the operator norm of a single random matrix. Finally, to obtain our improved results for the Massart setting, it turns out that the proof admits certain simplifications that guarantee final error opt + ϵ while also allowing a wider range of Poincaré distributions. Related Work. There is a large body of work on agnostic learning algorithms for halfspaces that run in fully polynomial time. We briefly mention only those that are most closely relevant to our work; please see [BH21] for a survey as well as [GKSV23, Sec 1.1] for further related work. Following a 1One may wonder if it is possible to test whether the low-degree moments of the input marginal D match any distribution in a family D (e.g., all strongly log-concave distributions) without directly comparing to a specific D . This is a reduction to testing whether a given (approximate) low-degree moment tensor lies within a large set of target low-degree moment tensors, and would indeed suffice for universally testable learning. This general problem, however, seems highly challenging to solve directly. long line of work on distribution-specific agnostic learners for halfspaces [KLS09, ABL17, Dan15, BZ17, YZ17, Zha18, ZSA20, ZL21], the work of [DKTZ20a] introduced a particularly simple approach for the Massart setting, based solely on non-convex SGD. This work, which sets the template that our approach also follows, achieved the information-theoretically optimal error of opt + ϵ for origin-centered Massart halfspaces over a wide range of structured distributions (and was later extended to general halfspaces by [DKK+22]). The non-convex SGD approach was then generalized by [DKTZ20b] to show an O(opt) + ϵ guarantee for the fully agnostic setting. The testable learning model was introduced by the work of [RV23], who showed a tester-learner for halfspaces achieving error opt + ϵ in time d e O(1/ϵ4) for the case where the target marginal is Gaussian. Subsequently, [GKK23] provided a general algorithmic framework based on moment-matching for this problem, and showed a tester-learner for halfspaces only requiring time d e O(1/ϵ2) with respect to any fixed strongly log-concave marginal (matching known lower bounds for ordinary agnostic learning over Gaussian marginals [GGK20, DKZ20, DKPZ21, DKR23]). The most closely relevant work to the present one is that of [GKSV23] (see also [DKK+23]), who showed fully polynomial-time tester-learners for halfspaces achieving error O(opt)+ϵ in the agnostic setting and opt + ϵ in the Massart setting for the case where the target marginal is the Gaussian. As detailed in the technical overview, their tests rely crucially on moment-matching and are tailored to a specific target marginal. By contrast, our tests check hypercontractivity using an SOS program and succeed universally for a wide class of certifiably hypercontractive distributions. Certifying distributional properties such as hypercontractivity is an important aspect of a large body of work on robust algorithmic statistics using the SOS framework. We will not attempt to summarize this literature here and direct the reader to [KS17, BK21] for overviews of related work, as well as to [FKP+19] for a textbook treatment. The notion of certifiable anti-concentration has also been studied (see e.g. [KKK19, RY20, BK21]), but it turns out not to be directly useful for our purposes as it is only known to hold for distributions satisfying very strong conditions such as rotational symmetry. Limitations and Further Work. Open directions in testable learning (and universally testable learning) include the design of (efficient) tester-learners for concept classes other than the class of halfspaces, e.g., functions of halfspaces or neurons with other activations (like Re LU or sigmoid). 2 Preliminaries Notation and Terminology. For what follows, we consider DXY to be an unknown joint distribution over X Y from which we receive independent samples, and its marginal on X will be denoted by DX . In particular X = Rd, and labels will lie in Y = { 1}. We will use C to denote a concept class mapping Rd to { 1}, which throughout this paper will be the class of halfspaces or functions of halfspaces over Rd. We use opt(C, DXY) to denote the optimal error inff C P(x,y) DXY[f(x) = y], or just opt when C and DXY are clear from context. We recall that in Massart noise model, the labels satisfy Py DXY|x[y = sign( w , x ) | x] = η(x), with η(x) η < 1 2 for all x. When we have adversarial noise (i.e., when we are in the agnostic model), the labels can be completely arbitrary. In both cases, the goal is to produce a hypothesis whose error is competitive with opt. We use E to denote the expectation of a random variable in brackets (or, correspondingly, P for the probability of an event), either over the unknown joint distribution or over the empirical distribution with respect to a sample S (e.g., EZ S[f(Z)] = 1 |S| P Definitions and Distributional Assumptions. For the problem of learning halfspaces in the agnostic and in Massart noise models, any of the known polynomial algorithms that achieve computationally optimal guarantees require that the marginal distribution has at least the following nice properties previously defined by, e.g., [DKTZ20b]. Definition 2.1 (Nice Distributions). For a given constant λ 1, we consider the class of λ-nice distributions over Rd to be the distributions that satisfy the following properties: 1. For any unit vector v in Rd the distribution satisfies E[ v, x 2] [ 1 λ, λ].(bounded spectrum) 2. For any two dimensional subspace V , the corresponding marginal density q V (x) satisfies q V (x) 1/λ for any x 2 1/λ. (anti-anti-concentration) 3. For any two dimensional subspace V , the corresponding marginal density q V (x) satisfies q V (x) Q( x 2) for some function Q : R+ R+ such that supr 0 Q(r) λ and also R r=0 rk Q(r) dr λ, for any k = 1, 3, 5. (anti-concentration and concentration) In the testable learning framework, however, corresponding results provide testable guarantees with respect to target marginals that are isotropic strongly log-concave [GKSV23], which is a strictly stronger condition than the one of Definition 2.1 (see Proposition 2.3 below). We now provide the standard definition of (strongly) log-concave distributions. Definition 2.2 ((Strongly) Log-Concave Distributions [SW14]). We say that a distribution over Rd is (β-strongly) log-concave, if its density can be written as e φ, where φ is a (β-strongly) convex function on Rd (for some β > 0). Proposition 2.3 (Log-Concave Distributions are Nice [LV07]). There exists a universal constant λ 1 such that any isotropic log-concave distribution is λ-nice. In this work, we provide universally testable guarantees with respect to the class of nice distributions with bounded Poincaré constant (see Definition 2.4 below). Definition 2.4 (Poincaré Distributions). For a given value γ > 0, we say that a distribution over Rd is γ-Poincaré, if Var(f(x)) γ E[ f(x) 2 2] for any differentiable function f : Rd R. Although it is not clear whether one can efficiently obtain testable guarantees for the problem of learning noisy halfspaces under nice marginals (which is known to be an efficiently solvable problem in the non-testable setting [DKTZ20a, DKTZ20b]), by restricting our attention to nice distributions that, additionally, have bounded Poincaré constant, we obtain efficient learning results, even in the universally testable setting. Our results are strictly stronger than the ones in [GKSV23], since we capture isotropic strongly log-concave distributions universally, due to Proposition 2.3 and the fact that strongly log-concave distributions are also Poincaré, as per Proposition 2.5 below. Proposition 2.5 (Strongly Log-Concave Distributions are Poincaré, [SW14, Proposition 10.1]). Any 1 γ -strongly log-concave distribution is γ-Poincaré. Furthermore, under a long-standing conjecture about the geometry of convex bodies [KLS95], our results capture the family of all isotropic log-concave distributions. Conjecture 2.6 (Kannan Lovász Simonovits Conjecture [KLS95] reformulation from [LV18]). There is a universal constant γ > 0 for which any isotropic log-concave distribution is γ-Poincaré. 3 Universal Testers In this section, we present two basic testers that constitute the basic building blocks of the universal tester-learners we provide in the next section. The testers in this section might be of independent interest and their appeal is that they succeed even when the distribution in their input is unspecified up to certain bounds on a number of its statistics. In fact, the family of distributions for which each such tester succeeds is of infinite size, even non-parametric. 3.1 Universal Tester for Bounding Local Halfspace Disagreement First, we present a universal tester that checks, given a parameter vector w, whether a set of samples S is such that bounding the angular distance of w from an optimum parameter vector, implies that the corresponding halfspace disagrees with the (unknown) optimum halfspace only on a bounded fraction of points in S. This property ensures that if w is close to the optimum parameter vector, then it is also an approximate empirical risk minimizer. The tester universally accepts samples from nice distributions with high probability (Definition 2.1). Lemma 3.1 (Universally Testable Bound for Local Halfspace Disagreement). Let DXY be a distribution over Rd { 1}, w Sd 1, θ (0, π/4], λ 1 and δ (0, 1). Then, for a sufficiently large constant C, there is a tester that given δ, θ, w and a set S of samples from DX with size at least C d4 θ2δ , runs in time poly d, 1 δ and satisfies the following specifications: (a) If the tester accepts S, then for every unit vector w Rn satisfying (w, w ) θ we have P x S[sign( w , x ) = sign( w, x )] C θ λC (b) If the distribution DX is λ-nice, the tester accepts S with probability 1 δ. The proof of Lemma 3.1 simplifies and improves the proof of a similar but weaker result in [GKSV23] (see their Proposition 4.5). The initial tester exploited the observation that the probability of disagreement between two halfspaces can be upper bounded by a sum of products, where each product has two terms: one corresponding to the probability of falling in a (known) strip orthogonal to w and one corresponding to the probability of having large enough inner product with some unknown vector orthogonal to w, conditioned in the (known) strip. The first term can be controlled by estimating the probability of falling in a (known) strip, while the second follows by Chebyshev s inequality, after estimating the largest eigenvalue of the covariance matrix conditioned in the known strip. This approach introduces a number of complications, including the fact that conditioning requires rejection sampling, which, in turn requires a lower bound on the probability of falling inside each strip. We propose a simpler tester that controls all of the terms of the sum simultaneously by estimating the largest eigenvalue of a single covariance matrix (without conditioning). Upper and lower bounds on the eigenvalues of random symmetric matrices can be universally tested with testers that are guaranteed to accept when the elements of the matrix have bounded second moments (spectral tester of Proposition A.2). We present our full proof in Appendix B.1. 3.2 Universally Testable Weak Anti-Concentration We now provide an important universal tester, which ensures that for a given vector w, a sample set S and any unknown unit vector v orthogonal to w, among the samples falling within a (known) strip orthogonal to w, at least a constant fraction is absolutely correlated with v by a constant. In other words, the tester ensures that the conditional empirical distribution is weakly anti-concentrated in every direction. The tester universally accepts nice distributions that have bounded Poincaré constant. Lemma 3.2 (Universally Testable Weak Anti-Concentration). Let D be a distribution over Rd. Then, there is a universal constant C > 0 and a tester that given a unit vector w Rd, δ (0, 1), γ > 0, λ 1, σ 1 2λ and a set S of i.i.d. samples from D with size at least C d4 σ2δ log(d)λC, runs in time poly(d, λ, 1 γ )) and satisfies the following specifications (a) If the tester accepts S, then for any unit vector v Rd with v, w = 0 we have | v, x | 1 CλC | w, x | σ 1 CλCγ4 (b) If D is γ-Poincaré and λ-nice, then the tester accepts S with probability at least 1 δ. The proof of Lemma 3.2 is based on a simple fact from probability that is true for any non-negative random variable and ensures that the mass assigned to the tails is lower bounded by the ratio of the square of its expectation to the second moment. Proposition 3.3 (Paley Zygmund Inequality). For any non-negative random variable Z, we have P[Z > E[Z]/2] 1 In the special case where Z follows the distribution of v, x 2 conditioned on | w, x | σ for some unitary orthogonal vectors v, w, some σ > 0 and some random variable x whose distribution is, say, 1-nice (see Definition 2.1), one can show that E[Z] is lower bounded by a constant and E[Z2] is upper bounded by another constant, so Z assigns a non-trivial mass to a set that is bounded away from zero. This property is useful in the context of learning noisy halfspaces, as we show in the following section (see Proposition 4.2 and Lemma 4.3). However, testing algorithms that check whether such a property holds for given w and σ, are guaranteed to succeed when the marginal distribution has, additionally, bounded Poincaré constant. The main part of the proof that requires a bounded Poincaré constant, is testing whether E[Z2] is bounded uniformly over the set of unit vectors v orthogonal to w, since Z2 = v, x 4, where v is unknown. We use the following result from [KS17]. Proposition 3.4 (Certifiable Hypercontractivity of Poincaré Distributions, Theorem 4.1 in [KS17]). Let δ (0, 1), γ > 0 and let D be a γ-Poincaré distribution over Rd. Let S be a set of independent samples from D with size at least (2d log(4d/δ))4. Consider the constrained maximization problem arg max v 2=1 E x S[ v, x 4] (3.1) Then, the optimum solution of the degree-4 sum-of-squares relaxation of the problem (3.1) has value at most Cγ4 for some universal constant C, with probability at least 1 δ over the sample S. Using Proposition 3.4, we are able to provide a universal tester for bounding the empirical fourth moments. The tester solves an appropriate SDP relaxation of the (hard) problem [HL13] of finding the direction with maximum fourth moment and is guaranteed to succeed if x has Poincaré parameter bounded by a known value. Proposition 3.5 (Hypercontractivity Tester). Let D be a distribution over Rd. Then, there is a tester that given δ (0, 1), γ > 0 and a set S of i.i.d. samples from D with size at least (2d log(4d/δ))4, runs in time poly(d, log 1 γ ) and satisfies the following specifications (a) If the tester accepts S, then for any unit vector v Rd we have E x S[ v, x 4] C γ4 , where C is some universal constant. (b) If the distribution D is γ-Poincaré, then the tester accepts S with probability at least 1 δ. Proof. The tester does the following: 1. Solves a degree-4 sum-of-squares relaxation of problem (3.1) up to accuracy γ4. (For a formal definition of the relaxed problem, see Problem (2.3) in [KS17].) 2. If the solution has value larger than (C 1)γ4, then reject. Otherwise accept. The computational complexity of the tester is poly(|S|, d, log 1 γ ), since the problem it solves can be written as a semidefinite program [Sho87, Par00, Nes00, Las01]. If the tester accepts S, then we know that the optimal solution of the relaxed problem is at most Cγ4 and we also know that any solution of the initial problem (3.1) has value at most equal to the value of the relaxation. Therefore E[ v, x 4] Cγ4, for any v Sd 1. On the other hand, if the true distribution D is γ-Poincaré, then, with probability at least 1 δ, we have that the solution found in step 3.2 has, with probability at least 1 δ, value at most C γ4 for some universal constant C , due to Proposition 3.4. In order to ensure that the tester will accept with probability at least 1 δ, it suffices to pick C = C + 1. We provide the full proof of Lemma 3.2, in Appendix B.2. The tests we perform include a spectral tester that accepts with high probability when the distribution of x is nice (similar to the spectral tester used for Lemma 3.1), a tester of the probability that | w, x | σ and the hypercontractivity tester of Proposition 3.5. 4 Universal Tester-Learners for Halfspaces In this section, we present our main result on universally testable learning of halfspaces. Theorem 4.1 (Efficient Universal Tester-Learner for Halfspaces). Let DXY be any distribution over Rd { 1}. Let C be the class of origin centered halfspaces in Rd. Then, for any λ 1, γ > 0, ϵ > 0 and δ (0, 1), there exists an universal tester-learner for C w.r.t. the class of λ-nice and γ-Poincaré marginals up to error poly(λ) (1+γ4) opt+ϵ, where opt = minw Sd 1 PDXY[y = sign( w, x )], and error probability at most δ, using a number of samples and running time poly(d, λ, γ, 1 Moreover, if the noise is Massart with given rate η < 1/2, then the algorithm achieves error opt + ϵ with time and sample complexity poly(d, λ, γ, 1 ϵ , 1 1 2η, log 1 Our proof follows a surrogate loss minimization approach that has been used for classical learning of noisy halfspaces [DKTZ20a, DKTZ20b] as well as classical (non-universal) testable learning [GKSV23]. In particular, the algorithm runs Projected Stochastic Gradient Descent (see A.5) on a surrogate loss whose stationary points are shown to be close to optimum parameter vectors under certain distributional assumptions. In the regular testable learning setting, given a stationary point, the above property can be tested with respect to any (fixed and known) target strongly log-concave marginal as shown by [GKSV23]. For such a stationary point, more tests are used in order to ensure bounds on local halfspace disagreement. We provide some delicate refinements of the proofs in [GKSV23] that enable us to substitute their testers with the universal testers of Section 3. We use the following surrogate loss function which was also used in [GKSV23]. Lσ(w; DXY) = E (x,y) DXY In Equation (4.1), the function ℓσ is a smoothed version of the step function as in Proposition A.4. In order to analyze the properties of the stationary points of the surrogate loss, we provide the following refinement of results implicit in [GKSV23, DKTZ20a, DKTZ20b]. We show that the gradient of the surrogate loss is lower bounded by the difference between certain quantities that are controlled by the marginal distribution (see Figure 1). We stress that we do not use any assumptions for the marginal distribution in this step. Prior work included similar bounds, but the corresponding quantities were different. We need to be more precise and provide the following result, whose proof is based on two dimensional geometry and can be found in Appendix C.1. Proposition 4.2 (Modification from [GKSV23, DKTZ20a, DKTZ20b]). For a distribution DXY over Rd { 1} let opt be the minimum error achieved by some origin-centered halfspace and w Sd 1 a corresponding vector. Consider Lσ as in Equation (4.1) for σ > 0 and let η < 1/2. Let w Sd 1 with (w, w ) = θ < π 2 and v span(w, w ) such that v, w = 0 and v, w < 0. Then, for some universal constant C > 0 and any α σ 2 tan θ we have w Lσ(w; DXY) 2 A1 A2 A3, where A1 = α C σ P h | v, x | α and | w, x | σ A2 = C tan θ P h | w, x | σ i and A3 = C E h v, x 2 1{| w,x | σ Moreover, if the noise is Massart with rate η, then w Lσ(w; DXY) 2 (1 2η)A1 A2. If the marginal distribution is nice, then the quantities A1, A2 and A3 are such that σ can be chosen accordingly so that stationary points of the surrogate loss (or their inverses) are close to some optimum vector (see Proposition A.3 for properties of nice distributions). We use some simple tests (e.g., estimate the probability of falling in a strip, P[| w, x | σ/2] and appropriate spectral testers) as well as our universal tester for weak anti-concentration (see 3.2) to establish bounds on quantities A1, A2 and A3 which ensure that the desired property holds for a given vector w, under no distributional assumptions. The tester in the following result universally accepts nice distributions with bounded Poincaré parameter. The formal proof can be found in Appendix C.2. Lemma 4.3 (Universally Testable Structure of Surrogate Loss). Let DXY be any distribution over Rd { 1}. Consider Lσ as in Equation (4.1). Then, there is a universal constant C > 0 and a tester that given a unit vector w Rd, δ (0, 1), η < 1/2, γ > 0, λ 1, σ 1 CλC and a set S of i.i.d. samples from DXY with size at least C d4 σ2δ log(d)λC, runs in time poly(d, λ, 1 γ )) and satisfies the following specifications (a) If the tester accepts S, then, the following statements are true for the minimum error opt S achieved by some origin-centered halfspace on S and the optimum vector w S Sd 1 If the noise is Massart with associated rate η and w Lσ(w; S) 2 1 2η CλCγ4 then either (w, w S) CλC(1+γ4) 1 2η σ or ( w, w S) CλC(1+γ4) If the noise is adversarial with opt S σ CλC and w Lσ(w; S) 2 < 1 CλCγ4 then either (w, w S) CλC(1 + γ4) σ or ( w, w S) CλC(1 + γ4) σ. (b) If the marginal DX is λ-nice and γ-Poincaré, then the tester accepts S with probability at least 1 δ. We now give the algorithm for δ 1/3 since we can reduce the probability of failure with repetition (repeat O(log 1 δ ) times, accept if the rate of acceptance is Ω(1) and output the halfspace achieving the minimum test error among the halfspaces returned). The algorithm receives λ 1, γ > 0, ϵ > 0 and η (0, 1/2) {1} (say η = 1 when we are in the agnostic case) and does the following for some appropriately large universal constants C1, C2 > 0. 1. First, initialize E = ϵ C1λC1 , and let Σ be a list of real numbers and A be a positive real number, where Σ and A are defined as follows. If η = 1, then Σ is an E C1λC1 -cover of the interval 0, 1 C1λC1 and A = 1 C1λC1γ4 . Otherwise, let Σ = E (1 2η) C1λC1(1+γ4) and A = 1 2η C1λC1γ4 . 2. Draw a set S1 of C2 λd γϵ C2 i.i.d. samples from DXY and run PSGD, as specified in Proposition A.5 with ϵ A, δ δ C1 on the loss Lσ for each σ Σ. 3. Form a list L with all the pairs of the form (w, σ) where w Sd 1 is some iterate of the PSGD subroutine performed on Lσ. 4. Draw a fresh set S2 of C2 λd γϵ C2 i.i.d. samples from DXY and compute for each (w, σ) L the value w Lσ(w; S2) 2. If, for some σ Σ, w Lσ(w; S2) 2 > A for all (w, σ) L, then reject. 5. Update L by keeping for each σ Σ only one pair of the form (w, σ) for which we have w Lσ(w; S2) 2 A. 6. Run the following tests for each (w, σ) L. (This will ensure that part (a) of Lemma 4.3 holds for each of the elements of L, i.e., that any stationary point of the loss Lσ that lies in L is angularly close to the empirical risk minimizer2.). If P(x,y) S2[| w, x | σ 6 ] σ C1λC1 or P(x,y) S2[| w, x | σ 2 ] > σ C1λC1, then reject. Compute the (d 1) (d 1) matrices M + S2 and M S2 as follows:3 M + S2 = E (x,y) S2 h (proj w x)(proj w x)T 1{| w,x | σ M S2 = E (x,y) S2 h (proj w x)(proj w x)T 1{| w,x | σ Reject if the maximum singular value of M + S2 is greater than σ C1λC1. Reject if the minimum singular value of M S2 is less than σ C1λC1 . Run the hypercontractivity tester on S = {proj w x : (x, y) S2 and | w, x | σ}, i.e., solve an appropriate SDP (see Prop. 3.5 with γ γ, δ δ/C1) and reject if the solution is larger than a specified threshold. 7. Set θ = (1+γ4)σ Aγ4 , and run the following tests for each pair of the form (w, σ) and ( w, σ) where (w, σ) L. (This will ensure that part (a) of Lemma 3.1 is activated, i.e., that the distance of a vector from the empirical risk minimizer is an accurate proxy for the error of the corresponding halfspace.) If P(x,y) S2[| w, x | θ] > C1λC1θ then reject. Compute the (d 1) (d 1) matrix MS2 as follows:4 MS2 = E (x,y) S2 (proj w x)(proj w x)T (i 1)2 1{| w, x | [(i 1)θ, iθ)} If MS op > C1θλC1, then reject. 8. Otherwise, accept and output the vector w that achieves the smallest empirical error on S2 among the vectors in the list L. This concludes the algorithm. The full proof of Theorem 4.1 may be found in Appendix C.3. 2Or the same holds for the inverse vector. 3The operator proj w : Rd Rd 1 projects vectors on the hyperplane orthogonal to w. 4Note that only at most |S2| many terms below are non-zero, hence MS2 can be computed efficiently. Acknowledgments and Disclosure of Funding We wish to thank the anonymous reviewers of Neur IPS 2023 for their constructive feedback. Aravind Gollakota was at UT Austin while this work was done, supported by NSF award AF-1909204 and the NSF AI Institute for Foundations of Machine Learning (IFML). Adam R. Klivans was supported by NSF award AF-1909204 and the NSF AI Institute for Foundations of Machine Learning (IFML). Konstantinos Stavropoulos was supported by NSF award AF-1909204, the NSF AI Institute for Foundations of Machine Learning (IFML), and by scholarships from Bodossaki Foundation and Leventis Foundation. Arsen Vasilyan was supported in part by NSF awards CCF-2006664, DMS2022448, CCF-1565235, CCF-1955217, CCF-2310818, Big George Fellowship and Fintech@CSAIL. Part of this work was done while Arsen Vasilyan was visiting UT Austin. [ABL17] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM (JACM), 63(6):1 27, 2017. 1 [BH21] Maria-Florina Balcan and Nika Haghtalab. Noise in classification. Beyond the Worst Case Analysis of Algorithms, page 361, 2021. 1 [BK21] Ainesh Bakshi and Pravesh K Kothari. List-decodable subspace recovery: Dimension independent error in polynomial time. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1279 1297. SIAM, 2021. 1 [BZ17] Maria-Florina F Balcan and Hongyang Zhang. Sample and computationally efficient learning algorithms under s-concave distributions. Advances in Neural Information Processing Systems, 30, 2017. 1 [Dan15] Amit Daniely. A ptas for agnostically learning halfspaces. In Conference on Learning Theory, pages 484 502. PMLR, 2015. 1 [DKK+22] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning general halfspaces with general massart noise under the gaussian distribution. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 874 885, 2022. 1 [DKK+23] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, Sihan Liu, and Nikos Zarifis. Efficient testable learning of halfspaces with adversarial label noise. ar Xiv preprint ar Xiv:2303.05485, 2023. 1, 1 [DKPZ21] Ilias Diakonikolas, Daniel M Kane, Thanasis Pittas, and Nikos Zarifis. The optimality of polynomial regression for agnostic learning under gaussian marginals in the sq model. In Conference on Learning Theory, pages 1552 1584. PMLR, 2021. 1 [DKR23] Ilias Diakonikolas, Daniel M Kane, and Lisheng Ren. Near-optimal cryptographic hardness of agnostically learning halfspaces and relu regression under gaussian marginals. ar Xiv preprint ar Xiv:2302.06512, 2023. 1 [DKTZ20a] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning halfspaces with massart noise under structured distributions. In Conference on Learning Theory, pages 1486 1513. PMLR, 2020. 1, 2, 4, 4, 4.2, A.5, C.1 [DKTZ20b] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Non-convex sgd learns halfspaces with adversarial label noise. Advances in Neural Information Processing Systems, 33:18540 18549, 2020. 1, 2, 2, 4, 4, 4.2, C.1 [DKZ20] Ilias Diakonikolas, Daniel Kane, and Nikos Zarifis. Near-optimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. Advances in Neural Information Processing Systems, 33:13586 13596, 2020. 1 [FKP+19] Noah Fleming, Pravesh Kothari, Toniann Pitassi, et al. Semialgebraic proofs and efficient algorithm design. Foundations and Trends in Theoretical Computer Science, 14(1-2):1 221, 2019. 1 [GGK20] Surbhi Goel, Aravind Gollakota, and Adam Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33:2147 2158, 2020. 1 [GKK23] Aravind Gollakota, Adam R Klivans, and Pravesh K Kothari. A moment-matching approach to testable learning and a new characterization of rademacher complexity. Proceedings of the fifty-fifth annual ACM Symposium on Theory of Computing, 2023. To appear. 1, 1, 1 [GKSV23] Aravind Gollakota, Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. An efficient tester-learner for halfspaces. ar Xiv preprint ar Xiv:2302.14853, 2023. 1, 1, 1, 2, 2, 3.1, 4, 4, 4.2, A.4, A.5, C.1 [HL13] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):1 39, 2013. 3.2 [KKK19] Sushrut Karmalkar, Adam Klivans, and Pravesh Kothari. List-decodable linear regression. Advances in neural information processing systems, 32, 2019. 1 [KLS95] Ravi Kannan, László Lovász, and Miklós Simonovits. Isoperimetric problems for convex bodies and a localization lemma. Discrete & Computational Geometry, 13:541 559, 1995. 2, 2.6 [KLS09] Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(12), 2009. 1 [KS17] Pravesh K Kothari and Jacob Steinhardt. Better agnostic clustering via relaxed tensor norms. ar Xiv preprint ar Xiv:1711.07465, 2017. 1, 3.2, 3.4, 3.2 [Las01] Jean B Lasserre. New positive semidefinite relaxations for nonconvex quadratic programs. Advances in Convex Analysis and Global Optimization: Honoring the Memory of C. Caratheodory (1873 1950), pages 319 331, 2001. 3.2 [LV07] László Lovász and Santosh Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307 358, 2007. 2.3 [LV18] Yin Tat Lee and Santosh S Vempala. The Kannan-Lovász-Simonovits conjecture. ar Xiv preprint ar Xiv:1807.03465, 2018. 2.6 [Nes00] Yurii Nesterov. Squared functional systems and optimization problems. High performance optimization, pages 405 440, 2000. 3.2 [Par00] Pablo A Parrilo. Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. California Institute of Technology, 2000. 3.2 [RV23] Ronitt Rubinfeld and Arsen Vasilyan. Testing distributional assumptions of learning algorithms. Proceedings of the fifty-fifth annual ACM Symposium on Theory of Computing, 2023. To appear. 1, 1 [RY20] Prasad Raghavendra and Morris Yau. List decodable learning via sum of squares. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 161 180. SIAM, 2020. 1 [Sho87] N.Z. Shor. Quadratic optimization problems. Izv. Akad. Nauk SSSR Tekhn. Kibernet., 1987(1):128 139, 222, 1987. 3.2 [SW14] Adrien Saumard and Jon A Wellner. Log-concavity and strong log-concavity: a review. Statistics surveys, 8:45, 2014. 2.2, 2.5 [YZ17] Songbai Yan and Chicheng Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. Advances in Neural Information Processing Systems, 30, 2017. 1 [Zha18] Chicheng Zhang. Efficient active learning of sparse halfspaces. In Conference on Learning Theory, pages 1856 1880. PMLR, 2018. 1 [ZL21] Chicheng Zhang and Yinan Li. Improved algorithms for efficient active learning halfspaces with massart and tsybakov noise. In Conference on Learning Theory, pages 4526 4527. PMLR, 2021. 1 [ZSA20] Chicheng Zhang, Jie Shen, and Pranjal Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. Advances in Neural Information Processing Systems, 33:7184 7197, 2020. 1 A Technical Lemmas In this section, we provide a list of technical results that we use in our proofs. Lemma A.1 (Preservation of Poincaré constant). Let I be an open interval in R and q : Rd R+ the density of a γ-Poincaré distribution. Let v Sd 1 and q v : Rd 1 R+ be the density of the distribution resulting from conditioning q to x v I and projecting on the subspace perpendicular to v. Then, the distribution corresponding to q v is γ-Poincaré. Proof. Assume, without loss of generality, that v = ed. We have that xd I q(x 0 and a set S of i.i.d. samples from D with size at least 2λd4 θ2δ , runs in time poly(d, 1 θ, |S|) and satisfies the following specifications (a) If the tester accepts, then, for z S, ES[zz T ] θ 2Id (resp. ES[zz T ] 2θId). (b) If, for z D, ED[(zizj)2] λ and ED[zz T ] θId (resp. ED[zz T ] θId), then the tester accepts with probability at least 1 δ. Proof. The tester receives λ, a set S and δ (0, 1) and does the following: 1. Compute the matrix MS = ES[zz T ]. 2. If the minimum (resp. maximum) eigenvalue of MS is larger than θ 2 (resp. smaller than 2θ), then accept. Otherwise reject. Clearly, if the tester accepts, then the desired property is satisfied by construction. If the distribution D satisfies the conditions of part (b), we can show that for MD = Ez D[zz T ] we have 2, with probability at least 1 δ which implies that MS θ 2Id (and MS (θ + θ 2)Id 2θId). In particular, we have that (MS)ij = ES[zizj], and by Chebyshev s inequality we have P |(MS)ij (MD)ij| > θ θ2|S| E z D[(zizj)2] 4λd2 θ2|S| δ d 2 By a union bound, we get that MS MD max θ 2d with probability at least 1 δ and hence MS MD op d MS MD max θ 2, which concludes the proof. Proposition A.3. Let c 0, λ 1, σ 1 2λ and D be a λ-nice distribution over Rd. Then, for any unit vectors w, v, v , u, u Rd with w, v = w, v = 0 and for some universal constant C > 0 we have (i) P[| w, x | σ] = 2σ αC, for some α [ 1 (ii) E[ v, x 2 1{| w, x | σ}] = 2σ αC, for some α [ 1 (iii) E[ x, u 2 x, u 2] = αC, for some α Cλ. (iv) E[ v, x 2 1{| w, x | [c, c + σ]}] 2σ αC, for some α Cλ. Proof. We start by deriving property (i). Recall the function Q from the definition of a λ-nice distribution, which upper-bounds the density of any two-dimensional projection of a λ-nice distribution we see that: P[| w, x | σ] = Z σ x2= qspan(v,w)(x1, x2) dx1dx2 x2 1 + x2 2 Now, note that the region {(x1, x2) : |x1| σ} is a subset of the set {(x1, x2) : |x2| σ|x1|} {(x1, x2) : |x1| σ & |x2| 1}. Therefore: Z σ x2 1 + x2 2 4 arcsin(σ) Z r=0 2πr Q (r) dr + Z σ x2 1 + x2 2 dx1dx2 O(σλ) Note that in the last line above, we bounded the first term via the bound R r=0 r Q (r) dr λ from the definition of λ-nice distributions. Likewise, we bounded the second term via the inequality Q(r) λ from the definition of λ-nice distributions. Overall, we get E[ v, x 2 1{| w, x | σ}] O(σλ) Now, we shall lower-bound the same quantity. We have P[| w, x | σ] = Z σ x2= qspan(v,w)(x1, x2) dx1dx2 2λ qspan(v,w)(x1, x2) dx1dx2 Now, since σ 1 2λ via the premise of the lemma, we see that the whole region of integration on the right side of the set {(x1, x2) : p x2 1 + x2 2 1 λ}. From the definition of λ-nice distributions, the density qspan(v,w) is lower-bounded by 1/λ in this region. Therefore, we have P[| w, x | σ] 2σ which finishes the proof of property (i). Now, we derive property (ii). Recall the function Q from the definition of a λ-nice distribution, which upper-bounds the density of any two-dimensional projection of a λ-nice distribution we see that: E[ v, x 2 1{| w, x | σ}] = Z σ x2= x2 2 qspan(v,w)(x1, x2) dx1dx2 x2= x2 2 Q q x2 1 + x2 2 Now, note that the region {(x1, x2) : |x1| σ} is a subset of the set {(x1, x2) : |x2| σ|x1|} {(x1, x2) : |x1| σ & |x2| 1}. Therefore: Z σ x2= x2 2 Q q x2 1 + x2 2 4 arcsin(σ) Z r=0 2πr3Q (r) dr + Z σ x2= 1 x2 2 Q q x2 1 + x2 2 dx1dx2 O(σλ) Note that in the last line above, we bounded the first term via the bound on R r=0 r3Q (r) dr from the definition of λ-nice distributions. Likewise, we bounded the second term via the inequality Q(r) λ from the definition of λ-nice distributions. Therefore, we get E[ v, x 2 1{| w, x | σ}] O(σλ) Now, we shall lower-bound the same quantity. We have E[ v, x 2 1{| w, x | σ}] = Z σ x2= x2 2 qspan(v,w)(x1, x2) dx1dx2 2λ x2 2 qspan(v,w)(x1, x2) dx1dx2 Now, since σ 1 2λ via the premise of the lemma, we see that the whole region of integration on the right side of the set {(x1, x2) : p x2 1 + x2 2 1 λ}. From the definition of λ-nice distributions, the density qspan(v,w) is lower-bounded by 1/λ in this region. Therefore, we have E[ v, x 2 1{| w, x | σ}] 2σ λ = σ 2λ4 , which finishes the proof of property (ii). We proceed to property (iii). We will denote the angle between v and v as β, which allows us to write E[ x, v 2 x, v 2] = Z x2= x2 1(x1 cos β + x2 sin β)2qspan(v,w)(x1, x2) dx1dx2 x2= x2 1(x1 cos β + x2 sin β)2 Q q x2 1 + x2 2 x2= (x2 1 + x2 2)2 Q q x2 1 + x2 2 r=0 2πr5Q(r) dr 2πλ, which finishes the proof of property (iii). Finally, we prove property (iv). For β 0 we have Z r2 + β dr = Z 1 r2 + β dr + Z r2 + β dr (since supr 0 Q(r) λ) r = 1+β (r 3 βr )Q(r ) dr (by setting r = p r=0 r3Q(r) dr (since βr Q(r) 0 for any r 0) Applying the above inequality to the quantity of property (iv), we get the desired result. E[ v, x 2 1{| w, x | [c, c + σ]}] = Z |x1| [c,c+σ] x2= x2 2 qspan(v,w) dx1dx2 |x1| [c,c+σ] x2= x2 2 Q q x2 1 + x2 2 dx1dx2 |x1| [c,c+σ] x2 1 + r2 dr dx1 |x1| [c,c+σ] (4λ) dx1 8λσ This concludes the proof of Proposition A.3. Proposition A.4 (Proposition 4.2 of [GKSV23]). There is a universal constant C > 0, such that for any σ > 0, there exists a continuously differentiable function ℓσ : R [0, 1] with the following properties. 1. For any t [ σ/6, σ/6], ℓσ(t) = 1 2. For any t > σ/2, ℓσ(t) = 1 and for any t < σ/2, ℓσ(t) = 0. 3. For any t R, ℓ σ(t) [0, C/σ], ℓ σ(t) = ℓ σ( t) and |ℓ σ(t)| C/σ2. Proposition A.5 (PSGD Convergence [DKTZ20a], restated in [GKSV23]). Let Lσ be as in Equation (4.1) with σ (0, 1], ℓσ as described in Proposition A.4, λ 1 and DXY such that the marginal DX on Rd is λ-nice. Then for some universal constant C > 0 and for any ϵ > 0 and δ (0, 1), there is an algorithm whose time and sample complexity is O( λCd σ4 + λC log(1/δ) ϵ4σ4 ), which, having access to samples from DXY, outputs a list L of vectors w Sd 1 with |L| = O( λCd σ4 + λC log(1/δ) ϵ4σ4 ) so that there exists w L with w Lσ(w; DXY) 2 ϵ , with probability at least 1 δ . In particular, the algorithm performs Stochastic Gradient Descent on Lσ Projected on Sd 1 (PSGD). B Proofs from Section 3 B.1 Proof of Lemma 3.1 We restate Lemma 3.1 here for convenience. Lemma B.1 (Lemma 3.1). Let DXY be a distribution over Rd { 1}, w Sd 1, θ (0, π/4], λ 1 and δ (0, 1). Then, for a sufficiently large constant C, there is a tester that given δ, θ, w and a set S of samples from DX with size at least C d4 θ2δ , runs in time poly d, 1 δ and satisfies the following specifications: (a) If the tester accepts S, then for every unit vector w Rn satisfying (w, w ) θ we have P x S[sign( w , x ) = sign( w, x )] C θ λC (b) If the distribution DX is λ-nice, the tester accepts S with probability 1 δ. Proof. The testing algorithm receives integer d, set S Rd, w Sd 1, θ (0, π/4], λ 1 and δ (0, 1) and does the following for some sufficiently large universal constant C1 > 0: 1. If Px S [| w, x | [0, θ]] > C1θλC1, then reject. 2. Let proj w : Rd Rd 1 denote the operator that given any vector in Rd, it outputs its projection into the (d 1)-dimensional subspace of Rd that is orthogonal to w. 3. Compute the (d 1) (d 1) matrix MS as follows5: (proj w x)(proj w x)T (i 1)2 1{| w, x | [(i 1)θ, iθ)} 4. Run the spectral tester of Proposition A.2 on MS given δ δ, λ C1λC1 and θ C1 2 θλC1, i.e., compute MS op and if MS op > C1θλC1, then reject. Otherwise, accept. First, suppose the test accepts. For the following, consider the vector w Rd to be an arbitrary unit vector and v Rd to be the unit vector that is perpendicular to w, lies within the plane defined by w and w and v, w 0. Then we have: P x S[sign( w , x ) = sign( w, x )] h | v, x | > θ tan θ (i 1) | {z } Implies | v,x |>(i 1)/2 & | w, x | [(i 1)θ, iθ] i P x S [| w, x | [0, θ]] | {z } C1θλC1 Ex S v, x 21| w,x | [(i 1)θ,iθ] (i 1)2 | {z } proj w v, M proj w v M op C1θλC1 For part (b), we suppose that the distribution DX is indeed λ-nice. We will show that with probability at least 1 δ, the tester will accept, i.e., that P x S [| w, x | [0, θ]] C1θλC1 and (B.1) MS op C1θλC1 (B.2) We first observe that the corresponding quantities under distribution DX due to Proposition A.3. In particular, we have that for some universal constant C > 0 P x DX [| w, x | [0, θ]] C θλC and (B.3) E x DX[ v , x 2 1{| w, x | [c, c + θ]}] C θλC for any v Sd 1 and c 0 (B.4) If we let MDX = EDX [MS], we get that MDX op = sup u Sd 2 u T MDX u = sup v Sd 1: v ,w =0 (proj w v )T MDX (proj w v ) 1 (i 1)2 sup v Sd 1 E x DX[ v , x 2 1{| w, x | [c, c + θ]}] 1 (i 1)2 sup v Sd 1 C θλC C π2 5Note that only at most |S| many terms below are non-zero, hence MS can be computed efficiently. By Proposition A.2, in order to satisfy expression (B.2), it remains to show that Ez D[(zℓzj)2] C1λC1 for any ℓ, j [d], where z is defined as follows (i 1) 1| w,x | [(i 1)θ,iθ). Since Ez D[(zℓzj)2] Ez D[ u, x 2 u , x 2], for some unit vectors u, u Sd 1 (orthogonal to w), the desired bound follows from Proposition A.3. It remains to bound the absolute distance between the quantities of the left hand side of expressions (B.1) and (B.3). This can be achieved by an application of the Hoeffding bound, since the empirical version of the quantity is the average of independent Bernoulli random variables. B.2 Proof of Lemma 3.2 We restate Lemma 3.2 here for convenience. Lemma B.2 (Universally Testable Weak Anti-Concentration). Let D be a distribution over Rd. Then, there is a universal constant C > 0 and a tester that given a unit vector w Rd, δ (0, 1), γ > 0, λ 1, σ 1 2λ and a set S of i.i.d. samples from D with size at least C d4 σ2δ log(d)λC, runs in time poly(d, λ, 1 γ )) and satisfies the following specifications (a) If the tester accepts S, then for any unit vector v Rd with v, w = 0 we have | v, x | 1 CλC | w, x | σ 1 CλCγ4 (b) If D is γ-Poincaré and λ-nice, then the tester accepts S with probability at least 1 δ. Proof. The testing algorithm receives a set S Rd, w Sd 1, δ (0, 1), γ > 0, λ 1 and σ 1 2λ and does the following for some sufficiently large C1 > 0: 1. If Px S[| w, x | σ] > 2σ C1λC1, then reject. 2. Compute the (d 1) (d 1) matrix MS as follows: MS = E x S (proj w x)(proj w x)T 1{| w, x σ|} 3. Run the spectral tester of Proposition A.2 on MS given δ δ, λ C1λC1 and θ 2σ C1λC1 , i.e., reject if the minimum singular value of MS is less than 2σ C1λC1 . 4. Run the hypercontractivity tester (Prop. 3.5) on S = {proj w x : x S and | w, x | σ}, i.e., solve an appropriate SDP and reject if the solution is larger than a specified threshold. Otherwise, accept. For part (a), we apply the Paley Zygmund inequality to the random variable Z = v, x 2 condtitioned on | w, x | σ and get h v, x 2 | w, x | σ i | w, x | σ (Ex S[ v, x 2 | | w, x | σ])2 4 Ex S[ v, x 4 | | w, x | σ] Note that since v, w = 0, we have v, x = proj w v, proj w x (where v 2 = proj w v 2). Therefore, since S has passed the spectral tester as well as the tester for the probability of lying within the strip | w, x | σ, we have that h v, x 2 | w, x | σ i = Ex S v, x 2 1{| w, x | σ} Px S[| w, x | σ] 1 2C1λ2C1 Moreover, {x S : | w, x | σ} has passed the hypercontractivity tester, and therefore, according to Proposition 3.5 we have h v, x 4 | w, x | σ i C1 γ4 Combining the above inequalities we conclude the proof of part (a). For part (b), we assume that D is indeed λ-nice and γ-Poincaré. We first use Proposition A.3 as well as a Hoeffding bound, to get that Px S[| w, x | σ] [ 2σ C λC , 2σ C λC ] with probability at least 1 δ/3 over S (since |S| is large enough), for some universal constant C > 0. Then, we use part (ii) of Proposition A.3 to lower bound the minimum eigenvalue of MD = ED[MS] by 4σ C λC . Using part (iii) of Proposition A.3 to bound the second moment of each of the elements of MD, we may use Proposition A.2 to get that MS 2σ C λC Id 1 (and our spectral test passes) with probability at least 1 δ/3. It remains to show that the hypercontractivity tester will accept with probability at least 1 δ/3 (since, then, the result follows from a union bound). We acquire samples from the hypercontractivity tester through rejection sampling (we keep only the samples within the strip). Since the probability of falling inside the strip is at least 2σ C λC , the number of samples we will keep is at least |S | |S|σ C λC , for some large enough constant C > 0 (due to Chernoff bound) and with probability at least 1 δ/6. We now apply Lemma A.1 to get that the distribution of proj w x conditioned on the strip | w, x | σ is γ-Poincaré, since D is also γ-Poincaré. Hence, the hypercontractivity tester accepts with probability at least 1 δ/6 due to Proposition 3.5. C Proofs from Section 4 C.1 Proof of Proposition 4.2 We restate Proposition 4.2 here for completeness. Proposition C.1 (Modification from [GKSV23, DKTZ20a, DKTZ20b]). For a distribution DXY over Rd { 1} let opt be the minimum error achieved by some origin-centered halfspace and w Sd 1 a corresponding vector. Consider Lσ as in Equation (4.1) for σ > 0 and let η < 1/2. Let w Sd 1 with (w, w ) = θ < π 2 and v span(w, w ) such that v, w = 0 and v, w < 0. Then, for some universal constant C > 0 and any α σ 2 tan θ we have w Lσ(w; DXY) 2 A1 A2 A3, where A1 = α C σ P h | v, x | α and | w, x | σ A2 = C tan θ P h | w, x | σ i and A3 = C E h v, x 2 1{| w,x | σ Moreover, if the noise is Massart with rate η, then w Lσ(w; DXY) 2 (1 2η)A1 A2. Proof. The proof is a slight modification of a part of the proof of Lemma 4.4 in [GKSV23], but we present it here for completeness. For any vector x Rd, let: xw = w, x and xv = v, x . It follows that proj V (x) = xve1 + xwe2, where proj V is the operator that orthogonally projects vectors on V . Using the fact that w( w, x / w 2) = x w, x w = x xww for any w Sd 1, the interchangeability of the gradient and expectation operators and the fact that ℓ σ is an even function we get that w Lσ(w) = E h ℓ σ(| w, x |) y (x xww) i Since the projection operator proj V is a contraction, we have w Lσ(w) 2 proj V w Lσ(w) 2, and we can therefore restrict our attention to a simpler, two dimensional problem. In particular, since proj V (x) = xve1 + xwe2, we get proj V w Lσ(w) 2 = E h ℓ σ(|xw|) y xv i = E h ℓ σ(|xw|) sign( w , x ) (1 2 1{y = sign( w , x )}) xv i Let F(y, x) denote 1 2 1{y = sign( w , x )}. We may write xv as |xv| sign(xv) and let G R2 such that sign(xv) sign( w , x ) = 1 iff x G. Figure 1: The Gaussian mass in each of the regions labelled A1 and A2 is proportional to the corresponding term appearing in the statement of Proposition 4.2. As σ tends to 0, the Gaussian mass of region A2 shrinks faster than the one of region A1, since both the height (σ) and the width ( σ tan θ) of A2 are proportional to σ, while the width of A1 is not affected (the height is σ/3). Lemma 4.3 demonstrates that a similar property is universally testable under any nice Poincaré distribution. Then, sign(xv) sign( w , x ) = 1{x G} 1{x G}. We get proj V w Lσ(w) 2 = = E h ℓ σ(|xw|) (1{x G} 1{x G}) F(y, x) |xv| i E h ℓ σ(|xw|) 1{x G} F(y, x) |xv| i E h ℓ σ(|xw|) 1{x G} F(y, x) |xv| i Let A 1 = E[ℓ σ(|xw|) 1{x G} F(y, x) |xv|] and A 2 = E[ℓ σ(|xw|) 1{x G} F(y, x) |xv|]. In the Massart noise case Ey|x[F(y, x)] = 1 2η(x) [1 2η, 1], where 1 2η > 0. Therefore, we have that A 1 (1 2η) E[ℓ σ(|xw|) 1{x G} |xv|]. When the noise is adversarial, we have A 1 E[ℓ σ(|xw|) 1{x G} |xv|] 2 E[ℓ σ(|xw|) 1{x G} 1{y = sign( w , x )} |xv|]. For any α σ 2 tan θ, we have that E h ℓ σ(|xw|) 1{x G} |xv| i E h ℓ σ(|xw|) 1{x G} 1{|xw| σ (since terms are positive) σ 1{x G} 1 n |xw| σ (by Proposition A.4) σ E h 1{x G} 1{|xw| σ 6 } 1{|xv| α} i σ E h 1{|xw| σ 6 } 1{|xv| α} i (see Figure 1) σ P h |xw| σ 6 and |xv| α i def = A1 Moreover, for some universal constant C > 0, we similarly have E h ℓ σ(|xw|) 1{x G} F(y, x) |xv| i E h ℓ σ(|xw|) 1{x G} |xv| i (since F(y, x) 1) 2 } 1{x G} |xv| (by Proposition A.4) σ E h 1{|xw| σ 2 } 1{|xv| σ 2 tan θ } |xv| i (see Figure 1) 2 tan θ E h 1{|xw| σ 2 } 1{|xv| σ 2 tan θ } i 2 tan θ P h |xw| σ Hence, we have shown that, in the Massart noise case, we have w Lσ(w) 2 (1 2η)A1 A2 as desired. For the adversarial noise case, it remains to bound the following quantity 2 E h ℓ σ(|xw|) 1{x G} 1{y =sign( w ,x )} |xv| i σ E h 1{x G} 1{|xw| σ 2 } 1{y =sign( w ,x )} |xv| i σ E h 1{|xw| σ 2 } 1{y =sign( w ,x )} |xv| i E h |xv|2 1{|xw| σ 2 } i def = A3 where the final inequality follows from Cauchy-Schwarz inequality. C.2 Proof of Lemma 4.3 We restate Lemma 4.3 here for convenience. Lemma C.2 (Universally Testable Structure of Surrogate Loss). Let DXY be any distribution over Rd { 1}. Consider Lσ as in Equation (4.1). Then, there is a universal constant C > 0 and a tester that given a unit vector w Rd, δ (0, 1), η < 1/2, γ > 0, λ 1, σ 1 CλC and a set S of i.i.d. samples from DXY with size at least C d4 σ2δ log(d)λC, runs in time poly(d, λ, 1 γ )) and satisfies the following specifications (a) If the tester accepts S, then, the following statements are true for the minimum error opt S achieved by some origin-centered halfspace on S and the optimum vector w S Sd 1 If the noise is Massart with associated rate η and w Lσ(w; S) 2 1 2η CλCγ4 then either (w, w S) CλC(1+γ4) 1 2η σ or ( w, w S) CλC(1+γ4) If the noise is adversarial with opt S σ CλC and w Lσ(w; S) 2 < 1 CλCγ4 then either (w, w S) CλC(1 + γ4) σ or ( w, w S) CλC(1 + γ4) σ. (b) If the marginal DX is λ-nice and γ-Poincaré, then the tester accepts S with probability at least 1 δ. Proof of Lemma 4.3. The testing algorithm receives w Sd 1, δ (0, 1), η < 1/2, γ > 0, λ 1, σ 1 2λ and a set S Rd { 1} and does the following for some sufficiently large C1 > 0 1. If P(x,y) S[| w, x | σ 6 ] σ C1λC1 or P(x,y) S[| w, x | σ 2 ] > σ C1λC1, then reject. 2. Compute the (d 1) (d 1) matrices M + S and M S as follows: M + S = E (x,y) S h (proj w x)(proj w x)T 1{| w,x | σ M S = E (x,y) S h (proj w x)(proj w x)T 1{| w,x | σ 3. Run the (maximum singular value) spectral tester of Proposition A.2 on M + S given δ δ 4, λ C1λC1 and θ C1σλC1 2 , i.e., reject if the maximum singular value of M + S is greater than σ C1λC1. 4. Run the (minimum singular value) spectral tester of Proposition A.2 on M S given δ δ 4, λ C1λC1 and θ 2σ C1λC1 , i.e., reject if the minimum singular value of M S is less than σ C1λC1 . 5. Run the hypercontractivity tester on S = {proj w x : (x, y) S and | w, x | σ}, i.e., solve an appropriate SDP (see Prop. 3.5 with γ γ, δ δ/4) and reject if the solution is larger than a specified threshold. Otherwise, accept. For part (a), we suppose that the testing algorithm has accepted S. Therefore, S has passed all the tests required for part (a) of Lemma 3.2 and there exists a universal constant C > 0 such that | v, x | 1 C λC | w, x | σ 1 C λC γ4 Moreover, we have σ C λC < P(x,y) S[| w, x | σ 6 ] P(x,y) S[| w, x | σ 2 ] < σ C λC and h v, x 2 | w, x | σ h v, x 2 | w, x | σ Since Proposition 4.2 holds for any distribution, it will also hold for the empirical distribution (uniform on S). We apply Proposition 4.2 with α = 1 C λC to lower bound w Lσ(w; S) 2 (or w Lσ( w; S) 2) as follows w Lσ(w; S) 2 A1(α) A2 A3 (adversarial noise case) w Lσ(w; S) 2 (1 2η) A1(α) A2 (Massart noise case) Combining the above inequalities with the bounds implied by the fact that S has passed the tests, concludes the proof of part (a), since (after observing that tan θ θ) we get w Lσ(w; S) 2 3 CλCγ4 σ (adversarial noise case) w Lσ(w; S) 2 3(1 η) θ (Massart noise case) For part (b), we follow a similar recipe as the one used to prove part (b) of Lemma 3.2, i.e., we use the following reasoning to show that the tests will pass with probability at least 1 δ 1. We assume that the marginal distribution DX is λ-nice and γ-Poincaré. 2. We use Proposition A.3 to bound the values of the tested quantities under the true distribution. 3. We use appropriate concentration results (Hoeffding/Chernoff Bounds and Proposition A.2) to show that, since |S| is large enough, each of the empirical quantities at hand does not deviate a lot from its mean. This concludes the proof of Lemma 4.3. C.3 Proof of Main Theorem We restate the main Theorem here for convenience. Theorem C.3 (Efficient Universal Tester-Learner for Halfspaces). Let DXY be any distribution over Rd { 1}. Let C be the class of origin centered halfspaces in Rd. Then, for any λ 1, γ > 0, ϵ > 0 and δ (0, 1), there exists an universal tester-learner for C w.r.t. the class of λ-nice and γ-Poincaré marginals up to error poly(λ) (1+γ4) opt+ϵ, where opt = minw Sd 1 PDXY[y = sign( w, x )], and error probability at most δ, using a number of samples and running time poly(d, λ, γ, 1 Moreover, if the noise is Massart with given rate η < 1/2, then the algorithm achieves error opt + ϵ with time and sample complexity poly(d, λ, γ, 1 ϵ , 1 1 2η, log 1 Proof of Theorem 4.1. Note that we will give the algorithm for δ δ = 1/3 since we can reduce the probability of failure with repetition (repeat O(log 1 δ ) times, accept if the rate of acceptance is Ω(1) and output the halfspace achieving the minimum test error among the halfspaces returned). For reader s convenience, we now restate the algorithm on page 9 (note that together with the algorithm we include additional detail relevant to the analysis). The algorithm receives λ 1, γ > 0, ϵ > 0 and η (0, 1/2) {1} (say η = 1 when we are in the agnostic case) and does the following for some appropriately large universal constant C1, C2 > 0. 1. First, create a set of parameters Σ and parameters E = ϵ C1λC1 and A > 0 as follows. If η = 1, then Σ is an E C1λC1 -cover of the interval 0, 1 C1λC1 and A = 1 C1λC1γ4 . Otherwise, let Σ = E (1 2η) C1λC1(1+γ4) and A = 1 2η C1λC1γ4 . 2. Then, draw a set S1 of C2 λd γϵ C2 i.i.d. samples from DXY and run PSGD, as specified in Proposition A.5 with ϵ A, δ δ C1 on the loss Lσ for each σ Σ. 3. Form a list L with all the pairs of the form (w, σ) where w Sd 1 is some iterate of the PSGD subroutine performed on Lσ. 4. Draw a fresh set S2 of C2 λd γϵ C2 i.i.d. samples from DXY and compute for each (w, σ) L the value w Lσ(w; S2) 2. If, for some σ Σ, w Lσ(w; S2) 2 > A for all (w, σ) L, then reject. 5. Update L by keeping for each σ Σ only one pair of the form (w, σ) for which we have w Lσ(w; S2) 2 A. 6. Run the following tests for each (w, σ) L to ensure that part (a) of Lemma 4.3 holds for each of the elements of L, i.e., that any stationary point of the surrogate loss that lies in L is angularly close to the empirical risk minimizer (or the same holds for the inverse vector). If P(x,y) S2[| w, x | σ 6 ] σ C1λC1 or P(x,y) S2[| w, x | σ 2 ] > σ C1λC1, then reject. Compute the (d 1) (d 1) matrices M + S2 and M S2 as follows: M + S2 = E (x,y) S2 h (proj w x)(proj w x)T 1{| w,x | σ M S2 = E (x,y) S2 h (proj w x)(proj w x)T 1{| w,x | σ Run the (maximum singular value) spectral tester of Proposition A.2 on M + S2 given C1 , λ C1λC1 and θ C1σλC1 2 , i.e., reject if the maximum singular value of M + S2 is greater than σ C1λC1. Run the (minimum singular value) spectral tester of Proposition A.2 on M S2 given δ δ C1 , λ C1λC1 and θ 2σ C1λC1 , i.e., reject if the minimum singular value of M S2 is less than σ C1λC1 . Run the hypercontractivity tester on S = {proj w x : (x, y) S2 and | w, x | σ}, i.e., solve an appropriate SDP (see Prop. 3.5 with γ γ, δ δ /C1) and reject if the solution is larger than a specified threshold. 7. Run the following tests for each pair of the form (w, σ) and ( w, σ) where (w, σ) L to ensure that part (a) of Lemma 3.1 is activated, i.e., that the distance of a vector from the empirical risk minimizer is an accurate proxy for the error of the corresponding halfspace. Set θ(σ) = (1+γ4)σ If P(x,y) S2[| w, x | θ] > C1λC1θ then reject. Compute the (d 1) (d 1) matrix MS2 as follows6: MS2 = E (x,y) S2 (proj w x)(proj w x)T (i 1)2 1{| w, x | [(i 1)θ, iθ)} Run the spectral tester of Proposition A.2 on MS given δ δ C1 , λ C1λC1 and θ C1 2 θλC1, i.e., compute MS op and if MS op > C1θλC1, then reject. 8. Otherwise, accept and output the vector w that achieves the smallest empirical error on S2 among the vectors in the list L. For the following, let α = 1 in the Massart noise case and α = C1λC1γ4 in the adversarial noise case. Consider also opt S2 to be the error of the origin-centered halfspace with the minimum empirical error on S2 and w S2 the corresponding optimum vector. Soundness. We first prove the soundness condition, i.e., that the following implication holds with probability at least 1 δ over the samples: If the tester accepts, then P DXY[y = sign( w, x )] α opt + ϵ The tester accepts only if for every σ Σ, we have some w L with w Lσ(w; S2) 2 A (step 4) and for which part (a) of each of Lemmas 4.3 (step 6) and 3.1 (step 7) is activated. Therefore, in the Massart noise case, for any σ Σ, there is some w such that either (w, σ) L or ( w, σ) L and also (w, w S2) 1 + γ4 def = θ (C.1) P S2[y = sign( w, x )] opt S2 + C λC θ (C.2) In the adversarial noise case, the above are true conditional on σ being such that opt S2 σ C λC . Therefore, in the Massart noise case, the above are true for σ = E(1 2η) C1λC1(1+γ4) which gives P S2[y = sign( w, x )] opt S2 + C λC E In the agnostic case, condition C.2 is true for some σ [0, 1 C1λC1 ] such that σ C λC 1 C1λC1 opt S2 σ C λC unless opt > 1 C1C λC1+C , in which case any halfspace has error at most 1 = opt (C1C λC1+C ). Hence we get P S2[y = sign( w, x )] poly(λ) (1 + γ4) opt S2 + C λC E Soundness follows from the fact that if |S2| is sufficiently large (but still polynomial in every parameter, since the VC dimension of the class of halfspaces in Rd is d + 1), then |opt S2 opt| ϵ C1λC1(1+γ)4 with probability at least 1 δ . 6Note that only at most |S2| many terms below are non-zero, hence MS2 can be computed efficiently. Completeness. Suppose now that the marginal is indeed λ-nice and γ-Poincaré. Then, for sufficiently large S1, after step 3, L will contain a stationary point of Lσ( ; DXY) for each σ Σ, due to Proposition A.5. If S2 is large enough, then steps 4, 6 and 7 will each accept with probability at least 1 δ /C1, due to part (b) of Lemmas 4.3 and 3.1, as well as the fact that each coordinate of w Lσ(w; S2) has bounded second moment (Proposition A.3) and therefore w Lσ(w; S2) is concentrated around w Lσ(w; DXY) for any fixed w such that (w, σ) L (we also need a union bound over L). Hence, in total, the tester will accept with probability at least 1 δ .