# reliable_learning_of_halfspaces_under_gaussian_marginals__58d520d3.pdf

Reliable Learning of Halfspaces under Gaussian Marginals

Ilias Diakonikolas University of Wisconsin-Madison ilias@cs.wisc.edu

Lisheng Ren University of Wisconsin-Madison lren29@wisc.edu

Nikos Zarifis University of Wisconsin-Madison zarifis@wisc.edu

We study the problem of PAC learning halfspaces in the reliable agnostic model of Kalai et al. (2012). The reliable PAC model captures learning scenarios where one type of error is costlier than the others. Our main positive result is a new algorithm for reliable learning of Gaussian halfspaces on Rd with sample and computational complexity

d O(log(min{1/α,1/ϵ})) min(2log(1/ϵ)O(log(1/α)), 2poly(1/ϵ)) ,

where ϵ is the excess error and α is the bias of the optimal halfspace. We complement our upper bound with a Statistical Query lower bound suggesting that the dΩ(log(1/α)) dependence is best possible. Conceptually, our results imply a strong computational separation between reliable agnostic learning and standard agnostic learning of halfspaces in the Gaussian setting.

1 Introduction

Halfspaces (or Linear Threshold Functions) is the class of functions f : Rd { 1} of the form f(x) = sign( w, x t), where w Rd is called the weight vector and t is called the threshold. The problem of learning halfspaces is one of the classical and most well-studied problems in machine learning going back to the Perceptron algorithm [Ros58] and has had great impact on many other influential techniques, including SVMs [Vap98] and Ada Boost [FS97].

Here we focus on learning halfspaces from random labeled examples. The computational complexity of this task crucially depends on the choice of the underlying model. For example, in the realizable PAC model (i.e., with clean labels), the problem is known to be efficiently solvable (see, e.g., [MT94]) via a reduction to linear programming. Unfortunately, this method is quite fragile and breaks down in the presence of noisy labels. In the noisy setting, the computational complexity of the problem depends on the choice of noise model and distributional assumptions. In this work, we study the problem of distribution-specific PAC learning of halfspaces, with respect to Gaussian marginals, in the reliable agnostic model of [KKM12]. Formally, we have the following definition.

Definition 1.1 ((Positive) Reliable Learning of Gaussian Halfspaces). Let Hd be the class of halfspaces on Rd. Given 0 < ϵ < 1 and i.i.d. samples (x, y) from a distribution D supported on Rd { 1}, where the marginal Dx is the standard Gaussian N(0, I), we say that an algorithm reliably learns Hd to error ϵ if the algorithm with high probability outputs a hypothesis h : Rd { 1} such that Pr(x,y) D[h(x) = 1 y = 1] ϵ and Pr(x,y) D[h(x) = 1 y = 1] OPT + ϵ,

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

where OPT def = minf G(D) Pr(x,y) D[f(x) = 1 y = 1], with

G(D) = {f Hd : Pr (x,y) D[f(x) = 1 y = 1] = 0} {f(x) = 1} .

We say that f = argminf G(D) Pr(x,y) D[f(x) = 1 y = 1] is the optimal halfspace on distribution D and for the conditions above that hypothesis h satisfies, we say that h is ϵ-reliable with respect to Hd on distribution D.

In other words, a (positive) reliable learner makes almost no false positive errors, while also maintaining the best possible false negative error as compared to any hypothesis with no false positive errors. Note that if there is no hypothesis (in the target class) that makes no false positive errors, then essentially we have no requirement for the false negative error of the returned hypothesis. The algorithm can then simply return the 1 constant hypothesis.

The reliable agnostic PAC model was introduced by [KKM12] as a one-sided analogue of the classical agnostic model [Hau92, KSS94], and has since been extensively studied in a range of works; see, e.g., [KKM12, KT14, GKKT17, DJ19, GKKM20, KK21]. The underlying motivation for this definition comes from learning situations where mistakes of one type (e.g., false positive) should be avoided at all costs. Typical examples include spam detection where incorrectly detecting spam emails is much less costly than mislabeling an important email as spam and detecting network failures where false negatives may be particularly harmful. Such scenarios motivate the study of reliable learning, which can be viewed as minimizing a loss function for different costs for false positive and false negative errors (see, e.g., [Dom99, Elk01]). In a historical context, reliable learning is related to the Neyman-Pearson criterion [NP33] in hypothesis testing, which shows that the optimal strategy to minimize one type of error subject to another type of error being bounded is to threshold the likelihood ratio function. More recently, reliable learning has been shown [KK21] to be equivalent to the PQ-learning model [GKKM20] a recently introduced learning model motivated by covariance shift.

The algorithmic task of agnostic reliable learning can be quite challenging. While reliable learning can be viewed as minimizing a loss function with different cost for false positive and false negative error, in general, such a loss function will result in a non-convex optimization problem. As pointed out in [KKM12], distribution-specific reliable learning can be efficiently reduced to distribution-specific agnostic learning (such reduction preserves the marginal distribution). A natural question, serving as a motivation for this work, is whether reliable learning is qualitatively easier computationally for natural concept classes and halfspaces in particular.

Before we state our contributions, we briefly summarize prior work on agnostically learning Gaussian halfspaces. Recall that, in agnostic learning, the goal is to output a hypothesis with error OPT + ϵ, where OPT is the optimal 0-1 error within the target class. Prior work has essentially characterized the computational complexity of agnostically learning Gaussian halfspaces. Specifically, it is known that the L1 polynomial regression algorithm of [KKMS08] is an agnostic learner with complexity d O(1/ϵ2) [DGJ+09, DKN10]. Moreover, there is strong evidence that this complexity upper bound is tight, both in the Statistical Query (SQ) model [DKZ20, GGK20, DKPZ21] and under plausible cryptographic assumptions [DKR23, Tie23]. It is worth noting that the aforementioned hardness results hold even for the subclass of homogeneous halfspaces.

Given the aforementioned reduction of reliable learning to agnostic learning [KKM12], one can use L1-regression as a reliable agnostic learner for Gaussian halfspaces, leading to complexity d O(1/ϵ2). Prior to this work, this was the best (and only known) reliable halfspace learner. Given the fundamental nature of this problem, it is natural to ask if one can do better for reliable learning.

Is it possible to develop faster algorithms for reliably learning Gaussian halfspaces compared to agnostic learning?

In this work, we provide an affirmative answer to this question.

To formally state our main result, we need the notion of the bias of a Boolean function. Definition 1.2 (Bias of Boolean Function). We define the bias α [0, 1/2] of h : Rd { 1} under

the Gaussian distribution as α = α(h) def = min(Prx Nd[h(x) = 1], Prx Nd[h(x) = 1]).

The main result of this paper is encapsulated in the following theorem:

Theorem 1.3 (Main Result). Let D be a joint distribution of (x, y) supported on Rd { 1} with marginal Dx = Nd and let α be the bias of the optimal halfspace on distribution D (with respect to Definition 1.1). There is an algorithm that uses N = d O(log(min{1/α,1/ϵ})) min 2log(1/ϵ)O(log(1/α)), 2poly(1/ϵ) many samples from D, runs in poly(N, d) time, and with high probability returns a hypothesis h(x) = sign( w, x t) that is ϵ-reliable with respect to Hd. Moreover, for ϵ < α/2, any SQ algorithm for the problem requires complexity dΩ(log(1/α)).

For more detailed theorem statements of our upper and lower bounds, see Appendix B for the algorithmic result and Appendix C for the SQ hardness result.

Theorem 1.3 gives a significantly more efficient algorithm (in terms of both sample complexity and computational complexity) for reliably learning Gaussian halfspaces, compared to agnostic learning. Specifically, as long as α > 0 is a universal constant, the overall complexity is polynomial in d and quasi-polynomial in 1/ϵ as opposed to dpoly(1/ϵ) in the agnostic setting. While the complexity of our algorithm (namely, the multiplicative factor that is independent of d) increases as the optimal halfspace becomes more biased, it is always bounded above by the complexity of agnostic learning.

Finally, we note that our algorithm also applies to the fully reliable learning model [KKM12], since one can easily reduce the fully reliable learning to positive reliable and negative reliable learning (as observed in [KKM12]).

1.1 Overview of Techniques

Instead of directly considering the optimal halfspace for a distribution D (as defined in Definition 1.1), we introduce the following definition as a relaxation.

Definition 1.4 (Reliability Condition). We say that a distribution D supported on X { 1} satisfies the reliability condition with respect to f : X 7 { 1} if Pr(x,y) D[f(x) = +1 y = 1] = 0 .

Notice that h being ϵ-reliable on distribution D is equivalent to h having better false negative error compared with any f such that D satisfies the reliability condition with respect to f (instead of compared with the optimal halfspace). At the same time, given any fixed f, such a definition allows the adversary to arbitrarily corrupt negative labels, thus allowing for more manageable analysis.

We start with a brief overview of our SQ lower bound construction, followed by the high-level idea enabling our nearly matching algorithm.

SQ Lower Bound As follows from Definition 1.4 and the subsequent discussion, the adversary in reliable learning can arbitrarily corrupt the negative labels. We leverage this idea to construct an instance where the corruption level suffices to match many moments, meaning that labels y are uncorrelated with any polynomial of degree at most log(1/α). To prove this, we construct an (infinite) feasibility Linear Program (LP) with the following properties: if the LP is feasible, there exists an instance for which no low-degree polynomial correlates with the labels; see Lemma 3.3. To show the existence of such an instance, we leverage a technique from [DKPZ21] that exploits (an infinite generalization of) LP duality. Specifically, we show that it is possible to add noise to the labels whose uncorrupted value is negative so that all low-degree moments are uncorrelated with the observed labels y. This implies that no algorithm that relies on low-degree polynomials can succeed for reliable learning. This allows us to show that SQ algorithms fail if they do not use queries with tolerance at most 1/dΩ(log(1/α)) or exponentially many queries.

Reliable Learning Algorithm As discussed in the previous paragraph, an adversary can arbitrarily corrupt the negative labels which can make various algorithmic approaches fail. In more detail, the adversary can corrupt Ω(log(1/α)) moments; that is, any SQ algorithm needs to rely on higher-order moment information. One of the main difficulties of this setting is that there may exist weight vectors w, w with w w 2 Ω(1) while at the same time the 0-1 error of w and w is nearly the same. This might suggest that no approach can verify that the algorithm decreases the angle between the current weight vector and the optimal vector. To overcome this obstacle, we develop an algorithm that performs a random walk over a low-dimensional subspace. To establish the correctness of our

algorithm, we prove that at some point during its execution, the algorithm will find a weight vector sufficiently close to the optimal vector.

To show that such a low-dimensional subspace exists, we proceed as follows: We first prove that there exists a non-trivial polynomial p of degree O(log(1/α)) that correlates with the negative labels (by the term nontrivial, we are referring to the requirement that we cannot use the constant polynomial, as this trivially correlates with the labels); see Claim B.5. This implies that the low-degree moment tensors, i.e., E(x,y) D[1{y = 1}x k] for some k between 1 and O(log(1/α)), correlate nontrivially with (w ) k, where w is the target weight vector. Thus, we can use these moment tensors to construct a low-dimensional subspace. We leverage the structure of the problem, namely that the (observed) negative labels are uncorrupted, to show that the correlation is in fact almost constant. As a consequence, this allows us to construct a subspace whose dimension depends only on the degree of the polynomial p (which is O(log(1/α))) and the desired excess error ϵ.

We then proceed as follows: in each round, as long as the current solution w is not optimal, i.e., there are negative samples on the region {x Rd : w x t 0}, our algorithm conditions on a thin strip, projects the points x on the orthogonal complement of w, and reapplies the above structure result. This leads (with constant probability) to an increase in the correlation between w and w . Assuming that the correlation is increased by β in each round with probability at least 1/3, we roughly need at most 1/β successful updates. Therefore, if we run our algorithm for 31/β steps, it is guaranteed that with probability at least 1/3 we will find a vector almost parallel to w .

Prior Algorithmic Techniques Roughly speaking, prior techniques for reliable learning relied on some variant of L1-regression. In that sense, our algorithmic approach departs from a direct polynomial approximation. Concretely, for distribution-free reliable learning, the algorithm from [KT14] uses a one-sided variant of L1 regression where instead of approximating the target function in L1 norm, they use the hinge loss instead. For distribution-specific (and Gaussian in particular) reliable learning of halfspaces, the only previous approach uses the reduction of [KKM12] to agnostic learning. While our algorithm also leverages polynomial approximation, our approach employs significantly different ideas and we believe it provides a novel perspective for this problem.

Technical Comparison with Prior Work Our algorithm shares similarities with the algorithm of [DKK+22] for learning Gaussian halfspaces with Massart noise (a weaker semi-random noise model). Both algorithms perform a random walk in order to converge to the target halfpace. Having said so, our algorithm is fundamentally different than theirs for the following reasons. The algorithm of [DKK+22] partitions Rd into sufficiently angles and searches in each one of them for a direction to update the current hypothesis at random. Our algorithm instead conditions on y = 1, which are the points that are uncorrupted, and uses the guarantee (that we establish) that the first Ω(log(1/α)) moments of the distribution (conditioned on y = 1) correlate with the unknown optimal hypothesis. This leads to an algorithm with significantly faster runtime.

Our SQ lower bound leverages techniques from [DKPZ21]. Essentially, we formulate a linear program to construct a noise function that makes all low-degree polynomials uncorrelated with the labels y. This condition suffices to show that no SQ algorithm can solve the problem without relying on higher moments. To prove the existence of such a noise function, we use (a generalization of) LP duality of an infinite LP, which provides the necessary conditions for the existence of such a function.

1.2 Related Work

This work is part of the broader research program of characterizing the efficient learnability of natural concept classes with respect to challenging noise models. The complexity of this general learning task depends on the underlying class, the distributional assumptions and the choice of noise model. A long line of prior work has made substantial progress in this direction for the class of halfspaces under Gaussians (and other natural distributions), and for both semirandom [ABHU15, DKTZ20a, ZSA20, DKTZ20b, DKK+20, DKK+21b, DKK+22] and adversarial noise [KKMS08, KOS08, KLS09, ABL17, DKS18, DKTZ20c, DKK+21a, DKTZ22]. An arguably surprising conceptual implication of our results is that the complexity of reliable learning for Gaussian halfspaces is qualitatively closer to the semi-random case.

1.3 Preliminaries

We use lowercase boldface letters for vectors and capitalized boldface letters for matrices and tensors. We use x, y for the inner product between x, y Rd. For x, s Rd, we use projs(x) def = x,s s

s 2 2 for the projection of x on the s direction. Similarly, we use proj s(x) = x projs(x) for the projection of x on the orthogonal complement of s. Additionally, let x V Rdim(V ) be the projection of x on the subspace V and reparameterized on Rdim(V ). More precisely, let BV Rd dim(V ) be the matrix whose columns form an (arbitrary) orthonormal basis for the subspace V , and let x V def = (BV ) x.

For p 1 and x Rd, we use x p def = (Pn i=1 |xi|p)1/p to denote the ℓp-norm of x. For a matrix or tensor T, we denote by T F the Frobenius norm of T.

We use Nd to denote the standard normal distribution N(0, I), where 0 is the d-dimensional zero vector and I is the d d identity matrix. We use x D to denote a random variable with distribution D. For a random variable x (resp. a distribution D), we use Px (resp. PD) to denote the probability density function or probability mass function of the random variable x (resp. distribution D). We also use Φ : R 7 [0, 1] to denote the cdf function of N1. We use 1 to denote the indicator function.

For a boolean function h : Rd { 1} and a distribution D supported on Rd { 1}, we use

R+(h; D) def = Pr(x,y) D[h(x) = 1 y = 1] (resp. R (h; D) def = Pr(x,y) D[h(x) = 1 y = 1]) to denote the false positive (resp. false negative) 0-1 error.

2 Algorithm for Reliably Learning Gaussian Halfspaces

In this section, we describe and analyze our algorithm establishing Theorem 1.3. Due to space limitations, some proofs have been deferred to Appendix B. For convenience, we will assume that α (the bias of the optimal halfspace) is known to the algorithm and that the excess error satisfies ϵ α/3. These assumptions are without loss of generality for the following reasons: First, one can efficiently reduce the unknown α case to the case that α is known, by guessing the value of α. Second, if ϵ is large, there is a straightforward algorithm for the problem (simply output the best constant hypothesis).

Notation: We use the notation Hα d for the set of all LTFs on Rd whose bias is equal to α. Given the above assumptions, it suffices for us to give a reliable learning algorithm for Hα d instead of Hd.

The high-level idea of the algorithm is as follows. Without loss of generality, we assume that there exists an α-biased halfspace that correctly classifies all the points with label y = 1 since otherwise the algorithm can just return the hypothesis h(x) 1.

Let f(x) = sign( w , x t ) be the optimal halfspace and let w be our current guess of w . Assuming that w is not sufficiently close to w and the hypothesis that classifies all the points as negative is not optimal, we show that there exists a low-degree polynomial of the form p( w, x ) with correlation at least 2 O(t 2)ϵ with the negative labels. By leveraging this structural result, we use a spectral algorithm to find a direction v that is non-trivially correlated with proj w(w ) with at least some constant probability.

Unfortunately, it is not easy to verify whether v, proj w(w ) is non-trivial. However, conditioned on the algorithm always getting a v that has good correlation, we show that it only takes at most

log(1/ϵ)O(t 2) steps to get sufficiently close to w . Therefore, repeating this process 2log(1/ϵ)O(t 2)

many times will eventually find an accurate approximation to w .

The structure of this section is as follows: In Section 2.1, we give our algorithm for finding a good direction. Section 2.2 describes and analyzes our random walk procedure.

2.1 Finding a Non-Trivial Direction

Here we present an algorithm that finds a direction that correlates non-trivially with the unknown optimal vector. We first show that there exists a zero-mean O(log(1/α))-degree polynomial that sign-matches the optimal hypothesis. Furthermore, using the fact that the optimal hypothesis always

correctly guesses the sign of the negative points, this gives us that the polynomial correlates with the negative (clean) points. Using this intuition, our algorithm estimates the first O(log(1/α)) moments of the distribution Dx conditioned on y = 1. This guarantees that at least one moment correlates with the optimal hypothesis, as a linear combination of the moments generates the sign-matching polynomial. Then, by taking a random vector that lies in the high-influence directions (which form a low-dimensional subspace), we guarantee that with constant probability, this vector correlates well with the unknown optimal vector. The main result of the section is the following. Proposition 2.1. Let D be a joint distribution of (x, y) supported on Rd { 1} with marginal Dx = Nd and ϵ (0, 1). Suppose D satisfies the reliability condition with respect to f(x) =

sign( w , x t ) with t = O p

log(1/ϵ) and Pr(x,y) D[y = 1] ϵ. Then there is an

algorithm that draws N = d O(t 2)/ϵ2 samples, has poly(N) runtime, and with probability at least Ω(1) returns a unit vector v such that v, w max(log(1/ϵ) O(t 2), ϵO(1)).

We start by showing that for any distribution satisfying the reliability condition with respect to some f(x) = sign( w , x t ), there exists a degree-O(t 2) zero-mean polynomial of the form p( w , x ) that has correlation at least 2 O(t 2) Pr(x,y) D[y = 1] with the labels.

Lemma 2.2 (Correlation with an Orthonormal Polynomial). Let D be a joint distribution of (x, y) supported on Rd { 1} with marginal Dx = Nd. Suppose D satisfies the reliability condition with respect to f(x) = sign( w , x t ). Then there exists a polynomial p : R 7 R of degree at most k = O(t 2 + 1) such that Ez N1[p(z)] = 0, Ez N1[p2(z)] = 1 and E(x,y) D[y p( w , x )] = 2 O(t 2) Pr(x,y) D[y = 1].

Proof Sketch of Lemma 2.2. Note that since p is a zero-mean polynomial with respect to Dx, we have that E(x,y) D[y p( w , x )] = 2 E(x,y) D[1(y = 1) p( w , x )] . Then, using the fact that the distribution D satisfies the reliability condition, the statement boils down to showing that for any ϵ-mass inside the interval [t , ], the expectation of p on this ϵ mass is at least 2 O(t 2). To prove this statement, it suffices to construct a polynomial p that is non-negative on [t , ] and such that the ϵ/2-tail of p in that interval is at least 2 O(t 2). To achieve this, we show that the sign-matching polynomial used in [DKK+22] meets our purpose.

Our algorithm uses the following normalized Hermite tensor. Definition 2.3 (Hermite Tensor). For k N and x Rd, we define the k-th Hermite tensor as

(Hk(x))i1,i2,...,ik = 1

Partitions P of [k] into sets of size 1 and 2

{a,b} P ( Iia,ib) O

{c} P xic .

Given that there is such a polynomial, we show that if we take a flattened version of the Chowparameter tensors (which turns them into matrices) and look at the space spanned by their top singular vectors, then a non-trivial fraction of w must lie inside this subspace. We prove the following lemma, which is similar to Lemma 5.10 in [DKK+22]. See Appendix B for the proof. Lemma 2.4. Let D be the joint distribution of (x, y) supported on Rd { 1} with marginal Dx = Nd. Let p : R 7 R be a univariate, mean zero and unit variance polynomial of degree k such that for some unit vector v Rd it holds E(x,y) D[1(y = 1)p( v , x )] τ for some τ (0, 1]. Let T m be an approximation of the order-m Chow-parameter tensor Tm = E(x,y) D[1(y = 1)Hm(x)] such that T m Tm F τ/(4

k). Denote by Vm the subspace spanned by the left singular vectors of flattened T m whose singular values are greater than τ/(4

k). Moreover, denote by V the union of V1, , Vk. Then, for γ = Pr(x,y) D[y = 1], we have that

1. dim(V ) = O(γ2 log(1/γ)kk2/τ 2 + 1), and

2. proj V (v ) 2 = Ω τ/

kγ log(1/γ)k/2 .

By Lemma 2.4, taking a random unit vector v in V will give us v, v proj V (v ) / p

dim(V ). We are ready to prove Proposition 2.1. For the algorithm pseudocode, see Appendix B.

Proof Sketch of Proposition 2.1 . The idea is that the empirical estimate T m obtained using d O(max{t 2,1}) log(1/δ)/ϵ2 many samples will satisfy T m Tm F τ/(4

k) with high probability. Then, by Lemma 2.2 and Lemma 2.4, if we take v to be a random unit vector in V , with constant probability we will have v, w = log(1/ϵ) O(t 2). For v, w = ϵO(1), the proof relies on a different version of Lemma 2.4.

2.2 Random Walk to Update Current Guess

We now describe how we use Proposition 2.1 to construct an algorithm for our learning problem. Let w be the current guess for w . For convenience, we assume that w, w = Ω(1/

d). Let D be the distribution of x w conditioned on x B, where B = {x Rd : w, x t 0}. We next show that the x-marginals of D are standard Gaussian and that D satisfies the reliability condition with respect to the halfspace h(x) = sign( w , x t ) with w = w w and |t | |t |. Lemma 2.5. Let D be the joint distribution of (x, y) supported on Rd { 1} with marginal Dx = Nd and ϵ (0, 1). Suppose D satisfies the reliability condition with respect to f(x) = sign( w , x t ). Suppose that h(x) = sign( w, x t) with w, w > 0 and t t [0, ϵ/100] does not satisfy R+ h (D) ϵ/2. Let B = {x Rd : w, x t 0} and D be the distribution of (x , y) = (x w, y) given x B, then

1. D has marginal distribution Dx = Nd 1,

2. D satisfies the reliability condition with respect to h (x) = sign( w , x t ), where w = w w/ w w 2 and |t | |t |, and

3. Pr(x ,y) D [y = 1] ϵ/2.

Proof Sketch of Lemma 2.5. Item 1 follows by definition. For Item 2, we consider two cases: t > 0; and t 0. For the case t 0, we prove that the distribution D satisfies the reliability condition with respect to h (x) = sign( w , x ), where w = w w/ w w 2. For the case t > 0, we prove that the distribution D satisfies the reliability condition with respect to h (x) = sign( w , x t ), where w = w w/ w w 2 and t = t . Then Item 3 follows from the fact that h does not satisfy R+ h (D) ϵ/2.

By Lemma 2.5, given any current guess w such that h(x) = sign( w, x t ) does not satisfy R+(h; D) ϵ/2, the corresponding distribution D satisfies the reliability condition with respect to h and Pr(x,y) D [y = 1] ϵ/2. Therefore, D satisfies the assumptions of Proposition 2.1. So, if we apply the algorithm in Proposition 2.1, we will with probability at least Ω(1) get a unit vector v such that v, w = 0 and v, w = log(1/ϵ) O(t 2).

The following fact shows that by updating our current guess in the direction of v with appropriate step size, we can get an updated guess with increased correlation with w . Fact 2.6 (Correlation Improvement, Lemma 5.13 in [DKK+22]). Fix unit vectors v , v Rd. Let u Rd such that u, v c, u, v = 0 and u 2 1 with c > 0. Then, for v = v+λu v+λu 2 , with λ = c/2, we have that v , v v, v + λ2/2.

Notice that because proj w(w ) 2 is unknown, we cannot always choose the optimal step size λ. Instead, we will use the same λ to do sufficiently many update steps such that after that many updates, we are certain that proj w(w ) 2 3λ. We then take the new step size λupdate = λ/2 and repeat this process, until w and w are sufficiently close to each other.

We are now ready to describe our algorithm (Algorithm 1) and prove its correctness.

Notice that given that we know the bias α, t must be either Φ 1(α) or Φ 1(α). For convenience, we assume that t = Φ 1(α). To account for the case that t = Φ 1(α), we can simply run Algorithm 1 twice and pick the output halfspace with the smallest t value (or even run a different efficient algorithm, since t 0 as explained in Appendix B). For convenience, we also assume min 2log(1/ϵ)O(log(1/α)), 2poly(1/ϵ) = 2log(1/ϵ)O(log(1/α)). To account for the other case, we can

simply initialize the step size ζ = ϵc for a sufficiently large constant c instead of ζ = log(1/ϵ) ct 2.

A more detailed version of Algorithm 1 is deferred to Appendix B.

Input: ϵ (0, 1), α (0, 1/2) and samples access to a joint distribution D of (x, y) supported on Rd { 1} with x-marginal Dx = Nd. Output: h(x) = sign( w, x t) that is ϵ-reliable with respect to the class Hα d .

1. Check if Pr(x,y) D[y = 1] ϵ/2 (with sufficiently small constant failure probability). If so, return the +1 constant hypothesis. Set the initial step size ζ = log(1/ϵ) ct 2, where c is a sufficiently large universal constant and t = Φ 1(α).

2. Initialize w to be a random unit vector in Rd. Let the update step size λ = ζ and repeat the following process until λ ϵ/100.

(a) Use samples from D to check if the hypothesis h(x) = sign( w, x t) satisfies R+(h; D) ϵ/2. If so, go to Step (3). (b) With 1/2 probability, let w = w. Let B = {x Rd : w, x t 0}, and let D be the distribution of (x w, y) for (x, y) D given x B. Use the algorithm of Proposi-

tion 2.1 on D to find a unit vector v such that v, w = 0 and D v, proj w(w ) proj w(w ) 2

Then update w as follows: wupdate = w+λv w+λv 2 .

(c) Repeat Steps (2a) and (2b) c/ζ2 times, where c is a sufficiently large universal constant, with the same step size λ. After that, update the new step size as λupdate = λ/2. 3. Check if h(x) = sign( w, x t) satisfies R+(h; D) ϵ/2. If so, return h and terminate. Repeat Step (2) 21/ζc many times where c is a sufficiently large constant.

Algorithm 1: Reliably Learning General Halfspaces with Gaussian Marginals.

Proof Sketch of the algorithmic part of Theorem 1.3. Let f(x) = sign( w , x t ) be the optimal halfspace with α bias. We need to show that with high probability Algorithm 1 returns a hypothesis h(x) = sign( w, x t) such that R+(h; D) ϵ and R (h; D) R (f; D) + ϵ.

To do so, it suffices to show that R+(h; D) ϵ; given R+(h; D) ϵ, R (h; D) R (f; D) + ϵ follows from our choice of t. For convenience, we can assume h never satisfies R+(h; D) ϵ/2 in Step (2a) (otherwise, we are done). We can also assume that the subroutine in Proposition 2.1 always succeeds since the algorithm repeats Step (2) sufficiently many times. Given the above conditions, using Fact 2.6, one can show that each time after c/ζ2 many updates in Step (2b), we must have proj ww 2 3λ. Therefore, when we have λ ϵ/100, then proj ww 2 3ϵ/100, which implies R+(h; D) ϵ/2.

3 Nearly Matching SQ Lower Bound

In this section, we establish the SQ hardness result of Theorem 1.3. Due to space limitations, some proofs have been deferred to Appendix C.

Proof Overview To establish our SQ lower bound for reliable learning, we first prove an SQ lower bound for a natural decision version of reliably learning α-biased LTFs. We define the following decision problem over distributions.

Definition 3.1 (Decision Problem over Distributions). Let D be a fixed distribution and D be a distribution family. We denote by B(D, D) the decision problem in which the input distribution D is promised to satisfy either (a) D = D or (b) D D, and the goal is to distinguish the two cases with high probability.

We show that given SQ access to a joint distribution D of (x, y) supported on Rd { 1} with marginal Dx = N(0, I), it is hard to solve the problem B(D, D) with the following distributions.

(a) Null hypothesis: D is the distribution so that y = 1 with probability 1/2 independent of x.

(b) Alternative hypothesis: D D, where D is a family of distributions such that for any distribution D D, there exists an α-biased LTF f such that R+(f; D) = 0.

In order to construct such a family of distributions D, we start by constructing a joint distribution D of (z, y) over R { 1} such that the marginal distribution of z is N1 and the conditional distributions z | y = 1 and z | y = 1 both match many moments with the standard Gaussian N1. Moreover, there is α probability mass on the positive side of the marginal distribution of z that is purely associated with y = 1 (i.e., E(z,y) D[y | z c] = 1 where c = Φ 1(1 α)). We then embed this distribution D along a hidden direction inside the joint distribution of (x, y) on Rd { 1} to construct a family of hard-to-distinguish distributions using the hidden-direction framework developed in [DKS17] and enhanced in [DKPZ21, DKRS23].

We can now proceed with the details of the proof. We start by defining the pairwise correlation between distributions.

Definition 3.2 (Pairwise Correlation). The pairwise correlation of two distributions with pdfs D1, D2 : Rd 7 R+ with respect to a distribution with density D : Rd 7 R+, where the support of D contains the support of D1 and D2, is defined as χD(D1, D2) def = R

Rd D1(x)D2(x)/D(x)dx 1.

Furthermore, the χ-squared divergence of D1 to D is defined as χ2(D1, D) def = χD(D1, D1).

In particular, the framework in [DKS17] allows us to construct a family of 2dΩ(1) distributions on Rd { 1} whose pairwise correlation is d Ω(n) where n is the number of matching moments D has with the standard Gaussian. Then, using standard SQ dimension techniques, this gives an SQ lower bound for the distinguishing problem. After that, we reduce the distinguishing problem to the problem of reliably learning α-biased LTFs under Gaussian marginals with additive error ϵ < α/3.

In order to construct such a distribution D of (z, y) supported on R { 1}, we reparameterize E(z,y) D [y|z] as g(z). For a function g : R R, we use g p = Et N1[|g(t)|p]1/p for its Lp norm. We let L1(R) denote the set of all functions g : R R that have finite L1-norm.

We use linear programming over one-dimensional functions in L1(R) space to establish the existence of such a function. Specifically, we show the following: Lemma 3.3. For any sufficiently large n N, there exists a function g : R 7 [ 1, +1] such that g satisfies the following properties:

(i) g(z) = 1 for all z c, where c = Φ 1(1 3 2n/4), and

(ii) Et N1[g(z)zk] = 0 for all k [n].

Proof Sketch of Lemma 3.3. We let Pn denote the set of all polynomials p : R R of degree at most n and let L1 +(R) denote the set of all nonnegative functions in L1(R). Then, using linear programming, we will get the following primal:

find g L1(R) such that E z N1[p(z)g(z)] = 0 , p Pn

E z N1[g(z)h(z)1{t c}] h(z)1{z c} 1 , h L1 +(R)

E z N1[g(z)H(z)] H 1 , H L1(R)

Then, using (infinite-dimensional) LP duality, we get that the above primal is feasible if and only if there is no polynomial of degree n such that Et N1[|p(t)|1(t c)] < Et N1[p(t)1(t c)].

Using Gaussian hypercontractivity (see [Bog98, Nel73]), one can show that for every polynomial p and c R, it holds

E z N1[|p(z)|] < 2 3n E z N1[|p(z)|] Pr z N1[z c] 1/2 .

From our choice of the parameter c, we have that Prz N1[z c] 3 2n/4; thus, Ez N1[|p(z)|] < Ez N1[|p(z)|], which is a contradiction. Therefore, such a polynomial p cannot exist.

We have proven the existence of the function g in Lemma 3.3. Now, we construct a joint distribution of (z, y) on R { 1} such that E[y|z] is exactly g, as we discussed in the proof outline. For a joint distribution D of (x, y) supported on X { 1}, we will use D+ to denote the conditional distribution of x given y = 1; and D for the distribution of x given y = 1. Lemma 3.4. For any sufficiently large n N, there exists a distribution D on R { 1} such that for (z, y) D:

(i) the marginal distribution Dz = N1;

(ii) E(z,y) D[y|z = z ] = 1 for all z Φ 1(1 3 2n/4);

(iii) E(z,y) D[y] = 0 and E(z,y) D[zk] = E(z,y) D[zk | y = 1] = E(z,y) D[zk | y = 1] for all k [n];

(iv) χ2(D+, N1), χ2(D , N1) = O(1).

Proof Sketch for Lemma 3.4. The properties here directly follow from the properties of g.

Using the framework introduced in [DKS17, DKPZ21], we can construct a set of alternative hypothesis distributions D = {Dv : v V } on Rd { 1}, where V is a set of exponentially many pairwise nearly-orthogonal vectors and the marginal distribution of each Dv on direction v is the distribution D in Lemma 3.4. This effectively embeds the distribution D in a hidden direction v. The size of the family D is exponential in d, and the distributions in it have small pairwise correlations. The details of D are deferred to Appendix C. Now, we are ready to give a proof sketch for the SQ hardness part of our main theorem Theorem 1.3.

Proof Sketch of the SQ hardness part of Theorem 1.3. Let D be the set of distributions discussed above. We also let Dnull be the joint distribution of (x, y) such that x Nd and y Bern(1/2) independent of x. Then, using standard SQ dimension techniques, one can show that any SQ algorithm that solves B(D, Dnull) requires either queries of tolerance at most d Ω(log 1

α ) or makes at least 2dΩ(1) queries. By reducing the decision problem B(D, Dnull) to reliably learning α-biased LTFs with ϵ < α/3 accuracy, we get the lower bound part of the statement in Theorem 1.3.

4 Conclusions and Open Problems

In this paper, we study the problem of learning halfspaces under Gaussian marginals in the reliable learning model. Our main contribution is the design of the first efficient learner for this task whose complexity beats the complexity of agnostically learning this concept class. Moreover, we provide rigorous evidence, via an SQ lower bound, that no fully-polynomial time algorithm exists for general halfspaces. The obvious open question is whether the dependence on ϵ in the complexity of our algorithm can be improved. Specifically, is it possible to design a reliable learner with complexity d O(log(1/α))poly(1/ϵ)? Is it possible to obtain similarly efficient reliable learners under more general marginal distributions (e.g., strongly log-concave or discrete distributions)? More broadly, it would be interesting to characterize the computational separation between (distribution-specific) reliable and agnostic learning for other natural concept classes.

5 Acknowledgements

ID was supported in part by NSF Medium Award CCF-2107079 and an H.I. Romnes Faculty Fellowship. LR was supported in part by NSF Medium Award CCF-2107079. NZ was supported in part by NSF Medium Award CCF-2107079.

[ABHU15] P. Awasthi, M. F. Balcan, N. Haghtalab, and R. Urner. Efficient learning of linear separators under bounded noise. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, pages 167 190, 2015.

[ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6):50:1 50:27, 2017.

[Bog98] V. Bogachev. Gaussian measures. Mathematical surveys and monographs, vol. 62, 1998.

[DGJ+09] I. Diakonikolas, P. Gopalan, R. Jaiswal, R. Servedio, and E. Viola. Bounded independence fools halfspaces. In Proc. 50th IEEE Symposium on Foundations of Computer Science (FOCS), pages 171 180, 2009.

[DJ19] A. Durgin and B. Juba. Hardness of improper one-sided learning of conjunctions for all uniformly falsifiable csps. In Algorithmic Learning Theory, ALT, volume 98 of Proceedings of Machine Learning Research, pages 369 382, 2019.

[DKK+20] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. A polynomial time algorithm for learning halfspaces with Tsybakov noise. ar Xiv, 2020.

[DKK+21a] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. Agnostic proper learning of halfspaces under gaussian marginals. In Proceedings of The 34th Conference on Learning Theory, COLT, 2021.

[DKK+21b] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. Efficiently learning halfspaces with Tsybakov noise. STOC, 2021.

[DKK+22] I. Diakonikolas, D. M. Kane, V. Kontonis, C. Tzamos, and N. Zarifis. Learning general halfspaces with general Massart noise under the gaussian distribution. In STOC 22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 874 885. ACM, 2022.

[DKN10] I. Diakonikolas, D. M. Kane, and J. Nelson. Bounded independence fools degree-2 threshold functions. In FOCS, pages 11 20, 2010.

[DKPZ21] I. Diakonikolas, D. M. Kane, T. Pittas, and N. Zarifis. The optimality of polynomial regression for agnostic learning under gaussian marginals. In Proceedings of The 34th Conference on Learning Theory, COLT, 2021.

[DKR23] I. Diakonikolas, D. M. Kane, and L. Ren. Near-optimal cryptographic hardness of agnostically learning halfspaces and relu regression under gaussian marginals. In International Conference on Machine Learning, ICML, volume 202 of Proceedings of Machine Learning Research, pages 7922 7938, 2023.

[DKRS23] I. Diakonikolas, D. Kane, L. Ren, and Y. Sun. SQ lower bounds for non-gaussian component analysis with weaker assumptions. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, 2023.

[DKS17] I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73 84, 2017.

[DKS18] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1061 1073, 2018.

[DKTZ20a] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning halfspaces with Massart noise under structured distributions. In Conference on Learning Theory, COLT, 2020.

[DKTZ20b] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning halfspaces with Tsybakov noise. ar Xiv, 2020.

[DKTZ20c] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Non-convex SGD learns halfspaces with adversarial label noise. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

[DKTZ22] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning general halfspaces with adversarial label noise via online gradient descent. In International Conference on Machine Learning, ICML 2022, volume 162 of Proceedings of Machine Learning Research, pages 5118 5141. PMLR, 2022.

[DKZ20] I. Diakonikolas, D. M. Kane, and N. Zarifis. Near-optimal SQ lower bounds for agnostically learning halfspaces and Re LUs under Gaussian marginals. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

[Dom99] P. M. Domingos. Metacost: A general method for making classifiers cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 155 164, 1999.

[Elk01] C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, IJCAI, pages 973 978, 2001.

[Fan68] K. Fan. On infinite systems of linear inequalities. Journal of Mathematical Analysis and Applications, 21(3):475 478, 1968.

[FGR+17] V. Feldman, E. Grigorescu, L. Reyzin, S. Vempala, and Y. Xiao. Statistical algorithms and a lower bound for detecting planted cliques. J. ACM, 64(2):8:1 8:37, 2017.

[FS97] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

[GGK20] S. Goel, A. Gollakota, and A. R. Klivans. Statistical-query lower bounds via functional gradients. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

[GKKM20] S. Goldwasser, A. T. Kalai, Y. Kalai, and O. Montasser. Beyond perturbations: Learning guarantees with arbitrary adversarial test examples. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

[GKKT17] S. Goel, V. Kanade, A. R. Klivans, and J. Thaler. Reliably learning the relu in polynomial time. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, volume 65 of Proceedings of Machine Learning Research, pages 1004 1042, 2017.

[Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78 150, 1992.

[KK21] A. T. Kalai and V. Kanade. Efficient learning with arbitrary covariate shift. In Algorithmic Learning Theory, 2021, volume 132 of Proceedings of Machine Learning Research, pages 850 864, 2021.

[KKM12] A. T. Kalai, V. Kanade, and Y. Mansour. Reliable agnostic learning. J. Comput. Syst. Sci., 78(5):1481 1495, 2012.

[KKMS08] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777 1805, 2008. Special issue for FOCS 2005.

[KLS09] A. Klivans, P. Long, and R. Servedio. Learning Halfspaces with Malicious Noise. Journal of Machine Learning Research, 10:2715 2740, 2009.

[KOS08] A. Klivans, R. O Donnell, and R. Servedio. Learning geometric concepts via Gaussian surface area. In Proc. 49th IEEE Symposium on Foundations of Computer Science (FOCS), pages 541 550, Philadelphia, Pennsylvania, 2008.

[KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Efficient Agnostic Learning. Machine Learning, 17(2/3):115 141, 1994.

[KT14] V. Kanade and J. Thaler. Distribution-independent reliable learning. In Proceedings of The 27th Conference on Learning Theory, COLT, volume 35 of JMLR Workshop and Conference Proceedings, pages 3 24, 2014.

[MT94] W. Maass and G. Turan. How fast can a threshold gate learn? In S. Hanson, G. Drastal, and R. Rivest, editors, Computational Learning Theory and Natural Learning Systems, pages 381 414. MIT Press, 1994.

[Nel73] E. Nelson. The free markoff field. Journal of Functional Analysis, 12(2):211 227, 1973.

[NP33] J. Neyman and E. S. Pearson. On the problem of the most efficient tests for statistical hypotheses. Trans. R. So. Lond. Ser. A Contain. Pap. Math. Phys. Character, 231:281 337, 1933.

[Ros58] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386 407, 1958.

[Tie23] S. Tiegel. Hardness of agnostically learning halfspaces from worst-case lattice problems. In The Thirty Sixth Annual Conference on Learning Theory, COLT, volume 195 of Proceedings of Machine Learning Research, pages 3029 3064, 2023.

[Vap98] V. Vapnik. Statistical learning theory. Wiley, 1998.

[ZSA20] C. Zhang, J. Shen, and P. Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. In Advances in Neural Information Processing Systems, Neur IPS, 2020.

Organization The supplementary material is structured as follows: Appendix A includes additional preliminaries. Appendix B includes the omitted proofs from Section 2. Finally, Appendix C includes the omitted proofs from Section 3.

A Additional Preliminaries

We use Sd 1 to denote the d-dimensional unit sphere, i.e., Sd 1 def = {x Rd : x 2 = 1}.

Basics of the SQ Model In order to give an SQ lower bound for reliable learning, we first present the basics of the SQ model. We define the pairwise correlation, which is used for the SQ lower bound.

Definition A.1 (Pairwise Correlation). The pairwise correlation of two distributions with probability density function D1, D2 : Rd 7 R+ with respect to a distribution with density D : Rd 7 R+, where the support of D contains the support of D1 and D2, is defined as χD(D1, D2) def = R

Rd D1(x)D2(x)/D(x)dx 1. Furthermore, the χ-squared divergence of D1 to D is defined

as χ2(D1, D) def = χD(D1, D1).

We require the following definitions and lemmata from [FGR+17] connecting pairwise correlation and SQ lower bounds.

Definition A.2. We say that a set of s distribution D = {D1, , Ds} over Rd is (γ, β)-correlated relative to a distribution D if χD(Di, Dj) γ for all i = j, and χD(Di, Dj) β for i = j.

Definition A.3 (Statistical Query Dimension). For β, γ > 0, a decision problem B(D, D), where D is a fixed distribution and D is a family of distribution, let s be the maximum integer such that there exists a finite set of distributions DD D such that DD is (γ, β)-correlated relative to D and |DD| s. The Statistical Query dimension with pairwise correlations (γ, β) of B is defined to be s, and denoted by s = SD(B, γ, β).

Lemma A.4. Let B(D, D) be a decision problem, where D is the reference distribution and D is a class of distribution. For γ, β > 0, let s = SD(B, γ, β). For any γ > 0, any SQ algorithm for B requires queries of tolerance at most γ + γ or makes at least sγ /(β γ) queries.

Basics of Hermite Polynomials We require the following definitions used in our work.

Definition A.5 (Normalized Hermite Polynomial). For k N, we define the k-th probabilist s Hermite polynomials Hek : R R as Hek(t) = ( 1)ket2/2 dk

dtk e t2/2. We define the k-th normalized Hermite polynomial hk : R R as hk(t) = Hek(t)/

Furthermore, we use multivariate Hermite polynomials in the form of Hermite tensors (as the entries in the Hermite tensors are re-scaled multivariate Hermite polynomials). We define the Hermite tensor as follows.

Definition A.6 (Hermite Tensor). For k N and x Rd, we define the k-th Hermite tensor as

(Hk(x))i1,i2,...,ik = 1

Partitions P of [k] into sets of size 1 and 2

{a,b} P ( Iia,ib) O

{c} P xic .

We also point out that fully reliable learning (see [KKM12] for the definition) reduces to one-sided reliable learning (Definition 1.1). The proof is implicit in the proof of Theorem 3 in [KKM12] and we include it here for completeness.

Fact A.7. For a joint distribution D of (x, y) supported on X { 1}, let h+ : X { 1} (resp. h : X { 1}) be a hypothesis that is ϵ/4-reliable with respect to the concept class C on distribution D for positive labels (resp. for negative labels). Let hypothesis h be defined as h(x) = 1 if h+(x) = h (x) = 1, h(x) = 1 if h+(x) = h (x) = 1 and h(x) =? otherwise. Then h is ϵ close to the best fully reliable sandwich hypothesis for C on distribution D.

Proof Sketch. We first show that R+(h; D), R (h; D) ϵ. Notice that

R+(h; D) = Pr (x,y) D[h(x) = 1 y = 1] Pr (x,y) D[h+(x) = 1 y = 1] ϵ .

The same holds for negative labels.

To show that Pr(x,y) D[h(x) =?] is ϵ-suboptimal, assume (for the purpose of contradiction) that there is an h such that R+(h ; D), R (h ; D) ϵ and Pr(x,y) D[h (x) =?] < Pr(x,y) D[h(x) = ?] ϵ. Notice that

Pr (x,y) D[h(x) =?]

= Pr (x,y) D[h+(x) = 1 h (x) = 1] + Pr (x,y) D[h+(x) = 1 h (x) = 1]

1 Pr (x,y) D[h+(x) = 1] Pr (x,y) D[h (x) = 1] + Pr (x,y) D[h+(x) = 1 h (x) = 1]

1 Pr (x,y) D[h+(x) = 1] Pr (x,y) D[h (x) = 1] + ϵ/2 .

Pr (x,y) D[h+(x) = 1] + Pr (x,y) D[h (x) = 1] 1 Pr (x,y) D[h(x) =?] + ϵ/2 . (1)

Then, define h + : X { 1} as h +(x) = 1 if h (x) = 1 and h +(x) = 1 otherwise (h is defined similarly). We get Pr(x,y) D[h +(x) = 1] + Pr(x,y) D[h (x) = 1] = 1 Pr(x,y) D[h (x) =?]. Using Equation (1) and Pr(x,y) D[h (x) =?] < Pr(x,y) D[h(x) =?] ϵ, we get

Pr (x,y) D[h+(x) = 1] Pr (x,y) D[h (x) = 1] < Pr (x,y) D[h +(x) = 1] Pr (x,y) D[h (x) = 1] ϵ/2 .

Therefore, either Pr(x,y) D[h+(x) = 1] < Pr(x,y) D[h +(x) = 1] ϵ/4 or

Pr (x,y) D[h (x) = 1] < Pr (x,y) D[h (x) = 1] ϵ/4 .

Thus, either h+ or h is not ϵ/4-suboptimal. This gives a contradiction.

B Omitted Proofs from Section 2

We define the bias of a function h : Rd { 1} as min(Prx Nd[h(x) = 1], Prx Nd[h(x) = 1]), and define Hα d to be the set of all LTFs whose bias is α. Let f = argminc Hα d R+(f;D)=0 R (f; D) be the optimal reliable hypothesis and α be the bias of f.

B.1 Reliably Learning Halfspaces with Gaussian Marginals when ϵ 2α

We show that for the problem in Theorem 1.3, in the case of ϵ 2α, the algorithm can just return one of the constant hypotheses. Let h+ (resp. h ) be the +1 (resp. 1) constant hypothesis.

Fact B.1. For the problem defined in Theorem 1.3, the case where ϵ 2α can be efficiently solved by returning h+ if R+(h+; D) 3ϵ/4 and returning h otherwise.

Proof. When the algorithm returns h+, it is easy to see that h+ satisfies the reliable learning requirement in Definition 1.1. So we only consider the case that the algorithm returns h .

Given the algorithm returns h , either there is an α-biased LTF f such that R+(f; D) = 0 or none of the α-biased LTFs f satisfies R+(f; D) = 0. For the case that none of the α-biased LTFs f satisfies R+(f; D) = 0, it is easy to see h is a valid answer from the definition of Definition 1.1. For the case that there is an α-biased LTF f such that R+(f, D) = 0, since the algorithm did not return h+, it must be the case that Pr(x,y) D[y = 1] > 3ϵ/4 > α. Therefore, we have Prx Nd[f(x) = 1] = 1 α. Thus, we get R+(h , D) = 0 and R (h , D) R (f, D) + α. This shows h is ϵ reliable with respect to all α biased LTFs. This completes the proof.

B.2 Reliably Learning Halfspaces with Gaussian Marginals: Case of Unknown α

Here we establish our main result:

Theorem B.2 (Main Algorithmic Result). Let ϵ (0, 1/2) and let D be a joint distribution of (x, y) supported on Rd { 1} with marginal Dx = Nd. Let f = argminf Hd R+(f;D)=0 R (f; D) be the optimal halfspace and α be its bias which is unknown. Then there is an algorithm that uses N = d O(log(min{1/α,1/ϵ})) min 2log(1/ϵ)O(log(1/α)), 2poly(1/ϵ) many samples from D, runs in

poly(N, d, 1/ϵ) time and with high probability returns a hypothesis h(x) = sign( w, x t) that is ϵ-reliable with respect to the class of LTFs.

To prove Theorem B.2, we need to use the algorithm in the following lemma as a black box. The lemma shows that if the optimal halfspace f satisfies Prx Nd[f(x) = 1] 1/2, then the problem can be solved efficiently.

Lemma B.3. Let D be the joint distribution of (x, y) supported on Rd { 1} with marginal Dx = Nd and ϵ (0, 1). Suppose the optimal halfspace f = argminf Hd R+(f;D)=0 R (f; D) satisfies Prx Nd[f(x) = 1] 1/2. Then there is an algorithm that reliably learns LTFs on D using N = O(d/ϵ2) many samples and poly(N, d) running time.

Proof. The algorithm is the following:

1. First check if Pr(x,y) D)[y = 1] ϵ with sufficiently small constant failure probability. If so, return the +1 constant hypothesis.

2. Otherwise, draw m = (d/ϵ)c many samples S = {(x1, y1), , (xm, ym)} conditioned on y = 1. By Step 1, the sampling efficiency here is Ω(ϵ) with high probability.

3. Solve the following semidefinite program for w and t :

such that w , x t 0 , (x, y) S

Return the hypothesis h(x) = sign( w, x t), where w = w / w 2 and t = t / w 2.

Let f(x) = sign( w , x t ) be the optimal hypothesis. We prove that h(x) = sign( w, x t) is such that R+(h; D) ϵ and R (h; D) R (f; D) + ϵ. Since h is consistent with all the negative samples, we have R+(h; D) ϵ. From the assumption that Prx Nd[f(x) = 1] 1/2, we have t 0. Notice that w and t is a feasible solution to the above program; therefore, t t 0 and we must have t = t / w 2 t t , which implies that Pr(x,y) D[h(x) = 1] Pr(x,y) D[f(x) = 1]. Therefore,

R (h; D) = Pr (x,y) D[y = 1] Pr (x,y) D[y = 1 h(x) = 1]

= Pr (x,y) D[y = 1] Pr (x,y) D[h(x) = 1] + R+(h; D)

Pr (x,y) D[y = 1] Pr (x,y) D[f(x) = 1] + ϵ

= R (f; D) + ϵ .

This completes the proof.

We now prove Theorem B.2.

Proof of Theorem B.2. The algorithm is the following:

1. Run the algorithm in Lemma B.3 with ϵ = ϵ/2. If the output hypothesis h satisfies R+(h; D) ϵ with a sufficiently small constant failure probability, then output h and terminate.

2. Set α = 1/2 ϵ/100 and run the algorithm in Theorem B.4 with ϵ = ϵ/2. If the output hypothesis h satisfies R+(h; D) ϵ with a sufficiently small constant failure probability, then output h and terminate. Otherwise, update α as α ϵ/100. Repeat Step 2 until the algorithm terminates.

Let f = argminf Hd R+(f;D)=0 R (f; D) be the optimal halfspace and αf be its bias which is unknown. Suppose the algorithm terminates. Let α be the bias of the output hypothesis h. It is easy to see that R+(h; D) ϵ with high probability; therefore, it only remains to show R (h; D) R (f; D)+ϵ. Notice that if Prx Nd[f(x) = 1] 1/2, the algorithm will terminate in Step 1. Therefore, without loss of generality, we assume Prx Nd[f(x) = 1] 1/2. Then, by Theorem B.4, it is easy to see that α αf ϵ/100 since given any α αf, Step 2 is guaranteed to terminate. Therefore, we have

Pr x Nd[h(x) = 1] 1 α 1 αf + ϵ/100 = Pr x Nd[f(x) = 1] + ϵ/100 .

R (h; D) = Pr (x,y) D[y = 1] Pr (x,y) D[y = 1 h(x) = 1]

= Pr (x,y) D[y = 1] Pr (x,y) D[h(x) = 1] + R+(h; D)

Pr (x,y) D[y = 1] Pr (x,y) D[f(x) = 1] + ϵ

= R (f; D) + ϵ .

This completes the proof.

B.3 Reliably Learning Halfspaces with Gaussian Marginals for the Case ϵ α/2

For convenience, we assume that α is known. This can be assumed without loss of generality as we discussed in Theorem B.2.

We show the following: Theorem B.4. Let ϵ (0, 1/2), α (0, 1/2] and let D be a joint distribution of (x, y) supported on Rd { 1} with marginal Dx = Nd. Assume that there is a halfspace in Hα d that is reliable with respect to D. Then, there is an algorithm that uses

N = d O(log(min{1/α,1/ϵ})) min 2log(1/ϵ)O(log(1/α)), 2poly(1/ϵ) many samples from D, runs in

poly(N, d, 1/ϵ) time and with high probability returns a hypothesis h(x) = sign( w, x t) that is ϵ-reliable with respect to the class of Hα d .

Below we provide the omitted proofs from Section 2.

B.4 Proof of Lemma 2.2

Notice that for any polynomial p such that Ez N1[p(z)] = 0, we have

E (x,y) D[y p( w , x )] = 2 E (x,y) D[1(y = 1) p( w , x )] .

Therefore, it suffices for us to show that there is a zero mean and unit variance polynomial p such that

E (x,y) D[1(y = 1) p( w , x )] = 2 O(t 2) Pr (x,y) D[y = 1] .

Here we will consider the sign-matching polynomial from [DKK+22], which satisfies the following properties:

Claim B.5. Let b R and b 4. There exists a zero mean and unit variance polynomial p : R R of degree k = Θ(b2) such that

1. The sign of p(z) matches sign(z b), i.e. sign(p(z)) = sign(z b).

2. For any z b/2, we have |p(z)| = 2 O(k).

Proof. We consider the polynomial p defined as:

p(z) = q(z) q(b)

where q(z) = z3k, r(z) = z2k (2k 1)!! and k is a sufficiently large odd integer such that 2|b|

k 4 max(|b|, 1). We then take p(z) = p(z)/ p

Eu N1[ p2(u)].

For convenience, we first note that from Stirling s approximation for m Z+,

(m/2)m (2m 1)!! (2m)m .

To prove that sign(p(z)) = sign(z b), we first show that q(b)

r(b) 0. Since q(b) 0, we just need

to show r(b) 0. Notice r(b) = b2k (2k 1)!!, since 2|b|

k and (2k 1)!! (k/2)k, thus r(b) (

k/2)2k (k/2)k 0. Therefore, considering q(z) = z3k and r(z) = z2k (2k 1)!!,

p(z) = q(z) q(b)

must be monotone increasing for z b and p(z) must also be monotone increasing for z b. Notice that p(b) = 0; therefore, p(z) > 0 for z > b. To show p(z) < 0 for z < b, we prove it for the cases z [ b, +b) and z < b. For z [ b, +b), notice that due to q(b)

r(b) 0 and the definition of q(z) and r(z),

p(z) = q(z) q(b)

r(b)r(z) q(b) q(b)

r(b)r(b) < p(b) = 0 .

Then, for z b, we show that p(z) is monotone increasing. Notice that the derivative p (z) = kz2k 1(3zk 2 q(b)

r(b)), for z b < 0, this is positive if and only if 3zk 2 q(b)

r(b) < 0. If we can

3q(b)/r(b), then it is immediate that for any z b, 3zk 2 q(b)

r(b) < 0. The condition

3q(b)/r(b) can be further simplified to (2k 1)!!/b2k 5/3. Then considering 2b <

k and (2k 1)!! (k/2)k, we have

(2k 1)!!/b2k (k/2)k/(

k/2)2k 2k ,

thus the condition holds. Therefore, p (z) > 0 for z < b and p(z) must be monotone increasing for z < b. Then combined with the condition p( b) < 0, which is implied from the previous case (z [ b, b]), we have p(z) < 0 for z < b.

For the Property 2, we show that p(z) = 2 Θ(k) for z [0, b/2], z [ b/2, 0], z [ b, b/2] and z [ , b] separately. For z [0, b/2] notice that p(z) is convex for z [0, b]; therefore, it suffices to show p(0) = 2 Θ(k). Note that p(0) = q(b) r(b)(2k 1)!!/ p

Eu N1 p2(u), and we

r(b) b3k/ max(b2k, (2k 1)!!) Ω(k)k/2. Therefore, since Eu N1[ p2(u)] (O(k))3k,

we have p(0) 2 O(k). This completes the proof of the case z [0, b/2]. For the case z [ b/2, 0], it immediately follows from the fact that |p( z)| |p(z)| by the definition. For the case z [ b, b/2], we have p(z) (p(b) (b/2)3k)/ p

Eu N1 p2(u) 2 O(k). Then, the case z [ , b] immediately follows from the fact that p(z) is monotone increasing in this interval. This completes the proof of Property 2.

Given Claim B.5, we let p be the polynomial in Claim B.5 with b = max(2|t |, 4). Then, given that D satisfies the reliability condition with respect to f, it is easy to see the polynomial p( w , x ) satisfies the requirements. This completes the proof of Lemma 2.2.

B.5 Proof of Lemma 2.4

We start by bounding the Frobenius norm of Tm, i.e., Tm F . Notice that by taking p : R R to be any degree-m polynomial such that Ex Nd[p(x)] = 0 and Ex Nd[p(x)2] = 1, then Tm F = maxp E(x,y) D[1(y = 1)p(x)]. We leverage the following standard fact about the concentration of polynomials lemma known as Bonami-Beckner inequality or simply Gaussian hypercontractivity.

Fact B.6 (Hypercontractivity Concentration Inequality [Bog98, Nel73]). Consider any m-degree, unit variance polynomial p : Rd R with respect to Nd. Then, for any λ 0

h p(x) E u[p(u)] λ i e2e (cλ2) 1/m ,

where c is an absolute constant.

Using Fact B.6, we have that for any zero mean and unit variance polynomial p,

E (x,y) D[1(y = 1)p(x)] E (x,y) D[1(y = 1)|p(x)|] Z

0 min γ, e2e (cλ2) 1/m dλ

= O γ log(1/γ)m/2 .

Therefore, Tm F = O γ log(1/γ)m/2 .

To prove the Property 1, since T m F Tm F + τ/(4

k) = O γ log(1/γ)m/2 + τ/(4

there can be at most Tm 2 F / τ/

k 2 = O(γ2 log(1/γ)mk/τ 2 + 1) many singular vectors with

singular values greater than τ/(4

k). Therefore, dim(V ) is at most O(γ2 log(1/γ)mk/τ 2 + 1).

For the Property 2, suppose proj V (v ) 2 = o τ/

kγ log(1/γ)k/2 , then we have for any m,

v T m 2 v Tm 2 + τ/(4

Tm F proj V (v ) 2 + τ/(4

k) proj V (v ) 2 + τ/(4

However, since E(x,y) D[1(y = 1)p( v , x )] = Pk m=1 aiv Tmv m 1 τ for some a2 1 + + a2 k = 1, we know there must be an m such that v Tmv m 1 τ/

Therefore, we can write

v T m 2 v Tm 2 τ/(4

k) = (3/4)τ/

which contradicts v T m 2 (5/8)τ/

k. This completes the proof.

B.6 Proof of Proposition 2.1

The pseudocode for the algorithm establishing Proposition 2.1 is given in Algorithm 2.

We can now prove Proposition 2.1.

Proof of Proposition 2.1. We first introduce the following fact about learning Chow tensors.

Fact B.7. Fix m Z+, and ϵ, δ (0, 1). Let D be a distribution on Rd [ 1] with standard normal marginals. There is an algorithm that with N = d O(m) log(1/δ)/ϵ2 samples and poly(d, N) runtime, outputs an approximation T m of the order-m Chow-parameter tensor Tm = E(x,y) D[y Hm(x)] such that with probability 1 δ, it holds T m Tm F ϵ.

Proof. Let T m be the empirical estimation of Tm using N = dcm log(1/δ)/ϵ2 for sufficiently large universal constant c. We start by showing E(x,y) D y Hm(x) 2 F dm. Notice that,

y Hm(x) 2 F = E (x,y) D

y2 Hm(x) 2 F E (x,y) D

Hm(x) 2 F = E x Rd Hm(x) 2 F .

Notice that each entry of Hm(x) must be some αh(x) where α [0, 1] and h(x) is a unit variance Hermite polynomial. Therefore, E(x,y) D Hm(x) 2 F dm and thus E(x,y) D y Hm(x) 2 F dm.

Using the above, we have E (x,y) D

y Hm(x) Tm 2 F = E (x,y) D

y Hm(x) 2 F Tm 2 F dm .

Then, applying Chebyshev s inequality gives T m Tm F ϵ.

Input: Sample access to a joint distribution D of (x, y) supported on Rd { 1} with the marginal Dx = Nd. Let ϵ, δ (0, 1) and suppose D satisfies the reliability condition with respect to an LTF

f(x) = sign( w , x t ) with t = O p

log(1/ϵ) and Pr(x,y) D[y = 1] ϵ.

Output: With probability at least Ω(1), the algorithm outputs a unit vector v such that v, w =

max log(1/ϵ) O(t 2), ϵO(1) .

1. Let c1 be a sufficiently large universal constant and c2 be a sufficiently large universal constant depending on c1. Let S be a set of N = dc2max(t 2,1) log(1/δ)/ϵ2 many samples from D. For m = 1, , c1t 2, with 1/2 probability, take

T m = E (x,y) u S[1(y = 1)Hm(x)] ,

to be the empirical Chow-tensor on negative samples. Otherwise, take

T m = E (x,y) u S[y Hm(x)] ,

to be the empirical Chow-tensor on all samples. Let γ be the empirical estimation of Pr(x,y) D[y = 1] with error at most ϵ/100 with a sufficiently small constant failure probability.

2. Take T m Rd dm 1 to be the flattened version of T m, and let Vm be the subspace spanned by the left singular vectors of T m whose singular values are greater than 2 ct 2γ where c is a sufficiently large constant. Let V be the union of all Vm and output v to be a random unit vector chosen from V .

Algorithm 2: Finding a Direction with High Correlation.

We prove Proposition 2.1 for two cases: (a) max(log(1/ϵ) O(t 2), ϵO(1)) = log(1/ϵ) O(t 2); and (b) max(log(1/ϵ) O(t 2), ϵO(1)) = ϵO(1). We first prove Case (a) as the other case is similar. For Case (a), it suffices for us to prove that v, w = log(1/ϵ) O(t 2) happens with Ω(1) probability given the algorithm chooses T m = E(x,y) u S[1(y = 1)Hm(x)] (which happens with 1/2 probability). Let k = c1t 2 and τ/(4

k) = 2 ct 2γ. By Fact B.7, we have T m Tm F τ/(4

k). Then from Lemma 2.2 and Lemma 2.4, if we take v to be a random vector in V , we will with constant probability have,

v, w = Ω τ/

kγ log(1/γ)k/2

dim(V )1/2 = Ω τ 2/ log(1/γ)kk3/2γ2

= Ω log(1/γ) O(t 2)k 3/2 .

Since Pr(x,y) D[y = 1] ϵ, we have γ = Ω(ϵ). Also, considering k = O(max{t 2, 1}) = O (log(1/ϵ)), we get v, w = log(1/ϵ) O(t 2).

For Case (b), the proof remains the same except we take T m = E(x,y) u S[y Hm(x)] and use fact below instead of Lemma 2.4.

Fact B.8 (Lemma 5.10 in [DKK+22]). Let D be the joint distribution of (x, y) supported on Rd { 1} with the marginal Dx = Nd. Let p : R 7 R be a univariate, mean zero, unit variance polynomial of degree k such that for some unit vector v Rd it holds E(x,y) D[yp( v , x )] τ for some τ (0, 1]. Let T m be an approximation of the order-m Chow-parameter tensor Tm = E(x,y) D[y Hm(x)] such that T m Tm F τ/(4

k). Denote by Vm the subspace spanned by the left singular vectors of flattened T m whose singular values are greater than τ/(4

k). Moreover, denote by V the union of V1, , Vk. Then we have that

1. dim(V ) 4k/τ 2, and

2. proj V (v ) 2 τ/ 4

Then the same argument will give v, w = ϵO(1). The above described algorithm uses at most N = d O(t 2)/ϵ2 samples and poly (N, 1/ϵ) runtime. This completes the proof of Proposition 2.1.

B.7 Proof of Lemma 2.5

Item 1 follows from the definition of D and the fact that D has marginal distribution Dx = Nd.

For proving Item 2, we consider two cases: The first case is when t > 0 and the second one when t 0. For the case t 0, we prove that the distribution D satisfies the reliability condition with respect to h (x) = sign( w , x ) where w = w w/ w w 2. Notice that for any (x, y) such that x B and h (x w) = 1, we have

f(x) = sign( w , x t ) = sign( proj w(w ), x + projw(w ), x t )

= sign( w w, x w + projw(w ), x t ) ,

where w w, x w 0 since h (x) = sign( w w, x w / w w 2) 0. Then we have

f(x) sign( projw(w ), x t ) = sign( projw(w )/ projw(w ) 2, x t / projw(w ) 2 sign( w, x t ) (2) sign( w, x t) = h(x) = +1 ,

where Inequality (2) follows from the fact that w, w > 0, projw(w ) 2 (0, 1] and t 0. Since D satisfies the reliability condition with respect to f, we have that it must be the case y = +1. Therefore, D satisfies the reliability condition with respect to h .

For the case t > 0, we prove that the distribution D satisfies the reliability condition with respect to h (x) = sign( w , x t ) where w = w w/ w w 2 and t = t . Notice that for any (x, y) such that x B and h (x w) = 1, we have

f(x) = sign( w , x t ) = sign( proj ww , x + projww , x t )

= sign w w 2 w , x w w 2t + projww , x (1 w w 2)t ,

where w w 2 w , x w w 2t 0 since h (x) = +1. Then we have

f(x) sign( projww , x (1 w w 2)t )

1 w w 2 2 w, x (1 w w 2)t

1 w w 2 2 ( w, x t ) = h(x) = 1 ,

where the second equation follows from w, w > 0 and the third one follows from t > 0. Since D satisfies the reliability condition with respect to f, we have that it must be the case y = +1. Therefore, D satisfies the reliability condition with respect to h .

For proving Item 3, notice that R+(h; D) ϵ/2 implies

Pr (x ,y) D [y = 1] = R+(h; D)/ Pr (x,y) D(x B) ϵ/2 .

This completes the proof.

B.8 Proof of Theorem B.4

We first give the algorithm in Theorem B.4, which is a more detailed version of Algorithm 1.

Input: ϵ (0, 1), α (0, 1/2) and sample access to a joint distribution D of (x, y) supported on Rd { 1} with x-marginal Dx = Nd. Output: h(x) = sign( w, x t) that is ϵ-reliable with respect to the class Hα d . 1. Check if Pr(x,y) D[y = 1] ϵ/2 (with sufficiently small constant failure probability). If so, return the +1 constant hypothesis. If the parameters satisfy

min 2log(1/ϵ)O(log(1/α)), 2poly(1/ϵ) = 2log(1/ϵ)O(log(1/α)) ,

then set the correlation parameter ζ = log(1/ϵ) ct 2, where c is a sufficiently large universal constant. Otherwise, set ζ = ϵc for a sufficiently large universal constant c. For t = Φ 1(α) and t = Φ 1(α), do Steps (2) and (3).

2. Initialize w to be a random unit vector in Rd. Let the update step size λ = ζ and repeat the following process until λ ϵ/100.

(a) Use samples from D to check if the hypothesis h(x) = sign( w, x t) satisfies R+(h; D) ϵ/2. If so, go to Step (3). (b) With 1/2 probability, let w = w. Let B = {x Rd : w, x t 0}, and let D be the distribution of (x w, y) for (x, y) D given x B. Use the algorithm of Proposi-

tion 2.1 on D to find a unit vector v such that v, w = 0 and D v, proj w(w ) proj w(w ) 2

Then, update w as follows: wupdate = w+λv w+λv 2 .

(c) Repeat Steps (2a) and (2b) c/ζ2 times, where c is a sufficiently large universal constant, with the same step size λ. After that, update the new step size as λupdate = λ/2. 3. Check if h(x) = sign( w, x t) satisfies R+(h; D) ϵ/2. If so, add it to the set S. For each choice of value t in Step (1), repeat Step (2) 21/ζc many times where c is a sufficiently large constant. 4. Let S be the set in Step (3). Return the hypothesis h(x) = sign( w, x t), where (w, t) = argmin(w,t) S t. Return the 1 constant hypothesis if S is empty.

Algorithm 3: Reliably Learning General Halfspaces with Gaussian Marginals (detailed version of Algorithm 1).

Proof of Theorem B.4. Without loss of generality, we assume ϵ > α for the following reason. Suppose α ϵ, then we can simply return one of the constant hypotheses that is closest to f. Let f(x) = sign( w , x t ) be the optimal LTF with α bias, namely,

f = argmin f Hα d R+(h;D)=0 R (f; D) .

We prove that with high probability, Algorithm 3 returns a hypothesis h(x) = sign( w, x t) such that R+(h; D) ϵ and R (h; D) R (f; D) + ϵ. Notice that at some point, we will run Step (2) with t = t . We show that given t = t , Step (2) will with probability 2 poly(1/ζ) returns a h with R+(h; D) ϵ/2. For convenience, we assume h never satisfies R+(h; D) ϵ/2 in Step (2a) (otherwise, we are done). Furthermore, we assume the subroutine algorithm in Proposition 2.1 used in Step (2b) always succeed, and we always have w, w 0, since both happen with constant probability and we are running the update at most O(log(1/ϵ)/ζ2) = poly(1/ζ) many times.

Let η = proj ww 2. Now, we will prove by induction that each time after c/ζ2 many updates in step (2b), we always have proj ww 2 3λ (without loss of generality, we assume λ is always at most a sufficiently small constant). Notice that by Lemma 2.5 and Proposition 2.1, we have that at each update, given the subroutine algorithm in Proposition 2.1 succeeds, the update direction v always satisfies D v, proj w(w )

η E ζ, which implies v, proj w(w ) ηζ. Then we have

1 η2 1 η2. When we update w, there are two possibilities. Either λ > ηζ/2 or

λ ηζ/2. If λ ηζ/2, using Fact 2.6, we will have wupdate, w w, w + λ2/2. If λ > ηζ/2, we have proj wupdate(w ) 2 = proj w (wupdate) 2 proj w (w + λv) 2

proj w (w) 2 + λ 3λ .

For the base case, when we do the first group of c/ζ2 many updates, combining the two cases above and the fact that λ = ζ ηζ gives that after c/ζ2 many updates, we must have proj ww 2 3λ. For the induction case, given η 6λ (which follows from the previous group of c/ζ2 many updates), combining the two cases above gives that after c/ζ2 many updates, we must have proj ww 2 3λ. This proves that after each group of c/ζ2 many updates, we have proj ww 2 3λ.

Therefore, when we have λ ϵ/100, we get proj ww 2 3ϵ/100, which implies R+(h; D) ϵ/2 since D satisfies the reliability condition with respect to f. Given R+(h; D) ϵ/2, using the fact that t t 0 (since we always output the hypothesis with the smallest t) implies Pr(x,y) D[h(x) = 1] Pr(x,y) D[f(x) = 1], we have that

R (h; D) = Pr (x,y) D[y = 1] Pr (x,y) D[y = 1 h(x) = 1]

= Pr (x,y) D[y = 1] Pr (x,y) D[h(x) = 1] + R+(h; D)

Pr (x,y) D[y = 1] Pr (x,y) D[f(x) = 1] + ϵ = R (f; D) + ϵ .

This completes the proof.

C Omitted Proofs from Section 3

We first restate our main SQ hardness result.

Theorem C.1 (SQ Lower Bound for Reliable Learning). For any α > 0 that is at most a sufficiently small absolute constant and ϵ (0, α/3), any SQ algorithm that reliably learns α-biased LTFs (for positive labels) on Rd under Gaussian marginals to additive error ϵ either (i) requires at least one query of tolerance at most d Ω(log( 1

α)), or (ii) requires at least 2dΩ(1) queries.

C.1 Proof of Lemma 3.3

We let Pn denote the set of all polynomials p : R R of degree at most n and let L1 +(R) denote the set of all nonnegative functions in L1(R). First, we write the conditions as the primal linear program.

find g L1(R) such that E z N1[p(z)g(z)] = 0 , p Pn

g(z) 1 , z c g 1

By introducing variables H L1(R) and h L1 +(R), we rewrite the primal LP as

find g L1(R) such that E z N1[p(z)g(z)] = 0 , p Pn

E z N1[g(z)h(z)1{z c}] h(z)1{z c} 1 , h L1 +(R)

E z N1[g(z)H(z)] H 1 , H L1(R)

The following fact is an application of (an infinite generalization of) LP duality:

Claim C.2. The LP above is feasible if there exists no polynomial p of degree n such that

E z N1[|p(z)|1(z c)] < E z N1[p(z)1(z c)] .

Proof of Claim C.2. First, we introduce some notation. We use ( h, c) for the inequality Ez N [β(z) h(z)] + c 0, where h L1(R) and c R. Moreover, let S be the set that contains all such tuples that describe the target system. For the set S, the closed convex cone over L1(R) R is the smallest closed set S+ satisfying the following: (a) if A S+ and B S+ then A + B S+; and (b) if A S+ then λA S+ for all λ 0. Note that the S+ contains the same feasible solutions as S. In order to prove the statement, we need the following LP duality from [Fan68].

Fact C.3 (Theorem 1 of [Fan68]). If X is a locally convex, real separated vector space. Then, a linear system described by S is feasible (i.e., there exists a g X ) if and only if (0, 1) S+.

Our dual LP is defined by the following inequalities:

(a) (p, 0) for p Pn;

(b) ( h, h 1) for all h(z) = h(z)1{z c}, where h L1 +(R); and

(c) and (H, H 1) for all H L1(R).

Furthermore, the LP is also equivalent to breaking the last inequality into two, i.e., ( H1, H1 1) for all H1(z) = H1(z)1{z < c}, where H1 L1(R), and ( H2, H2 1) for all H2(z) = H2(z)1{z c} where H2 L1 +(R). By taking the convex cone defined from the above inequalities, we get the following set

S+ = n p h + H1 + H2, h 1 H1 + H2 1 p Pn,

h(z) = h(z)1{z c} for h L1 +(R), H1(z) = H1(z)1{z < c} for H1 L1(R),

H2(z) = H2(z)1{z c} for H2 L1 +(R) o .

We first show that S+ is closed. To do so, we prove that S+ is closed under limits. First, we assume, in order to reach a contradiction, there is a sequence (pi, hi, H1,i, H2,i) so that

(Fi, λi) = pi hi + H1,i + H2,i, hi 1 H1,i + H2,i 1

has Fi converges to Flim with respect to L1-norm and λi converges to λlim. We claim then that (Flim, λlim) S+. First, notice that pi 1 Fi 1 + hi 1 + H1,i 1 + H2,i 1 Fi 1 + 3. Therefore, there must be a k such that for any i k, pi 1 Flim 1 + 4.

Then, since an L1 ball in Pn with radius Flim 1 + 4 is compact, there must be a subsequence of (pi, hi, H1,i, H2,i) such that pi converges to plim for some plim Pn. Notice that due to the positivity of hi and H2,i, it must be hi = (Fi pi) 1(z c), H2,i = (Fi pi)+1(z c) and H1,i = (Fi pi)1(z < c). Therefore, hi, H1,i and H2,i converge to hlim = (Flim plim) 1(z c), H2,lim = (Flim plim)+1(z c) and H1,lim = (Flim plim)1(z < c) respectively. Then by taking (plim, hlim, H1,lim, H2,lim), one can see that (Flim, λlim) S+.

It only remains to verify that if there exists a polynomial p of degree n such that

E z N1[|p(z)|1(z c)] < E z N1[p(z)1(z c)] ,

then (0, 1) S+. Let p be the polynomial satisfying the above. We can simply take hi = (pi)+1(z c), H2,i = (pi) 1(z c) and H1,i = pi1(z < c). This gives pi hi + H1,i + H2,i, hi 1 H1,i + H2,i 1 = (0, k) S+ ,

k = (pi)+1(z c) 1 (pi) 1(z c) pi1(z < c) 1 = (pi)+1(z c) 1 (pi) 1(z c) 1 pi1(z < c) 1 = E z N1[p(z)1(z c)] E z N1[|p(z)|1(z c)] > 0

Then, by properly rescaling (0, k), we get (0, 1) S+. This completes the proof.

Given Claim C.2, to prove Lemma 3.3, it only remains to show that there does not exist a polynomial p of degree at most n such that

E z N1[|p(z)|1(z c)] < E z N1[p(z)1(z c)] .

First, the above condition implies that Ez N1[|p(z)|] < 2 Ez N1[|p(z)|1(z c)]. Using the Cauchy Schwarz inequality, we get

E z N1[|p(z)|] < 2 E z N1[|p(z)|2]1/2 Pr z N1[z c] 1/2 . (3)

We then give the following claim. Claim C.4. Let p : R 7 R be any polynomial of degree at most n. Then, it holds that

E z N1[|p(z)|2]1/2 3n E z N1[|p(z)|] .

Proof for Claim C.4. The proof is based on the following well-known fact: for any function g : R 7 R, it holds that (from Holder s inequality)

E z N1[|g(z)|2] = E z N1[|g(z)|2/3|g(z)|4/3] E z N1[|g(z)|]2/3 E z N1[|g(z)|4]1/3 ,

thus E z N1[|g(z)|2]1/2 E z N1[|g(z)|]1/3 E z N1[|g(z)|4]1/6 . (4)

We then use Fact C.5 (Gaussian Hypercontractivity) to bound the term Ez N1[|g(z)|4]1/6.

Fact C.5 (Gaussian Hypercontractivity). Let p : R 7 R be any polynomial of degree at most l. Then, for q 2, it holds

E x N1[|p(x)|q] (q 1)ql/2 E x N1[p2(x)] q/2 .

Now we apply Fact C.5 and choose l = n and q = 4. This implies Ez N1[|p(z)|4] 32n Ez N1[|p(z)|2]2, which is Ez N1[|p(z)|4]1/6 3n/3 Ez N1[|p(z)|2]1/3. Plugging it into Equation (4), we get that for any polynomial p of at most degree-n,

E z N1[|p(z)|2]1/2 3n/3 E z N1[|p(z)|]1/3 E z N1[|p(z)|2]1/3 .

This implies E z N1[|p(z)|2]1/2 3n E z N1[|p(z)|] .

Applying Claim C.4 to Equation (3), we get

E z N1[|p(z)|] < 2 3n E z N1[|p(z)|] Pr z N1[z c] 1/2 .

From our choice of the parameter c, we have that Prz N1[z c] 3 2n/4, thus

E z N1[|p(z)|] < E z N1[|p(z)|] ,

contradiction. Therefore, such a polynomial p cannot exist. Thus, a function g satisfying the requirements in the body of the lemma exists.

C.2 Proof of Lemma 3.4

We let D be the unique distribution such that z N1 and E(z,y) D[y|z = z ] = g(z ) where g is the function from Lemma 3.3. We claim that D satisfies the conditions in this lemma s statement. Condition (i) is immediate from the definition of D. Then Condition (ii) is immediately implied from Condition (a) in Lemma 3.3.

For Condition (iii), Condition (b) in Lemma 3.3 implies Ez N1[g(z)] = 0 which immediately implies E(z,y) D[y] = 0. To show that

E (z,y) D[zk] = E (z,y) D[zk | y = 1] = E (z,y) D[zk | y = 1]

for all k [n], notice that for all k [n],

E (z,y) D[zk] = Pr (z,y) D[y = 1] E (z,y) D[zk | y = 1] + Pr (z,y) D[y = 1] E (z,y) D[zk | y = 1] . (5)

Moreover, from Condition (ii) of Lemma 3.3, we have E (z,y) D[g(z)zk] = E (z,y) D[yzk]

= Pr (z,y) D[y = 1] E (z,y) D[zk | y = 1] Pr (z,y) D[y = 1] E (z,y) D[zk | y = 1]

=0 . (6) Taking the difference between Equation (5) and Equation (6) gives

E (z,y) D[zk] = 2 Pr (z,y) D[y = 1] E (z,y) D[zk | y = 1] .

Plugging in that Pr(z,y) D[y = 1] = 1/2, which follows from E(z,y) D[y] = 0 and y is supported on { 1}, we get E (z,y) D[zk] = E (z,y) D[zk | y = 1] .

Then, we have E (z,y) D[zk] = E (z,y) D[zk | y = 1] Pr (z,y) D[y = 1] + E (z,y) D[zk | y = 1] Pr (z,y) D[y = 1] ;

therefore, E (z,y) D[zk] = E (z,y) D[zk | y = 1] .

Thus, Condition (iii) holds.

Now, we show the fourth condition. We show the following claim. Claim C.6. It holds that χ2(D+, N1) = O(1) .

χ2(D+, N1) = Z

PN1(z) dz 1

R PD+(z)PD+(z)

PN1(z) dz 1

R PD+(z)/ Pr (z,y) Df[y = 1]dz 1

= Pr (z,y) D[y = 1] 1 Z

R PD+(z)dz 1

= Pr (z,y) D[y = 1] 1 1 = O(1) ,

where the inequality follows from the fact that PN1(z) = PD+(z) Pr (z,y) D[y = 1] + PD (z) Pr (z,y) D[y = 1] PD+(z) Pr (z,y) D[y = 1] ,

which implies PD+(z)/PN1(z) 1/ Pr(z,y) D[y = 1]. The same holds for χ2(D , N1). This completes the proof of Claim C.6.

The proof of χ2(D , N1) = O(1) is similar to Claim C.6. This completes the proof of Lemma 3.4.

C.3 Construction of the Alternative Hypothesis Distribution Set D

Lemma C.7. For any sufficiently small α > 0, there exists a distribution family D = {Dv : v V } supported on Rd { 1} where |D| = 2dΩ(1) satisfying the following:

(i) For any Dv D and (x, y) Dv, the marginal Dx = Nd;

(ii) For any Dv D and (x, y) Dv, E(x,y) Dv[y|x = x ] = 1 for all x Rd such that v, x Φ 1(1 α);

(iii) Let Dnull be the joint distribution of (x, y) such that x Nd and y = 1 with probability 1/2 independent of x. Then for any Du, Dv D, χDnull(Du, Dv) = O(1) if u = v and χDnull(Du, Dv) = d Ω(log 1/α) if u = v.

Here, we give two lemmas from [DKPZ21] to construct a large set of distributions on Rd { 1} with small correlations. For a matrix A Rm n, we use A 2 to denote the spectral norm. Lemma C.8 (Correlation Lemma: Lemma 2.3 from [DKPZ21]). Let g : Rm 7 R and U, V Rm d with m d be linear maps such that UUT = VVT = I where I is the m m identity matrix. Then, we have that

E x Nd[g(Ux)g(Vx)]

t=0 UVT t 2 E x Nm[(g[t](x))2] ,

where g[t] denote the degree-t Hermite part of g.

Note that we will use Lemma C.8 with m = 1 and n = d so that the linear maps U and V become vectors. We then introduce the following well-known fact, which states that there are exponentially many near-orthogonal vectors in Rd. Fact C.9 (Near-orthogonal Vectors: Fact 2.6 from [DKPZ21]). For any 0 < c < 1/2, there exists a set V Sd 1 such that |V | = 2dΩ(c) and for any u, v V , u, v dc 1/2 .

Proof of Lemma C.7. We define D in the following manner. We let D be the distribution on R { 1} in Lemma 3.4 for n = c log(1/α) where c is a sufficiently small universal constant such that 3 2n/4 α. We take V to be the set of vectors in Fact C.9 with c = 1/4. Then we consider the set of distributions D = {Dv : v V } where Dv is the unique distribution on Rd { 1} generated in the following way. We first sample (z, y) D and x Nd 1. Then let B Rd (d 1) be a matrix whose columns form an (arbitrary) orthonormal basis for the orthogonal complement of v. Let x = B x + zv and we define the distribution Dv = D(x,y). This effectively embeds the distribution D along the hidden direction v in Dv.

To establish Condition (i) in this lemma, the definition of x combined with the fact that Dz = N1 (from Condition (i) in Lemma 3.4) implies the marginal distribution Dx = Nd.

Then, for Condition (ii) in this lemma, we have that for any x Rd such that v, x Φ 1(1 α), E (x,y) Dv[y|x = x ] = E (z,y) D [y = 1|z = v, x ] ,

from the definition of Dv. Since 3 2n/4 α, we have v, x Φ 1(1 α) Φ 1(1 3 2n/4). Therefore, using the second condition in Lemma 3.4, we get E (x,y) Dv[y|x = x ] = E (z,y) D [y|z = v, x ] = 1 .

For Condition (iii) in this lemma, notice that Condition (iii) in Lemma 3.4 implies Pr (x,y) Dv[y = 1] = Pr (x,y) Dv[y = 1] = 1/2 .

Then, χDnull(Dv, Dv) = Pr (x,y) Dv[y = 1]χ2(D+ v , Nd) + Pr (x,y) Dv[y = 1]χ2(D v , Nd)

2χ2(D+ v , Nd) + 1

2χ2(D v , Nd) .

We show that the first term χ2(D+ v , Nd) = O(1). χ2(D v , Nd) = O(1) can be shown in the exact same way. Notice that

χ2(D+ v , Nd) = Z

PNd(u) du 1

PD+ v (B u + zv)2

PNd(B u + zv) dudz 1

Rd 1 PNd 1(u)2PD +(z)2

PNd 1(u)PN1(z) dudz 1

Rd 1 PNd 1(u)du Z

PN1(z) dz 1

=χ2(D +, N1) = O(1) ,

where the last equality follows from Condition (iv) of Lemma 3.4. Therefore

χDnull(Dv, Dv) = 1

2χ2(D+ v , Nd) + 1

2χ2(D v , Nd) = O(1) .

For Du, Dv D where u = v, we have

χDnull(Du, Dv) = Pr[y = 1]χNd(D+ u , D+ v ) + Pr[y = 1]χNd(D u , D v )

2χNd(D+ u , D+ v ) + 1

2χNd(D u , D v ) .

We show that χNd(D+ u , D+ v ) = d Ω(log 1

α ). χNd(D u , D v ) = d Ω(log 1

α ) can be shown in the exact same way. Notice that

χNd(D+ u , D+ v ) = Z

PD+ u (w)PD+ v (w)

PNd(w) dw 1

PD+ u (proj u(w) + proju(w))PD+ v (proj v(w) + projv(w))

PNd(w) dw 1

Rd PNd 1(w u)PD +(wu)PNd 1(w v)PD +(wv)

PNd(w) dw 1

Rd PNd(w) PD +(wu)

PN1(wu) PD +(wv)

PN1(wv) dw 1 .

Let g : R R be g(z) def = PD +(z)/PN1(z) and apply Lemma C.8 and Condition (iii) of Lemma 3.4, we get

χNd(D+ u , D+ v )

t=n+1 d Ω(t) E z N1

g[t](z) 2 !

d Ω(n)χ2 N1(D +, N1)

where the last equality follows form Condition (iv) of Lemma 3.4. This completes the proof of Lemma Lemma C.7.

C.4 Proof of Theorem C.1

Let D be the set of distributions in Lemma C.7. We also let Dnull be the joint distribution of (x, y) such that x Nd and y Bern(1/2) independent of x. We consider the decision problem of B(D, Dnull). Using Lemma A.4 with γ = d log( 1

α ), we can see that any SQ algorithm that solves B(D, Dnull) requires either queries of tolerance at most d Ω(log 1

α ) or makes at least 2dΩ(1) queries.

We then reduce the decision problem B(D, Dnull) to reliably learning α-biased LTFs with ϵ < α/3 accuracy. Let D be an instance of B(D, Dnull) which we need to decide either D = Dnull or

D D. Let A be an algorithm that reliably learns α-biased LTFs with ϵ < α/3 accuracy and succeeds with 2/3 probability. We can simply give the i.i.d. samples from the distribution D to algorithm A. Suppose that A succeeds and outputs a hypothesis function h. If D = Dnull, then since A promises R+(h, Dnull) ϵ. Since Dnull has y Bern(1/2) independent of x, this implies Pr(x,y) D[h(x) = 1] ϵ.

Then we argue that if we are in the alternative case, i.e., D D, then given that the algorithm succeeds, we must have Pr(x,y) D[h(x) = 1] α ϵ instead. We assume Pr(x,y) D[h(x) = 1] < α ϵ and prove a contradiction. If D = Dv D, then by the first property in Lemma C.7, there must be an α-biased LTF c such that Pr(x,y) D[c(x) = 1] α and Pr(x,y) D[c(x) = 1 y = 1] = 0. Combining this with the fact that Pr(x,y) D[y = 1] = 1/2, we have R (c; D) = Pr[c(x) = 1 y = 1] 1/2 α. Therefore, the output hypothesis h of A has to satisfy

R (h; D) 1/2 α + ϵ .

However, since Pr(x,y) D[h(x) = 1] < α ϵ, we have

R (h; D) = Pr (x,y) D[h(x) = 1 y = 1] Pr (x,y) D[y = 1] Pr (x,y) D[h(x) = 1] > 1/2 α + ϵ

which contradicts the above. Thus, it has to be Pr(x,y) D[h(x) = 1] > α ϵ.

If D = Dnull, then Pr(x,y) D[h(x) = 1] ϵ. Otherwise, we have Pr(x,y) D[h(x) = 1] α ϵ. Since the gap between them is at least a constant α 2ϵ α/3 and Pr(x,y) D[h(x) = 1] can be estimated to inverse exponential accuracy on samples with high probability. Thus, we can distinguish the two cases of B(D, Dnull) with high probability.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The abstract summarizes the result provided in Theorem 1.3. The introduction describes how this contribution resolves an open problem in the literature by summarizing the motivation for the model and describing prior work s contributions. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The limitations are clearly stated in the statement of each theorem and are discussed in the introduction of the paper. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: Each theorem statement provides all the assumptions and we provide a complete proof for all statements that is either in the main body of the paper or in the appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [NA]

Justification: The paper is theoretical in nature and does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The paper is theoretical in nature and does not include experiments. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: The paper is theoretical in nature and does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The paper is theoretical in nature and does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: The paper is theoretical in nature and does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: Our research conforms in every respect with the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: The work is theoretical and we do not see any major or immediate implications.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The work is theoretical.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: This work does not use any assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: This work does not use any assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA]

Justification: This work does not involve any crowdsourcing or research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This work does not involve research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.