# testably_learning_polynomial_threshold_functions__7338edb3.pdf Testably Learning Polynomial Threshold Functions Lucas Slot Department of Computer Science ETH Zurich lucas.slot@inf.ethz.ch Stefan Tiegel Department of Computer Science ETH Zurich stefan.tiegel@inf.ethz.ch Manuel Wiedmer Department of Computer Science ETH Zurich manuel.wiedmer@inf.ethz.ch Rubinfeld & Vasilyan recently introduced the framework of testable learning as an extension of the classical agnostic model. It relaxes distributional assumptions which are difficult to verify by conditions that can be checked efficiently by a tester. The tester has to accept whenever the data truly satisfies the original assumptions, and the learner has to succeed whenever the tester accepts. We focus on the setting where the tester has to accept standard Gaussian data. There, it is known that basic concept classes such as halfspaces can be learned testably with the same time complexity as in the (distribution-specific) agnostic model. In this work, we ask whether there is a price to pay for testably learning more complex concept classes. In particular, we consider polynomial threshold functions (PTFs), which naturally generalize halfspaces. We show that PTFs of arbitrary constant degree can be testably learned up to excess error ε > 0 in time npoly(1/ε). This qualitatively matches the best known guarantees in the agnostic model. Our results build on a connection between testable learning and fooling. In particular, we show that distributions that approximately match at least poly(1/ε) moments of the standard Gaussian fool constant-degree PTFs (up to error ε). As a secondary result, we prove that a direct approach to show testable learning (without fooling), which was successfully used for halfspaces, cannot work for PTFs. 1 Introduction The PAC learning model of Valiant [30] has long served as a test-bed to study which learning tasks can be performed efficiently and which might be computationally difficult. One drawback of this model is that it is inherently noiseless. In order to capture noisy learning tasks, the following extension, called the agnostic model, has been introduced [19, 24]: Let F be a class of boolean functions and let Djoint be an (unknown) distribution over example-label-pairs in X { 1}. Typically, X = {0, 1}n or Rn. As input, we receive iid samples from Djoint. For a small ε > 0, our task is to output a classifier ˆf (not necessarily in F) whose loss L( ˆf, Djoint) := P(x,z) Djoint( ˆf(x) = z) is at most opt + ε, where opt := inff F L(f, Djoint). The parameter opt thus indicates how "noisy" the instance is. We say that an algorithm agnostically learns F up to error ε if it outputs such an ˆf. This model is appealing since it makes assumptions neither on the distribution of the input, nor on the type and amount of noise. After running an agnostic learning algorithm, we can therefore be certain that the output ˆf achieves error close to that of the best function in F even without knowing what distribution the data came from. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Efficient learning and distributional assumptions. We are interested in understanding when agnostic learning can be performed efficiently. Unfortunately, efficient learning is likely impossible without making assumptions on the distribution Djoint, even for very simple function classes F. For instance, consider the class FHS of halfspaces, i.e., boolean functions of the form f(x) = sign( v, x θ). Then, even if there exists a halfspace achieving arbitrarily small error, it is widely believed that outputting an ˆf that performs better than a random guess in the agnostic model takes at least super-polynomial time if no assumptions are made on Djoint [8, 29]. To find efficient algorithms, one therefore has to make such assumptions. Typically, these take the form of assuming that the marginal DX of Djoint over the examples X belongs to a specific family of distributions. Definition 1 (Agnostic learning with distributional assumptions). Let ε > 0. A learner A agnostically learns F with respect to D up to error ε if, for any distribution Djoint on X { 1} whose marginal DX on X is equal to D, given sufficient samples from Djoint, it outputs with high probability a function f : X { 1} satisfying L(f, Djoint) opt(F, Djoint) + ε. For example, under the assumption that DX is standard Gaussian, we can find ˆf such that L( ˆf, Djoint) opt(FHS, Djoint) + ε in time n O(1/ε2) [20, 11]. This runtime is likely bestpossible [12, 29, 13].1 Efficient learning is still possible under weaker assumptions on DX , e.g., log-concavity [20]. Regardless, we cannot know whether a learning algorithm achieves its claimed error without a guarantee that the input actually satisfies our distributional assumptions. Such guarantees are inherently difficult to obtain from a finite (small) sample. Furthermore, approaches like cross-validation (i.e., computing the empirical error of ˆf on a hold-out data set) fail in the noisy agnostic model, since we do not know the noise level opt. This represents a severe limitation of the agnostic learning model with distributional assumptions. 1.1 Testable learning. To address this limitation, Rubinfeld & Vasilyan [28] recently introduced the following model, which they call testable learning: First, they run a tester on the input data, which attempts to verify a computationally tractable relaxation of the distributional assumptions. If the tester accepts, they then run a (standard) agnostic learning algorithm. The tester is required to accept whenever the data truly satisfies the distributional assumptions, and whenever the tester accepts, the output of the algorithm must achieve error close to opt. More formally, they define: Definition 2 (Testable learning [28]). Let ε > 0. A tester-learner pair (T , A) testably learns F with respect to a distribution D on X up to error ε if, for any distribution Djoint on X { 1}, the following hold 1. (Soundness). If samples drawn from Djoint are accepted by the tester T with high probability, then the learner A must agnostically learn F w.r.t. Djoint up to error ε. 2. (Completeness). If the marginal of Djoint on X is equal to D, then the tester must accept samples drawn from Djoint with high probability. Soundness tells us that whenever a testable learning algorithm outputs a function ˆf, this function achieves low error (regardless of whether Djoint satisfies any distributional assumption). On the other hand, completeness tells us testable learners are no weaker than (distribution-specific) agnostic ones, in the sense that they achieve the same error whenever Djoint actually satisfies our assumptions (i.e., whenever this error can in fact be guaranteed for the agnostic learner). The testable model is thus substantially stronger than the agnostic model with distributional assumptions. Which function classes can be learned testably? A natural question is whether testable learning comes at an additional computational cost compared to (distribution-specific) agnostic learning. We focus on the setting where D is the standard Gaussian on X = Rn. Following [28, 15], we consider the following simple tester: Accept if and only if the empirical moments up to degree k of the input distribution (approximately) match those of D. This tester satisfies completeness as the 1In particular, achieving a runtime in poly(n, 1/ε) is likely not possible. We note that such runtimes can be achieved in the weaker model where one accepts a loss of O(opt)+ε (vs. opt+ε in the model we consider) [1]. empirical moments of a Gaussian concentrate well. Using this tester, Rubinfeld & Vasilyan [28] show that halfspaces can be testably learned in time n O(1/ε4). Their runtime guarantee was improved to n O(1/ε2) in [15], (nearly) matching the best known non-testable algorithm. This shows that there is no separation between the two models for halfspaces. On the other hand, a separation does exist for more complex function classes. Namely, for fixed accuracy ε > 0, testably learning the class of indicator functions of convex sets requires at least 2Ω(n) samples (and hence also time) [28], whereas agnostically learning them only takes subexponential time 2O( n), see [25]. The relation between agnostic and testable learning is thus non-trivial, depending strongly on the concept class considered. 1.2 Our contributions In this work, we continue to explore testable learning and its relation to the agnostic model. We consider the concept class of polynomial threshold functions (short PTFs). A degree-d PTF is a function of the form f(x) = sign(p(x)), where p is a polynomial of degree at most d. PTFs naturally generalize halfspaces, which correspond to the case d = 1. They form an expressive function class with applications throughout (theoretical) computer science, and have been studied in the context of circuit complexity [5, 26, 27, 3], and learning [18, 14]. Despite their expressiveness, PTFs can be agnostically learned in time n O(d2/ε4) [22], which is polynomial in n for any fixed degree d N and error ε > 0. They are thus significantly easier to learn in the agnostic model than convex sets. Our main result is that PTFs can be learned efficiently in the testable model as well. Theorem 3 (Informal version of Theorem 19). Fix d N. Then, for any ε > 0, the concept class of degree-d polynomial threshold functions can be testably learned up to error ε w.r.t. the standard Gaussian in time and sample complexity npoly(1/ε). Theorem 3 is the first result achieving efficient testable learning for PTFs of any fixed degree d (up to constant error ε > 0). Previously, such a result was not even available for learning degree-2 PTFs with respect to the Gaussian distribution. It also sheds new light on the relation between agnostic and testable learning: there is no qualitative computational gap between the two models for the concept class of PTFs, whose complexity lies between that of halfspaces and convex sets in the agnostic model. In addition to Theorem 3, we also show an impossibility result ruling out a certain natural approach to prove testable learning guarantees for PTFs. In particular, we show in Section 2.4 that an approach which has been successful for testably learning halfspaces in [28] provably cannot work for PTFs. Limitations. The dependence of the running time on the degree parameter d and the error ε is (much) worse than in the agnostic model (see Theorem 19). Moreover, we do not have access to lower bounds on the complexity of testably learning PTFs which might indicate whether these dependencies are inherent to the problem, or an artifact of our analysis. The only lower bounds available apply already in the agnostic model, and show that the time complexity of agnostically (and thus also testably) learning degree-d PTFs is at least nΩ(d2/ε2) in the SQ-model [12], and at least neΩ(d2 β/ε2 β) for any β > 0 under a cryptographic hardness assumption [29]. 1.3 Previous work The two works most closely related to this paper are [28, 15]. Both rely on the following highlevel strategy. A standard result [20] shows that one can agnostically learn a concept class F w.r.t. a distribution D in time n O(k) if all elements of F are well-approximated w.r.t. D by degree-k polynomials. That is, if for all f F, there exists a degree-k polynomial h such that EX D [|h(X) f(X)|] ε. This result can be extended to the testable setting, but now one needs a good low-degree L1-approximation w.r.t. any distribution D accepted by the tester. Using the moment-matching tester outlined above, one thus needs to exhibit low-degree approximations to all functions in F w.r.t. any distribution which approximately matches the first few moments of D. A direct approach. In [28], the authors use a direct approach to show that if F = FHS is the class of halfspaces and D = N(0, In), these approximators exist for k = O(1/ε4), leading to an overall running time of n O(1/ε4) for their testable learner. Their approach consists of two steps. First, they construct a low-degree approximation q sign of the sign function in one dimension using standard techniques. Then, for any halfspace f(x) = sign( v, x θ), they set h(x) = q( v, x θ). By exploiting concentration and anti-concentration properties of the push-forward under linear functions of distributions that match the moments of a Gaussian, they show that h is a good approximation of f. Unfortunately, this kind of approach cannot work for PTFs: We formally rule it out in Theorem 16. This is the aforementioned secondary contribution of our paper, which extends earlier impossibility results for (agnostic) learning of Bun & Steinke [6]. See Section 2.4 for details. An indirect approach using fooling. In order to prove our main theorem we thus need a different approach. Gollakota, Klivans & Kothari [15] establish a connection between testable learning and the notion of fooling, which has played an important role in the study of pseudorandomness [4, 2, 9]. Its connection to learning theory had previously been observed in [23]. We say a distribution D fools a concept class F up to error ε > 0 with respect to D if, for all f F, it holds that |EX D [f(X)] EX D [f(X)]| ε. Roughly speaking, the work [15] shows that, if any distribution D which approximately matches the moments of D up to degree k fools F with respect to D, then F can be testably learned in time n O(k) (see Theorem 9 below). We remark that (approximately) moment-matching distributions have not been considered much in the existing literature on fooling. Rather, it has focused on distributions D whose marginals on any subset of k variables are equal to those of D, which is a stronger condition a priori. While it coincides with moment-matching in special cases (e.g., when D is the uniform distribution over the hypercube), it does not when D = N(0, In). Nevertheless, the authors of [15] show that (approximate) moment matching up to degree k = O(1/ε2) fools halfspaces with respect to N(0, In), allowing them to obtain the aforementioned result for testably learning FHS. In fact, they show that this continues to hold when F consists of arbitrary boolean functions applied to a constant number of halfspaces. They also use existing results in the fooling literature to show that degree-2 PTFs can be testably learned under the uniform distribution over {0, 1}n (but these do not extend to learning over Rn w.r.t. a standard Gaussian). Other previous work on testable learning. In weaker error models than the agnostic model or under less stringent requirements on the error of the learner it is known how to construct tester-learner pairs with runtime poly(n, 1/ε) [16, 10]. These results have been extended to allow for the following stronger completeness condition: The tester has to accept, whenever Djoint is an isotropic strongly log-concave distribution [17]. 2 Technical overview 2.1 Preliminaries From here, we restrict to the setting X = Rn. We let D be a well-behaved distribution on Rn; usually D = N(0, In) is the standard Gaussian. For x Rn and a multi-index α Nn, we write xα := Qn i=1 xαi i . For k N, we write Nn k := {α Nn, Pn i=1 αi k}. We say a statement holds with high probability if it holds with probability 0.99. The notation Od (resp. Ωd, Θd) hides factors that only depend on d. We now define the moment-matching tester introduced above. Definition 4 (Moment matching). Let k N and η 0. We say a distribution D on Rn approximately moment-matches D up to degree k and with slack η if |EX D [Xα] EX D [Xα]| η α Nn k. Definition 5. Let k N and η 0. The approximate moment-matching tester TAMM = TAMM(k, η) for a distribution D accepts the samples (x(1), z(1)), . . . , (x(m), z(m)) Rn { 1} if, and only if, EX D [Xα] 1 x(i) α η α Nn k. That is, TAMM(k, η) accepts if and only if the moments of the empirical distribution belonging to the samples {x(i)} match the moments of D up to degree k and slack η. Note that TAMM(k, η) requires time at most O(m nk) to decide whether to accept a set of m samples. The tester TAMM does not take the labels of the samples into account. In general, for testers T which depend only on the marginal D of Djoint on Rn, we say that T accepts a distribution D on Rn if it accepts samples drawn from D with high probability (regardless of the labels). 2.2 Review of existing techniques for testable learning In this section, we review in more detail the existing techniques to establish guarantees for agnostic and testable learning discussed in Section 1.3. Our goals are twofold. First, we wish to highlight the technical difficulties that arise from proving error guarantees in the testable model versus the agnostic model. Second, we want to introduce the necessary prerequisites for our proof of Theorem 3 in Section 2.3, namely testable learning via fooling (see Theorem 9). Learning and polynomial approximation. A standard result [20] shows that one can agnostically learn any concept class that is well-approximated by low-degree polynomials in the following sense. Theorem 6 ([20]). Let k N and ε > 0. Suppose that, for any f F, there exists a polynomial h of degree k such that EX D [|h(X) f(X)|] ε. Then, F can be agnostically learned up to error ε in time and sample complexity n O(k)/poly(ε). The underlying algorithm in the theorem above is polynomial regression w.r.t. the absolute loss function. In the testable setting, a similar result holds. The key difference is that one now needs good approximation w.r.t. all distributions accepted by the proposed tester. Theorem 7. Let k N and ε > 0. Let T be a tester which accepts D and which requires time and sample complexity τ. Suppose that, for any f F, and for any D accepted by T , there exists a polynomial h of degree k such that EX D [|h(X) f(X)|] ε. Then, F can be testably learned up to error ε in time and sample complexity τ + n O(k)/poly(ε). The takeaway is that, in order to devise efficient algorithms for agnostic or testable learning, it suffices to study low-degree polynomial approximations of elements of F. Under the assumption that D is a (standard) Gaussian, one has access to powerful techniques from Fourier analysis to show existence of good polynomial approximators w.r.t. D for various concept classes. Using Theorem 6, this leads to efficient agnostic learning algorithms for a variety of concept classes w.r.t. N(0, In) [20, 25, 22]. Testable learning via direct approximation. In the testable setting, it is not sufficient to approximate with respect to D alone, and so one cannot rely directly on any of its special structure. In [28], the authors overcome this obstacle to get testable learning guarantees for halfspaces w.r.t. D = N(0, In) by appealing to more basic properties of the distributions D accepted by their tester. Their approach is roughly as follows. First, they use standard results from polynomial approximation theory to find a (univariate) polynomial q which approximates the sign-function well on the interval [ 1, 1]. For a halfspace f(x) = sign( v, x θ), they consider the approximator h(x) = q( v, x θ), which satisfies EX D [|h(X) f(X)|] = EY D v,θ [|q(Y ) sign(Y )|] . Here, D v,θ is the (shifted) projection of D onto the line span(v) Rn. That is, Y = v, X θ. Then, for carefully chosen k N and η > 0, they show that for any D accepted by TAMM(k, η), the distribution D v,θ satisfies certain concentration and anti-concentration properties, meaning essentially that D v,θ is distributed somewhat uniformly on [ 1, 1]. As q approximates the sign-function on [ 1, 1], they may conclude that EY D v,θ [|q(Y ) sign(Y )|] is small, and invoke Theorem 7. Testable learning via fooling. It is natural to attempt a generalization of the approach above to PTFs. Indeed, for f(x) = sign(p(x)), one could consider the approximator h(x) = q(p(x)). However, as we show below in Section 2.4, this approach cannot work when deg(p) 6. Instead, we will rely on a more indirect technique, proposed in [15]. It connects the well-studied notion of fooling to low-degree polynomial approximation, and to testable learning. Definition 8 (Fooling). Let ε > 0. We say a distribution D on Rn fools F w.r.t. D up to error ε, if, for all f F, we have |EY D [f(Y )] EX D [f(X)]| ε. The main result of [15] shows that fooling implies testable learning when using approximate momentmatching to test the distributional assumptions. It forms the basis of our proof of Theorem 3. Theorem 9 ([15, Theorem 4.5]). Let k, m N and ε, η > 0. Suppose that the following hold: 1. Any distribution D whose moments up to degree k match those of D with slack η fools F w.r.t. D up to error ε/2. 2. With high probability over m samples from D the empirical distribution matches moments of degree at most k with D up to slack η. Then, using the moment-matching tester T = TAMM(k, η), we can learn F testably with respect to D up to error ε in time and sample complexity m + n O(k). Remark 10. When D = N(0, In) is the standard Gaussian, then the second condition in Theorem 9 is satisfied for m = Θ (2kn)k η 2 , see also Fact 36. The primary technical argument in the proof of Theorem 9 in [15] is an equivalence between fooling and a type of low-degree polynomial approximation called sandwiching. Compared to Theorem 7, the advantage of sandwiching is that one needs to approximate f only w.r.t. D (rather than any distribution accepted by the tester). However, one needs to find not one, but two low degree approximators h1, h2 that satisfy h1 f h2 pointwise (i.e., sandwich f). We refer to [15] for details. Fooling PTFs. In light of Theorem 9 and Remark 10, in order to prove our main result Theorem 3, it suffices to show that distributions D which approximately match the moments of N(0, In) fool the concept class of PTFs. This is our primary technical contribution (see Proposition 12 below). It can be viewed as a generalization of the following result due to Kane [21]. Theorem 11 (Informal version of [21, Theorem 1]). Let D be a k-independent standard Gaussian, meaning the restriction of D to any subset of k variables has distribution N(0, Ik). Then, D fools degree-d PTFs w.r.t. N(0, In) up to error ε > 0 as long as k = k(d, ε) is large enough. Theorem 11 applies to a class of distributions that is (far) more restrictive than what we need. First, note that k-independent Gaussians match the moments of N(0, In) up to degree k exactly, whereas we must allow D whose moments match only approximately. Second, even if D would match the moments of a Gaussian exactly up to degree k, its k-dimensional marginals need not be Gaussian. In fact, we have no information on its moments of high degree even if they depend on at most k variables. These two distinctions cause substantial technical difficulties in our proof of Proposition 12 below. 2.3 Overview of the proof of Theorem 3: testably learning PTFs As we have seen, in order to prove Theorem 3, it suffices to show that approximately momentmatching distributions fool PTFs. We obtain the following. Proposition 12. Let ε > 0. Suppose that D approximately matches the moments of N(0, In) up to degree k and slack η, where k Ωd ε 4d 7d , and η n Ωd(k)k Ωd(k). Then, D fools the class of degree-d PTFs w.r.t. N(0, In) up to error ε/2. That is, for any f FPTF,d, we then have EY N(0,In) [f(Y )] EX D [f(X)] ε/2. (1) In the rest of this section, we outline how to obtain Proposition 12. Full details can be found in Appendix A. Structurally, our proof is similar to the proof of Theorem 11 in [21]: First, in Section 2.3.1, we show fooling for the subclass of PTFs defined by multilinear polynomials. Then, in Section 2.3.2, we extend this result to general PTFs by relating arbitrary polynomials to multilinear polynomials in a larger number of variables. Our primary contribution is thus to show that the construction of [21] (which considers k-independent Gaussians) remains valid for the larger class of distributions that approximately match the moments of a Gaussian. 2.3.1 Fooling multilinear PTFs Let f(x) = sign(p(x)), where p is a multilinear polynomial. Our goal is to establish (1) for f under the assumptions of Proposition 12. We will follow the proof of Kane [21] for Theorem 11, which proceeds as follows. Let D be a k-independent Gaussian. First, Kane constructs a degree-k polynomial approximation h of f, satisfying EY N(0,In) [h(Y )] EY N(0,In) [f(Y )] , and, (2) EX D [h(X)] EX D [f(X)] . (3) Since the moments of D are exactly equal to those of N(0, In) up to degree k, we have EY N(0,In) [h(Y )] = EX D [h(X)]. We may then conclude the fooling property for D (cf. (1)): EY N(0,In) [f(Y )] EY N(0,In) [h(Y )] = EX D [h(X)] EX D [f(X)] . As we see below, Kane relies on a structure theorem (see Lemma 21) for multilinear polynomials to construct his low-degree approximation h. We wish to extend this proof to our setting, where D merely matches the moments of the standard Gaussian up to degree k and slack η. As we will see, the construction of the polynomial h remains valid (although some care is required in bounding the approximation error). A more serious concern is that, for us, EY N(0,In) [h(Y )] = EX D [h(X)] in general. Our main technical contribution in this section is dealing with the additional error terms that arise from this fact. Constructing a low-degree approximation. We now give details on the construction of the lowdegree approximation h that we use in our proof, which is the same as in [21]. The starting point of the construction is a structure theorem for multilinear polynomials p (see Lemma 21 below). It tells us that f = sign(p) can be decomposed as f(x) = F(P(x)), where F is again a PTF, and P = (Pi) is a vector of multilinear polynomials of degree d, whose moments EY N(0,In) Pi(Y )ℓ are all at most Od( ℓ)ℓ. Note that these bounds are much stronger than what we would get from standard arguments (which would only yield EY N(0,In) Pi(Y )ℓ Od( ℓ)dℓ). As in [21], we approximate F by a smooth function F via mollification (see Appendix A.1.1). That is, F is the convolution F ρ of F with a carefully chosen smooth function ρ. Then, we set h(x) = T(P(x)), where T is the Taylor approximation of F of appropriate degree (see Appendix A.1.2). Intuitively, taking the Taylor expansion yields a good approximation as the (Gaussian) moments of the Pi are not too large, yielding (2), (3). Error analysis of the approximation. Our goal is now to establish (2), (3) in our setting. Note that, since (2) is only concerned with the Gaussian distribution, there is no difference with [21]. For (3), we have to generalize the proof in [21]. For this, we first bound the probability under D that (at least) one of the Pi is large, which we do using Markov s inequality. Then, we need to show a bound on the moments of Pi under D (recall that the structure theorem only gives a bound on the Gaussian moments). Using bounds on the coefficients of the Pi, we are able to do this under a mild condition on η (see Appendices A.1.2 and A.1.3). Controlling the additional error terms. To conclude the argument, we need to show that, for our low-degree approximation h, we have EY N(0,In) [h(Y )] EX D [h(X)]. Recall that in [21], these expectations were simply equal. The main issue lies in the fact that, in our setting, the moment matching is only approximate; equality would still hold if D matched the moments of N(0, In) up to degree k exactly. Under η-approximate moment matching, we could say that EY N(0,In) [h(Y )] EX D [h(X)] η h 1, (4) where h 1 is the 1-norm of the coefficients of h. However, there is no way to control this norm directly. Instead, we rely on the fact that h = T P and argue as follows. On the one hand, we show a bound on the coefficients in the Taylor approximation T of F. On the other hand, we show bounds on all terms of the form EY N(0,In) [P(Y )α] EX D [P(X)α] . Combining these bounds yields an estimate on the difference EY N(0,In) [h(Y )] EX D [h(X)] , which lets us conclude (1). Going into more detail, the LHS of (4) is equal to the inner product | t, u | between the vector t = (tα) of coefficients of T and the vector u = (uα), where uα = EY N(0,In) [P(Y )α] EX D [P(X)α]. This can be viewed as a change of basis x P(x). Then, (4) can bounded by u t 1, where u = maxα |uα|. The coefficients tα of T are related directly to the partial derivatives of F, which in turn depend on the function ρ used in the mollification. After careful inspection of this function, we can bound t 1 k Od(k) (see Lemma 27). Finally, for any |α| k, it holds that EY N(0,In) [P(Y )α] EX D [P(X)α] η sup i Pi 1 |α| η n 2 η n Od(k), see Fact 22. Putting things together, we get that EY N(0,In) [h(Y )] EX D [h(X)] u t 1 η k Od(k)n Od(k) ε/2, using the fact that η n Ωd(k)k Ωd(k) and k 1/ε for the last inequality. 2.3.2 Fooling arbitrary PTFs Now, let f(x) = sign(p(x)) be an arbitrary PTF. As before, we want to establish (1). Following [21], the idea is to reduce this problem to the multilinear case as follows. Let Y N(0, In) and let X be a random variable that matches the moments of Y up to degree k and with slack η. For N N to be chosen later, we construct new random variables ˆX and ˆY , and a multilinear PTF ˆf = sign(ˆp), all in n N variables, such that ˆY N(0, In N), ˆX matches moments of ˆY up to degree k with slack ˆη, and |EY [f(Y )] EX [f(X)]| E ˆY [ ˆf( ˆY )] E ˆ X [ ˆf( ˆX)] . (5) Assuming ˆη is not much bigger than η, and the approximation above is sufficiently good, we may then apply the result of Section 2.3.1 to ˆf to conclude (1) for f. Our construction of ˆX, ˆY and ˆf will be the same as in [21]. However, with respect to his proof, we face two difficulties. First, we need to control the slack parameter ˆη in terms η. More seriously, Kane s proof of (5) breaks in our setting: He relies on the fact that X is k-independent Gaussian in his setting to bound high degree moments of ˆX which depend on at most k variables. In our setting, we have no information on such moments at all (even if X matched the moments of N(0, In) up to degree k exactly). Construction of ˆX and ˆY . For i [n], let Z(i) be an N-dimensional Gaussian random variable with mean 0, variances 1 1/N and covariances 1/N, independent from X and all other Z(i ). We define ˆXij := Xi/ N + Z(i) j , and set ˆX = ( ˆXij). We define ˆY analogously. This ensures that ˆY N(0, In N). Furthermore, given that X matches the moments of Y with slack η, it turns out that ˆX matches the moments of ˆY with slack ˆη = (2k)k/2 η. This follows by direct computation after expanding the moments of ˆX in terms of those of X and of the Z(i), see Lemma 32. Construction of the multilinear PTF. We want to construct a multilinear polynomial ˆp in n N variables so that p(X) ˆp( ˆX). For ˆx Rn N, write φ(ˆx) := (PN j=1 ˆxij/ N)i [n] Rn. Since φ(Z(i)) = 0 holds deterministically, φ( ˆX) = X. So, if we were to set ˆp = p φ, it would satisfy ˆp( ˆX) = p(X). However, it would clearly not be multilinear. To fix this, we write p(φ(ˆx)) = P α λαˆxα and replace each non-multilinear term λαˆxα by a multilinear one as follows: If the largest entry of α is at least three, we remove the term completely. If the largest entry of α is two, we replace the term by λαˆxα , where α ij = 1 if αij = 1 and 0 otherwise. This is identical to the construction in [21]. Now, to show that p(X) ˆp( ˆX) we need to bound the effect of these modifications. It turns out that it suffices to control the following expressions in terms of N: !ℓ (i [n], 3 ℓ d). For the bi and ci,ℓ, we can do so using a slight modification of the arguments in [21]. For the ai, however, Kane [21] exploits the fact that in his setting, Xi is standard Gaussian for each fixed i [n], meaning the ˆXi,j are jointly standard Gaussian over j. This gives him access to strong concentration bounds. To get such concentration bounds in our setting, we would need information on the moments of the Xi up to degree roughly log n. However, we only have access to moments up to degree k, which is not allowed to depend on n (as our tester uses time nΩ(k)). Instead, we use significantly weaker concentration bounds based on moments of constant degree. By imposing stronger conditions on the bi, ci,ℓ, we are able to show that the remainder of the argument in [21] still goes through in our setting, see Appendix A.2.2. Finally, for N sufficiently large, this allows us to conclude (5) for ˆf = sign(ˆp). 2.4 Impossibility result: learning PTFs via the push-forward In this section, we show that the approach of [28] to prove testable learning guarantees for halfspaces w.r.t. the standard Gaussian cannot be generalized to PTFs. Namely, we show that in general, PTFs f(x) = sign(p(x)) with deg(p) 3 cannot be approximated up to arbitrary error w.r.t. N(0, In) by a polynomial of the form h(x) = q(p(x)), regardless of the degree of q.2 Importantly, we show 2Note that this even excludes proving an agnostic learning guarantee w.r.t. N(0, In) using this approach. that this is the case even if one makes certain typical structural assumptions on p which only change the PTF f = sign(p) on a set of negligible Gaussian volume; namely that p is square-free and that {p 0} Rn is compact. Our main technical contribution is an extension of a well-known inapproximability result due to Bun & Steinke (Theorem 15 below) to distributions with a single heavy tail (see Theorem 18). Approximating the sign-function on the real line Let p#N(0, In) be the push-forward of the standard Gaussian distribution by p, which is defined by PY p#N(0,In) [Y A] := PX N(0,In) X p 1(A) (A R). (6) Note that, if h(x) = q(p(x)), we then have EX N(0,In) [|h(X) f(X)|] = EY p#N(0,In) [|q(Y ) sign(Y )|] . Finding a good approximator h f of the form h = q p is thus equivalent to finding a (univariate) polynomial q which approximates the sign-function on R well under the push-forward distribution p#N(0, In). In light of this observation, we are interested in the following question: Let D be a distribution on the real line. Is it possible to find for each ε > 0 a polynomial q such that EY D [|q(Y ) sign(Y )|] ε? This question is well-understood for distributions D whose density is of the form wγ(x) := Cγ exp( |x|γ), γ > 0. Namely, when γ 1, these distributions are log-concave, and the question can be answered in the affirmative. On the other hand, when γ < 1, they are log-superlinear, and polynomial approximation of the sign function is not possible. Theorem 13 (see, e.g. [20]). Let D be a log-concave distribution on R. Then, for any ε > 0 there exists a polynomial q such that EY D [|q(Y ) sign(Y )|] ε. Definition 14. Let D be a distribution on R whose density function w satisfies w(x) C wγ(x) x R for some γ < 1 and C > 0. Then we say D is log-superlinear (LSL). Theorem 15 (Bun-Steinke [6]). Let D be an LSL-distribution on R. Then there exists an ε > 0 such that, for any polynomial q, we have EY D [|q(Y ) sign(Y )|] > ε. When p is of degree 1 (i.e., when f = sign(p) defines a halfspace), the push-forward distribution D = p#N(0, In) defined in (6) is itself a (shifted) Gaussian. In particular, it is log-concave and by Theorem 13 approximation of the sign-function w.r.t. D is possible. On the other hand, when p is of higher degree, D could be an LSL-distribution, meaning approximation of the sign-function w.r.t. D is not possible by Theorem 15. For instance, consider p(x) = x3. The density w of p#N(0, 1) is given by w(x) = C |x| 2/3 exp |x|2/3 , and so p#N(0, 1) is log-superlinear. Choice of description. The example p(x) = x3 is artificial: We have sign(p(x)) = sign(x), and so the issue is not with the concept f = sign(p), but rather with our choice of description p. In general, one can (and should) assume that p is square-free, meaning it is not of the form p = p2 1 p2. Indeed, note that for such a polynomial, we have sign(p(x)) = sign(p2(x)) almost everywhere. Square-freeness plays an important role in the analysis of learning algorithms for PTFs, see, e.g., [22, Appendix A]. It turns out that even if p is square-free, the distribution p#N(0, In) can still be log-superlinear, e.g., when p(x) = x(x 1)(x 2). Note that, for this example, sign(p) describes a non-compact subset of R. This is crucial to find that p#N(0, 1) is LSL. Indeed, if {p 0} R were compact, then pmax = supx R p(x) < , and so the density w of p#N(0, 1) would satisfy w(x) = 0 for all x > pmax. In particular, w would not be log-superlinear. One could therefore hope that assuming {p 0} is compact might fix our issues. This assumption is reasonable as {p 0} can be approximated arbitrarily well by a compact set (in terms of Gaussian volume). On the contrary, we show the following. Theorem 16. There exists a square-free polynomial p, so that {p 0} R is compact, but for which there exists ε > 0 so that, for any polynomial q, EX N(0,1) [|q(p(X)) sign(p(X))|] > ε. To establish Theorem 16, we prove a one-sided analog of Theorem 15 in Appendix B.1, which we believe to be of independent interest. It shows impossibility of approximating the sign-function under a class of distributions related to, but distinct from, those considered by Bun & Steinke [6]. The key difference is that the densities in our result need only to have a single heavy tail. However, this tail must be twice as heavy (γ < 1/2 vs. γ < 1). We emphasize that [6] does not cover compact PTFs. Definition 17. Let D be a distribution on R whose density function w satisfies w(x) C wγ(x) x ( , 1] for some γ < 1/2 and C > 0. Then we say D is one-sided log-superlinear. Theorem 18. Let D be a one-sided LSL-distribution on R. Then there exists an ε > 0 such that, for any polynomial q, we have EY D [|q(Y ) sign(Y )|] > ε. Proof of Theorem 16. It suffices to find a square-free polynomial p for which {p 0} is compact, and the push-forward distribution p#N(0, 1) is one-sided LSL. A direct computation shows that p(x) := x(x 1)(x 2)(x 3)(x 4)(x 5) meets the criteria, see Appendix B.2 for details. Acknowledgments We thank the anonymous reviewers for their valuable comments and suggestions. We thank Arsen Vasilyan for helpful discussions. This work is supported by funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No 815464). [1] Pranjal Awasthi, Maria Florina Balcan, and Philip M. Long. The power of localization for efficiently learning linear separators with noise . In: J. ACM 63.6 (2017), Art. 50, 27. ISSN: 0004-5411,1557-735X. [2] Louay M. J. Bazzi. Polylogarithmic independence can fool DNF formulas . In: SIAM J. Comput. 38.6 (2009), pp. 2220 2272. ISSN: 0097-5397,1095-7111. [3] Richard Beigel. The polynomial method in circuit complexity . In: Proceedings of the Eighth Annual Structure in Complexity Theory Conference (San Diego, CA, 1993). IEEE Comput. Soc. Press, Los Alamitos, CA, 1993, pp. 82 95. ISBN: 0-8186-4070-7. [4] Mark Braverman. Polylogarithmic independence fools AC0 circuits . In: J. ACM 57.5 (2008). ISSN: 0004-5411. [5] Jehoshua Bruck and Roman Smolensky. Polynomial threshold functions, AC0 functions, and spectral norms . In: SIAM J. Comput. 21.1 (1992), pp. 33 42. ISSN: 0097-5397. [6] Mark Bun and Thomas Steinke. Weighted polynomial approximations: limits for learning and pseudorandomness . In: Approximation, randomization, and combinatorial optimization. Algorithms and techniques. Vol. 40. LIPIcs. Leibniz Int. Proc. Inform. Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern, 2015, pp. 625 644. ISBN: 978-3-939897-89-7. [7] Anthony Carbery and James Wright. Distributional and Lq norm inequalities for polynomials over convex bodies in Rn . In: Math. Res. Lett. 8.3 (2001), pp. 233 248. ISSN: 1073-2780. [8] Amit Daniely. Complexity theoretic limitations on learning halfspaces . In: STOC 16 Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing. ACM, New York, 2016, pp. 105 117. ISBN: 978-1-4503-4132-5. [9] Ilias Diakonikolas, Parikshit Gopalan, Ragesh Jaiswal, Rocco A. Servedio, and Emanuele Viola. Bounded independence fools halfspaces . In: SIAM J. Comput. 39.8 (2010), pp. 3441 3462. ISSN: 0097-5397,1095-7111. [10] Ilias Diakonikolas, Daniel M. Kane, Vasilis Kontonis, Sihan Liu, and Nikos Zarifis. Efficient testable learning of halfspaces with adversarial label noise . In: Advances in Neural Information Processing Systems. Vol. 36. Curran Associates, Inc., 2023, pp. 39470 39490. [11] Ilias Diakonikolas, Daniel M. Kane, and Jelani Nelson. Bounded independence fools degree-2 threshold functions . In: 2010 IEEE 51st Annual Symposium on Foundations of Computer Science FOCS 2010. IEEE Computer Soc., Los Alamitos, CA, 2010, pp. 11 20. ISBN: 978-0-7695-4244-7. [12] Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas, and Nikos Zarifis. The Optimality of Polynomial Regression for Agnostic Learning under Gaussian Marginals in the SQ Model . In: Proceedings of Thirty Fourth Conference on Learning Theory. Vol. 134. PMLR, 2021, pp. 1552 1584. [13] Ilias Diakonikolas, Daniel M. Kane, and Lisheng Ren. Near-Optimal Cryptographic Hardness of Agnostically Learning Halfspaces and Re LU Regression under Gaussian Marginals . In: Proceedings of the 40th International Conference on Machine Learning. Vol. 202. PMLR, 2023, pp. 7922 7938. [14] Ilias Diakonikolas, Ryan O Donnell, Rocco A. Servedio, and Yi Wu. Hardness results for agnostically learning low-degree polynomial threshold functions . In: Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, Philadelphia, PA, 2011, pp. 1590 1606. [15] Aravind Gollakota, Adam R. Klivans, and Pravesh K. Kothari. A moment-matching approach to testable learning and a new characterization of Rademacher complexity . In: STOC 23 Proceedings of the 55th Annual ACM Symposium on Theory of Computing. ACM, New York, 2023, pp. 1657 1670. ISBN: 978-1-4503-9913-5. [16] Aravind Gollakota, Adam R. Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. An Efficient Tester-Learner for Halfspaces . In: The Twelfth International Conference on Learning Representations. 2024. [17] Aravind Gollakota, Adam R. Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Testerlearners for halfspaces: Universal algorithms . In: Advances in Neural Information Processing Systems. Vol. 36. Curran Associates, Inc., 2023, pp. 10145 10169. [18] Prahladh Harsha, Adam R. Klivans, and Raghu Meka. Bounding the sensitivity of polynomial threshold functions . In: Theory Comput. 10 (2014), pp. 1 26. ISSN: 1557-2862. [19] David Haussler. Decision-theoretic generalizations of the PAC model for neural net and other learning applications . In: Inform. and Comput. 100.1 (1992), pp. 78 150. ISSN: 08905401,1090-2651. [20] Adam Tauman Kalai, Adam R. Klivans, Yishay Mansour, and Rocco A. Servedio. Agnostically learning halfspaces . In: SIAM J. Comput. 37.6 (2008), pp. 1777 1805. ISSN: 0097-5397,1095-7111. [21] Daniel M. Kane. k-independent Gaussians fool polynomial threshold functions . In: 26th Annual IEEE Conference on Computational Complexity. IEEE Computer Soc., Los Alamitos, CA, 2011, pp. 252 261. ISBN: 978-0-7695-4411-3. [22] Daniel M. Kane. The Gaussian surface area and noise sensitivity of degree-D polynomial threshold functions . In: Comput. Complexity 20.2 (2011), pp. 389 412. ISSN: 1016-3328,14208954. [23] Daniel M. Kane, Adam R. Klivans, and Raghu Meka. Learning Halfspaces Under Log Concave Densities: Polynomial Approximations and Moment Matching . In: Proceedings of the 26th Annual Conference on Learning Theory. Vol. 30. PMLR, 2013, pp. 522 545. [24] Michael J. Kearns, Robert E. Schapire, and Linda M. Sellie. Toward efficient agnostic learning . In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. Association for Computing Machinery, 1992, pp. 341 352. ISBN: 089791497X. [25] Adam R. Klivans, Ryan O Donnell, and Rocco A. Servedio. Learning Geometric Concepts via Gaussian Surface Area . In: 2008 49th Annual IEEE Symposium on Foundations of Computer Science. 2008, pp. 541 550. [26] Matthias Krause and Pavel Pudlák. Computing Boolean functions by polynomials and threshold circuits . In: Comput. Complexity 7.4 (1998), pp. 346 370. ISSN: 1016-3328,1420-8954. [27] Ryan O Donnell and Rocco A. Servedio. Extremal properties of polynomial threshold functions . In: J. Comput. System Sci. 74.3 (2008), pp. 298 312. ISSN: 0022-0000,1090-2724. [28] Ronitt Rubinfeld and Arsen Vasilyan. Testing distributional assumptions of learning algorithms . In: STOC 23 Proceedings of the 55th Annual ACM Symposium on Theory of Computing. ACM, New York, 2023, pp. 1643 1656. ISBN: 978-1-4503-9913-5. [29] Stefan Tiegel. Hardness of Agnostically Learning Halfspaces from Worst-Case Lattice Problems . In: Proceedings of Thirty Sixth Conference on Learning Theory. Vol. 195. PMLR, 2023, pp. 3029 3064. [30] Leslie G. Valiant. A theory of the learnable . In: Commun. ACM 27 (1984), pp. 1134 1142. ISSN: 0001-0782. A Testable learning of polynomial threshold functions In this section, we give a formal proof of our main result, Theorem 3, that we restate here. Theorem 19 (Formal version of Theorem 3). Let d N. For any ε > 0, the concept class of degree-d polynomial threshold functions in n variables can be testably learned up to error ε w.r.t. the standard Gaussian N(0, In) in time and sample complexity n Od ε 4d 7d ε Od ε 4d 7d , In particular, if d is constant, the time and sample complexity is npoly(1/ε). Our goal to show this result is to apply Theorem 9. We first focus on proving the fooling condition in this theorem. Recall that for this we need to show that there are k and η such that if D matches the moments of N(0, In) up to degree k and slack η, then we have EX D [f(X)] EY N(0,In) [f(Y )] ε/2. In order to show this, we follow the result of [21] (see Theorem 11). This paper shows this condition for any distribution D that is a k-independent Gaussian, i.e. for which the marginal of every subset of k coordinates is N(0, Ik). The reason why it is enough to only focus on satisfying the fooling condition is that for any η, we can find m large enough that the first condition of Theorem 9 is satisfied for N(0, In) (see Remark 10 and Fact 36). For k we choose the same value as in [21], namely k = Θd ε 4d 7d . (7) We first show how to choose η for the case of multilinear polynomial polynomials in Appendix A.1. In Appendix A.2, we generalize the fooling result to arbitrary PTFs. Finally, in Appendix A.3, we show how to apply Theorem 9 to get testable learning for PTFs. A.1 Fooling for multlinear PTFs Thus, let f C and let p be the multilinear polynomial (of degree d) such that f(x) = sign(p(x)). Note that this notation is different than the one used in [21]. There, the roles of f and F are interchanged. We use f for the PTF throughout to be consistent with the previous work on (testable) learning. Without loss of generality, we assume that the sum of the squares of the coefficients of p is 1. We can make this assumption since rescaling does not change the PTF. The main result of this section is the following proposition. Proposition 20 (Fooling for multilinear PTFs). Let ε > 0. Suppose that D approximately matches the moments of N(0, In) up to degree k and slack η, where k Ωd ε 4d 7d , and η n Ωd(k)k Ωd(k). Then, we have that, for any multilinear f FPTF,d, EX D [f(X)] EY N(0,In) [f(Y )] Od(ε). Note that we only show Od(ε) here instead of ε/2, which is needed to apply Theorem 9. This simplifies the notation in our proof of this proposition. We later apply this proposition to ε = ε/Ωd(1) to conclude the fooling result we need. As mentioned earlier, the general strategy to do this is based on [21]. We want to find a function f : Rn R that approximates f. We will define this function in Appendix A.1.1. In Appendix A.1.2, we show that the expectation of f is close under D and N(0, In). More precisely, in Lemma 25, we show that under the above assumption on η, we have that EY N(0,In) [ f(Y )] EX D [ f(X)] O(ε). In Appendix A.1.3, we then show that under both D and N(0, In) the expectation of f and f are close and complete the proof of Proposition 20. More precisely, from [21, Proposition 14] (restated in Lemma 28), we get that EY N(0,In) [f(Y )] EY N(0,In) [ f(Y )] O(ε). In Appendix A.1.3, we then also show that this also holds for X D instead of Y N(0, In). More precisely, we show in Lemma 29 that Lemmas 25 and 28 contain already enough information about the moment-matching distribution D to conclude Proposition 20. A.1.1 Set up and definition of the function f In this section, we want to define the function f that should be thought of as a smooth approximation of the PTF f. In order to define this function, we first restate the following structural theorem from [21]. Lemma 21 ([21, Proposition 4]). Let m1 m2 md be integers. Then there exists integers n1, n2, . . . , nd, where ni Od(m1m2 . . . mi 1) and (non-constant homogeneous multilinear) polynomials h1, . . . , hd, Pi,j (1 i d, 1 j ni) such that: 1. The sum of the squares of the coefficients of Pi,j is 1. 2. If Y N(0, In) and ℓ mi, then E |Pi,j(Y )|ℓ Od( i=1 hi(Pi,1(Y ), Pi,2(Y ), . . . , Pi,ni(Y )). The values of mi we want to choose are as in [21], i.e. we let mi = Θd ε 3 7id . Given this structure theorem, we introduce now the following notation that we use throughout the remainder of this section, analogous to [21]. As before, we use p : Rn R to be denote the multilinear polynomial and f : Rn [ 1, 1] to be the PTF we are interested in, i.e. f(x) = sign(p(x)). Furthermore, the Pi,j : Rn R for i [n] and j [ni] are the polynomials in the structure theorem. We denote by Pi : Rn Rni for i [n] the vector (Pi,1, . . . , Pi,ni) and by P : Rn Rn1 Rnd the vector (P1, . . . , Pd). Finally, we define F : Rn1 Rnd R as the function F(y1, . . . , yd) = Pd i=1 hi(yi), i.e. we get f(x) = F(P(x)). Similar to Condition 1 in the above lemma, we can also get a bound on the sum of the absolute values of the coefficients of the Pi,j. We need this later in the proof of Lemma 25. Fact 22. For any i [n] and j [ni], the sum of the absolute values of the coefficients of Pi,j is at most nd/2. The idea to prove this is to bound the number of coefficients we have and use an inequality between the 1and 2-norm. The detailed argument can be found in Appendix C.1. Furthermore, we have an analogous result to Item 2 for the moment-matching distributions, which is again proved in Appendix C.1. We need this result later for concluding that f is a good approximation under the moment matching distribution D . Fact 23. For any i [d], j [ni] and ℓ k/d, we have that EX D Pi,j(X)ℓ EY N(0,In) Pi,j(Y )ℓ + ηndℓ/2. As in [21], we now consider the function ρC : Rn R defined in the following lemma. Lemma 24 ([21, Lemma 5]). Let B(ξ) = 1 ξ 2 2 if ξ 2 1 0 else and ρ2(x) = | ˆB(x)|2 where where ˆB is the Fourier transform of B. Then, the function ρC defined by satisfies the following conditions Rn ρ(x) dx = 1, 3. for any unit vector v and any non-negative integer ℓwe have Z Rn |Dℓ vρ(C)(x)| dx Cℓ, 4. for D > 0, R x 2 D |ρC(x)| dx = O n We now define the following three functions ρ, F and in particular f, in the same way as in [21]. We let ρ : Rn1 Rnd R be defined as ρ(y1, . . . , yd) = ρC1(y1) ρCd(yd), where the Ci are the same as in [21], i.e. we let Ci = Θd ε 7id . Using this function, we define an approximation F : Rn1 Rnd R to F as the convolution F = F ρ and an approximation f : Rn R to f as f(x) = F(P(x)). The general strategy to prove Proposition 20 is show the following three steps, where as usual Y N(0, In) and X D , EY [f(Y )] Lem. 28 EY [ f(Y )] Lem. 25 EX [ f(X)] Lem. 29 EX [f(X)] . (8) A.1.2 Expectations of f under the Gaussian and the moment-matching distribution are close In this section, we want to prove the middle approximation of (8). More precisely, we show the following lemma. Lemma 25. Let ε > 0. Suppose that D approximately matches the moments of N(0, In) up to degree k and slack η, where k Ωd ε 4d 7d , and η n Ωd(k)k Ωd(k). Then we have that EY N(0,In) [ f(Y )] EX D [ f(X)] O(ε). We want to do this similar to [21, Section 6]. For this, let T be the Taylor approximation of F around 0. We use degree mi for the ith batch of coordinates (recall that F : Rn1 Rnd R). The strategy to show Lemma 25 is to proceed in the following three steps, where again Y N(0, In) and X D , EY [ f(Y )] [21] EY [T(P(Y ))] Pf. of Lem. 25 EX [T(P(X))] Lem. 26 EX [ f(X)] . (9) The first approximation above only involves the Gaussian Y , so we get directly from [21, Proof of Proposition 8] that EY N(0,In) [| f(Y ) T(P(Y ))|] O(ε). We now want to extend this also to moment-matching distribution, which we do in the following lemma, i.e. show the third approximation in (9). Lemma 26. Let ε > 0. Suppose that D approximately matches the moments of N(0, In) up to degree k and slack η, where k Ωd ε 4d 7d and η 1 kd. Then, we have that EX D [| f(X) T(P(X))|] O(ε) This proof follows closely [21, Proof of Proposition 8], which is why we defer the proof to Appendix C.2. Note that the condition on η in this lemma is also satisfied for the η from Lemma 25. Thus, it remains to prove EY N(0,In) [T(P(Y ))] EX D [T(P(X))] O(ε) to complete the proof of Lemma 25. Note that in contrast to [21], this quantity is not 0 in our case since we only have approximate moment matching. This is the main technical difficulty in our proof for the multilinear case. We need to give a different argument here and argue that this is small even if X is only approximately moment matching a Gaussian and not a k-independent Gaussian. We can write the Taylor expansion as follows α=(α1,...,αd): |αi| mi 1 α! α F(0)xα, where the αi are multi-indices in Nni. We want to prove the following lemma about the coefficients in the Taylor expansion. Lemma 27. For any multi-index α = (α1, . . . , αd) Nn1 Nnd, we have i=1 C|αi| i . The proof of this lemma is in Appendix C.2. The ideas of this proof are based on [21, Proof of Lemmas 5 and 7]. We can now prove Lemma 25. Proof of Lemma 25. As argued above, we already know that EY N(0,In) [| f(Y ) T(P(Y ))|] O(ε) and EX D [| f(X) T(P(X))|] O(ε). It thus remains to argue that EY N(0,In) [T(P(Y ))] EX D [T(P(X))] O(ε). Define the vectors t = (tα) and u = (uα) by α! α F(0) and uα := EY N(0,In) [P(Y )α] EX D [P(X)α] . We then have that EY N(0,In) [T(P(Y ))] EX D [T(P(X))] = | t, u | t 1 u . We now want to bound t 1 and u separately. For the bound on t 1, note that, by Lemma 27, we have i=1 C|αi| i . Plugging in the definition of Ci = Θd ε 7id Od(k) and using |α| = Pd i=1 |αi| k, we get that |tα| k Od(k). Since we have that ni k, we can conclude that the number of multi-indices α with |αi| k is at most (dk)k k Od(k) and thus, we have that α |tα| k Od(k). We now move on to bound u . We write Pi,j as follows Pi,j(x) = X β a(β) i,j xβ. Here, the β in the sum goes over all multi-indices with |β| being the degree of Pi,j, which is at most d. By Fact 22, we have that X β |a(β) i,j | nd/2. Let |α| = ℓbe such that |αi| mi (i.e. the term appears in the Taylor expansion) and let (i1, j1), . . . , (iℓ, jℓ) be such that P(x)α = Pi1,j1(x) . . . Piℓ,jℓ(x). Note that if some (αi)j > 1, we include the corresponding factor Pi,j multiple times. We can now expand P(x)α as follows β1 a(β1) i1,j1xβ1 βℓ a(βℓ) iℓ,jℓxβℓ βℓ a(β1) i1,j1 . . . a(βℓ) iℓ,jℓxβ1+ +βℓ. Now, note that k d(m1 + + md) (this condition is in addition to the conditions on k in [21] but it does not change the asymptotic value of k as stated in (7)). We get that the degree of the terms appearing in the sum is |β1| + + |βℓ| d|α| d(m1 + . . . md) k and thus that |EY N(0,In) [P(Y )α] EX D [P(X)α] | X βℓ |a(β1) i1,j1| . . . |a(βℓ) iℓ,jℓ|η. Here, we used the triangle inequality and the fact that |EY N(0,In) Y β1+ +βℓ EX D Xβ1+ +βℓ | η. Thus, we can compute |EY N(0,In) [P(Y )α] EX D [P(X)α] | X βℓ |a(β1) i1,j1| . . . |a(βℓ) iℓ,jℓ|η β1 |a(β1) i1,j1| βℓ |a(βℓ) iℓ,jℓ| = ai1,j1 1 . . . aiℓ,jℓ 1η = n|α|d/2η. Thus, we get u n Od(k)η. Finally, we can conclude that EY N(0,In) [T(P(Y ))] EX D [T(P(X))] t 1 u n Od(k)k Od(k)η O(ε) since we have the conditions η n Ωd(k)k Ωd(k) and k 1 O(ε). A.1.3 The functions f and f are close in expectation In this section, we want to complete the proof of Proposition 20 that shows EX D [f(X)] EY N(0,In) [f(Y )] Od(ε). So far, we already showed in Lemma 25 that EY N(0,In) h f(Y ) i EX D h f(X) i O(ε). Thus, it remains to show that under both X D and Y N(0, In), the expectation of f and f differ by at most Od(ε). For Y , we directly get the following. Note that the approximation f depends on ε via the numbers mi in the structure theorem Lemma 21 and the Ci in the definition of ρ. Lemma 28 ([21, Proposition 14]). Let ε > 0. Then, we have that EY N(0,In) [f(Y )] EY N(0,In) h f(Y ) i O(ε). The reason why we get this directly from [21] is that this lemma only concerns the Gaussian and not the moment matching distribution. Combining this with the above, we have now shown that EY N(0,In) [f(Y )] EX D h f(X) i O(ε). To conclude Proposition 20 we want to use the following lemma. It is analogous to [21, Proof of Proposition 2] and we use the same definition of Bi, i.e. we define Bi = Θd( p log(1/ε)). We prove this lemmas in Appendix C.2. Lemma 29 (analogous to [21, Proof of Proposition 2]). Let ε > 0. Suppose that D is a distribution such that the following holds EY N(0,In) [f(Y )] EX D h f(X) i O(ε), and PX D [ i, j : |Pi,j(X)| > Bi] O(ε). Then, we have that |EY N(0,In) [f(Y )] EX D [f(X)] | Od(ε). We are now ready to prove Proposition 20. Proof of Proposition 20. Using Lemmas 25, 28 and 29, it remains to argue that PX D [ i, j : |Pi,j(X)| > Bi] O(ε). This is true by Markov s inequality and looking at the log(dni/ε) = ℓ-th moment since this implies that (assuming without loss of generality that ℓis even) PX D [|Pi,j(Y )| Bi] EX D Pi,j(X)ℓ EY N(0,In) Pi,j(Y )ℓ + ηndℓ/2 EY N(0,In) Pi,j(Y )ℓ + 1 Bℓ i In the second step we used Fact 23. Note that ℓ= log(dni/ε) k/d is ensured by the choice of k as in [21]. In the third step we used that since dℓ/2 k and thus the condition on η in the statement of this proposition ensures η 1 ndℓ/2 . In the fourth step we used Lemma 21. In the fifth step we used the condition Bi Ωd( ℓ). This is true by the choice of Bi = Θd( p log(1/ε)) and the fact that log(ni) Pi 1 j=1 log(mj) Od(log(1/ε)). In the last step, we then used the definition of ℓ. Taking a union bound over j and then over i gives PX D [ i, j : |Pi,j(X)| > Bi] O(ε), which completes the proof. A.2 Fooling for arbitrary PTFs In this section, we want to prove a result similar to Proposition 20 for arbitrary PTFs and not just multilinear ones. Namely, we show Proposition 12, which we restate with a slight modification below. Namely, we only prove EY N(0,In) [f(Y )] EX D [f(X)] Od(ε) instead of ε/2. The reason for this is, as for Proposition 20, this simplifies the proof and we take care of this difference when we apply Theorem 9 in Appendix A.3. Proposition 30 (Restatement of Proposition 12). Let ε > 0. Suppose that D approximately matches the moments of N(0, In) up to degree k and slack η, where k Ωd ε 4d 7d , and η n Ωd(k)k Ωd(k). Then, we have that for any f FPTF,d EY N(0,In) [f(Y )] EX D [f(X)] Od(ε). The general strategy for this will be to, given a polynomial p, find another polynomial pδ and reduce to Proposition 20. This strategy and the construction described in what follows are the same as used in [21, Lemma 15]. The following lemma is an analog of this lemma for our case. However, there is one key part of the proof of [21] that breaks in our setting, as explained in Section 2.3.2. Specifically, in [21] all restrictions to coordinates are exactly Gaussian, and in particular we have access to moments of all orders. The proof in [21] exploits this since it considers a number of moments depending on the dimension, whereas we only have access to a constant number of moments. Lemma 31. Let δ > 0. Suppose X D approximately matches the moments of N(0, In) up to degree k and slack η, where η δOd(1)n Ωd(1)k k and k Ωd(1). Let Y N(0, In). Then there are a polynomial pδ and random variables ˆX and ˆY , all in more variables, such that ˆY is a Gaussian with mean 0 and covariance identity, ˆX is approximately moment-matching ˆY up to degree k and slack ˆη = η(2k)k/2, PY, ˆY h |p(Y ) pδ( ˆY )| > δ i < δ, and, PX, ˆ X h |p(X) pδ( ˆX)| > δ i < δ. Given this lemma, we can prove Proposition 30. Since this proof follows closely [21, Proof of Theorem 1], we defer it to Appendix C.2 It remains to prove Lemma 31 and to explain how we construct pδ as well as ˆX and ˆY . We do the latter in Appendix A.2.1. Namely, we show there how to construct the random variables ˆX and ˆY from X and Y . In this section, we also make precise in how many variables the polynomial pδ is (and thus also the random variable ˆX and ˆY ). The proof that they satisfy the condition required by Lemma 31 can be found in Appendix C.2. In Appendix A.2.2 we then state a lemma about how we want to replace a factor Xℓ i in p by a multilinear polynomial in ˆX (whose proof is in Appendix C.2) and use it construct pδ and prove Lemma 31. A.2.1 Construction of the random variables ˆX and ˆY To show Lemma 31, we want, given a polynomial p and δ > 0 as well as two random variables X and Y , where Y N(0, In) and X matches the moments of N(0, In) up to degree k and slack η, to construct a multilinear polynomial pδ in more variables and two random variables ˆX and ˆY such that ˆY is a again Gaussian with mean 0 and covariance identity and ˆX matches the moments of ˆY up to degree k and slack ˆη. The guarantee we then want to show in Lemma 31 is that PY, ˆY h |p(Y ) pδ( ˆY )| > δ i < δ and PX, ˆ X h |p(X) pδ( ˆX)| > δ i < δ, where the probability is over the joint distribution of Y and ˆY respectively X and ˆX. Let N be a (large) positive integer that will be chosen later. Then the number of variables of the new polynomial pδ is n N, i.e. we replace every variable of p by N variables for pδ. We make the following definition. For i {1, . . . , n} and j {1, . . . , N}, we define N Xi + Z(i) j , where Z(i) is are multivariate Gaussians with mean 0, variance 1 1 N and covariance 1 N , independent for different i and independent from X. In particular, the choice of the covariance matrix ensures that we deterministically have Zi,1 + . . . Zi,N = 0 and thus Xi = PN j=1 ˆXi,j. The construction for ˆY is the same, i.e. ˆYi,j := 1 N Yi + Z (i) j , where Z (i) are again multivariate Gaussians with mean 0, variance 1 1 N and covariance 1 N , independent for different i and also independent from Y (as well as X and Z). We now have the following two lemmas that relate ˆX to X and ˆY to Y . The proofs of these lemmas are in Appendix C.2. Lemma 32. Suppose X approximately matches the moments of N(0, In) up to degree k and slack η. Then, ˆX approximately matches the moments of N(0, In N) up to degree k and slack ˆη = (2k)k/2η. Lemma 33. If Y N(0, In), then ˆY N(0, In N). Lemma 33 is already proven in [21, Lemma 15], but for completeness we also make it explicit in Appendix C.2. A.2.2 Proof of Lemma 31 After constructing the random variables ˆX and ˆY , we now move on to construct the polynomial pδ. As in [21, Proof of Lemma 15], the goal is to replace every factor xℓ i of p by a multilinear polynomial in variables (ˆxi,j)j such that Xℓ i is close to this polynomial evaluated in ( ˆXi,j)j with large probability. Doing this for all factors xℓ i appearing in p and combining the new multilinear terms, this then gives a multilinear polynomial pδ of degree d. Note that the polynomial is in fact multilinear since for replacing xi we only use the variables ˆxi ,j where i = i. Note that it is enough to show P h |p(X) pδ( ˆX)| > δ i < δ. The reason for this is that, if Y N(0, In), then Y in particular matches the moments of N(0, In) (exactly) and the proof for X applies and we can conclude Lemma 31, using also Lemmas 32 and 33. In order to get the above, we let δ be a small positive number (depending on δ, n, d) to be chosen later. We need the following lemma. Lemma 34. Let δ > 0. Let i [n] and ℓ [d]. Assume that each of the following conditions holds with probability 1 δ !a δ d+1 3 a d. (12) Then, there is a multilinear polynomial in ˆXi,j that is within Od(δ ) of Xℓ i with probability 1 δ . The proof of this lemma is analogous to [21, Proof of Lemma 15] and is deferred to Appendix C.2. However, in contrast to [21], there is a key difference in this lemma. The bound on the RHS of (10) is much weaker than the one used by Kane. The reason for that is that the bounds used there do not hold in our case, since we only have that X (approximately) matches the moments of N(0, In). By using stronger bounds for (11) and (12), we are able to generalize the proof from Kane to our setting. For a more detailed explanation why we need to change the bounds, we refer to Section 2.3.2. We now define Why we make this choice will become clear in the proof of Lemma 31 below. In Appendix C.2, we prove the following lemma that states that this choice of δ and the condition on η from Lemma 31 ensure that we can apply Lemma 34. Lemma 35. Let δ as in (13). Assuming η δOd(1)n Ωd(1)k k, there is a choice of N (independent of i or ℓ) such that (10), (11) and (12) hold, each with probability 1 δ Proof of Lemma 31. Lemma 34 (together with Lemma 35) shows that we can replace Xℓ i by a multilinear polynomial in ˆXi,j that is within Od(δ ) of Xℓ i with probability 1 δ for our choice of δ as in (13). Since δ δ 2dn, we can union bound these events over all i [n] and ℓ [d] and get that with probability 1 δ 2, we have that for any i [n] and ℓ [d], we have that the replacement polynomial for Xℓ i is within Od(δ ) of Xℓ i . Furthermore, for any i [n], we have that with probability 1 δ 2n that |Xi| 3 n δ . This is true since we match k 2 moments (and η 1 4), and thus P h |Xi| 3n δ 2 (2 + η)δ2 Hence, we can again apply the union bound to show that with probability 1 δ 2, we have for any i [n] that |Xi| 3 n Thus, with probability 1 δ, all the above events holds. Conditioned on that, we have that the replacement polynomial pδ is off by at most 3 n δ d Od (δ ) multiplied by the sum of the coefficients of p. The later is at most nd/2 by Fact 22 applied to p instead of Pi,j. Hence, the replacement polynomial is off by at most Od n3d/2 δd δ . Thus, since δ δ Ωd n3d/2 δd , we get that with probability 1 δ, pδ is off by at most δ, which is what we wanted to show. A.3 Proof of testable learning of PTFs In this section, we prove Theorem 19. As already mentioned in the beginning of this section, we want to apply Theorem 9 for this. In Proposition 30, we have shown that if we have moment matching up to error η n Ωd(k)k Ωd(k), then we have the fooling condition of Theorem 9. Note that the fooling condition requires |EY N(0,In) [f(Y )] EX D [f(X)] | ε 2 but Proposition 30 only gives Od(ε). Thus, technically, we apply Proposition 30 for ε = ε Ωd(1). However, this does not change the asymptotic condition on η described above. In summary, if η satisfies the condition as described above, we get indeed the fooling condition as needed for Theorem 9. The remaining part to prove Theorem 19 is to find an m such that with high probability over m samples from N(0, In) we have that the empirical distribution matches the moments up to degree k with error at most η. Then, we get testable learning of PTFs with respect to Gaussian in time and sample complexity m + n O(k) by Theorem 9. To get m, we use the following fact, which we prove in Appendix C.1. Using this, we can then prove Theorem 19. Fact 36. Given m Ω((2kn)kη 2) samples of N(0, In), we have that with high probability the empirical distribution matches the moments of N(0, In) up to degree k and slack η. Proof of Theorem 19. Using Theorem 9, as noted before, by Proposition 30, we get testable learning of degree d PTFs with respect to Gaussian in time and sample complexity bounded by m + n O(k), where m is such that with high probability over m samples from N(0, In) we have that the empirical distribution matches the moments up to degree k with error at most η. It remains to determine m. By Fact 36, we get that the choice of m = Θ((2kn)kη 2) is enough. Now, in order to apply Proposition 30, we need to choose η = n Θd(k)k Θd(k). Plugging in the value k = Θd ε 4d 7d we get η = ε Θd ε 4d 7d n Θd ε 4d 7d and hence m = Θ (2kn)knΘd(ε 4d 7d)ε Θd(ε 4d 7d) . Thus, the time and sample complexity for testably learning PTFs is O (2kn)kn Od ε 4d 7d ε Ωd ε 4d 7d n O(k) . Again, by plugging in the value k = Θd ε 4d 7d , we can simplify this to get that the sample and time complexity for testably learning PTFs is n Od ε 4d 7d ε Ωd ε 4d 7d , which completes the proof of this theorem. B Impossibility of approximating PTFs via the push-forward In this section, we provide further details on the results claimed in Section 2.4. Specifically, we prove our impossibility result Theorem 16 below in Appendix B.2. For this, we first need to establish our one-sided analog Theorem 18 of the inapproximability result for LSL-distributions of Bun & Steinke (Theorem 15), which we do in Appendix B.1. B.1 Proof of Theorem 18 Let us begin by restating some important definitions and results from Section 2.4 for convenience. For γ > 0, we write wγ(x) := Cγ exp( |x|γ), where Cγ is a normalizing constant which ensures wγ is a probability density. A distribution D on R is called log-superlinear (LSL) if its density function3 satisfies w(x) C wγ(x) for all x R, for some γ (0, 1) and C > 0. Recall the following. Theorem (Restatement of Theorem 15). Let D be an LSL-distribution on R. Then there exists an ε > 0 such that, for any polynomial q, we have EY D [|q(Y ) sign(Y )|] > ε. We defined in Section 2.4 a one-sided analog of LSL-distributions as follows. Definition (Restatement of Definition 17). Let D be a distribution on R whose density function w satisfies w(x) C wγ(x) x ( , 1] for some γ < 1/2 and C > 0. Then we say D is one-sided log-superlinear. In this section, we prove Theorem 18, which is an analog of Theorem 15 for one-sided LSLdistributions, and the basis of our proof of Theorem 16 in Appendix B.2. Theorem (Restatement of Theorem 18). Let D be a one-sided LSL-distribution on R. Then there exists an ε > 0 such that, for any polynomial q, we have EY D [|q(Y ) sign(Y )|] > ε. For our proof, it is useful to first recall the main ingredient of the proof of Theorem 15 in [6], which is the following inequality. Proposition 37 ([6, Lemma 20]). Let q be a univariate polynomial, and let γ (0, 1). Then, there exists a constant Mγ > 0, depending only on γ, so that sup x R |q (x)wγ(x)| Mγ Z R |q(x)|wγ(x) dx. Given Proposition 37, the intuition for the proof of Theorem 15 is that, if q is a good approximation of the sign-function on R, it must have very large derivative near the origin. On the other hand, EY D [|q(Y )|] is bounded from above (as EY D [| sign(Y )|] = 1), leading to a contradiction. The key technical tool in our proof of Theorem 18 is a version of Proposition 37 that applies to onesided LSL distributions. That is, a bound on the derivative of a polynomial in terms of its L1-norm on ( , 1] w.r.t. the weight wγ(x) = Cγ exp( |x|γ), γ < 1/2. We will only be able to obtain such a bound in a small neighborhood of 0 (Proposition 39 below), but this will turn out to be sufficient. We first need the following lemma, which bounds the derivative near 1. One can think of it as a one-sided Bernstein-Nikolskii-type inequality (whereas Proposition 37 is a Markov-Nikolskii-type inequality). Lemma 38. Let q be a univariate polynomial, and let γ < 1/2. Then, there exists a constant M γ > 0, depending only on γ, so that sup |x 1| 1 4 |q (x)| M γ Z 0 |q(x)|wγ(x) dx. Proof. By a substitution u2 = x, and using the fact that wγ(x) := Cγ exp( |x|γ) is even, we find Z 0 |q(x)|wγ(x) dx = Z 0 |q(u2)|wγ(u2) 2u du = Cγ |q(u2)u|w2γ(u) du. 3Technically, density functions are defined only up to measure-zero sets. Therefore, one should read statements of the form w(x) . . . for all x . . . as only holding a.e. throughout. Now, since 2γ < 1, we may apply Proposition 37 to the polynomial ˆq(u) = q(u2)u, yielding an upper bound on its derivative for all u R; namely w2γ(u) d du ˆq(u) M2γ Z |q(u2)u|w2γ(u) du 0 |q(x)|wγ(x) dx. As w2γ is monotonically decreasing, we have w2γ(x) w2γ(5/4) for all x 5/4, and so we find sup |u2 1| 1 d du ˆq(u) ˆ Mγ Z 0 |q(x)|wγ(x) dx, ˆ Mγ := M2γC2γ Cγ w2γ( 5 We wish to transform this bound on the derivative of ˆq into a bound on the derivative of q. Note that d du ˆq(u) = d du[q(u2)u] = 2q (u2)u2 + q(u2). We can bound this as follows sup |u2 1| 1 4 |2q (u2)u2 + q(u2)| 3 2 sup |u2 1| 1 4 |q (u2)| sup |u2 1| 1 Now, switching our notation back to the variable x = u2, and using (14), we have 3 2 sup |x 1| 1 4 |q (x)| ˆ Mγ Z 0 |q(x)|wγ(x) dx + sup |x 1| 1 4 |q(x)|. (15) It remains to bound the rightmost term in this inequality. Write cq > 0 for the constant (depending on q) satisfying: sup |x 1| 1 4 |q(x)| = cq wγ( 5 0 |q(x)|wγ(x) dx. (16) If cq 2, we are done immediately by (15), as then 3 2 sup |x 1| 1 4 |q (x)| ˆ Mγ + 2 wγ( 5 0 |q(x)|wγ(x) dx. So assume that cq > 2. Note that Z 0 |q(x)|wγ(x) dx wγ( 5 3 4 |q(x)| dx wγ( 5 4) inf |x 1| 1 inf |x 1| 1 4 |q(x)| 1 wγ( 5 0 |q(x)|wγ(x) dx. (17) Applying the mean value theorem to (16) and (17) on the interval [ 3 4], we find that sup |x 1| 1 4 |q (x)| 2 cq 1 0 |q(x)|wγ(x) dx = 2(cq 1) cq sup |1 x| 1 As cq > 2 by assumption, this yields sup |x 1| 1 4 |q (x)| sup |x 1| 1 and we may conclude from (15) that 3 2 1 sup |x 1| 1 4 |q (x)| ˆ Mγ Z 0 |q(x)|wγ(x) dx. Combining the cases cq 2, cq > 2 finishes the proof with M γ = max{ ˆ Mγ + 2 wγ(5/4), 2 ˆ Mγ}. Proposition 39. Let D be a one-sided LSL distribution on R. Then there exists a constant CD > 0 such that, for any polynomial q, we have 4 |q (x)| CD EX D [|q(X)|] . Proof. We wish to use Lemma 38, which gives us a bound on the derivative q (x) of q when x 1. To transport this bound to the origin, we consider the shifted polynomial ˆq(x) := q(1 x). Let w be the density function of D. Since D is a one-side LSL-distribution, there exists a constant C > 0 and a γ (0, 1/2) such that w(x) C wγ(x) for all x ( , 1]. As wγ is even, bounded from above and below on [0, 1], and wγ(x 1) wγ(x) for x 0, we can find a constant C > 0 such that w(x) C wγ(1 x) x ( , 1]. Now, we find that Z 0 |ˆq(x)|wγ(x) dx = Z 1 |ˆq(1 x)|wγ(1 x) dx |q(x)|w(x) dx 1 C EX D [|q(X)|] . As γ < 1/2, we may apply Lemma 38 to find that sup |x 1| 1 4 |(ˆq) (x)| M γ Z 0 |ˆq(x)|wγ(x) dx M γ C EX D [|q(X)|] . To finish the proof (with CD = M γ/C ), it remains to note that sup |x 1| 1 4 |(ˆq) (x)| = sup |x| 1 We are now ready to prove Theorem 18. Our approach is similar to the proof of Theorem 15 in [6]. Proof of Theorem 18. Let D be a one-side LSL-distribution, and suppose that q is a univariate polynomial satisfying EX D [| sign(X) q(X)|] 1. Since EX D [| sign(X)|] = 1, we may use the triangle inequality to find that EX D [|q(X)|] 1 + 1 = 2. By Proposition 39, this means that, for some constant CD > 1 depending only on D, 4 |q (x)| 2 CD. Set η = (10CD) 1. It follows that sup |x| η |q(x) q(0)| (10CD) 1 2CD = 1 Assuming first that q(0) 0, this means that |q(x) sign(x)| 4 5 x [0, η]. Let w be the density function D. As D is one-sided LSL, and η < 1, we know there is a constant C > 0, independent of η, such that w(x) C for x [0, η]. But, this implies that EX D [| sign(X) q(X)|] C Z η 0 | sign(x) q(x)| dx 4 giving a uniform lower bound on EX D [| sign(X) q(X)|] for all polynomials q. If, on the other hand q(0) 0, the same argument works after replacing [0, η] by [ η, 0]. B.2 Proof details for Theorem 16 In this section, we complete the proof of Theorem 16 started in Section 2.4. Recall that, in light of Theorem 18, it remained to prove the following. Proposition 40. Let p(x) = x(x 1)(x 2)(x 3)(x 4)(x 5). Then, p is square-free, {p 0} R is compact, and p#N(0, 1) is one-sided log-superlinear. Proof. The first two properties follow directly. For the third, let w be the density function of p#N(0, 1). Applying the inverse function rule to w2(x) = C2 exp |x|2 , we have w(y) = C2 X x p 1(y) exp |x|2 1 |p (x)| y R. We need to show that, for some C > 0 and γ (0, 1/2), w(y) C exp( |y|γ) y ( , 1]. (18) Qualitatively, the behavior of w can be described as follows. When y supx R p(x) 17, then w(y) = 0. When y 0, then x p 1(y) implies |x| |y|1/6, and so w(y) exp |y|1/3 . Finally, for any K > 0, there is a uniform lower bound on w(y) for all y [ K, 1]. To show (18), it remains to make this description quantitative. As the leading term of p is x6, there is a K > 0 such that |p 1(y)| = 2 for all y K, and x p 1(y) = |x| 2|y|1/6 y K. This means that, for a (possibly larger) constant K > 0, and some γ (1/3, 1/2), we have w(y) 2 C2 exp 4|y|1/3 1 |p (2|y|1/6)| exp( |y|γ) y K . (19) Now, there is an M > 0 so that x p 1(y) = |x| M y [ K , 1]. It follows, as p 1(y) is non-empty for y 1, that w(y) inf |x| M C2 exp |x|2 1 |p (x)| > 0 y [ K , 1]. (20) Combining (19) and (20), and choosing C > 0 small enough, yields (18). C Technical details on the proof of testable learning C.1 Facts and computations In this section, we prove several facts that we used in the proofs in Appendix A about the moments and other properties of a Gaussian random variable Y and a random variable X that matches the moments of a Gaussian up to degree k and slack η. First, we want to prove Fact 22, which we restate below, about the sum of the absolute values of the coefficients of the polynomials Pi,j. Recall from Lemma 21 that the sum of the squares of the coefficients of the polynomials Pi,j is 1. Fact (Restatement of Fact 22). For any i [n] and j [ni], the sum of the absolute values of the coefficients of Pi,j is at most nd/2. Proof. Similar to before, we write the Pi,j as follows Pi,j(x) = X β a(β) i,j xβ. The β in the sum goes over all multi-indices with |β| being the degree of Pi,j, which is at most d. Thus, there are at most nd terms in the sum. By the structure theorem (Lemma 21), we know that β(a(β) i,j )2 = 1. We can now get a bound on the 1-norm as follows (since the number of coefficients a(β) i,j is at most nd) X β |a(β) i,j | = ai,j 1 nd/2 ai,j 2 nd/2. Next, we want to show two facts about the moments of X. On the one hand, we want to show that under very mild assumptions on η, we can bound the moments of X similar as the moments of a Gaussian Y . We also prove Fact 23, which we also restate below, about the expectation of Pi,j(X). For the Gaussian, we get a bound by Lemma 21 and we generalize this in this fact to the moment-matching distribution. Fact 41. Let η 1 and let D be a distribution that matches the moments of N(0, In) up to degree k and slack η. Then, we have that for any multi-index α with |α| k we have |EX D [Xα] | p Proof. We have that |EX D [Xα] | |EY N(0,In) [Y α] | + 1 |EZ N(0,1) h Z|α|i | + 1 1 + 0 if |α| is odd (|α| 1)!! otherwise which is what we wanted to proof. Fact (Restatement of Fact 23). For any i [d], j [ni] and ℓ k/d, we have that EX D Pi,j(X)ℓ EY N(0,In) Pi,j(Y )ℓ + ηndℓ/2. Proof. Writing Pi,j as in Fact 22 Pi,j(x) = X β a(β) i,j xβ, we can compute EX D Pi,j(X)ℓ as follows EX D Pi,j(X)ℓ = X β1,...,βℓ a(β1) i,j . . . a(βℓ) i,j EX D Xβ1+...βℓ . Now, since X is η-approximately moment matching, we have that (note that Pi,j has degree at most d and thus any term in the sum that degree at most dℓ k and thus the moment matching applies) EX D Xβ1+...βℓ = EY N(0,In) Y β1+...βℓ η. Combining this with the above, we get EX D Pi,j(X)ℓ EY N(0,In) Pi,j(Y )ℓ + η By Fact 22, we thus get that EX D Pi,j(X)ℓ EY N(0,In) Pi,j(Y )ℓ + ηndℓ/2 Finally, we show two facts about the Gaussian distribution. First, we want to give a bound on the moments of a Gaussian random vector that has not necessarily independent entries and for which we only know that the variances are at most 1. If Z were independent, then the bound we show would follow directly by the formulas for the moments of a standard Gaussian. Second, we show Fact 36 that shows how many samples m of N(0, In) we need such that the empirical moments up to degree k match the actual moments of N(0, In) up to slack η. Fact 42. Let Z be a (multivariate) Gaussian random variable with mean 0 and variances at most 1. Then, for any multi-index β, we have that E Zβ |β||β|/2. Proof. We want to show that for any random variables W1, . . . , Wℓwe have j=1 E W ℓ j . To prove this, we use induction and Hölder s inequality. The case ℓ= 1 follows directly and for ℓ 2 we can compute using Hölder s inequality, since 1 ℓ ℓ 1 + 1 Thus, we get, using the induction hypothesis, j=1 E W ℓ j . We now use this result with ℓ= |β| and the W1, . . . , Wℓbeing the Zj, where every entry Zj occurs βj times. Then we get Since E [Zj] = 0 and Var [Zj] 1, we have that E Zℓ j ℓℓ/2 and thus we get E Zβ ℓℓ/2 = |β||β|/2. Fact (Restatement of Fact 36). Given m Ω((2kn)kη 2) samples of N(0, In), we have that with high probability the empirical distribution matches the moments of N(0, In) up to degree k and slack η. Proof. Given m samples from N(0, In), we want to compute the probability that for some α with |α| k we have that the empirical moment is close to the moment with high probability. We can compute using Chebyshev s inequality, where cα is the α-moment of N(0, In) and ˆcα is the empirical moment, that P [|ˆcα cα| > η] Var [ˆcα] m Var [Y α] Var h Y |α| 1 i m (2|α|)|α| To be able to use a union-bound, we need this to be smaller than O(n k), i.e. we need C.2 Remaining proofs In this section, we prove several lemmas that follow closely [21]. Some of these lemmas are also proven in [21], but we include them here for completeness. We first prove Lemmas 26 and 27, which we need in order to show that the expectation of f is close under the moment-matching distribution and the Gaussian distribution. Lemma (Restatement of Lemma 26). Let ε > 0. Suppose that D approximately matches the moments of N(0, In) up to degree k and slack η, where k Ωd ε 4d 7d and η 1 kd. Then, we have that EX D [| f(X) T(P(X))|] O(ε) As mentioned earlier, this proof follows closely [21, Proof of Proposition 8]. In particular, we also need the following lemma from [21]. Lemma 43 ([21, Proposition 6]). We have that, for x Rn, |T(P(x)) F(P(x))| 1 + Cmi i Pi(x) mi 2 mi! The Ci and mi are again as in [21], i.e. we choose Ci = Θd ε 7id and mi = Θd ε 3 7id . For the proof, we furthermore need the bound on the expectation of |Pi,j(X)| from Fact 23. Proof of Lemma 26. We have by Lemma 43 that |T(P(x)) F(P(x))| 1 + Cmi i Pi(x) mi 2 mi! As in [21, Proof of Proposition 8], the RHS of this inequality is the sum over all non-empty subsets S {1, 2, . . . , d} of the following term Y Cmi i Pi(x) mi 2 mi! Continuing exactly as in [21, Proof of Proposition 8], we want to show that any of these 2d 1 terms is at most O(ε/2d). By the inequality of arithmetic and geometric means, we get that this term is at most 1 |S| Cmi i Pi(x) mi 2 mi! Thus, it remains to bound EX D h Pi(X) ℓi 2 i , where ℓi = mi|S|. We do this as follows EX D h Pi(X) ℓi 2 i EX D h Pi(X) 2ℓi 2 i v u u u u t EX D j=1 Pi,j(X)2 where we used the Cauchy-Schwarz inequality in the first step. Continuing further by applying Jensen s inequality to the (convex) function g(a) = aℓi, we get j=1 Pi,j(X)2 j=1 Pi,j(X)2 j=1 Pi,j(X)2ℓi. Combining these two, we get EX D h Pi(X) ℓi 2 i v u u tnℓi i 1 ni j=1 EX D [Pi,j(X)2ℓi]. The next step is now different to [21] since we only have an approximately moment-matching distribution. By Fact 23, we get that EX D Pi,j(X)2ℓi EY N(0,In) Pi,j(Y )2ℓi + ηndℓi assuming that k 2d2mi (this ensures that 2ℓi = 2mi|S| 2mid is at most k/d as required in Fact 23). This condition is slightly different to [21], but it does not change the (asymptotic) definition of k as in (7). By Lemma 21, we now have that EY N(0,In) Pi,j(Y )2ℓi Od p 2ℓi 2ℓi = Od p for any j. Combining this with the above, we can conclude that EX D h Pi(X) ℓi 2 i ni ℓi Od p nℓi i ηndℓi nimi|S| mi|S| , using that η 1 ndℓi . This is true since ℓi = mi|S| mid k and thus η 1 ndk implies η 1 ndℓi . The rest of this proof is again the same as in [21, Proof of Proposition 8]. We can compute EX D h |T(P(X)) F(P(X))| i 2d max =S [d] Cmi i Pi(X) mi 2 mi! 2d max =S [d] Cmi|S| i Od p nimi|S| mi|S| 2d max i [d],s [d] By choice of mi, exactly as in [21], we can conclude that EX D h |T(P(X)) f(P(X))| i 2d exp min i [d] mi which completes the proof. Lemma (Restatement of Lemma 27). For any multi-index α = (α1, . . . , αd) Nn1 Nnd, we have i=1 C|αi| i . Proof. As mentioned earlier, the ideas of the following proof are based on [21, Proof of Lemmas 5 and 7]. We have that | α(F ρ)| = |(F αρ)(0)| (property of convolution) F L αρ L1 αρ L1 (F maps to [ 1, 1]). Thus, it remains to bound αρ L1. Using the product structure of ρ, we have that i=1 αiρCi L1. (21) First, we compute a bound for αiρ2 L1. We generalize this afterwards to ρCi. Recall that B(ξ) = 1 ξ 2 2 if ξ 2 1 0 else ρ2(x) = | ˆB(x)|2 where ˆB is the Fourier transform of B. We first note that ˆB(x) ˆB(x) = | ˆB(x)|2. Thus, we can apply the product rule to get the following formula for the derivative αiρ2 = 1 B 2 L2 ( β ˆB)( αi β ˆB) ( β ˆB)( αi β ˆB), where we used that the conjugate of the derivative is the derivative of the conjugate. We thus get the following by triangle inequality ( αiρ2)(x) L1 1 B 2 L2 ( β ˆB)(x)( αi β ˆB)(x) L1 . We now want to analyze ( β ˆB)(x). We have ( β ˆB)(x) = F(xβB(x)), where F( ) = ˆ stands for the Fourier transform and we used a fact about the derivative of the Fourier transform. Thus, we get furthermore that ( αiρ2)(x) L1 1 B 2 L2 F(xβB(x))F(xαi βB(x)) L1 xβB(x)xαi βB(x) L1 B(x)B(x) L1 The second step uses the Parseval identity; the third step uses that B(x) = 0 implies that x x 2 1; the fourth step uses that B(x)B(x) L1 = B(x) 2 L2. For an arbitrary Ci, we can bound αiρCi L1 as follows. Recall that ρCi(x) = Ci We can compute, using the chain rule that ( αρCi)(x) = Ci n ( αρ2) Ci To illustrate how the chain rule implies this, we can compute 2 if j = 1 0 if j = 1 Doing this iteratively, we get the formula above. Now, we want to compute the L1-norm of that function. We get αiρCi 1 = Z Rn( αiρCi)(x) dx Rn( αiρ2) Ci Rn( αiρ2)(y) dy y = Ci |αi| αiρ2 1. Using the bound from above on αiρ2 1, we thus get αiρCi 1 C|αi| i . (22) Combining (21) and (22) completes the proof. Next, we prove how we can generalize from EY N(0,In) [f(Y )] EX D h f(X) i O(ε) to the fooling condition we need EY N(0,In) [f(Y )] EX D [f(X)] Od(ε). This proof is based on [21, Proof of Proposition 2]. It turns out that this part of [21] does not need that X is k-Gaussian but works on the following, weaker assumptions. Lemma (Restatement of Lemma 29). Let ε > 0. Suppose that D is a distribution such that the following holds EY N(0,In) [f(Y )] EX D h f(X) i O(ε), and PX D [ i, j : |Pi,j(X)| > Bi] O(ε). Then, we have that |EY N(0,In) [f(Y )] EX D [f(X)] | Od(ε). In the proof of this lemma, we use the following theorem about anti-concentration of a polynomial of a Gaussian random variable. Importantly, it is enough to have this result for the Gaussian random variable Y and we do not need it for the moment-matching distribution. Theorem 44 (Carbery and Wright, see [21, Theorem 13] or [7, Theorem 8]). Let p be a degree d polynomial. Suppose that EY N(0,In) p(Y )2 = 1. Then, for ε > 0, PY N(0,In) [|p(Y )| < ε] Od(dε1/d). Proof of Lemma 29. Note that [21, Lemma 12 and proof of Proposition 14] and the second condition in the lemma imply that E [f(X)] E h f(X) i O(ε) 2P Od(εd) < p(X) < 0 and E [f(X)] E h f(X) i + O(ε) + 2P 0 < p(X) < Od(εd) . Using these and the first assumption of the lemma we get that E [f(X)] E [f(Y )] O(ε) 2P Od(εd) < p(X) < 0 and E [f(X)] E [f(Y )] + O(ε) + 2P 0 < p(X) < Od(εd) . We have furthermore that E [sign(p(X))] + 2P Od(εd) < p(X) < 0 = E sign(p(X) + Od(εd) and E [sign(p(X))] 2P 0 < p(X) < Od(εd) = E sign(p(X) Od(εd) . The reason for this is that adding or subtracting the two probability terms can be interpreted as changing the sign for values of X in ( Od(εd), 0) respectively (0, Od(εd)), which the same as shifting the polynomial. Thus, when combining this with the above we get that E sign(p(X) + Od(εd)) E [sign(p(Y ))] O(ε) and E sign(p(X) Od(εd)) E [sign(p(Y ))] + O(ε). Now we apply this result not to the polynomial p, but to the polynomial p Od(εd). This shifts the additional factor from the X-side to the Y -side and we get E [sign(p(X))] E sign(p(Y ) Od(εd)) O(ε) as well as E [sign(p(X))] E sign(p(Y ) + Od(εd)) + O(ε). Combining these two inequalities we get that E sign(p(Y ) Od(εd)) O(ε) E [f(X)] E sign(p(Y ) + Od(εd)) + O(ε). For Y , we have that (since the inequality hold point-wise) E sign(p(Y ) Od(εd)) E [f(Y )] E sign(p(Y ) + Od(εd)) . Now, the two function sign(p(Y ) Od(εd)) and sign(p(Y ) + Od(εd)) differ by at most 2 and only when |p(Y )| Od(εd). We now use an anti-concentration result for Y (the standard Gaussian). Namely, we can use Theorem 44 to conclude that this happens with probability at most Od(ε). Note that we have EY N(0,In) p(Y )2 = 1 since we assumed that the sum of the squares of the coefficients of p, which is exactly EY N(0,In) p(Y )2 for multilinear p, is 1. Thus, E sign(p(Y ) + Od(εd)) E sign(p(Y ) Od(εd)) 2Od(ε) = Od(ε). Thus, since both E [f(X)] and E [f(Y )] are between the values E sign(p(Y ) Od(εd)) and E sign(p(Y ) + Od(εd)) (up to O(ε) for E [f(X)]), we can conclude that also |E [f(X)] E [f(Y )] | Od(ε), Next, we want to prove Proposition 30 that follows closely [21, Proof of Theorem 1]. This proposition shows the fooling condition for arbitrary PTFs. In the proof we need Proposition 20 about the fooling condition for multilinear PTFs and Lemma 31 that, given an arbitrary PTF p, constructs a multilinear PTF pδ that is close to p. Proposition (Restatement of Proposition 30). Let ε > 0. Suppose that D approximately matches the moments of N(0, In) up to degree k and slack η, where k Ωd ε 4d 7d , and η n Ωd(k)k Ωd(k). Then, we have that for any f FPTF,d EY N(0,In) [f(Y )] EX D [f(X)] Od(ε). Proof. As already before, we assume without loss of generality that the polynomial is normalized in the sense that the sum of the squares of the coefficients is 1 (this does not change the PTF). We set We first want to show that the condition on η in this lemma ensures that we can apply both Lemma 31 to construct the multilinear polynomial pδ and Proposition 20 to get the fooling condition for pδ. The condition for Lemma 31 is η δOd(1)n Ωd(1)k k. Note that by our choice of k, we have εOd(1) k Ωd(1). Thus, for our choice of δ, the condition on η needed for Lemma 31 is satisfied by our assumption on η in this lemma. For Proposition 20, we need ˆη = (2k)k/2η k Ωd(k)n Ωd(k), which is also satisfied by our condition on η. Note that we need this condition for ˆη (and not η) since the new random variable ˆX is only moment-matching up to slack ˆη. We now want to show that |P [p(X) > 0] P [p(Y ) > 0]| Od(ε). Note that E [sign(p(X))] = P [p(X) 0] P [p(X) < 0] = P [p(X) 0] (1 P [p(X) 0]) = 2P [p(X) 0] 1 and the same holds for Y . Thus, |P [p(X) 0] P [p(Y ) 0]| Od(ε) will be enough since then |E [sign(p(Y ))] E [sign(p(X))]| = 2 |P [p(Y ) 0] P [p(X) 0]| Od(ε). By Lemma 31, we have that P [p(X) 0] P h pδ( ˆX) δ i δ. Since pδ is multilinear, we can apply Proposition 20 to pδ δ to get that P [p(X) 0] P h pδ( ˆY ) δ i Od(ε) since δ O(ε). Note that we can apply Proposition 20 to P [p(X) 0] instead of E [sign(p(X))] by the relation E [sign(p(X))] = 2P [p(X) 0] 1 from above. By Carbery-Wright (Theorem 44), we get that P h |pδ( ˆY )| δ i O(ε), thus we further have that P [p(X) 0] P h pδ( ˆY ) δ i Od(ε). Applying again Lemma 31, we get that P [p(X) 0] P [p(Y ) 0] Od(ε). By an analogous calculation, we get that P [p(X) 0] P [p(Y ) 0] + Od(ε), which completes the proof. We now want to prove Lemmas 32 and 33 that show that the construction of ˆX respectively ˆY preserve the assumptions on the distribution. The latter lemma is also proved in [21, Lemma 15], but we include it here for completeness. Lemma (Restatement of Lemma 32). Suppose X approximately matches the moments of N(0, In) up to degree k and slack η. Then, ˆX approximately matches the moments of N(0, In N) up to degree k and slack ˆη = (2k)k/2η. Proof. For a multi-index α, let cα be the α-moment of a Gaussian. We want to show that E h ˆXαi cα ˆη for any α with |α| k. We can compute the following, by writing Zi,j = Z(i) j , E h ˆXαi cα = j=1 ˆXαi,j i,j N Xi + Zi,j r Zαi,j r i,j In the second step we used the definition of ˆX and in the last step, we used that Z and X are independent. Now, E since X matches the moments of N(0, In) up to degree k and slack η. Thus, we get E h ˆXαi cα = E Zα β + |E [Y α] cα| The second-to-last step is the same computation from above but backwards (note that the moments of Z used for the construction of ˆY are the same as those of Z used for the construction of ˆX) and the last step used that Y is Gaussian and thus E [Y α] = cα. We can furthermore compute E h ˆXαi cα = η X j=1 Zβi,j i,j where the third step uses that the Zi,j are independent for different i. We now get that, using Fact 42, E j=1 Zβi,j i,j |βi, ||βi, |/2 Thus, continuing with the above, we get E h ˆXαi cα = η X i=1 Zβi,j i,j (2k)k/2η = ˆη, where 1 is the all-ones vector. This completes the proof. Lemma (Restatement of Lemma 33). If Y N(0, In), then ˆY N(0, In N). Proof. Since ˆY is a linear transformation of the Gaussian vector (Y, Z ), it is jointly Gaussian and thus to show the lemma, it is enough to show that the expectation of ˆY is 0 and the covariance matrix is the identity. Writing as above Z i,j = Z (i) j , we have for any i [n], j [N] that E h ˆYi,j i = E 1 N Yi + Z ij N 0 + 0 = 0. Furthermore, for any i [n], j [N] we have that, by independence of Y and Z , Var h ˆYi,j i = 1 N Var [Yi] + Var Z i,j = 1 For any i [n], j1 = j2 [N] we get that Cov h ˆYi,j1, ˆYi,j2 i = E 1 N Yi + Z i,j1 N Yi + Z i,j2 N Yi Z i,j1 N Yi Z i,j2 + E Z i,j1Z i,j2 N + 0 + 0 1 Finally, for any i1 = i2 [n], j1, j2 [N] we have that Cov h ˆYi1,j1, ˆYi2,j2 i = 0 by independence of ˆYi1,j1 = Yi1 + Z i1,j1 and ˆYi2,j2 = Yi2 + Z i2,j2 for i1 = i2. Thus, as argued in the beginning, this shows that ˆY N(0, In N) and completes the proof. Finally, we want to provide the details of the missing parts in the proof of Lemma 31 about the existence of the polynomial pδ. For this, it remains to prove Lemmas 34 and 35, which we restate below. First, recall our definition of δ in (13). . (restatement of 13) In the proof of Lemma 31, we then used the following lemma that allows to replace a factor Xℓ i by a multilinear polynomial in the new variable ˆX. Lemma (Restatement of Lemma 34). Let δ > 0. Let i [n] and ℓ [d]. Assume that each of the following conditions holds with probability 1 δ δ (restatement of 10) δ d+1 (restatement of 11) !a δ d+1 3 a d. (restatement of 12) Then, there is a multilinear polynomial in ˆXi,j that is within Od(δ ) of Xℓ i with probability 1 δ . We prove this lemma below. Before that, we want to prove Lemma 35 that states that we can use the above lemma for our choice of δ . Lemma (Restatement of Lemma 35). Let δ as in (13). Assuming η δOd(1)n Ωd(1)k k, there is a choice of N (independent of i or ℓ) such that (10), (11) and (12) hold, each with probability 1 δ The proof of this lemma is a combination of the following lemmas. Lemma 45. Assuming 1 δ d 1 N(2k)k/2 , we have with probability 1 δ Lemma 46. Assuming N 3(2k)k/2 , we have with probability 1 δ Note that for the above condition on η to be meaningful (we need η > 0), we also need to ensure that N > 2d δ 2d+3 . Lemma 47. Assuming η 1 (2k)k/2 we have for any 3 a d with probability 1 δ Using these three lemmas, we are now able to prove Lemma 35. Proof of Lemma 35. It remains to argue that there is a choice of N such that the condition on η in the lemma statement ensures that the conditions on η in Lemmas 45, 46 and 47 are satisfied. We argue below that the following choice of N is enough, which will complete the proof, n3d(2d+3)/2 δ(d+1)(2d+3) For Lemma 45, we need η 1 δ d 1 N(2k)k/2 . Plugging in the value of N and δ , it is enough for η to satisfy (note that 1 δ d 1 1 since δ is small) η Od δOd(1) nΩd(1)(2k)k/2 , which holds since by assumption η δOd(1)n Ωd(1)k k. For Lemma 46, we need η N 3(2k)k/2 and N > 2d δ 2d+3 . The latter is clearly satisfied by the choice of N as in (23). For the former, plugging in again the values of N and δ , it is enough for η to satisfy η Od δOd(1) nΩd(1)(2k)k/2 , which again holds since η δOd(1)n Ωd(1)k k. For Lemma 47, we need N 100d2 δ 2d+3 and η 1 (2k)k/2 . The former is again directly satisfied by the choice of N. Furthermore, also the latter is true, again since we assume η δOd(1)n Ωd(1)k k. Next, we want to prove Lemmas 45, 46 and 47. Proof of Lemma 45. Using Markov s inequality, we get that E 1 N PN j=1 ˆXi,j 2 Since, ˆX approximates the moments of N(0, In N) up to degree k and error ˆη, we can compute the above expectation as follows E h ˆXα i, i ˆηN(N 1) + (1 + ˆη)N = ˆηN 2 + N. Here, we used that the expectation E h ˆXα i, i is at most ˆη if some αj = 1 (for such α the expectation of N(0, In N) is 0) and at most 1 + ˆη otherwise (since the expectation of N(0, In N) is 1 for such α). Thus, we get that (ˆηN + 1)δ 2 δ since by assumption we have that ˆη = (2k)k/2η Proof of Lemma 46. We again use Markov s inequality to get j=1 ˆX2 i,j E 1 N PN j=1 ˆX2 i,j 1 2 We are thus interested in E PN j=1 ˆX2 i,j 1 2 . We can expand this as follows E h ˆX2 i, 1 αi . If some αj = 1, then the expectation will be, for some j1 = j2 [N], E h ˆX2 i,j1 1 ˆX2 i,j2 1 i = E h ˆX2 i,j1 ˆX2 i,j2 i E h ˆX2 i,j1 i E h ˆX2 i,j2 i + 1 1 + ˆη (1 ˆη) (1 ˆη) + 1 = 3ˆη. As for the proof of Lemma 45, this is true since the corresponding Gaussian moments are all 1. If no αj = 1, we get E ˆX2 i,j 1 2 = E h ˆX4 i,j i 2E h ˆX2 i,j i + 1 3 + ˆη 2(1 ˆη) + 1 = 2 + 2ˆη, since the second and fourth moment of a Gaussian are 1 and 3 respectively. Summarized we get that N(2 + 2ˆη) + N(N 1)3ˆη = 2N N ˆη + 3N 2ˆη. Together with the above we get that j=1 ˆX2 i,j 2N N ˆη + 3N 2ˆη since by assumption we have that ˆη = (2k)k/2η Proof of Lemma 47. For any 3 a d, similar to before, we compute using Markov s inequality j=1 ˆXa i,j E 1 Na/2 PN j=1 ˆXa i,j 2 We thus need to bound E PN j=1 ˆXa i,j 2 . We get j=1 ˆXa i,j E h ˆXa α i, i N 2(2a)a since for any α we have E h ˆXa α i, i (2a)a by Fact 41, since by assumption we have η 1 (2k)k/2 or in other words ˆη 1. Combining this with the above, we get j=1 ˆXa i,j N aδ 2d+2 = (2a)a N a 2δ 2d+2 . We have for any a 3 that ((2a)a) 1 a 2 100a and thus we get that j=1 ˆXa i,j N a 2δ 2d+2 a 2 1 δ 2d+2 a 2 1 δ 2d+2 In the third step we used that by assumption we have that N 100d2 δ 2d+3 and d a. In the last step, we then used that a 2 1, together with δ 2d+3 Finally, it remains to prove Lemma 34. Proof of Lemma 34. By the assumptions, the conditions (10), (11) and (12) hold simultaneously with probability 1 d δ d = 1 δ . We want to construct a multilinear polynomial in ˆXi,j such that conditioned on (10), (11) and (12) it is within Od(δ ) of Xℓ i . This will complete the proof. The construction follows [21, Proof of Lemma 15]. First, note that by construction we have Xℓ i = N ℓ/2 Expanding the sum and grouping the terms according together according how the power of ℓis partitioned to different ˆXi,j, we get 1 a1 ar P as=ℓ c(a1, . . . , ar) X j1,...,jr [N] distinct where the c(a1, . . . , ar) are constants that capture how often the latter terms occur if we multiply the product out. More precisely, c(a1, . . . , ar) = ℓ a1, . . . , ar 1 |{s : as = t}|. The strategy is now as follows. We want to approximate the terms j1,...,jr [N] distinct separately by multilinear polynomials. If ar = 1, then the terms is already multilinear and there is nothing to do. Note that if ℓ= 1, then we only have this case, so from now on we assume ℓ 2. If ar = 2, then we want to show that j1,...,jr [N] distinct j1,...,jˆr [N] distinct Od(δ ), (24) where here and also later ˆr is the largest s [r] such that as = 1 (we assume here and throughout this proof that in case no as = 1, i.e. we have an empty sum and an empty product, the second term on the LHS is 1; the reason for this is that, intuitively, we want to make the term multilinear by removing all powers higher than 1, which leaves 1 in case no as = 1). If ar 3, then we want to show that j1,...,jr [N] distinct Od(δ ). (25) Once we have shown these, the idea is to remove all terms with ar 3 and replace the terms for ar = 2 by the multilinear term on the LHS of (24). We get that Xℓ i is within Od(δ ) of 1 a1 ar P as=ℓ ar 2 c(a1, . . . , ar) X j1,...,jˆr [N] distinct which is multilinear. The Od here directly also covers that we need to multiply the Od(δ ) from above with the constants c(a1, . . . , ar) and then sum over the choices of a1, . . . , ar and over r. This then completes the proof. Thus, it remains to prove (24) and (25). We do this using the assumptions of the lemma and by induction on r (and technically also over ℓ; the base case for ℓ= 1 was already covered above). We also inductively show that j1,...,jˆr [N] distinct This will be needed to prove (24) and (25). Base case r = 1. If r = 1, then j1,...,jr [N] distinct In this case, we have ar = ℓ. If ℓ= 2, we need to show (24) and this follows directly from (11) (since δ d+1 Od(δ )). If ℓ 3, then we need to show (25), which follows again directly from (12). Also note that (26) holds since 1 Od 1 Induction step. We now assume that r 2 and that we have proven the result for all values smaller than r. The goal is to now show the result for r. We compute j1,...,jr [N] distinct j1,...,jr 1 [N] distinct jr {j1,...,jr 1} j1,...,jr 1 [N] distinct j1,...,jr 1 [N] distinct jr {j1,...,jr 1} We first want to analyze the second term. Note that this term is equal to j1,...,jr 1 [N] distinct !as+1[s=t]ar . Now, note that all terms in this sum have been considered in the induction hypothesis. Also note that at + ar 3 (since ar 2, otherwise there is nothing to prove) and thus every term in the above sum over t is at most Od(δ ) in absolute value by (25). Thus, we also get that j1,...,jr 1 [N] distinct jr {j1,...,jr 1} Od(δ ). (28) In the following it thus remains to analyze the first term in order to show (24) respectively (25) j1,...,jr 1 [N] distinct We now need to distinguish the cases ar 3 (in which case we need to show (25)) and ar = 2 (in which case we need to show (24)). Case I: ar 3, proof of (25). In this case, we have, by (12), that To analyze the term X j1,...,jr 1 [N] distinct we want to apply the induction hypothesis. We need to again distinguish three cases, namely ar 1 3 (in which case we can use (25)), ar 1 = 2 (in which case we can use (24)) and ar 1 = 1 (in which case the term on the LHS of (24) are in fact equal). We do this in the following. If ar 1 3, then, by the induction hypothesis for (25), j1,...,jr 1 [N] distinct and thus we get, using the decomposition (27) and the bound (28) on the second term, j1,...,jr [N] distinct Od(δ )δ d+1 + Od(δ ) Od(δ ). If ar 1 = 2, then, by the induction hypothesis for (24), we have that j1,...,jr 1 [N] distinct j1,...,jˆr [N] distinct By the induction hypothesis on (26), we get j1,...,jˆr [N] distinct Combining these two, we get j1,...,jr 1 [N] distinct Od(δ ) + Od Thus, we get (note that r d), using again the decomposition (27) and the bound (28) on the second term, j1,...,jr [N] distinct Od(δ ) + Od δ d+1 + Od(δ ) Od(δ ). If ar 1 = 1, then ˆr = r 1 and thus j1,...,jr 1 [N] distinct j1,...,jˆr [N] distinct As above, we get that, by the induction hypothesis on (26), j1,...,jˆr [N] distinct Again by using the decomposition (27) and the bound (28) on the second term, we get j1,...,jr [N] distinct δ d+1 + Od(δ ) Od(δ ). Thus, for all three cases, we get that (25) still holds for r. Case II: ar = 2, proof of (24). In this case, we have, by (11), that For the term X j1,...,jr 1 [N] distinct we again need to distinguish two cases, namely ar 1 = 2 and ar 1 = 1. If ar 1 = 2, then we have, as above, by the induction hypothesis for (24), j1,...,jr 1 [N] distinct j1,...,jˆr [N] distinct Also as above, by the induction hypothesis on (26), we get j1,...,jˆr [N] distinct Combining these, we have that j1,...,jr 1 [N] distinct j1,...,jˆr [N] distinct j1,...,jr 1 [N] distinct j1,...,jˆr [N] distinct j1,...,jˆr [N] distinct Od(δ )(1 + δ d+1) + Od Thus, we get that, using again the decomposition (27) and the bound (28) on the second term, j1,...,jr [N] distinct j1,...,jˆr [N] distinct Od(δ )(1 + δ d+1) + Od δ d+1 + Od(δ ) = Od(δ ) + Od(δ )δ d+1 + Od δ d+1 + Od(δ ) If ar 1 = 1, then we get j1,...,jr 1 [N] distinct j1,...,jˆr [N] distinct Again, by the induction hypothesis on (26), we get j1,...,jˆr [N] distinct As above, we get that j1,...,jr 1 [N] distinct j1,...,jˆr [N] distinct j1,...,jˆr [N] distinct Thus, we can conclude that, exactly as above by the decomposition (27) and the bound (28) on the second term, j1,...,jr [N] distinct j1,...,jˆr [N] distinct δ d+1 + Od(δ ) Hence, for both cases, we get that (24) still holds for r. Proof of (26). We can also analogously show (26). First note that if ˆr < r 1, then (26) follows directly from the induction hypothesis since the term j1,...,jˆr [N] distinct already appeared for r 1 by using a s = as if s < r 1 ar 1 + ar if s = r 1 and we can directly apply the induction hypothesis. Thus, we only need to show (26) for ˆr r 1 and in particular ˆr 1. If ˆr = 1, then by (10), the term is at most 1 δ r . Otherwise, ˆr 2 and we can do a similar expansion as above to get j1,...,jˆr [N] distinct j2,...,jˆr [N] distinct j1 {j2,...,jˆr} j2,...,jˆr [N] distinct j2,...,jˆr [N] distinct j1 {j2,...,jˆr} The first term is then, by the induction hypothesis and (10), at most Od 1 δ r 1 1 δ in absolute value. The second term can be expanded as above and is thus equal to j2,...,jˆr [N] distinct By (24) for r 1, we get that this is within Od(δ ) of j2,...,jt 1,jt+1...,jˆr [N] distinct (again if ˆr = 2, then the above term should be interpreted as 1). This term now is, by (26) for r 2, Od 1 δ r 2 . Note that for r = 2 (and thus ˆr = 2) we cannot apply the induction hypothesis but the term is 1 and thus the bound still holds. Combing these result, we get that j1,...,jˆr [N] distinct r 1! 1 δ + Od(δ ) + Od This is exactly (26) for r. By induction, we have now shown (24) and (25) for all r and as argued above, this completes the proof of this lemma. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: For the main claims made in the abstract and introduction we give a proof overview in Section 2 and complete proofs in the appendices. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss the limitations of our work in the corresponding part of Section 1.2. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: For all the theoretical results, we give a proof in the main part of the paper or in the appendices. For the main results, we give a proof overview in Section 2. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: Since this paper discusses theoretical results only and does not contain experiments, this question does not apply. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: Since this paper discusses theoretical results only and does not contain experiments, this question does not apply. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: Since this paper discusses theoretical results only and does not contain experiments, this question does not apply. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: Since this paper discusses theoretical results only and does not contain experiments, this question does not apply. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: Since this paper discusses theoretical results only and does not contain experiments, this question does not apply. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The research conducted in this papers conforms with the Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: This paper discussed general theoretical results and algorithms and has as such no immediate societal impact. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] . Justification: Since this paper discusses theoretical results only and does not contain experiments, this question does not apply. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] . Justification: Since this paper discusses theoretical results only and does not contain experiments, this question does not apply. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] . Justification: Since this paper discusses theoretical results only and does not contain experiments, this question does not apply. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Since this paper discusses theoretical results only and does not contain experiments, this question does not apply. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Since this paper discusses theoretical results only and does not contain experiments, this question does not apply. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.