# active_classification_with_few_queries_under_misspecification__8e7da2fa.pdf

Active Classification with Few Queries under Misspecification

Vasilis Kontonis UT-Austin vasilis@cs.utexas.edu

Mingchen Ma University of Wisconsin-Madison mingchen@cs.wisc.edu

Christos Tzamos University of Athens and Archimedes AI ctzamos@gmail.com

We study pool-based active learning, where a learner has a large pool S of unlabeled examples and can adaptively ask a labeler questions to learn these labels. The goal of the learner is to output a labeling for S that can compete with the best hypothesis from a given hypothesis class H. We focus on halfspace learning, one of the most important problems in active learning. It is well known that in the standard active learning model, learning the labels of an arbitrary pool of examples labeled by some halfspace up to error ϵ requires at least Ω(1/ϵ) queries. To overcome this difficulty, previous work designs simple but powerful query languages to achieve O(log(1/ϵ)) query complexity, but only focuses on the realizable setting where data are perfectly labeled by some halfspace. However, when labels are noisy, such queries are too fragile and lead to high query complexity even under the simple random classification noise model. In this work, we propose a new query language called threshold statistical queries and study their power for learning under various noise models. Our main algorithmic result is the first query-efficient algorithm for learning halfspaces under the popular Massart noise model. With an arbitrary dataset corrupted with Massart noise at noise rate η, our algorithm uses only poly log(1/ϵ) threshold statistical queries and computes an (η + ϵ)-accurate labeling in polynomial time. For the harder case of agnostic noise, we show that it is impossible to beat O(1/ϵ) query complexity even for the much simpler problem of learning singletons (and thus for learning halfspaces) using a reduction from agnostic distributed learning.

1 Introduction

Obtaining labeled examples is often challenging in applications as querying either human annotators or powerful pre-trained models is time-consuming and/or expensive. Active learning aims to minimize the number of labeled examples required for a task by allowing the learner to adaptively select for which examples they want to obtain labels. More precisely, in pool-based active learning, the learner has to infer the labels of a pool S of n unlabeled examples, and can adaptively select an example x S and ask for its label.

Even though it is known that active learning can exponentially reduce the number of required labels, this is unfortunately only true in very idealized settings such as datasets labeled by one-dimensional

*Equal Contribution

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

thresholds or structured high-dimensional instances (e.g., Gaussian marginals) [17, 4, 6, 7, 3, 19, 23]. It is well-known that without such distributional assumptions, even for linear classification in 2 dimensions, active learning yields no improvement over passive learning [15, 16].

Active Learning with Queries To bypass the hardness results and establish learning without restrictive distributional assumptions [5, 33, 30, 31, 47, 10] introduce enriched queries, where the learner is allowed to make more complicated queries. Broadly speaking, there are two lines of work that study active learning with enriched queries. The first one designs queries based on the structure of the hypothesis class it aims to learn. For example, [33] design comparison queries for learning halfspaces in 2 dimensions, [31] design same-leaf queries for learning decision trees, and [8] design derivative queries to learn polynomial threshold functions. The success of these queries heavily depends on the relation between the hypothesis class and the properties of the queries and thus a completely new query language has to be designed if the learning problem gets changed. The other line of work such as [5, 10, 44] study mistake-based queries, asking questions like if a positive example exists in a given set. These works break down a complicated learning problem into a small number of simple statistical tasks that require only very few rounds of interactions with a labeler who knows the hidden labels and can easily solve these tasks. These learning models can be formally summarized as follows. Definition 1.1 (Active Learning with Enriched Queries). Given a (multi)set of n unlabeled examples S X over a domain X, a learner A wants to output a hypothesis ˆf : X { 1} by adaptively submitting binary queries to a labeler who knows the hidden labels of the examples S. Each query q : 2S { 1} {0, 1} is a function that takes a subset of examples in S together with their unknown labels as input and outputs a number in {0, 1}. If f(x) : S { 1} is the unknown labeling function, the learner aims to make the error of ˆf,

err(f) := 1

x S 1( ˆf(x) = f(x))

as small as possible compared to a target class of binary hypotheses H over X.

In the realizable setting where the unknown labeling function belongs to the target hypothesis class H, several non-trivial hypothesis classes including the class of halfspaces have been proven efficiently learnable using only a logarithmic number of rounds of interactions. For example, [10] shows for any set S of n examples satisfying γ-margin condition with respect to the underlying halfspace h , one can use O(d log(d/γ)) seed queries, which returns an example with a specified label in a given region, to perfectly learn labels of all examples in S efficiently. More recently, [44] shows that for an arbitrary set of n examples labeled by an arbitrary halfspace, efficiently learning all the labels only requires O(d3 log(n)) region queries, which ask if an example with a specified label exists in a given region.

While these results show that mistake-based queries are extremely powerful in the realizable case, they fail to capture most practical cases where there is typically model misspecification or errors in the data. In fact, it is not hard to see that even under tiny amounts of noise, these mistake-based queries become essentially unusable and can be simulated by label queries. For example, if the labels of the examples are flipped with probability η, say 10% (random classification noise [2]), then a region query over a region that contains more than 10 examples will return yes with extremely high probability, which provides no information to the learner. In fact, even for the more powerful seed query used in [10], where in addition an example with the specified label is returned, [5] shows that if an algorithm can learn the labels of a set of examples S with error η + ϵ, with ground truth in a given hypothesis class H corrupted by η-level random classification noise using M(ϵ) queries, then one can simulate such an algorithm using M(ϵ)/η label queries. Such a result implies even for simple hypothesis classes such as the class of intervals in real line or the class of halfspaces in 2 dimensions, one needs at least Ω(1/ϵ) seed/region queries to learn the labels of examples in a given dataset to error η + ϵ. The gap between the realizable setting and the noise setting motivates the following natural questions:

Can we design a simple noise-tolerant query language, that allows learning non-trivial hypothesis classes efficiently with few queries?

Similar to [10, 35], in this work, we focus on the class of halfspaces, one of the most important hypothesis classes in active learning.

1.1 Learning Model And Our Contribution

We propose a new query language called Threshold Statistical Queries (TSQ), which generalizes the region queries used in [10, 35] and study its power for learning halfspaces under noise. Definition 1.2 (Threshold Statistical Queries (TSQ)). Let S be a set of examples in a domain X with a corresponding labeling function f : S { 1}. A threshold SQ query q(ϕ, τ) takes as input a function ϕ(x, y) over S { 1} and a threshold τ R, and answers whether P x S ϕ(x, f(x)) τ.

TSQ is a simple generalization of region queries and vanilla label queries. For a region U S and a target label a { 1}, if ϕ(x, y) = 1(x U f(x) = a), then q(ϕ, 1) is exactly the region query, where it checks if at least one example in U has label a { 1}. Furthermore, if |U| = 1, then q(ϕ, 1) is exactly the classic label queries.

Our goal is to study the power of TSQ for active learning under different label noise models. We consider 3 progressively more challenging noise models commonly studied in the literature: Random, Massart and Adversarial. Definition 1.3 (Active Learning under Label Noise). Let H be a hypothesis class over domain X. Let S X be a (multi)set of n examples and h H be a ground truth function. For a parameter η [0, 1/2), the labeling function f(x) over S is generated in the following way under the three label noise models.

Random Classification Noise (RCN) [2]: For each x S, f(x) is h (x) with probability η and h (x) otherwise.

Massart Noise [41]: For each x S, f(x) = h (x) with some unknown probability η(x) η and h (x) otherwise.

Adversarial Label Noise: For an unknown subset S containing η fraction of examples from S, f(x) = h (x) for all x S and f(x) = h (x) for all x S \ S .

Given the unlabeled examples S, and an error parameter ϵ (0, 1), the goal of the learner is to output a labeling ˆf over S such that with high probability err( ˆf) η + ϵ

We remark that in our model, after the labeling f(x) is generated, the label of each example x S will be fixed throughout the learning process, also known as persistent noise. This means if an algorithm keeps querying the label of the same example, it will receive the same answer. Furthermore, under the Random classification noise/Massart noise model, we will assume the size n of the dataset S is large enough (poly(d, 1/ϵ)), because if n is small we have even no guarantee on the error of the ground truth hypothesis. Our main algorithmic result is the first distribution-free halfspace learning algorithm that achieves both computational efficiency and query efficiency under the (persistent) Massart noise model and the Random Classification Noise model. Theorem 1.4. Let H = {h(x) = sign(w x) | w Sd 1} be the class of halfspaces over Rd. Given parameters ϵ, δ (0, 1), a set S of n = poly(d, 1/ϵ, log(1/δ)) examples in X and TSQ query access to an unknown labeling corresponding to a ground truth hypothesis h H corrupted with Massart noise η [0, 1/2), we can compute in poly(n) time a labeling ˆf such that err( ˆf) η + ϵ, with probability at least 1 δ, making O(d3 log3(1/ϵ)) threshold SQ queries.

Importantly, unlike [10], we make no structure assumption over the dataset S, and the query complexity of our algorithm qualitatively matches the query complexity obtained by [35], which only holds in the realizable setting. Theorem 1.4 shows a sharp separation between standard active learning/region queries which require poly(1/ϵ) query complexity and threshold statistical queries where poly log(1/ϵ) query complexity suffices under the Massart noise and Random classification noise models. Furthermore, we will discuss in Appendix A that the TSQs we use in the algorithm have very simple strictures. A natural question is whether TSQ can tolerate more complex noise.

Our second main result is a negative result showing that even using TSQ, it is still hard to achieve query efficiency under the adversarial label noise even for simpler hypothesis classes such as the class of singleton and the class of intervals and thus for the class of halfspaces. Formally, we have the following theorem. Theorem 1.5. Let H be the class of singleton functions over the domain X = N. For every ϵ (0, 1) and m > Ω(1/ϵ), there is a set S of m examples over X and a labeling function f for S such that

any learning algorithm A that makes less than O(1/ϵ) TSQs must output, with probability at least 1/3, a labeling function ˆf with error err( ˆf) > opt + ϵ, where opt = minh H err(h).

As we can always embed an instance of learning singleton into an instance of learning a 2-dimensional halfspace, Theorem 1.5 also implies a Ω(1/ϵ) query complexity for agnostic learning halfspaces with TSQ. This shows a sharp separation of the performance of TSQ under different noise models and leaves designing more robust query languages as an important future direction. From a technical perspective, unlike usual approaches in the active learning literature which explicitly construct hard instances [16, 28], we obtain our result via reduction from the agnostic distributed learning problem studied by [32] for which a communication complexity lower-bound has been established. To the best of our knowledge, this is the first result that connects distributed learning and active learning, two seemingly unrelated learning models.

Though, Theorem 1.5 shows that agnostic learning up to error opt + ϵ cannot be achieved in a query efficient way, inspired by the work of [5], it is possible to use only O(d log(1/ϵ)) TSQ to learn the label of a dataset up to error O(opt) + ϵ for every hypothesis class with finite VC dimension, though the running time of the algorithm is exponential. Such a result might be of independent interest as how to efficiently learn a hypothesis up to error O(opt) + ϵ have already been studied in many agnostic learning literature such as [13, 14, 22]. We leave the proof of Theorem 1.6 to Appendix C due to the space limit. Theorem 1.6. Let X be the space of examples and H be a hypothesis class over X with VC-dimension d, there is an algorithm such that for every ϵ, δ (0, 1), for every set S of n examples, and for every labeling function f(x), it makes O(d log(1/ϵ)) TSQs and outputs a labeling ˆf such that with probability 1 δ, err( ˆf) O(opt) + ϵ, where opt = minh H err(h).

1.2 Related Works

Active Learning with Mistake-Based Queries Learning with mistake-based queries has a long history [1, 39, 5, 10]. A typical mistake-based query can be understood as follows. A learner selects a subset of examples T X and proposes a possible labeling for them to a labeler. The labeler will return an example x T labeled incorrectly by the learner or return nothing when every example in T is labeled correctly. Beyond being quite successful in theory, mistake-based queries also have wide applications in commercial systems [11, 25]. In the realizable setting, it has been well-known that such queries can be used to implement the Halving algorithm [38] and achieve O(d log(1/ϵ)) query complexity for hypothesis classes of VC dimension d. However, it is only until very recently [10, 35] that people know how to use these queries to design algorithms that achieve both computational efficiency and query efficiency. In the noisy setting, [5] shows that even under random classification noise, it is impossible to use such queries to do query-efficient learning even for very simple classes. In this work, we propose TSQ as a robust generalization of these queries.

Statistical Query Learning Model Close to our threshold statistical learning model (TSQ) is the classic statistical learning model (SQ) proposed by [34]. SQ was originally designed to overcome random classification noise but has numerous applications in learning theory literature as a refinement of the PAC learning model which captures most algorithms used in practice. It has been used as a tool for obtaining efficient learning algorithms robust to noise [9] and as an evidence of computational difficulty of a statistical problems [21]. In the SQ model, the learner has no direct access to any example but can evaluate the expectation E(x,y) D ϕ(x, y) for an arbitrary bounded function ϕ(x, y) within accuracy δ. This means in SQ model, a learning algorithm should consider both the time used for computing ϕ(x, y) but also have to consider the final accuracy. On the other hand, a TSQ is a boolean function of the unlabeled examples and their hidden labels. No matter the complexity, any TSQ, q(ϕ) can be computed by the labeler accurately in time at most O(n). Furthermore, as in SQ model, a learner has no access to individual examples, SQ learning does not naturally fit in the active learning model. One even cannot implement classic active learning algorithms such as CAL or Halving [27, 28] in the SQ model. As opposed to SQ, our TSQ model is more powerful as it can isolate individual examples and thus fills such a gap. We remark that this more powerful type of access is not needed for Theorem 1.4 and can be implemented with SQ queries of poly(ϵ) accuracy.

Learning Halfspaces with Massart Noise Active learning for halfspaces under Massart noise also has a long history. Many works [4, 46, 3, 48] design learning algorithms that achieve both

computational efficiency and query efficiency under structured distributions such as the uniform distribution over the unit sphere, the Gaussian distribution, and log-concave distributions. On the other hand, without distributional assumptions, learning under Massart noise is much more challenging. Computationally efficient learning algorithm for learning halfspaces under Massart noise [18, 12, 20] were only recently discovered for passive learning. Our algorithm is the first one that works in an active learning setting and achieves both computational efficiency and query efficiency.

2 Learning Halfspaces under Massart Noise

In this section, we present Theorem 1.4, our main algorithmic result. The full proof is left at Appendix A. To start with, we give a high-level overview of how our algorithm works. Similar to previous works on distribution-free learning halfspaces [9, 18, 35], our learning algorithms recursively run two subroutines over the dataset S. The first subroutine is a weak learning algorithm that works under structured datasets S . More specifically, we assume that all points in the dataset have unit norm and for every direction w Sd 1, there are at least Ω(1/d) fraction of the examples x in S

such that |w x| Ω(1/

d). Intuitively, the regions {x S | |w x| Ω(1/

d)} correspond to examples for which a halfspace with normal vector w is more confident about the label. If S contains a non-trivial fraction of examples in S and we can run a weak learning algorithm over S

to learn a vector w that has a classification error η + ϵ over {x S | |w x| Ω(1/

d)}, we are able to label a non-trivial fraction of examples in S with a low error. In Section 2.1, we will design such a learning algorithm that is robust to Massart noise and achieves query efficiency and computational efficiency simultaneously. However, in general, it is not always possible to find a large enough subset from S that is in an approximate radially isotropic position. Forster s transform [26], a powerful preprocessing technique can be used to solve this issue. Given any set of n examples in Rd, we can always use Forster s transform to find a subset of kn/d examples that lie in a k dimensional subspace such that after a non-linear transformation, the transformed examples are in an approximate radially isotropic position. This implies that if we can implement our weak learning algorithm over the transformed data, each round, we are able to label 1/d fraction of the whole dataset with small error and thus after d log(1/ϵ) rounds of weak learning, only ϵ fraction of the examples are unlabeled. In Section 2.2, we will show how to use Forster s transform to select a large fraction of the dataset for the weak learning algorithm and how to implement the weak learning algorithm over the transformed dataset using TSQ. Furthermore, we want to point out that the TSQs we use in our algorithms have very simple structures. We leave the discussion in detail in Appendix A.

2.1 A Weak Learning Oracle

In this section, we present our weak learning algorithm, Algorithm 1, which plays a central role in Theorem 1.4. Our main algorithmic result in this section is the following theorem, the proof of which can be found in Appendix A.

Theorem 2.1. Let V Rd be a subspace of dimension k and S V be a set of n = poly(k, 1/ϵ, log(1/δ)) examples with unit length. Let h (x) = sign(w x), w Bk 1 be the ground truth hypothesis. If for every unit vector w Bk 1, at least 1/4d fraction of examples x S satisfy |w x| 1/(2

k), and u w 1/(4

k), then under the Massart noise model, for every ground truth , with probability at least 1 δ, Algorithm 1 outputs (S , ˆf S ) such that |S | n/(4k) and ˆf S has error at most η + ϵ over S , using O(d2 log2(1/ϵ)) TSQ, in poly(n, k) time.

To understand why Algorithm 1 is robust to Massart noise, we need to understand why such a problem is difficult. Let S be a subset of n example in an approximate radially isotropic position. Take the algorithm in [35] as an example. Such an algorithm uses a modified perception algorithm to learn some w that can perfectly classify all examples that have a large margin with respect to it. Namely, in each round, either the current hypothesis wi perfectly classifies a large fraction of examples or seed/region queries are used to quickly find an example in that region that is misclassified by wi, which will be fed to the perception algorithm and improve wi. In the noisy setting, however, every example x has a constant probability of being misclassified by wi. This implies we need to use queries to find a point where w and wi disagree. To do this, we associate each example x in the region Si = {x S | |w x| Ω(1/

d)}, with a variable Yx {0, 1}, where Yx = 1 if sign(wi x) = y(x) and 0 otherwise. If the noise η(x) = η for every example x i.e. Massart noise

Algorithm 1 WEAKLY LEARNING HALFSPACES (Labeling 1/d fraction of examples via TSQ)

Input: ϵ, δ (0, 1), subspace V Rd of dimension k, S V of n examples with unit length, u, a unit vector in V Output: S S a subset of examples, ˆf S : S { 1} a labeling for S

Let P0 = {x Bk 1 | u x 1/(4

k)}, where Bk 1 is the unit ball in V . Compute x0 P0 using Vaidya s algorithm by Theorem 2.3. for i = 0, . . . , O(k) do

Let wi = xi/ xi Check if over Swi = {x S | |wi x| 1 2

k}, wi has error larger than η + ϵ via TSQ If wi has error less than η + ϵ over Swi, return (Swi, sign(wi x)) and stop the algorithm Draw a random set U from Swi of size m = O(k2/ϵ2). For each x Swi, define

ϕ(x, y) = (Yx(y) η)/(wi x),

where Yx = 1 if y = sign(wi x) and Yx = 0 otherwise.

Use O(d) TSQ to do binary searches along each coordinate and find some ci such that ci 1

x U ϕ(x, y)x ϵ/(8k2). Feed Vaidya s algorithm by ((ci ϵu/4)t, 0) and compute (Pi+1, xi+1). Report Fail if nothing has been returned

model degenerates to the random classification noise model, then consider the following point

x Si (Yx η)x = X

x S+ (Yx η)x + X

x S (Yx η)x,

where S+ is the subset of examples in Si where wi agrees with w and S = Si \ S+. For each x S+, E Yx = η, while for every x S , E Yx = 1 η. This implies that in expectation, E x = (1 2η) P

x S x. After properly scaling, this gives a point in Si where wi and w disagree due to the convexity of the problem and thus serves as a counter-example to run the perception algorithm. In particular, since the contribution of each example x only depends on its true label, we can draw random samples from Si and use TSQ along each coordinate to approximately find x up to high accuracy using very few queries via binary search.

However, for Massart noise, this is not the correct way to design a learning algorithm. This is because η(x) is non-uniform over each x. For simplicity, we assume wi x > 0 for each x Si. As η(x) η, a simple calculation shows that E x w 0, where the randomness only comes from the Massart noise. The hope is that if wi has an error η +ϵ over Si, then x wi is larger than some positive number so that we find a counter-example. This is unfortunately not true. Because wi x are different and η(x) are different, even if the error is large, some of the examples with large margins could force x points to the opposite direction, making x wi 0 as well. To overcome this issue, we consider using a slightly more complicated statistic here, where we define

x Si (Yx η) x wi x

instead. Such a point is still easy to approximate up to error ϵ with only d log(1/ϵ) TSQs, because it is each to compute wi x for each x Si. But more importantly, when wi has an error larger than η + ϵ, in expectation wi and w will always disagree on x because

1 |Si|wi x = 1 |Si|

x Si (Yx η) x wi x wi = 1 |Si|

x Si (Yx η) > ϵ. (1)

Furthermore, as wi x is large for every x Si, x has a bounded norm and thus can serve as a counter-example. A technical issue here is that the inequality E x w 0 is quite fragile, due to the randomness of the Massart noise, it is impossible to guarantee x w 0 actually holds after the labeling being fixed. This issue can be fixed using the following trick. Before run the learning algorithm, we will randomly sample a unit vector u. We know from [45] that with constant probability

d and thus by shifting x a little towards u, this will give us a counter example and guarantee the whole algorithm succeeds with a constant probability.

Though, we find a counter-example and can use it to run a perception algorithm in a similar way to [35], this cannot give us a good query complexity. This is because (1) can only guarantee x wi > ϵ, which requires to run the perception algorithm for O(1/ϵ2) rounds to converge to a good hypothesis. Thus, we will solve this problem using Vaidya s cutting plane method. We want to remind the reader of the following convex feasibility problem, which is closely related to our halfspace learning problem. Definition 2.2 (Convex Feasibility Problem). Let K Rd be a convex body. A separation oracle with respect to K is a function on Rd such that for any input x Rd, if x K, then it reports yes , otherwise it outputs some (ct, b) Rd+1 such that for every y K, c y b but c x b. Assuming K Bd 1, given a separation oracle with respect to K and ϵ (0, 1), the convex feasibility problem asks to either find some x K or prove that K does not contain a ball of radius ϵ.

There exists a long line of research for solving the convex feasibility problem for example, [43, 40, 37]. We will use these algorithms as a subroutine of our learning algorithm. Theorem 2.3 (Vaidya s Algorithm). Let K P0 Bd 1 be an unknown convex body. Vaidya s algorithm solves the convex feasibility problem for K as follows. In round i, it maintains a convex body K Pi P0 and a point xi Pi and sends xi to the separation oracle of K. If the oracle returns yes , then it claims xi K, otherwise it computes in poly(d, log(1/ϵ)) time a pair of (Pi+1, xi+1) based on (ct i, bi) the return of the separation oracle. In particular, after T = O(d log(1/ϵ)) rounds, PT does not contain a ball of radius ϵ.

Let the unknown convex body K that we want to solve for the convex feasibility problem be a ball of radius ϵ around w and we want to run Vadidya s algorithm to find some wi close to w . Consider the Pi maintained by Vadiya s algorithm. As with constant probability w u 1/

d as we mentioned earlier, we can guarantee that 0 Pi. Let xi be the point used by Vadidya s algorithm. Then we will check the error of wi = xi/ xi over Si is large, which can be done with a single TSQ. If the error is less than η + ϵ, we are done. Otherwise, we use O(d log(1/ϵ)) TSQ to approximately find a counter example x for wi. Importantly, the halfspace x w 0 separate xi and any w K. This will make it possible to run the next round of Vadiya s algorithm. Since we only care about examples that have margin Ω(1/

d) with respect to wi, when wi is within a ball of radius 1/poly(d) centered at w , every example in Si is agreed by wi and w and thus wi is guaranteed to have error at most η + ϵ. Furthermore, in each round, Vadiya s algorithm shrinks the volume of Pi by a constant factor, and after at most O(d log(1/ϵ)) rounds, we are guaranteed to find a good hypothesis. This gives a weak learning algorithm with a desired query complexity.

2.2 From Weak Learning to Strong Learning

We leave the formal analysis of the algorithm to Appendix A and discuss two technical issues raised in designing Algorithm 2. First, as required in Theorem 2.1, the dataset S should be large enough and satisfy the structured assumption. Thus, to run Algorithm 1, we need to recursively select a dataset of enough size that satisfies the structured assumption from the data we have not labeled. In fact, the structured assumption can be fulfilled by a dataset that is in approximate radially isotropic position. Definition 2.4 (Approximate Radially Isotropic Position). Let S be a multiset of non-zero points in Rd, we say S is in ϵ-approximate radially isotropic position, if for every x S, x = 1 and for every u Sd 1, P

x S(u x)2/|S| 1/d ϵ.

Lemma 2.5. Let S be a multiset of non-zero points in Rd that is in 1/2d-approximate radially

isotropic position. Then for every u Sd 1, we have Prx S |u x| 1/2

Recent results show that for any dataset S, one can efficiently find a non-trivial fraction of the data and a non-linear transformation such that after the transform, the data are in approximate radially isotropic position. Theorem 2.6 (Approximate Forster s Transform [24]). There is an algorithm such that given any set of n points S Rd \ {0} and ϵ > 0, it runs in time poly(d, n, log 1/ϵ) and returns a subspace V of Rd containing at least dim(V )/d fraction of points in S and an invertible matrix A Rd d such

Algorithm 2 STRONG LEARNING HALFSPACES (Label S with few queries up to η + ϵ error )

Input: ϵ, δ (0, 1), S Rd of n examples Output: ˆf : S { 1} a labeling for S L , n |S| while |S| > ϵn/2 do

Apply Theorem 2.6 to S with ϵ = 1/2d to obtain a matrix A and a k-dimensional subspace V Use a single TSQ to check if constant hypothesis +1( 1) has error η + ϵ over S V if constant hypothesis has error at most η + ϵ/2 over S V then

Define ˆf to be the constant over S = S V S S \ S else

Run Algorithm 1 over input parameter ϵ/2, δ/poly(d, log(1/ϵ)), V , f A(S V ) and a random unit vector u V until some (S , ˆf S ) is output. Though Algorithm 1 is run over the transformed dataset f A(S V ), each TSQ can be simulated over the original data as FA(x) preserves the ground truth label.

Define ˆf(x) = ˆf S (FA(x)), x, FA(x) S S S \ S

Define ˆf = 1 for the rest of ϵn/2 examples in S return ˆf

that FA(S V ) is in ϵ-approximate radially isotropic position up to isomorphic to Rdim(V ), where FA(S V ) = {FA(x) := Ax/ Ax | x S V }.

Combine Theorem 2.6 and Lemma 2.5, we know that given any set of n examples S Rd, we can find a subset of at least kn/d examples SV := S V S that lies in some k-dimensional subspace V and some invertible matrix A such that FA(SV ) is in 1/2k-approximate radially isotropic position (up to isomorphic to Rk). Now, for convenience, we assume our transformed data FA(SV ) is exactly our original dataset and we focus on the transformed data. Notice that for each x SV , we have

sign(w x) = sign(A T w Ax) = sign(A T w FA(x)) = sign(proj A(V )(A T w ) f A(x)),

which implies that each transformed example FA(x) is labeled by halfspace v = proj A(V )(A T w ) and has the same label as x. So, we can use Algorithm 1 to learn their labels. However, as Algorithm 1 is run over the transformed data, we have to simulate every TSQ used by the algorithm via a TSQ over the original data. This issue can be overcome using the following argument. Since FA is a bijection between x and FA(x) and the outcome of the function ϕ(x, y) used in a TSQ for each example x can be uniquely represented by two numbers, we can rewrite ϕ(FA(x), y) as a function of x for each FA(x) such that for a TSQ as long as y(FA(x)) = f(x), the result of the query will be the same. This gives us a way to simulate the TSQ over S.

3 Agnostic Learning with Threshold SQ

In this section, we study learning with TSQs under the more challenging adversarial label noise proving Theorem 1.5. In the previous section, we saw that using TSQ, learning halfspaces only requires polylog(1/ϵ) rounds of interactions. We show in this section that this is not the case for the adversarial label noise. We show that it is impossible to reduce the query complexity from poly(1/ϵ) to polylog(1/ϵ) even for very simple classes such as the class of singletons (and thus the class of the halfspace in high dimensions). The classic method of proving query complexity lower bound [15, 16, 28] is to construct a hard instance directly. However, as there are infinite types of TSQs to be considered, it is impossible to construct a single hard instance that defeats all types of TSQs. Instead, we will build a reduction from a hardness result on agnostic distributed learning [32] that we define as follows. Definition 3.1 (Agnostic Distributed Learning). Let X be the space of examples. Let a, b be two learners and S =< Sa, Sb > be a collection of labeled examples, where Sa is the (multi)set of labeled examples owned by a and Sb is the (multi)set of labeled examples owned by b. a, b only knows their own sample set. A learning protocol is a communication strategy, where in each round of

communication a sends information by bits to b and after reviving information sent from a, b sends information by bits back to a and finally the learning protocol outputs a hypothesis ˆf : X { 1} . The error of ˆf is

err( ˆf) := 1

x S 1( ˆf(x) = f(x)),

where f(x) is the true label of x. Let S be a collection of labeled examples, and H be a hypothesis class. Given an accurate parameter ϵ (0, 1), the goal of the agnostic distributed learning problem is to design a learning protocol that outputs some ˆf such that err( ˆf) minh H err(h) + ϵ while minimizing the number of bits communicated in the learning protocol.

In this paper, we will make use of the following slightly easier problem of agnostic distributed learning singleton functions, where the unlabeled examples owned by a, b are known to each other and they want to output a labeling with error at most opt.

Problem 3.2 (Distributed Learning Singleton). Consider the agnostic distributed problem. Let S =< Sa, Sb > be a collection of examples, where for u {a, b}, Su = {(i, yi u)}n i=1, where yu i { 1} for i [n]. Let H = {hi(x) = 21(x = i) 1 | i N} be the class of singleton functions. The goal is to design a (randomized) learning protocol that outputs a hypothesis ˆf such that err( ˆf) minh H err(h) + ϵ for ϵ = 1/4n with probability at least 2/3.

[32] shows the following hardness result for Problem 3.2.

Theorem 3.3 (Lemma 3 in [32]). If there is a (randomized) learning protocol that can solve Problem 3.2 using T(n) bits of communication, then there is a (randomized) protocol that can solve the set-disjointness problem with T(n) log(n) bits of communication.

According to [29], solving the set disjointness problem requires Ω(n) bits of communication, and thus solving Problem 3.2 requires Ω(n) = Ω(1/ϵ) bits of communication. The central result we use to prove Theorem 1.5 is the following technical lemma, which means if one can agnostically learn the class of singleton functions with error opt using T(n) queries, then one can design a learning protocol for Problem 3.2 with T(n)polylog(n) bits of communication. This is enough to prove Theorem 1.5, because given the hardness of Lemma 3.4, we can create a hard problem by making multiple copies of each example used in the proof of Lemma 3.4. This preserves the error of every hypothesis h : X { 1}. We leave more details to Appendix B and in the rest of this section, we give an overview of the proof of Lemma 3.4.

Lemma 3.4. Let S N be a multiset of 2n examples and f(x) be a hidden labeling function. Let H = {hi(x) = 21(x = i) 1 | i N} be the class of singleton functions. If there is an algorithm A that can make T(n) TSQ and outputs some ˆf such that err( ˆf) minh H err(h) + ϵ, with ϵ = 1/4n, then there is a learning protocol that can solve Problem 3.2 with O(T(n)polylog(n)) bits of communications.

Consider A to be a learning algorithm for singleton functions that can learn up to error opt with T(n) queries. Since both a, b know the unlabeled examples owned by each other and know the labels of examples owned by themselves, we will design a learning protocol for a, b to check the answer to each TSQ used by A together using only polylog(n) bits of communication. Recall that in the definition of TSQ, each qi answers if P

x S ϕ(x, y) τ, where ϕ(x, y) given every x is a two-value function. Thus, to check the answer of qi, it is sufficient to check if P

x Sa ϕ(x, y) τ P

x Sb ϕ(x, y). One possible way to check the answer is to let a send the number P x Sa ϕ(x, y) to b. However, if a TSQ is very complicated, communicating such a number would cost too many bits. Two arguments are made to address this problem. First, we claim that we can assume every outcome of P

x Sa ϕ(x, y) and τ P

x Sb ϕ(x, y) is an integer with bit complexity n. Intuitively, this is because there are at most 2n different outcomes for P

x Sa ϕ(x, y) and τ P

x Sb ϕ(x, y) and we can explicitly create a map from each outcome to such an integer. Second, we show that to compare a pair of integers with bit complexity n only polylog(n) bits of communication are required. To see why this is true, we can expand integers Ia = P

x Sa ϕ(x, y) and Ib = τ P

x Sb ϕ(x, y) into binary strings. Then Ia > Ib if and only if there exists some index i such that (Ia)j = (Ib)j for each j > i but (Ia)j > (Ib)j for each j = i. Thus, to compare Ia, Ib, we only need to binary search the first index j such that the

partial binary strings of Ia, Ib are different. Since checking whether two binary strings are equal only requires O(log n) bits of communication, we only need O(log2 n) bits of communication to compare the two integers.

Acknowledgments

This work was supported by the NSF Award CCF-2144298 (CAREER).

[1] D. Angluin. Queries and concept learning. Machine learning, 2:319 342, 1988.

[2] D. Angluin and P. Laird. Learning from noisy examples. Machine learning, 2:343 370, 1988.

[3] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM (JACM), 63(6):1 27, 2017.

[4] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In International Conference on Computational Learning Theory, pages 35 50. Springer, 2007.

[5] M. F. Balcan and S. Hanneke. Robust interactive learning. In Conference on Learning Theory, pages 20 1. JMLR Workshop and Conference Proceedings, 2012.

[6] M.-F. Balcan and P. Long. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, pages 288 316. PMLR, 2013.

[7] M.-F. F. Balcan and H. Zhang. Sample and computationally efficient learning algorithms under s-concave distributions. Advances in Neural Information Processing Systems, 30, 2017.

[8] O. Ben-Eliezer, M. Hopkins, C. Yang, and H. Yu. Active learning polynomial threshold functions. Advances in Neural Information Processing Systems, 35:24199 24212, 2022.

[9] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. Algorithmica, 22:35 52, 1998.

[10] M. Bressan, N. Cesa-Bianchi, S. Lattanzi, A. Paudice, and M. Thiessen. Active learning of classifiers with label and seed queries. Advances in Neural Information Processing Systems, 35:30911 30922, 2022.

[11] E. Y. Chang, S. Tong, K. Goh, and C. Chang. Support vector machine concept-dependent active learning for image retrieval. IEEE Transactions on Multimedia, 2:1 35, 2005.

[12] S. Chen, F. Koehler, A. Moitra, and M. Yau. Classification under misspecification: Halfspaces, generalized linear models, and connections to evolvability. ar Xiv preprint ar Xiv:2006.04787, 2020.

[13] A. Daniely. A ptas for agnostically learning halfspaces. In Conference on Learning Theory, pages 484 502. PMLR, 2015.

[14] A. Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 105 117, 2016.

[15] S. Dasgupta. Analysis of a greedy active learning strategy. Advances in neural information processing systems, 17, 2004.

[16] S. Dasgupta. Coarse sample complexity bounds for active learning. Advances in neural information processing systems, 18, 2005.

[17] S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. In Learning Theory: 18th Annual Conference on Learning Theory, COLT 2005, Bertinoro, Italy, June 27-30, 2005. Proceedings 18, pages 249 263. Springer, 2005.

[18] I. Diakonikolas, T. Gouleakis, and C. Tzamos. Distribution-independent pac learning of halfspaces with massart noise. Advances in Neural Information Processing Systems, 32, 2019.

[19] I. Diakonikolas, D. Kane, and M. Ma. Active learning of general halfspaces: Label queries vs membership queries. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

[20] I. Diakonikolas, D. Kane, and C. Tzamos. Forster decomposition and learning halfspaces with noise. Advances in Neural Information Processing Systems, 34:7732 7744, 2021.

[21] I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73 84. IEEE, 2017.

[22] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty noise. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1061 1073, 2018.

[23] I. Diakonikolas, M. Ma, L. Ren, and C. Tzamos. Fast co-training under weak dependence via stream-based active learning. In Forty-first International Conference on Machine Learning, 2024.

[24] I. Diakonikolas, C. Tzamos, and D. M. Kane. A strongly polynomial algorithm for approximate forster transforms and its application to halfspace learning. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1741 1754, 2023.

[25] S. Doyle, J. Monaco, M. Feldman, J. Tomaszewski, and A. Madabhushi. A class balanced active learning scheme that accounts for minority class problems: Applications to histopathology. In OPTIMHis E Workshop (MICCAI), pages 19 30, 2009.

[26] J. Forster. A linear lower bound on the unbounded error probabilistic communication complexity. Journal of Computer and System Sciences, 65(4):612 625, 2002.

[27] S. Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014.

[28] S. Hanneke and L. Yang. Minimax analysis of active learning. J. Mach. Learn. Res., 16(1):3487 3602, 2015.

[29] J. Håstad and A. Wigderson. The randomized communication complexity of set disjointness. Theory of Computing, 3(1):211 219, 2007.

[30] M. Hopkins, D. Kane, S. Lovett, and G. Mahajan. Noise-tolerant, reliable active classification with comparison queries. In Conference on Learning Theory, pages 1957 2006. PMLR, 2020.

[31] M. Hopkins, D. Kane, S. Lovett, and M. Moshkovitz. Bounded memory active learning through enriched queries. In Conference on Learning Theory, pages 2358 2387. PMLR, 2021.

[32] D. Kane, R. Livni, S. Moran, and A. Yehudayoff. On communication complexity of classification problems. In Conference on Learning Theory, pages 1903 1943. PMLR, 2019.

[33] D. M. Kane, S. Lovett, S. Moran, and J. Zhang. Active classification with comparison queries. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 355 366. IEEE, 2017.

[34] M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983 1006, 1998.

[35] V. Kontonis, M. Ma, and C. Tzamos. Active learning with simple questions. ar Xiv preprint ar Xiv:2405.07937, 2024.

[36] T. Lee, A. Shraibman, et al. Lower bounds in communication complexity. Foundations and Trends in Theoretical Computer Science, 3(4):263 399, 2009.

[37] Y. T. Lee, A. Sidford, and S. C.-w. Wong. A faster cutting plane method and its implications for combinatorial and convex optimization. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 1049 1065. IEEE, 2015.

[38] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2:285 318, 1988.

[39] W. Maass and G. Turán. Lower bound methods and separation results for on-line learning models. Machine Learning, 9:107 145, 1992.

[40] W. Maass and G. Turán. How fast can a threshold gate learn? In Proceedings of a workshop on Computational learning theory and natural learning systems (vol. 1): constraints and prospects: constraints and prospects, pages 381 414, 1994.

[41] P. Massart and E. Nedelec. Risk bounds for statistical learning. Ann. Statist., 34(5):2326 2366, Oct. 2006.

[42] T. Roughgarden et al. Communication complexity (for algorithm designers). Foundations and Trends in Theoretical Computer Science, 11(3 4):217 404, 2016.

[43] P. M. Vaidya. A new algorithm for minimizing convex functions over convex sets. Mathematical programming, 73(3):291 341, 1996.

[44] K. Vasilis, M. Mingchen, and T. Christos. Active learning with simple questions. In S. Agrawal and A. Roth, editors, Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 3064 3098. PMLR, 30 Jun 03 Jul 2024.

[45] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.

[46] S. Yan and C. Zhang. Revisiting perceptron: Efficient and label-optimal learning of halfspaces. Advances in Neural Information Processing Systems, 30, 2017.

[47] G. Yona, S. Moran, G. Elidan, and A. Globerson. Active learning with label comparisons. In Uncertainty in Artificial Intelligence, pages 2289 2298. PMLR, 2022.

[48] C. Zhang and Y. Li. Improved algorithms for efficient active learning halfspaces with massart and tsybakov noise. In Conference on Learning Theory, pages 4526 4527. PMLR, 2021.

Supplementary Material

A Omitted Proofs and Discussions in Section 2

A.1 Proof of Theorem 2.1

Proof of Theorem 2.1. Since u w 1/(4

k), we know that w P0, furthermore, K := Bk ϵ/poly(k)(w ) P0 contains a ball in V with radius at least ϵ/poly(k). In particular, 0 P0. We will first show that every time Algorithm 1 computes ((ci u/4)t, 0), it separates xi and K.

For a given round i in Algorithm 1, for every x S, Yx = 1 implies that wi misclassifies x. According to Algorithm 1, we know that when ((ci u/4)t, 0) is computed, it must be the case where Ex Swi Yx η + ϵ. We remark that we can use a single TSQ to check if Ex Swi Yx η + ϵ by querying if the number of mistakes made by wi over Swi is larger than (η + ϵ)|Swi|.

Since U is a random subset of Swi with size m = O(k2/ϵ2), by Hoeffding s inequality we know that with probability at least 1 poly(δ), 1

x U Yx η + ϵ/2. We first show that given this happens, ci := 1

x U ϕ(x, y)x must have large correlation with wi. We have

x U ϕ(x, y)x wi = 1

(wi x) (x wi) = 1

x U (Yx(y) η) ϵ/2,

where in the last inequality we use the fat that 1

x U Yx η + ϵ/2. On the other hand, we show that with high probability, 1

x U ϕ(x, y)x w ϵ/(20

k). To see this, we first consider any fixed x S. If sign(wi x) = sign(w x), then under the Massart noise model, in expectation we have

E y(x) ϕ(x, y)x w = E y(x) Yx(y) η

(wi x) (w x) = η(x) η

(wi x) (w x) 0.

Similarly, if sign(wi x) = sign(w x), then

E y(x) ϕ(x, y)x w = E y(x) Yx(y) η

(wi x) (w x) = 1 η(x) η

(wi x) (w x) 0.

Thus, for any possible subset U S, it always holds that 1 |U|ϵ P

x U Ey(x) ϕ(x, y)x w 0, where the randomness comes from the Massart noise model. In Algorithm 1, each x U satisfies |wi x| 1/(4

k). This implies that it always holds for each x U that ϕ(x, y)x w /ϵ ( 4

k/ϵ). Since U is a random subset of k2/ϵ2 examples from Swi, by Hoeffding s inequality, we have

x U ϕ(x, y)x w 1

x U E y(x) ϕ(x, y)x w 1 20

k2 ) 1 poly(δ).

Thus, with high probability ci w ϵ/(20

Notice that for each coordinate j, cij 4

k. Along each coordinate j, we are able to use TSQ of the type 1 |U|ϵ P

x U ϕ(x, y)(x)j τ to binary search cij up to error ϵ/(8k2) in O(log(k/ϵ)) O(log(d/ϵ)) rounds of interactions. Since we have found ci ci ϵ/(8k2), we know that

ci wi ci wi ϵ/(8k) ϵ/2 ϵ/(8k) 3ϵ/8

ci w ci w + ϵ/(8k) ϵ/(20

k) + ϵ/(8k) ϵ/(17

However, ci itself cannot separate xi from K as it could be the case that both ci w and ci wi are positive. However, since u w 1/(4

k) and u 2 = 1, ci ϵu/4 can separate wi from K. This can be viewed as follows. On the one hand,

(ci ϵu/4) wi ci wi ϵ/4 ϵ/8 > 0,

which means (ci uϵ/4) xi > 0. On the other hand, for every x K, we have

(ci ϵu/4) x (ci ϵu/4) w + ϵ/poly(k) ϵ/(17

k) + ϵ/poly(k) < 0.

As long as Algorithm 1 computes (ci ϵu/4), with high probability it will separate xi from K. In particular, by Theorem 2.3, we know that after T = O(k log(1/ϵ)) rounds, any point in PT must be at most ϵ/poly(log(1/ϵ)) close to w . This implies that over Sw T = {x S |w T | x 1/(2

k)}, w T and w agrees on every single example in Sw T . Thus, after O(k log(1/ϵ)) rounds, Algorithm 1 is guaranteed to output some wi such that wi has an error at most η + ϵ over the region Swi. By our assumption, Swi has a size at least n/(4k). This proves the correctness of the algorithm.

Finally, we compute the query complexity of the algorithm. In each round of Algorithm 1, we use 1 TSQ to check if the current hypothesis is good enough and use O(d log(1/ϵ)) TSQ to find a good approximation of the separation hyperplane. Since there are at most O(k log(1/ϵ)) rounds, the query complexity of Algorithm 1 is O(d2 log2(1/ϵ))

A.2 Proof of Theorem 1.4

Proof of Theorem 1.4. We start by showing the correctness of Algorithm 2. We will show that in each round of Algorithm 2, |S | |S|/d and ˆf over S has error at most η + ϵ/2. Given this to be correct, after at most O(d log(1/ϵ)) rounds, Algorithm 2 labels (1 ϵ/2) fraction of the examples with an error of η + ϵ/2, leaving at most ϵ/2 fraction of the examples unlabeled. This means ˆf has error at most η + ϵ.

By Lemma 2.5 and Theorem 2.6, we know that in each round of Algorithm 2, we can compute we find a subspace V that contains k/d fraction of the unlabeled data in S and a matrix A that can make FA(S V ) in approximate radially isotropic position. If w V , then the ground truth labels of examples in S V are the same and thus with high probability a constant hypothesis achieves an error of at most η + ϵ over S V .

Now we assume w is not orthogonal to V and we will show that by running Algorithm 1 O(log(1/δ)) times, we are able to label S S V , a subset of at least 1/k-fraction of examples in S V with error at most η + ϵ. To see this, we first argue that labeling S V is equivalent to labeling the transformed data FA(S V ). We notice that for every x V , we have

sign(w x) = sign(A T w Ax) = sign(A T w FA(x)) = sign(proj A(V )(A T w ) FA(x)),

which implies that we can view FA(V ) to be labeled by a halfspace v = proj A(V )(A T w ) furthermore, x and FA(x) have the same ground truth label. If we associate y(FA(x)) = f(x) for each x V , then the label y(FA(x)) of FA(x) can be seen as created by halfspace sign(v x) under the Massart noise model such that η(FA(x)) = η(x), x. Thus, if we are able to find a subset FA(S V ) FA(S V ) and label examples in FA(S V ) up to error η + ϵ/2, then we are able to label the labels of the corresponding examples in S up to η + ϵ/2. We will use Algorithm 1 to do this. Since FA(S V ) are in approximate radially isotropic position, we know from that Lemma 2.5 that for every unit vector w V , at least 1/(4k) fraction of examples in FA(S V ) satisfied |FA(x) w| 1/(2

k). Thus, once Algorithm 1 outputs a labeling ˆf S for S FA(S V ), the size of S is at least |FA(S V )|/(4k) |S|/(4d) and the error of ˆf S must be at most η + ϵ/2. By Theorem 2.1, we need some unit vector u that has a non-trivial correlation with the target halfspace A T w for the transformed data. By randomly select a unit vector in V , with constant probability (see [45]), we can guarantee that u proj A(V )(A T w ) 1/(4

k). Thus by repeating Algorithm 1 several times, with high probability, it will output some labeling function. However, to run Algorithm 1, we have to implement TSQ over the transformed data while we can only make TSQ over the original data. Such an issue is easy to address. Since FA is a bijection between x and FA(x) and the outcome of the function ϕ(x, y) for each example x can be uniquely represented by two numbers, we can rewrite ϕ(FA(x), y) as a function of x for each FA(x) such that for a TSQ as long as y(FA(x)) = f(x), the result of the query will be the same. As we have mentioned that y(FA(x)) = f(x) holds for every example, we conclude that we can simulate each TSQ over the transformed data via a TSQ over the original data. This finishes the proof of the correctness of Algorithm 2.

Finally, we calculate the query complexity of Algorithm 2. By Theorem 2.1, we know that every time we run Algorithm 1, we make O(d2 log2(1/ϵ)) queries. Since every round of Algorithm 2, we run

Algorithm 2 O(log(1/ϵ)) rounds and there are at most O(d log(1/ϵ)) rounds, we conclude the query complexity of Algorithm 2 is O(d3 log3(1/ϵ)). In particular, by Theorem 2.6 and Algorithm 1, each subroutine of the algorithm can be implemented in polynomial time, we conclude that Algorithm 2 can be run in polynomial time.

A.3 On the TSQs Used by Algorithm 2

In this section, we want to discuss the TSQs used by Algorithm 2 and argue that these TSQs have simple structures and are easy to communicate and implement. There are two types of TSQs used by Algorithm 2.

First, the algorithm needs to check whether a hypothesis h = sign(w x) has an error larger than τ over a given region U. In other words, we want to use TSQs to approximate the conditional expectation, Ex 1(sign(w x) = f(x) | x U). To express this using TSQ, for each x U, we define ϕ(x, y) = 1/|U| if sign(w x) = y and 0 otherwise. For each x S \ U, we define ϕ(x, y) = 0. In particular, in Algorithm 2 each U we use is just some random samples drawn from S {x V | |FA(x) w| Ω(1/

d)}. To communicate such a TSQ, a learner only needs to communicate such a query, a learner only needs to communicate to the labeler, (v, A), the parameters for a Forster s transformation, w, the hypothesis maintained by the algorithm, τ, the threshold used by the TSQ and a random seed to guide the labeler to do sampling. The labeler receives these parameters, computes the answer to the TSQ, and returns a binary answer to the learner.

The second type of TSQ can be seen as a weighted sum of the mistakes made by the current hypothesis h over a region U. Recall the notation used in Algorithm 1, (Yx(y) η)/(wi x), where Yx(y) = 1 if h makes a mistake at x. The algorithm wants to approximate the point Ex(FA(x)(Yx(y) η)/(wi FA(x)) | x U), which is equivalent to get an approximation of the point from each coordinate. Similarly, every U used by the algorithm is a random set sample from S {x V | |FA(x) w| Ω(1/

d)}. To communicate such a query, a learner will communicate (v, A), the parameters for a Forster s transformation, w, the hypothesis maintained by the algorithm, i, the coordinate the learner want to approximate, τ, the threshold used by the TSQ and a random seed to guide the labeler to do sampling.

This shows that the TSQs used by Algorithm 2 is simple from both a computational view and a communication complexity view.

B Omitted Proofs in Section 3

B.1 Proof of Lemma 3.4

Proof of Lemma 3.4. Notice that the learning algorithm A can be described as follows. In round i, A constructs a TSQ qi (possibly using randomness), submits qi to the labeler, and receives the answer to qi. Given A, we will design a learning protocol as follows. In round i, the learner a, b will check the answer of qi together by sending bits to each other and construct the next TSQ qi+1 based on the answer and using A. If they use only K(n) bits of communication to check qi in each round, then since A will output some ˆh such that err( ˆf) minh H err(h) + ϵ after T(n) rounds, the total bits of communication is at most T(n)K(n).

We can without loss of generality assume the randomness used to implement A is public so that in each round, both a and b know exactly the TSQ qi. Otherwise, by Newman s theorem [36], we only need to use another O(log n) bits of communication to simulate the randomness used by A. Recall that in the definition of TSQ, each qi answers if P

x S ϕ(x, y) τ, where ϕ(x, y) given every x is a two-value function. Thus, to check the answer of qi, it is sufficient to check if P

x Sa ϕ(x, y) τ P

x Sb ϕ(x, y). Although a can compute P

x Sa ϕ(x, y) and sends the number to b, communicating a single number P

x Sa ϕ(x, y) might need a lot of bits of communication. In the rest of the proof, we will design a protocol that use only O(log2 n) bits of communication to check the answer of qi.

First, we show that we can without loss of generality assume every outcome of P x Sa ϕ(x, y) and τ P

x Sb ϕ(x, y) is an integer with bits complexity n. We observe that both P

x Sa ϕ(x, y) and τ P

x Sb ϕ(x, y) have at most 2n outcomes. As a and b all know all possible outcomes, they can explicitly construct a maps Fa, which maps each of the possible outcomes of P x Sa ϕ(x, y) to an integer between 0 and 2n 1 and a map Fb, which maps each of the possible outcomes of τ P

x Sb ϕ(x, y) to an integer between 0 and 2n 1. Given these two explicit maps, if a and b can use communication to learn Fa(P

x Sa ϕ(x, y)) and Fb(τ P

x Sb ϕ(x, y)), then they can reconstruct P

x Sa ϕ(x, y) and τ P

x Sb ϕ(x, y). In the rest of the proof, we prove based on this assumption.

Next, we design a protocol that uses O(log2 n) bits of communication. After a determine P

x Sa ϕ(x, y), a know an integer Ia such that Ib = τ P

x Sb ϕ(x, y) > P

x Sa ϕ(x, y) if and only if Ib > Ia. So it remains to show that given two integers Ia, Ib with bit complexity n, we are able to compare these two integers with O(log n) bits of communication. To do this, we first represent Ia, Ib by binary strings of length n. Notice that Ia > Ib if and only of there exists some index i such that (Ia)j = (Ib)j for j > i but (Ia)j > (Ib)j for j = i. This implies that to compare Ia and Ib it is sufficient to find the largest index i such that (Ia)j = (Ib)j for each j > i and compare (Ia)i and (Ib)i . Such an index can be found via binary search. Specifically, given i we want to check if (Ia)j = (Ib)j for each j < i. If the two partial binary strings are equal, then we decrease i, otherwise we increase i. After O(log n) rounds, we successfully find such an index i . It is well-known that checking the equality of two binary strings of length n can be done via a simple randomized protocol by communicating O(log n) bits [36, 42]. Thus, in total, with O(log2 n) bits of communication, we are able to compare Ia, Ib and thus can check the answer of the TSQ.

This gives a randomized learning protocol that uses O(T(n) log2(n)) bits of communication.

B.2 Proof of Theorem 1.5

Proof of Theorem 1.5. Let ϵ (0, 1). For simplicity, we write ϵ = 1/4n and let S be a multiset of 2n examples over N. Given any labeling function f(x) over S and every m 2n such that m/(2n) is an integer, we create a multiset S of size m in the following way. For each x S we create m/(2n) copies x for x such that for each copy x it has a hidden label equal to f(x). Denote by f the labeling function over S . Notice that for every hypothesis h : N { 1}, the error of h over S and the error of h over S are the same. This implies that if we have a learning algorithm such that for every S N and every labeling function f , it can output a hypothesis ˆf using T(1/ϵ) = T(4n) TSQs such that with probability 2/3 ˆf has an error opt + ϵ, then ˆf has an error at most opt+ϵ over the original dataset S. By Lemma 3.4, we know that this implies a learning protocol that solve Problem 3.2 with O(T(1/ϵ) log2(1/ϵ)) = O(T(4n) log2(n)) bits of communication. By Theorem 3.3, this implies a communication protocol that solves the set disjointness problem of size n using O(T(1/ϵ) log3(1/ϵ)) = O(T(4n) log3(n))) bits of communication. By [29], we know that to solve a set disjointness problem of size n, any (randomized) protocol has a communication complexity of Ω(n). This implies that T(1/ϵ) = Ω(1/ϵ).

C Proof of Theorem 1.6

In this section, we prove Theorem 1.6 by presenting the following Algorithm 3. Our algorithm is inspired by [5], where they design an algorithm that learns a hypothesis class H with finite VC dimension up to error O(opt) + ϵ using O(d log(1/ϵ)) class-conditional queries, which returns an example with a specified label in a given region. Unlike their algorithm, our algorithm does not need such a strong query. Instead, our algorithm makes O(d log(1/ϵ)) TSQs to achieve the same guarantee. Furthermore, each TSQ used in Algorithm 3 only checks if a given hypothesis has an error larger than some threshold over a given region.

Proof of Theorem 1.6. If opt ϵ, the learning up to error O(opt) + ϵ is equivalent to learning up to error O(ϵ) and η = ϵ can be used as an upper bound for opt. So, we assume that η = opt ϵ, because we can always guess some η such that η/2 opt η via a doubling trick, which will only

Algorithm 3 APPROXIMATE AGNOSTIC LEARNING(Learning a labeling up to O(opt) error)

Input: Dataset S of size n, hypothesis class H with VC dimension d, η, an upper bound of opt Output: ˆf, a labeling of S Let H0 be an ϵ-cover of the hypothesis class H with respect to the uniform distribution over S Let f H0 be a labeling over S. For each x S, f H0(x) agrees with the majority of H0 at x. Use a single TSQ to check if the error η0 of f H0 over S is larger than O(η) If η0 O(η), output f H0 while the error ηi of f Hi larger than O(η) do This can be checked with a single TSQ for j = 1, . . . , T = O(log(1/δ)) do

Keep drawing random subsets Sj of size 1/(50η) from S until Sj gets accepted We accepted Si if we find at least 1 example in Sj are misclassified by f Hi using a single TSQ

if More than 1/6 fraction of the hypothesis in Hi agrees with f Hi over Sj then

Mark all the hypothesis in Hi that agrees with with f Hi over Sj else

Find a subset of ˆSj Sj such that ξ-fraction of the hypothesis in Hi agrees with f Hi over ˆSj, where ξ [1/6, 2/3].

Use a single TSQ to check if over ˆSj, f Hi makes no mistake. If so, mark all hypotheses that disagree with f Hi at any single example over ˆSj, otherwise mark the hypothesis in Hi agrees with f Hi over ˆSj

Remove all hypotheses in Hi that are marked more than 0.1T times and Hi+1 be the set of remaining hypothesis return f Hi

make the final guarantee worse up to a constant factor. Denote by h H0 the hypothesis that has the smallest error over S. By the definition of ϵ-cover, we know that err(h ) η + ϵ.

We show that with high probability, in each round of Algorithm 3, either err(h Hi) is at most 250η or a constant fraction of the hypothesis in Hi gets removed. In particular, we will show that h will always stay in Hi and thus after O(d log(1/ϵ)) rounds, we are guaranteed to output some hypothesis with small error.

Assume that err(h Hi) > 250η. We say a set Sj is good if it contains no example x such that h (x) = f(x). We first show that given a set Sj accepted by Appendix C, with a non-trivial probability it is good.

Pr (Sj is good | Sj is accepted) = Pr (Sj is good and Sj is accepted)

Pr (Sj is accepted) Pr (Sj is good and Sj is accepted) 1 Pr (Sj is not good) Pr (Sj is not accepted) .

Since the noise rate is η, we know from the definition of ϵ-cover that h has an error at most 2η. This implies that in expectation, a random SJ contains 1/25 example that is misclassified by h .

Pr (Sj is not good) = Pr (Sj contains one example misclassified byh ) 1/25 = 0.04.

On the other hand, since err(h Hi) > 250η, a random example has a probability at most 1 1/(250η) not misclassified by h Hi and this

Pr (Sj is not accepted) (1 1/(250η))1/(50η) e 5 0.01. (2)

Thus, with a probability of at least 95%, an accepted set Sj is good. In particular, h misclassified no example in SJ. This implies that h will not get marked when Sj is good. And thus in expectation, h will not get marked for more than T/20 times. By Hoeffding s inequality, this implies with high probability h will not get removed from Hi. On the other hand, for every Sj that gets accepted more than 1/6 of the hypothesized in Hj must get marked. To show this, we consider two cases. In the first case, more than 1/6 fraction of the hypothesis in Hi agrees with f Hi over Sj. In this case, according to Algorithm 3, all hypothesizes in Hi that agree with f Hi over Sj will get marked and more than

1/6 fraction of the hypothesis in Hi will be marked. In the second case, we show that there must be a subset of ˆSj Sj such that ξ-fraction of the hypothesis in Hi agrees with f Hi over ˆSj, where ξ [1/6, 2/3]. We order Sj in an arbitrary order x1, x2, . . . , xm, where m = |Sj|. For each t [m], we use H(t) to denote the set of hypothesises in Hi that agree with h Hi for x1, . . . , xt. From the above discussion, we know that |H(m)| |Hi|/6. On the other hand, we know by the definition of h Hi that |H(1)| |Hi|/2. If |H(1)| 2|Hi|/3, then we are done. Otherwise, there must be a largest t such that |H(t )| 2|Hi|/3. We claim that |Hi|/6 |H(t +1)| 2|Hi|/3. This is because at most |Hi|/2 hypothesises in Hi will disagree with h Hi over xt+1 and will get deleted from H(t ). This implies that we can choose ˆSj = {x1, . . . , xt+1}. Given this, whether h Hj makes a mistake over ˆSj or not, at least 1/6 fraction of the hypothesis will be marked.

We next use this fact to show that in each round, a constant fraction of the hypotheses in Hi will be removed. Assuming that c-fraction of the hypotheses in Hi gets removed from Hi. On the one hand, for each accepted Sj, at least |Hi|/6 hypotheses are marked. So the total number of marks we made is at least T|Hi|/6. On the other hand, since only c-fraction of the hypotheses are marked by more than 0.1T times. The total number of marks we made is at most c|Hi|T + 0.1(1 c)|Hi|T. As the following inequality always holds

c|Hi|T + 0.1(1 c)|Hi|T T|Hi|/6,

we conclude c 2/27. According to [28], the size of the ϵ-cover of H is at most O(d/ϵ)d. Since h

is also included in Hi, after at most k = O(d log(1/ϵ)) rounds, h will be the only hypothesis not removed. Since h has an error at most 2η, if Appendix C runs for k rounds, then h will be output. So, before the kth round, some h Hi must be output and has error O(η) = O(opt). This proves the correctness of Algorithm 3.

Finally, it remains to prove the query complexity of Algorithm 3. We notice by Equation (2) that when h Hi has an error larger than 250η, a random Sj has only probability less than 0.01 not getting accepted. This implies that to get an accepted Sj, we only need to make O(1) TSQs to check whether h Hi has zero error over Sj. Since checking the error of h Hj and marking hypothesizes after some Sj gets accepted will only take us O(1) TSQs. In each round of Algorithm 3, we will make at most O(1) TSQs. Since there are at most O(d log(1/ϵ)) rounds in Algorithm 3, we conclude the query complexity of Algorithm 3 is O(d log(1/ϵ)).

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The abstract summarizes the results provided in Theorem 1.4 and Theorem 1.5. The introduction summarizes the motivations of this paper and describes prior work s contributions. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The limitations of this paper are discussed in the introduction of the paper. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: The statement of each theorem provides all the assumptions and we provide complete proofs for all statements that are either in the main body of the paper or in the appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: This paper is theoretical and does not contain experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: This paper is theoretical and does not contain experiments. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: This paper is theoretical and does not contain experiments. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: This paper is theoretical and does not contain experiments. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: This paper is theoretical and does not contain experiments. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our research conforms in every respect with the Neur IPS Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: This work is theoretical and we do not see any immediate implications on society. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: This paper is theoretical.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: This work does not use any assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: This work does not use any assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.