# active_labeling_streaming_stochastic_gradients__c6ceda7e.pdf

Active Labeling: Streaming Stochastic Gradients

Vivien Cabannes Meta Francis Bach INRIA / ENS / PSL Vianney Perchet ENSAE Alessandro Rudi INRIA / ENS / PSL

The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which focuses on active learning with partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over the number of samples. We illustrate our technique in depth for robust regression.

1 Introduction

A large amount of the current hype around artificial intelligence was fueled by the recent successes of supervised learning. Supervised learning consists in designing an algorithm that maps inputs to outputs by learning from a set of input/output examples. When accessing many samples, and given enough computation power, this framework is able to tackle complex tasks. Interestingly, many of the difficulties arising in practice do not emerge from choosing the right statistical model to solve the supervised learning problem, but from the problem of collecting and cleaning enough data (see Chapters 1 and 2 of Géron, 2017, for example). Those difficulties are not disjoint from the current trends toward data privacy regulations (Council of European Union, 2016). This fact motivates this work, where we focus on how to efficiently collect information to carry out the learning process.

In this paper, we formalize the active labeling problem for weak supervision, where the goal is to learn a target function by acquiring the most informative dataset given a restricted budget for annotation. We focus explicitly on weak supervision that comes as a set of label candidates for each input, aiming to partially supervise input data in the most efficient way to guide a learning algorithm. We also restrict our study to the streaming variant where, for each input, only a single partial information can be collected about its corresponding output. The crux of this work is to leverage the fact that full supervision is not needed to acquire unbiased stochastic gradients, and perform stochastic gradient descent.

The following summarizes our contributions.

1. First, we introduce the active labeling problem, which is a relevant theoretical framework that encompasses many useful problems encountered by practitioners trying to annotate their data in the most efficient fashion, as well as its streaming variation, in order to deal with privacy preserving issues. This is the focus of Section 2. 2. Then, in Section 3, we give a high-level framework to access unbiased stochastic gradients with weak information only. This provides a simple solution to the streaming active labeling problem. 3. Finally, we detail this framework for a robust regression task in Section 4, and provide an algorithm whose optimality is proved in Section 5.

Work done while at INRIA / ENS / PSL. Contact the first author at vivien.cabannes@gmail.com.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

As a proof of concept, we provide numerical simulations in Section 6. We conclude with a high-level discussion around our methods in Section 7.

Related work. Active query of information is relevant to many settings. The most straightforward applications are searching games, such as Bar Kokhba or twenty questions (Walsorth, 1882). We refer to Pelc (2002) for an in-depth survey of such games, especially when liars introduce uncertainty, and their relations with coding on noisy channels. But applications are much more diverse, e.g. for numerical simulation (Chevalier et al., 2014), database search (Qarabaqi and Riedewald, 2014), or shape recognition (Geman and Jedynak, 1993), to name a few.

In terms of motivations, many streams of research can be related to this problem, such as experimental design (Chernoff, 1959), statistical queries (Kearns, 1998; Fotakis et al., 2021), crowdsourcing (Doan et al., 2011), or aggregation methods in weak supervision (Ratner et al., 2020). More precisely, active labeling 2 consists in having several inputs and querying partial information on the labels. It is close to active learning (Settles, 2010; Dasgupta, 2011; Hanneke, 2014), where there are several inputs, but exact outputs are queried; and to active ranking (Valiant, 1975; Ailon, 2011; Braverman et al., 2019), where partial information is queried, but there is only one input. The streaming variant introduces privacy preserving constraints, a problem that is usually tackled through the notion of differential privacy (Dwork et al., 2006).

In terms of formalization, we build on the partial supervision formalization of Cabannes et al. (2020), which casts weak supervision as sets of label candidates and generalizes semi-supervised learning (Chapelle et al., 2006). Finally, our sequential setting with a unique final reward is similar to combinatorial bandits in a pure-exploration setting (Garivier and Kaufmann, 2016; Fiez et al., 2019).

2 The active labeling problem

Supervised learning is traditionally modeled in the following manner. Consider X an input space, Y an output space, ℓ: Y Y R a loss function, and 𝜌 ΔX Y a joint probability distribution. The goal is to recover the function

𝑓 arg min 𝑓:X Y R( 𝑓) := E(𝑋,𝑌) 𝜌[ℓ( 𝑓(𝑋),𝑌)], (1)

yet, without accessing 𝜌, but a dataset of independent samples distributed according to 𝜌, D𝑛= (𝑋𝑖,𝑌𝑖)𝑖 𝑛 𝜌 𝑛. In practice, accessing data comes at a cost, and it is valuable to understand the cheapest way to collect a dataset allowing to discriminate 𝑓 .

We shall suppose that the input data (𝑋𝑖)𝑖 𝑛are easy to collect, yet that labeling those inputs to get outputs (𝑌𝑖)𝑖 𝑛demands a high amount of work. For example, it is relatively easy to scrap the web or medical databases to access radiography images, but labeling them by asking radiologists to recognize tumors on zillions of radiographs will be both time-consuming and expensive. As a consequence, we assume the (𝑋𝑖)𝑖 𝑛given but the (𝑌𝑖)𝑖 𝑛unknown. As getting information on the labels comes at a cost (e.g., paying a pool of label workers, or spending your own time), given a budget constraint, what information should we query on the labels?

To quantify this problem, we will assume that we can sequentially and adaptively query 𝑇information of the type 1𝑌𝑖𝑡 𝑆𝑡, for any index 𝑖𝑡 {1, , 𝑛} and any set of labels 𝑆𝑡 Y (belonging to a specified set of subsets of Y). Here, 𝑡 {1, ,𝑇} indexes the query sequence, and 𝑇 N is a fixed budget. The goal is to optimize the design of the sequence (𝑖𝑡, 𝑆𝑡) in order to get the best estimate of 𝑓 in terms of risk minimization (1). In the following, we give some examples to make this setting more concrete. Example 1 (Classification with attributes). Suppose that a labeler is asked to provide fine-grained classes on images (Krause et al., 2016; Zheng et al., 2019), such as the label caracal in Figure 6. This would be difficult for many people. Yet, it is relatively easy to recognize that the image depicts a feline with tufted-ears and sandy color . As such, a labeler can give the weak information that 𝑌belongs to the set feline , 𝑆1 = { cat , lion , tiger , . . .}, and the set tufted ears , 𝑆2 = { Great horned owl , Aruacana chicken , . . .}. This is enough to recognize that 𝑌 𝑆1 𝑆2 = { caracal }. The question 1𝑌 𝑆1, corresponds to asking if the image depicts a feline.

2Note that the wording active labeling has been more or less used as synonymous of active learning (e.g., Wang and Shang, 2014). In contrast, we use active labeling to design active weakly supervised learning .

Literature on hierarchical classification and autonomic taxonomy construction provides interesting ideas for this problem (e.g., Cesa-Bianchi et al., 2006; Gangaputra and Geman, 2006).

Example 2 (Ranking with partial ordering). Consider a problem where for a given input 𝑥, characterizing a user, we are asked to deduce their preferences over 𝑚items. Collecting such a label requires knowing the exact ordering of the 𝑚items induced by a user. This might be hard to ask for. Instead, one can easily ask the user which items they prefer in a collection of a few items. The user s answer will give weak information about the labels, which can be modeled as knowing 1𝑌𝑖 𝑆= 1, for 𝑆the set of total orderings that satisfy this partial ordering. We refer the curious reader to active ranking and dueling bandits for additional contents (Jamieson and Nowak, 2011; Bengs et al., 2021). Example 3 (Pricing a product). Suppose that we want to sell a product to a consumer characterized by some features 𝑥, this consumer is ready to pay a price 𝑦 R for this product. We price it 𝑓(𝑥) R, and we observe 1 𝑓(𝑥)<𝑦, that is if the consumer is willing to buy this product at this price tag or not (Cesa-Bianchi et al., 2019; Liu et al., 2021). Although, in this setting, the goal is often to minimize the regret, which contrasts with our pure exploration setting.

As a counter-example, our assumptions are not set to deal with missing data, i.e. if some coordinates of some input feature vectors 𝑋𝑖are missing (Rubin, 1976). Typically, this happens when input data comes from different sources (e.g., when trying to predict economic growth from country information that is self-reported).

Streaming variation. The special case of the active labeling problem we shall consider consists in its variant without resampling. This corresponds to the online setting where one can only ask one question by sample, formally 𝑖𝑡= 𝑡. This setting is particularly appealing for privacy concerns, in settings where the labels (𝑌𝑖) contain sensitive information that should not be revealed totally. For example, some people might be more comfortable giving a range over a salary rather than the exact value; or in the context of polling, one might not call back a previous respondent characterized by some features 𝑋𝑖to ask them again about their preferences captured by 𝑌𝑖. Similarly, the streaming setting is relevant for web marketing, where inputs model new users visiting a website, queries model sets of advertisements chosen by an advertising company, and one observes potential clicks.

3 Weak information as stochastic gradients

In this section, we discuss how unbiased stochastic gradients can be accessed through weak information.

Suppose that we model 𝑓= 𝑓𝜃for some Hilbert space Θ 𝜃. With some abuse of notations, let us denote ℓ(𝑥, 𝑦, 𝜃) := ℓ( 𝑓𝜃(𝑥), 𝑦). We aim to minimize R(𝜃) = E(𝑋,𝑌) [ℓ(𝑋,𝑌, 𝜃)] . Assume that R is differentiable (or sub-differentiable) and denote its gradients by 𝜃R.

Definition 1 (Stochastic gradient). A stochastic gradient of R is any random function 𝐺: Θ Θ such that E[𝐺(𝜃)] = 𝜃R(𝜃). Given some step size function 𝛾: N R , a stochastic gradient descent (SGD) is a procedure, (𝜃𝑡) ΘN, initialized with some 𝜃0 and updated as 𝜃𝑡+1 = 𝜃𝑡 𝛾(𝑡)𝐺(𝜃𝑡), where the realization of 𝐺(𝜃𝑡) given 𝜃𝑡is independent of the previous realizations of 𝐺(𝜃𝑠) given 𝜃𝑠.

In supervised learning, SGD is usually performed with the stochastic gradients 𝜃ℓ(𝑋,𝑌, 𝜃). More generally, stochastic gradients are given by

𝐺(𝜃) = 1 𝜃ℓ(𝑋,𝑌,𝜃) 𝑇 𝜏(𝑇), (2)

for 𝜏: T Θ with T 2Θ a set of subsets of Θ, and 𝑇a random variable on T , such that

𝜃 Θ, E𝑇[1𝜃 𝑇 𝜏(𝑇)] = 𝜃. (3)

Stated otherwise, if you have a way to image a vector 𝜃from partial measurements 1𝜃 𝑇such that you can reconstruct this vector in a linear fashion (3), then it provides you a generic strategy to get an unbiased stochastic estimate of this vector from a partial measurement (2).

For 𝜓: Y Θ a function from Y to Θ (e.g., 𝜓= 𝜃(𝑋, , 𝜃)), a question 1𝜓(𝑌) 𝑇translates into a question 1𝑌 𝑆for some set 𝑆= 𝜓 1(𝑇) Y, meaning that the stochastic gradient (2) can be evaluated from a single query. As a proof of concept, we derive a generic implementation for 𝑇and 𝜏in Appendix B. This provides a generic SGD scheme to learn functions from weak queries when there are no constraints on the sets to query.

Remark 2 (Cutting plane methods). While we provide here a descent method, one could also develop cutting-plane/ellipsoid methods to localize 𝜃 according to weak information, which corresponds to the techniques developed for pricing by Cohen et al. (2020) and related literature.

4 Median regression

In this section, we focus on efficiently acquiring weak information providing stochastic gradients for regression problems. In particular, we motivate and detail our methods for the absolute deviation loss.

Motivated by seminal works on censored data (Tobin, 1958), we shall suppose that we query halfspaces. For an output 𝑦 Y = R𝑚, and any hyper-plane 𝑧+ 𝑢 R𝑚for 𝑧 R𝑚, 𝑢 S𝑚 1, we can ask a labeler to tell us which half-space 𝑦belongs to. Formally, we access the quantity sign( 𝑦 𝑧, 𝑢 ) for a given unit cost. Such an imaging scheme where one observes summations of its components rather than a vector itself bears similarity with compressed sensing. To provide further illustration, this setting could help to price products while selling bundles: where the context 𝑥characterizes some users, web-pages or/and advertisement companies; the label 𝑦 R𝑚corresponds to the value associated to 𝑚different items, such as stocks composing an index, or advertisement spots; and the observation sign( 𝑦, 𝑢 𝑐) (with 𝑐= 𝑧, 𝑢 ) captures if the user 𝑥buys the basket with weights 𝑢 S𝑚 1 when it is priced 𝑐.

Least-squares. For regression problems, it is common to look at the mean square loss

ℓ(𝑋,𝑌, 𝜃) = 𝑓𝜃(𝑋) 𝑌 2 , 𝜃ℓ(𝑋,𝑌, 𝜃) = 2( 𝑓𝜃(𝑋) 𝑌) 𝐷𝑓𝜃(𝑋),

where 𝐷𝑓𝜃(𝑥) Y Θ denotes the Jacobian of 𝜃 𝑓𝜃(𝑥). In rich parametric models, it is preferable to ask questions on 𝑌 Y rather than on gradients in Θ which is a potentially much bigger space. If we assume that 𝑌and 𝑓𝜃(𝑋) are bounded in ℓ2-norm by 𝑀 R+, we can adapt (2) and (3) through the fact that for any 𝑧 Y, such that 𝑧 2𝑀, as proven in Appendix B,

E𝑈,𝑉 1 𝑧,𝑈 𝑉 𝑈 = 𝑐1 𝑧, where 𝑐1 = E𝑈,𝑉 1 𝑒1,𝑈 𝑉 𝑒1,𝑈 = 𝜋3/2

2𝑀(𝑚2 + 4𝑚+ 3) ,

for 𝑈uniform on the sphere S𝑚 1 and 𝑉uniform on [0, 2𝑀]. Applied to 𝑧= 𝑓𝜃(𝑋) 𝑌, it designs an SGD procedure by querying information of the type 1 𝑌,𝑈 < 𝑓𝜃(𝑋),𝑈 𝑉.

A case for median regression. Motivated by robustness purposes, we will rather expand on median regression. In general, we would like to learn a function that, given an input, replicates the output of I/O samples generated by the joint probability 𝜌. In many instances, 𝑋does not characterize all the sources of variations of 𝑌, i.e. input features are not rich enough to characterize a unique output, leading to randomness in the conditional distributions (𝑌|𝑋). When many targets can be linked to a vector 𝑥 X, how to define a consensual 𝑓(𝑥)? For analytical reasons, statisticians tend to use the least-squares error which corresponds to asking for 𝑓(𝑥) to be the mean of the distribution (𝑌|𝑋= 𝑥). Yet, means are known to be too sensitive to rare but large outputs (see e.g., Huber, 1981), and cannot be defined as good and robust consensus in a world of heavy-tailed distributions. This contrasts with the median, which, as a consequence, is often much more valuable to summarize a range of values. For instance, median income is preferred over mean income as a population indicator (see e.g., US Census Bureau, 2021).

Median regression. The geometric median is variationally defined through the absolute deviation loss, leading to

ℓ(𝑋,𝑌, 𝜃) = 𝑓𝜃(𝑋) 𝑌 , 𝜃ℓ(𝑋,𝑌, 𝜃) = 𝑓𝜃(𝑋) 𝑌 𝑓𝜃(𝑋) 𝑌

𝐷𝑓𝜃(𝑋). (4)

Similarly to the least-squares case, we can access weakly supervised stochastic gradients through the fact that for 𝑧 S𝑚 1, as shown in Appendix B,

E𝑈[sign ( 𝑧,𝑈 ) 𝑈] = 𝑐2 𝑧, where 𝑐2 = E𝑈[sign ( 𝑒1,𝑈 ) 𝑒1,𝑈 ] =

where 𝑈is uniformly drawn on the sphere S𝑚 1, and Γ is the gamma function. This suggests Algorithm 1.

Algorithm 1: Median regression with SGD. Data: A model 𝑓𝜃for 𝜃 Θ, some data (𝑋𝑖)𝑖 𝑛, a labeling budget 𝑇, a step size rule 𝛾: N R+ Result: A learned parameter ˆ𝜃and the predictive function ˆ𝑓= 𝑓ˆ𝜃. Initialize 𝜃0. for 𝑡 1 to 𝑇do

Sample 𝑈𝑡uniformly on S𝑚 1. Query 𝜀= sign( 𝑌𝑡 𝑧,𝑈𝑡 ) for 𝑧= 𝑓𝜃𝑡 1(𝑋𝑡). Update the parameter 𝜃𝑡= 𝜃𝑡 1 + 𝛾(𝑡)𝜀 𝑈 𝑡(𝐷𝑓𝜃𝑡 1(𝑋𝑡)).

Output ˆ𝜃= 𝜃𝑇, or some average, e.g., ˆ𝜃= 𝑇 1 Í𝑇 𝑡=1 𝜃𝑡.

5 Statistical analysis

In this section, we quantify the performance of Algorithm 1 by proving optimal rates of convergence when the median regression problem is approached with (reproducing) kernels. For simplicity, we will assume that 𝑓 can be parametrized by a linear model (potentially of infinite dimension).

Assumption 1. Assume that the solution 𝑓 : X R𝑚of the median regression problem (1) and (4) can be parametrized by some separable Hilbert space H, and a bounded feature map 𝜑: X H, such that, for any 𝑖 [𝑚], there exists some 𝜃 𝑖 H such that 𝑓 ( ), 𝑒𝑖 Y = 𝜃 𝑖, 𝜑( )

H , where (𝑒𝑖) is the canonical basis of R𝑚. Written into matrix form, there exists 𝜃 Y H, such that 𝑓 ( ) = 𝜃 𝜑( ).

The curious reader can easily relax this assumption in the realm of reproducing kernel Hilbert spaces following the work of Pillaud-Vivien et al. (2018a). Under the linear model of Assumption 1, Algorithm 1 is specified with 𝑢 𝐷𝑓𝜃(𝑥) = 𝑢 𝜑(𝑥). Note that rather than working with Θ = Y H which is potentially infinite-dimensional, empirical estimates can be represented in the finitedimensional space Y Span {𝜑(𝑋𝑖)}𝑖 𝑛, and well approximated by small-dimensional spaces to ensure efficient computations (Williams and Seeger, 2000; Meanti et al., 2020).

One of the key points of SGD is that gradient descent is so gradual that one can use noisy or stochastic gradients without loosing statistical guarantees while speeding up computations. This is especially true when minimizing convex functions that are nor strongly-convex, i.e., bounded below by a quadratic, nor smooth, i.e., with Lipschitz-continuous gradient (see, e.g., Bubeck, 2015). In particular, the following theorem, proven in Appendix A.1, states that Algorithm 1 minimizes the population risk at a speed at least proportional to 𝑂(𝑇 1/2).

Theorem 1 (Convergence rates). Under Assumption 1, and under the knowledge of 𝜅and 𝑀two real values such that E[ 𝜑(𝑋) 2] 𝜅2 and 𝜃 𝑀, with a budget 𝑇 N, a constant step size 𝛾= 𝑀 𝜅

𝑇and the average estimate ˆ𝜃= 1

𝑇 Í𝑇 1 𝑡=0 𝜃𝑡, Algorithm 1 leads to an estimate 𝑓that suffers from an excess of risk

E R 𝑓ˆ𝜃 R( 𝑓 ) 2𝜅𝑀

𝑇 𝜅𝑀𝑚3/2𝑇 1/2, (6)

where the expectation is taken with respect to the randomness of ˆ𝜃that depends on the dataset (𝑋𝑖,𝑌𝑖) as well as the questions (𝑖𝑡, 𝑆𝑡)𝑡 𝑇.

While we give here a result for a fixed step size, one could retake the extensive literature on SGD to prove similar results for decaying step sizes that do not require to know the labeling budget in advance (e.g. setting 𝛾(𝑡) 𝑡 1/2 at the expense of an extra term in log(𝑇) in front of the rates), as well as different averaging strategies (see e.g., Bach, 2023). In practice, one might not know a priori the parameter 𝑀but could nonetheless find the right scaling for 𝛾based on cross-validation.

The rate in 𝑂(𝑇 1/2) applies more broadly to all the strategies described in Section 3 as long as the loss ℓand the parametric model 𝑓𝜃ensure that R(𝜃) is convex and Lipschitz-continuous. Although the constants appearing in front of rates depend on the complexity to reconstruct the full gradient 𝜃ℓ( 𝑓𝜃(𝑋𝑖,𝑌𝑖)) from the reconstruction scheme (3). Those constants correspond to the second moment of the stochastic gradient. For example, for the least-squares technique described earlier one would have to replace 𝑐2 by 𝑐1 in (6).

Theorem 2, proven in Appendix A.3, states that any algorithm that accesses a fully supervised learning dataset of size 𝑇cannot beat the rates in 𝑂(𝑇 1/2), hence any algorithm that collects weaker information on (𝑌𝑖)𝑖 𝑇cannot display better rates than the ones verified by Algorithm 1. This proves minimax optimality of our algorithm up to constants. Theorem 2 (Minimax optimality). Under Assumption 1 and the knowledge of an upper bound on 𝜃 𝑀, assuming that 𝜑is bounded by 𝜅, there exists a universal constant 𝑐3 such that for any algorithm A that takes as input D𝑇= (𝑋𝑖,𝑌𝑖)𝑖 𝑇 𝜌 𝑇for any 𝑇 N and output a parameter 𝜃,

sup 𝜌 M𝑀 ED𝑇 𝜌 𝑇 R( 𝑓A(D𝑇;𝜌)) R( 𝑓𝜌; 𝜌) 𝑐3𝑀𝜅𝑇 1/2. (7)

The supremum over 𝜌 M𝑀has to be understood as the supremum over all distributions 𝜌 ΔX Y such that the problem defined through the risk R( 𝑓; 𝜌) := E𝜌[ℓ( 𝑓(𝑋),𝑌)] is minimized for 𝑓𝜌that verifies Assumption 1 with 𝜃 bounded by a constant 𝑀.

The same theorem applies for least-squares with a different universal constant. It should be noted that minimax lower bounds are in essence quantifying worst cases of a given class of problems. In particular, to prove Theorem 2, we consider distributions that lead to hard problems; more specifically, we assumed the variance of the conditional distribution (𝑌| 𝑋) to be high. The practitioner should keep in mind that it is possible to add additional structure on the solution, leverage active learning or semi-supervised strategy such as uncertainty sampling (Nguyen et al., 2021), or Laplacian regularization (Zhu et al., 2003; Cabannes et al., 2021a), and reduce the optimal rates of convergence.

To conclude this section, let us remark that most of our derivations could easily be refined for practitioners facing a slightly different cost model for annotation. In particular, they might prefer to perform batches of annotations before updating 𝜃rather than modifying the question strategy after each input annotation. This would be similar to mini-batching in gradient descent. Indeed, the dependency of our result on the annotation cost model and on Assumption 1 should not be seen as a limitation but rather as a proof of concept.

6 Numerical analysis

In this section, we illustrate the differences between our active method versus a classical passive method, for regression and classification problems. Further discussions are provided in Appendix E. Our code is available online at https://github.com/Vivien Cabannes/active-labeling.

Let us begin with the regression problem that consists in estimating the function 𝑓 that maps 𝑥 [0, 1] to sin(2𝜋𝑥) R. Such a regular function, which belongs to any Hölder or Sobolev classes of functions, can be estimated with the Gaussian kernel, which would ensure Assumption 1, and that corresponds to a feature map 𝜑such that 𝑘(𝑥, 𝑥 ) := 𝜑(𝑥), 𝜑(𝑥 ) = exp( |𝑥 𝑥 | /(2𝜎2)) for any bandwidth parameter 𝜎> 0.3 On Figure 1, we focus on estimating 𝑓 given data (𝑋𝑖)𝑖 [𝑇] that are uniform on [0, 1] in the noiseless setting where 𝑌𝑖= 𝑓 (𝑋𝑖), based on the minimization of the absolute deviation loss. The passive baseline consists in randomly choosing a threshold 𝑈𝑖 N (0, 1) and acquiring the observations (1𝑌𝑖>𝑈𝑖)𝑖 [𝑇] that can be cast as the observation of the half-space 𝑆𝑖= 𝑦 Y 1𝑦>𝑈𝑖= 1𝑌𝑖>𝑈𝑖 =: 𝑠(𝑌𝑖,𝑈𝑖). In this noiseless setting, a good baseline to learn 𝑓 from the data (𝑋𝑖, 𝑆𝑖) is provided by the infimum loss characterization (see Cabannes et al., 2020)

𝑓 = arg min 𝑓:X Y E(𝑋,𝑆) [inf 𝑦 𝑆ℓ( 𝑓(𝑋), 𝑦)],

where the distribution over 𝑋corresponds to the marginal of 𝜌over X, and the distribution over (𝑆| 𝑋= 𝑥) is the pushforward of 𝑈 N (0, 1) under 𝑠( 𝑓 (𝑥), ). The left plot on Figure 1 corresponds to an instance of SGD on such an objective based on the data (𝑋𝑖, 𝑆𝑖), while the right plot corresponds to Algorithm 1. We take the same hyperparameters for both plots, a bandwidth 𝜎= 0.2 and an SGD step size 𝛾= 0.3. We refer the curious reader to Figure 7 in Appendix E for plots illustrating the streaming history, and to Figure 10 for real-world experiments.

To illustrate the versatility of our method, we approach a classification problem through the median surrogate technique presented in Proposition 3. To do so, we consider the classification problem

3A noteworthy computational aspect of linear models, often refer as the kernel trick , is that the features map 𝜑does not need to be explicit, the knowledge of 𝑘: X X R being sufficient to compute all quantities of interest (Scholkopf and Smola, 2001). This trick can be applied to our algorithms.

Passive strategy

Above Below

Active strategy

Signal Reconstruction

Figure 1: Visual comparison of active and passive strategies. Estimation in orange of the original signal 𝑓

in dashed blue based on median regression in a noiseless setting. Any orange point (𝑥, 𝑢) R2 corresponds to an observation made that 𝑢is below 𝑓 (𝑥), while a blue point corresponds to 𝑢above 𝑓 (𝑥). The passive strategy corresponds to acquiring information based on (𝑈| 𝑥) following a normal distribution, while the active strategy corresponds to (𝑢| 𝑥) = 𝑓𝜃(𝑥). The active strategy reconstructs the signal much better given the budget of 𝑇= 30 observations.

with 𝑚 N classes, X = [0, 1] and the conditional distribution (𝑌| 𝑋) linearly interpolating between Dirac in 𝑦1, 𝑦2 and 𝑦3 respectively for 𝑥= 0, 𝑥= 1/2 and 𝑥= 1 and the uniform distribution for 𝑥= 1/4 and 𝑥= 3/4; and 𝑋uniform on X \ ([1/4 𝜀, 1/4 + 𝜀] [3/4 𝜀, 3/4 + 𝜀]).

iteration T

Regression task

active passive

iteration T

Classiﬁcation task

active passive

Figure 2: Comparison of generalization errors of passive and active strategies as a function of the annotation budget 𝑇. This error is computed by averaging over 100 trials. In solid is represented the average error, while the height of the dark area represents one standard deviation on each side. In order to consider the streaming setting where 𝑇is not known in advance, we consider the decreasing step size 𝛾(𝑡) = 𝛾0/ 𝑡; and to smooth out the stochasticity due to random gradients, we consider the average estimate 𝜃𝑡= (𝜃1 + + 𝜃𝑡)/𝑡. The left figure corresponds to the noiseless regression setting of Figure 1, with 𝛾0 = 1. We observe the convergence behavior in 𝑂(𝑇 1/2) of our active strategy. The right setting corresponds to the classification problem setting described in the main text with 𝑚= 100, 𝜀= 1/20, and approached with the median surrogate. We observe the exponential convergence phenomenon described by Pillaud-Vivien et al. (2018b); Cabannes et al. (2021b); its kicks in earlier for the active strategy. The two plots are displayed with logarithmic scales on both axes.

7 Discussion

7.1 Discrete output problems

In this section, we discuss casting Algorithm 1 into a procedure to tackle discrete-output problems, by leveraging surrogate regression tasks.

Learning problems with discrete output spaces are not as well understood as regression problems. This is a consequence of the complexity of dealing with combinatorial structures in contrast with continuous metric spaces. In particular, gradients are not defined for discrete output models. The current state-of-the-art framework to deal with discrete output problems is to introduce a continuous surrogate problem whose solution can be decoded as a solution on the original problem (Bartlett et al.,

2006). For example, one could solve a classification task with a median regression surrogate problem, which is the object of the next proposition, proven in Appendix C.

Proposition 3 (Consistency of median surrogate). The classification setting where Y is a finite space, and ℓ: Y Y R is the zero-one loss ℓ(𝑦, 𝑧) = 1𝑦 𝑧can be solved as a regression task through the simplex embedding of Y in RY with the orthonormal basis (𝑒𝑦)𝑦 Y. More precisely, if 𝑔 : X RY is the minimizer of the median surrogate risk R𝑆(𝑔) = E [ 𝑔(𝑋) 𝑒𝑌 ], then 𝑓 : X Y defined as 𝑓 (𝑥) = arg max𝑦 Y 𝑔 𝑦(𝑥) minimizes the original risk R( 𝑓) = E [ℓ( 𝑓(𝑋),𝑌)].4

More generally, any discrete output problem can be solved by reusing the consistent least-squares surrogate of Ciliberto et al. (2020). Algorithm 1 can be adapted to the least-squares problem based on specifications at the beginning of Section 4. This allows using our method in an off-the-shelve fashion for all discrete output problems. For example, a problem consisting in ranking preferences over 𝑚items can be approached with the Kendall correlation loss ℓ(𝑦, 𝑧) = 𝜑(𝑦) 𝜑(𝑧) with 𝜑(𝑦) = (1𝑦(𝑖)>𝑦( 𝑗)) for 𝑖< 𝑗 𝑚, where 𝑦and 𝑧are permutations over [𝑚] that encode the rank of each element in terms of user preferences. In this setting, the surrogate task introduced by Ciliberto et al. (2020) consists in learning 𝑔(𝑥) = E[𝜑(𝑌)|𝑋= 𝑥] as a least-squares problem. The half-space surrogate queries translate directly into the questions Í 𝑖< 𝑗 𝑚𝑤(𝑖, 𝑗)1𝑦(𝑖)>𝑦( 𝑗) > 𝑐for some (𝑤(𝑖, 𝑗)), 𝑐in R. In particular, if 𝑈is chosen to be uniform on the canonical basis (rather than on the sphere), those questions translate into pairwise orderings (e.g., does user 𝑥prefer movie 𝑖or movie 𝑗?). In terms of guarantee akin to Theorem 1, retaking the calibration inequality of Ciliberto et al. (2020), we get convergence rates of the form 𝑚3/2𝑇 1/4. In terms of guarantee akin to Theorem 2, since we need as least log2(𝑚!) 𝑚log(𝑚) binary queries to discriminate between 𝑚! permutations, we can expect a lower bound in 𝑚1/2 log(𝑚)1/2𝑇 1/2. More generally, many ranking problems can be approached with correlation losses and tackled through surrogate regression problems on the convex hulls of some well-known polytopes such as the Birkhoff polytope or the permutohedron (e.g., Ailon, 2014). Although their descriptions is out-of-scope of this paper, linear cuts of those polytopes form well-structured queries sets e.g., the faces of all dimensions of the permutohedron correspond, in a one-to-one fashion, to strict weak orderings (Ziegler, 1995).

In those discrete settings, Theorem 1 can be refined under low noise conditions. In particular, under generalization of the Massart noise condition, our approach could even exhibit exponential convergence rates as illustrated on Figure 2. For classification problems, this condition can be expressed as the existence of a threshold 𝛿> 0 such that for almost all 𝑥 X and 𝑧 Y, we have P(𝑌= 𝑓(𝑥)|𝑋= 𝑥) P(𝑌= 𝑧|𝑋= 𝑥) (0, 𝛿). Arguably, this assumption is met on well-curated images dataset such as Image Net or CIFAR10, where for each input 𝑋the most probable class has always more than e.g. 60% of chance to be the target 𝑌. When this assumption holds together with Assumption 1 (when the surrogate target 𝑔 belongs to the RKHS and the kernel is bounded), then the right hand-side of equation (6) can be replaced by exp( 𝑐𝑇) for some constant 𝑐. The proof would be a simple adaptation of Pillaud-Vivien et al. (2018a); Cabannes et al. (2021b) to our case.

7.2 Supervised learning baseline with resampling

In this section, we discuss simple supervised learning baselines that compete with Algorithm 1 when resampling is allowed.

When resampling is allowed a simple baseline for the active labeling problem is provided by supervised learning. In regression problems with the query of any half-space, a method that consists in annotating each (𝑌𝑖)𝑖 𝑛(𝑇,𝜀) up to precision 𝜀, before using any supervised learning method to learn 𝑓from (𝑋𝑖,𝑌𝑖)𝑖 𝑛(𝑇,𝜀) could acquire 𝑛(𝑇, 𝜀) 𝑇/𝑚log2(𝜀 1) data points with a dichotomic search along all directions, assuming 𝑌𝑖bounded or sub-Gaussian. In terms of minimax rates, such a procedure cannot perform better than in 𝑛(𝑇, 𝜀) 1/2 + 𝜀, the first term being due to the statistical limit in Theorem 2, the second due to the incertitude 𝜀on each 𝑌𝑖that transfers to the same level of incertitude on 𝑓. Optimizing with respect to 𝜀yields a bound in 𝑂(𝑇 1/2 log(𝑇)1/2). Therefore, this not-so-naive baseline is only suboptimal by a factor log(𝑇)1/2. In the meanwhile, Algorithm 1 can be rewritten with resampling, as well as Theorem 1, which we prove in Appendix A.2. Hence, our technique will still achieve minimax optimality for the problem with resampling . In other terms, by deciding

4As a side note, while we are not aware of any generic theory encompassing the absolute-deviation surrogate of Proposition 3, we showcase its superiority over least-squares on at least two types of problems on Figures 3 and 4 in Appendix C.

to acquire more imprecise information, our algorithm reduces annotation cost for a given level of generalization error (or equivalently reduces generalization error for a given annotation budget) by a factor log(𝑇)1/2 when compared to this baseline.

The picture is slightly different for discrete-output problems. If one can ask any question 𝑠 2Y then with a dichotomic search, one can retrieve any label with log2(𝑚) questions. Hence, to theoretically beat the fully supervised baseline with the SGD method described in Section 3, one would have to derive a gradient strategy (2) with a small enough second moment (e.g., for convex losses that are non-smooth nor strongly convex, the increase in the second moment compared to the usual stochastic gradients should be no greater than log2(𝑚)1/2). How to best refine our technique to better take into account the discrete structure of the output space is an open question. Introducing bias that does not modify convergence properties while reducing variance eventually thanks to importance sampling is a potential way to approach this problem. A simpler idea would be to remember information of the type 𝑌𝑖 𝑠to restrict the questions asked in order to locate 𝑓𝜃𝑡(𝑋𝑖) 𝑌𝑖when performing stochastic gradient descent with resampling. Combinatorial bandits might also provide helpful insights on the matter. Ultimately, we would like to build an understanding of the whole distribution (𝑌| 𝑋) and not only of 𝑓 (𝑋) as we explore labels in order to refine this exploration.

7.3 Min-max approaches

In this section, we discuss potential extensions of our SGD procedure, based on min-max variational objectives.

Min-max approaches have been popularized for searching games and active learning, where one searches for the question that minimizes the size of the space where a potential guess could lie under the worst possible answer to that question. A particularly well illustrative example is the solution of the Mastermind game proposed by Knuth (1977). While our work leverages plain SGD, one could build on the vector field point-of-view of gradient descent (see, e.g., Bubeck, 2015) to tackle min-max convex concave problems with similar guarantees. In particular, we could design weakly supervised losses 𝐿( 𝑓(𝑥), 𝑠; 1𝑦 𝑠) and min-max games where a prediction player aims at minimizing such a loss with respect to the prediction 𝑓, while the query player aims at maximizing it with respect to the question 𝑠, that is querying information that best elicit mistakes made by the prediction player. For example, the dual norm characterization of the norm leads to the following min-max approach to the median regression

arg min 𝑓:X Y R( 𝑓) = arg min 𝑓:X Y max 𝑈 (S𝑚 1)X Y E(𝑋,𝑌) 𝜌[ 𝑈(𝑥, 𝑦), 𝑓(𝑥) 𝑦 ] .

Such min-max formulations would be of interest if they lead to improvement of computational and statistical efficiencies, similarly to the work of Babichev et al. (2019). For classification problems, the following proposition introduces such a game and suggests its suitability. Its proof can be found in Appendix D.

Proposition 4. Consider the classification problem of learning 𝑓 : X Y where Y is of finite cardinality, with the 0-1 loss ℓ(𝑧, 𝑦) = 1𝑧 𝑦, minimizing the risk (1) under a distribution 𝜌on X Y. Introduce the surrogate score functions 𝑔: X ΔY; 𝑥 𝑣where 𝑣= (𝑣𝑦)𝑦 Y is a family of non-negative weights that sum to one, as well as the surrogate loss function 𝐿: ΔY S { 1, 1} R; (𝑣, 𝑆, 𝜀) = 𝜀(1 2 Í 𝑦 𝑆𝑣𝑦), and the min-max game

min 𝑔:X ΔY max 𝜇:X ΔS E(𝑋,𝑌) 𝜌E𝑆 𝜇(𝑥) [𝐿(𝑔(𝑥), 𝑆; 1𝑌 𝑆 1𝑌 𝑆)] . (8)

When S contains the singletons and with the low-noise condition that P (𝑌 𝑓 (𝑥) | 𝑋= 𝑥) < 1/2 almost everywhere, then 𝑓 can be learned through the relation 𝑓 (𝑥) = arg min𝑦 Y 𝑔 (𝑥)𝑦for the unique minimizer 𝑔 of (8). Moreover, the minimization of the empirical version of this objective with the stochastic gradient updates for saddle point problems provides a natural active labeling scheme to find this 𝑔 .

On the one hand, this min-max formulation could help to easily incorporate restrictions on the sets to query. On the other hand, it is not completely clear how to best update (or derive an unbiased stochastic gradient strategy for) the adversarial query strategy 𝜇based on partial information.

8 Conclusion

We have introduced the active labeling problem, which corresponds to active partially supervised learning . We provided a solution to this problem based on stochastic gradient descent. Although our method can be used for any discrete output problem, we detailed how it works for median regression, where we show that it optimizes the generalization error for a given annotation budget. In a near future, we would like to focus on better exploiting the discrete structure of classification problems, eventually with resampling strategies.

Understanding more precisely the key issues in applications concerned with privacy, and studying how weak gradients might provide a good trade-off between learning efficiently and revealing too much information also provide interesting follow-ups. Finally, regarding dataset annotation, exploring different paradigms of weakly supervised learning would lead to different active weakly supervised learning frameworks. While this work is based on partial labeling, similar formalization could be made based on other weak supervision models, such as aggregation (e.g., Ratner et al., 2020), or group statistics (Dietterich et al., 1997). In particular, annotating a huge dataset is often done by bagging inputs according to predicted labels and correcting errors that can be spotted on those bags of inputs (Deng et al., 2009). We left for future work the study of variants of the active labeling problem that model those settings.

Acknowledgments and Disclosure of Funding

While at INRIA / ENS / PSL, VC was funded in part by the French government under management of Agence Nationale de la Recherche as part of the Investissements d avenir program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). FR and AR also acknowledges support of the European Research Council (grants SEQUOIA 724063 and REAL 947908).

Nir Ailon. Active learning ranking from pairwise preferences with almost optimal query complexity. In Advances in Neural Information Processing Systems, 2011.

Nir Ailon. Improved bounds for online learning over the permutahedron and other ranking polytopes. In International Conference on Artificial Intelligence and Statistics, 2014.

Martin Anthony and Peter Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.

Dmitry Babichev, Dmitrii Ostrovskii, and Francis Bach. Efficient primal-dual algorithms for large-scale multiclass classification. Technical Report 1902.03755, ar Xiv, 2019.

Francis Bach. Learning Theory from First Principles. To appear at MIT Press, 2023.

Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate 𝑂(1/𝑛). In Advances in Neural Information Processing Systems, 2013.

Peter Bartlett, Michael Jordan, and Jon Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier. Preference-based online learning with dueling bandits: A survey. Journal of Maching Learning Research, 22(7): 1 108, 2021.

Lucien Birgé. Approximation dans les espaces métriques et théorie de l estimation. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 65(2):181 237, 1983.

Salomon Bochner. Monotone funktionen, stieltjessche integrale und harmonische analyse. Mathematische Annalen, 108(1):378 410, 1933.

Mark Braverman, Jieming Mao, and Yuval Peres. Sorted top-k in rounds. In Conference on Learning Theory, 2019.

Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015.

Vivien Cabannes, Alessandro Rudi, and Francis Bach. Structured prediction with partial labelling through the infimum loss. In International Conference on Machine Learning, 2020.

Vivien Cabannes, Loucas Pillaud-Vivien, Francis Bach, and Alessandro Rudi. Overcoming the curse of dimensionality with Laplacian regularization in semi-supervised learning. In Advances in Neural Information Processing Systems, 2021a.

Vivien Cabannes, Alessandro Rudi, and Francis Bach. Fast rates in structured prediction. In Conference on Learning Theory, 2021b.

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2006.

Nicolò Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. Incremental algorithms for hierarchical classification. Journal of Machine Learning Research, 7(2):31 54, 2006.

Nicolò Cesa-Bianchi, Tommaso Cesari, and Vianney Perchet. Dynamic pricing with finitely many unknown valuations. In International Conference on Algorithmic Learning Theory, 2019.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):1 27, 2011.

Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien, editors. Semi-Supervised Learning. MIT Press, 2006.

Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3): 755 770, 1959.

Clément Chevalier, Julien Bect, David Ginsbourger, Emmanuel Vázquez, Victor Picheny, and Yann Richet. Fast parallel kriging-based stepwise uncertainty reduction with application to the identification of an excursion set. Technometrics, 56(4):455 465, 2014.

Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A general framework for consistent structured prediction with implicit loss embeddings. Journal of Machine Learning Research, 21(98):1 67, 2020.

Maxime Cohen, Ilan Lobel, and Renato Paes Leme. Feature-based dynamic pricing. Management Science, 66(11):4921 4943, 2020.

Council of European Union. Regulation (EU) 2016/679 of the European parliament (General Data Protection Regulation), 2016.

Timothée Cour, Benjamin Sapp, and Ben Taskar. Learning from partial labels. Journal of Machine Learning Research, 12(42):1501 1535, 2011.

Thomas Cover and Joy Thomas. Elements of Information Theory. Wiley, 1991.

Sanjoy Dasgupta. Two faces of active learning. Theoretical Computer Science, 412(19):1767 1781, 2011.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009.

Thomas Dietterich, Richard Lathrop, and Tomás Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1-2):31 71, 1997.

An Hai Doan, Raghu Ramakrishnan, and Alon Halevy. Crowdsourcing systems on the world-wide web. Communication of the ACM, 54(4):86 96, 2011.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, 2006.

Robert Fano. Transmission of Information: A Statistical Theory of Communications. MIT Press, 1968.

Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. Sequential Experimental Design for Transductive Linear Bandits. In Advances in Neural Information Processing Systems, 2019.

Dimitris Fotakis, Alkis Kalavasis, Vasilis Kontonis, and Christos Tzamos. Efficient algorithms for learning from coarse labels. In Conference on Learning Theory, 2021.

Sachin Gangaputra and Donald Geman. A design principle for coarse-to-fine classification. In Conference on Computer Vision and Pattern Recognition, 2006.

Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, 2016.

Donald Geman and Bruno Jedynak. Shape recognition and twenty questions. Technical report, INRIA, 1993.

Aurélien Géron. Hands-On Machine Learning with Scikit-Learn & Tensor Flow. O Reilly, 2017.

Edgar Gilbert. A comparison of signalling alphabets. Bell System Technical Journal, 31(3):504 522, 1952.

Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014.

Charles Harris, Jarrod Millman, Stéfan van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, 2020.

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

Peter Huber. Robust Statistics. Wiley, 1981.

John Hunter. Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3): 90 95, 2007.

Il dar Ibragimov and Rafail Khas minskii. On the estimation of an infinite-dimensional parameter in gaussian white noise. Doklady Akademii Nauk SSSR, 236(5):1053 1055, 1977.

Kevin Jamieson and Robert Nowak. Active ranking using pairwise comparisons. In Advances in Neural Information Processing Systems, 2011.

Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the Asoociation for Computing Machinery, 45(6):983 1006, 1998.

Donald Knuth. The computer as master mind. Journal of Recreational Mathematics, 9(1):1 6, 1977.

Andrey Kolmogorov and Vladimir Tikhomirov. 𝜀-entropy and 𝜀-capacity of sets in functional spaces. Uspekhi Matematicheskikh Nauk, 14(2):3 86, 1959.

Jonathan Krause, Benjamin Sapp, Andrew Howard, Howard Zhou, Alexander Toshev, Tom Duerig, James Philbin, and Li Fei-Fei. The unreasonable effectiveness of noisy data for fine-grained recognition. In European Conference on Computer Vision, 2016.

Allen Liu, Renato Paes Leme, and Jon Schneider. Optimal contextual pricing and extensions. In Symposium on Discrete Algorithms, 2021.

Andreas Maurer. A vector-contraction inequality for Rademacher complexities. In International Conference on Algorithmic Learning Theory, 2016.

Giacomo Meanti, Luigi Carratino, Lorenzo Rosasco, and Alessandro Rudi. Kernel methods through the roof: Handling billions of points efficiently. In Advances in Neural Information Processing Systems, 2020.

Vu-Linh Nguyen, Mohammad Hossein Shaker, and Eyke Hüllermeier. How to measure uncertainty in uncertainty sampling for active learning. Machine Learning, 111(1):89 122, 2021.

Alex Nowak-Vila. Structured prediction with theoretical guarantees. Phd thesis, Ecole Normale Supérieure, 2021.

Andrzej Pelc. Searching games with errors - fifty years of coping with liars. Theoretical Compututer Science, 270(1):71 109, 2002.

Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems, 2018a.

Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Exponential convergence of testing error for stochastic gradient methods. In Conference On Learning Theory, 2018b.

Bahar Qarabaqi and Mirek Riedewald. User-driven refinement of imprecise queries. In International Conference on Data Engineering, 2014.

Alexander Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: rapid training data creation with weak supervision. The VLDB Journal, 29(2):709 730, 2020.

Donald Rubin. Inference and missing data. Biometrika, 63(3):581 592, 1976.

Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, 2015.

Bernhard Scholkopf and Alexander Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2001.

Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison, 2010.

Karthik Sridharan, Shai Shalev-shwartz, and Nathan Srebro. Fast rates for regularized objectives. In Advances in Neural Information Processing Systems, 2008.

James Tobin. Estimation of relationships for limited dependent variables. Econometrica, 26(1):24 36, 1958.

US Census Bureau. Income and poverty in the United States: 2020, 2021.

Leslie Valiant. Parallelism in comparison problems. SIAM Journal on Computing, 4(3):348 355, 1975.

Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995.

Rom Varshamov. Estimate of the number of signals in error correcting codes. Doklady Akademii Nauk SSSR, 117:739 741, 1957.

Anatoliy Vitushkin. On Hilbert s thirteenth problem. Proceedings of the USSR Academy of Sciences, 95(4):701 704, 1954.

John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, 1944.

Mansfield Tracy Walsorth. Twenty Questions: A Short Treatise on the Game. Holt, 1882.

Dan Wang and Yi Shang. A new active labeling method for deep learning. In International Joint Conference on Neural Networks, 2014.

Harold Widom. Asymptotic behavior of the eigenvalues of certain integral equations. Transactions of the American Mathematical Society, 109(2), 1963.

Christopher Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, 2000.

Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, and Jiebo Luo. Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In Conference on Computer Vision and Pattern Recognition, 2019.

Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In International Conference of Machine Learning, 2003.

Günter Ziegler. Lectures on Polytopes. Springer-Verlag, 1995.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See discussion section.

(c) Did you discuss any potential negative societal impacts of your work? [N/A] This work aims at developping advanced techniques to learn without too much supervision. Such a quest of increasing AI systems capability at a reduced human labor cost is associated with broad societal issues. Those questions being really generic, we did not mention them in the main text. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] The experiments were run on a personal laptop and did not require many charges. Indeed, the amount of compute for experiments were similar to the amount used to write this paper. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] Although we have not cited the creators of some LATEX libraries we used such as Michael Sharpe and the newtx package which we used for fonts in our text. (b) Did you mention the license of the assets? [N/A] Numpy and LIBSVM are under Berkeley Software Distribution licenses (respectively the liberal and revised ones), Python and matplotlib are under the Python Software Foundation license. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]