# corruption_robust_active_learning__bcdf09b7.pdf

Corruption Robust Active Learning

Yifang Chen, Simon S. Du, Kevin Jamieson Paul G. Allen School of Computer Science & Engineering University of Washington, Seattle,WA {yifangc, ssdu, jamieson }@cs.washington.edu

We conduct theoretical studies on streaming-based active learning for binary classiﬁcation under unknown adversarial label corruptions. In this setting, every time before the learner observes a sample, the adversary decides whether to corrupt the label or not. First, we show that, in a benign corruption setting (which includes the misspeciﬁcation setting as a special case), with a slight enlargement on the hypothesis elimination threshold, the classical Robust CAL framework can (surprisingly) achieve nearly the same label complexity guarantee as in the non-corrupted setting. However, this algorithm can fail in the general corruption setting. To resolve this drawback, we propose a new algorithm which is provably correct without any assumptions on the presence of corruptions. Furthermore, this algorithm enjoys the minimax label complexity in the non-corrupted setting (which is achieved by Robust CAL) and only requires O(Ctotal) additional labels in the corrupted setting to achieve O(ε + Ctotal

n ), where ε is the target accuracy, Ctotal is the total number of corruptions and n is the total number of unlabeled samples.

1 Introduction

An active learning algorithm for binary classiﬁcation aims to obtain the best hypothesis (classiﬁer) from some given hypothesis set while requesting as few labels as possible. Under some favorable conditions, active learning algorithms can require exponentially fewer labels then passive, random sampling [Hanneke, 2014]. Active learning is ideally suited for applications where large datasets are required for accurate inference, but the cost of paying human annotators to label a dataset is prohibitively large [Joshi et al., 2009, Yang et al., 2015, Beluch et al., 2018]. A bit more formally, for an example space X (such as a set of images) and label space {0, 1} (like whether the image contains a human-made object or not), let H be a hypothesis class such that for each h H, we have h : X {0, 1}. After a certain number of labels are requested, the learner will output a target hypothesis hout H. In this paper, we consider the streaming setting where at each time t nature reveals xt DX and an active learning algorithm must make the real-time decision on whether to request the corresponding label yt or not. Such a streaming setting of active learning is frequently encountered in online environments such as learning a spam ﬁlter or fraud detection (i.e., mark as spam/fraudulent and do not request the label, or send to inbox/expert to obtain a label).

This paper is interested in a setting when the requested label yt is potentially corrupted by an adversary. That is, when requesting the label for some example xt X, if uncorrupted the learner will receive a label drawn according to the true conditional label distribution, but if corrupted, the learner will receive a label drawn from an arbitrary distribution decided by an adversary. This setting is challenging because the learner has no a priori knowledge of when or how many corruptions will occur. And if the learner is collecting data adaptively, he may easily be misled into becoming conﬁdent in an incorrect belief, collect data based on that belief, and never recover to output an accurate classiﬁer even if the adversary eventually stops serving corrupted labels later. This greatly contrasts with the passive setting (when all labels are observed) where as long as the number of

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

corrupts grows sub-linearly over time, the effect of the corruptions will fade and the empirical risk minimizer will converge to an accurate classiﬁer with respect to the uncorrupted labels.

The source of corruptions can come from automatic labeling, non-expert labeling, and, mostly severely, adaptive data poisoning adversaries. Particularly, with the rise of crowdsourcing, it is increasingly feasible for such malicious labelers to enter the system (Miller et al. [2014], Awasthi et al. [2014]). There have been many prior works that consider robust ofﬂine training using corrupted labels (i.e., the passive setting) [Hendrycks et al., 2018, Yu et al., 2019]. Correspondingly, related corruption settings have also been considered in online learning [Gupta et al., 2019, Zimmert and Seldin, 2019, Wei et al., 2020] and reinforcement learning [Lykouris et al., 2020, Chen et al., 2021a, Wei et al., 2021]. However, there is a striking lack of such literature in the active learning area. Existing disagreement-based active learning algorithms nearly achieve the minimax label complexity for a given target accuracy when labels are trusted [Hanneke, 2014], but they fail to deal with the case where the labels are potentially corrupted.

Our contributions: In this paper, we study active learning in the agnostic, streaming setting where an unknown number of labels are potentially corrupted by an adversary. We begin with the performance of existing baseline algorithms.

Firstly, we analyze the performance of empirical risk minimization (ERM) for the passive setting where all labels are observed, which will output an ε + R Ctotal

n -optimal hypothesis

as long as n 1

ε2 , where R is the risk of best hypothesis. This result serves as a benchmark for the following active learning results (Section 3).

If we assume that the disagreement coefﬁcient, a quantity that characterizes the sample complexity of active learning algorithms [Hanneke, 2014], is some constant, then we obtain the following results for active learning.

Secondly, we analyze the performance of a standard active learning algorithm called Robust CAL [Balcan et al., 2009, Dasgupta et al., 2007, Hanneke, 2014] under a benign assumption on the corruptions (misspeciﬁcation model is a special case under that assumption).We show that, by slightly enlarging the hypothesis elimintion threshold, this algorithm can achieve almost the same label complexity as in the non-corrupted setting. That is, the algorithm will output an O(ε + R Ctotal

n )-optimal hypothesis as long as n R

ε with at most e O (R n + log(n)) number of labels (Section 4).

Finally and most importantly, in the general corruption case without any assumptions on how corruptions allocated, we propose a new algorithm that matches Robust CAL in the non-corrupted case and only requires e O(Ctotal) additional labels in the corrupted setting. That is, the algorithm will output an ε + Ctotal

n -optimal hypothesis as long as n 1 ε2 with at most e O (R )2n + log(n) + Ctotal number of labels. Besides, this algorithm also enjoys an improved bound under a benign assumption on the corruptions. That is, the algorithm will output an ε + R Ctotal

n -optimal hypothesis with at most e O (R )2n + log(n) + R Ctotal

number of labels (Section 5)

Note that Ctotal can be regarded as a ﬁxed budget or can be increasing with incoming samples. In the latter case, Ctotal in the second and third result can be different since they may require different n. Detailed comparison between these two results will be discussed in corresponding sections.

Related work: For nearly as long as researchers have studied how well classiﬁers generalize beyond their performance on a ﬁnite labelled dataset, they have also been trying to understand how to minimize the potentially expensive labeling burden. Consequently, the ﬁeld of active learning that aims to learn a classiﬁer using as few annotated labels as possible by selecting examples sequentially is also somewhat mature [Settles, 2011, Hanneke, 2014]. Here, we focus on just the agnostic, streaming setting where there is no relationship assumed a priori between the hypothesis class and the example-label pairs provided by Nature. More than a decade ago, a landmark algorithm we call Robust CAL was developed for the agnostic, streaming setting and analyzed by a number of authors that obtains nearly minimax performance [Balcan et al., 2009, Dasgupta et al., 2007, Hanneke, 2014].

The performance of Robust CAL is characterized by a quantity known as the disagreement coefﬁcient that can be large, but in many favorable situations can be bounded by a constant θ which we assume is the case here. In particular, for any ϵ > 0, once Nature has offered Robust CAL n unlabelled samples, Robust CAL promises to return a classiﬁer with error at most p

R log(|H|)/n+log(|H|)/n and requests just n R + θ ( p

n R log(|H|) + log(|H|)) labels with high probability. Said another way, Robust CAL returns an ϵ-good classiﬁer after requesting just θ ((R )2/ϵ2 + log(1/ϵ)) labels. If θ is treated as an absolute constant, this label complexity is minimax opitmal [Hanneke, 2014]. While there exist algorithms with other favorable properties and superior performance under special distributional assumptions (c.f., Zhang and Chaudhuri [2014], Koltchinskii [2010], Balcan et al. [2007], Balcan and Long [2013], Huang et al. [2015]), we use Robust CAL as our benchmark in the uncorrupted setting. We note that since Robust CAL is computationally inefﬁcient for many classiﬁer classes of interest, a number of works have addressed the issue at the cost of a higher label sample complexity [Beygelzimer et al., 2009, 2010, Hsu, 2010, Krishnamurthy et al., 2017] or higher unlabeled sample complexity [Huang et al., 2015]. Our own work, like Robust CAL, is also not computationally efﬁcient but could beneﬁt from the ideas in these works as well.

To the best of our knowledge, there are few works that address the non-IID active learning setting, such as the corrupted setting of this paper. Nevertheless, Miller et al. [2014] describes the need for robust active learning algorithms and the many potential attack models. While some applied works have proposed heuristics for active learning algorithms that are robust to an adversary [Deng et al., 2018, Pi et al., 2016], we are not aware of any that are provably robust in the sense deﬁned in this paper. Active learning in crowd-sourcing settings where labels are provided by a pool of a varying quality of annotators, some active learning algorithms have attempted to avoid and down-weight poorly performing annotators, but these models are more stochastic than adversarial [Khetan and Oh, 2016]. The problem of selective sampling or online domain adaptation studies the setting where P(Yt = 1|Xt = x) remains ﬁxed, but P(Xt = x) drifts and the active learner aims to compete with the best online predictor that observes all labels [Yang, 2011, Dekel et al., 2012, Hanneke and Yang, 2021, Chen et al., 2021b]. Another relevant line of work considers the case where the distribution of the examples drifts over time (i.e., P(Xt = x)) [Rai et al., 2010] or the label proportions have changed (i.e, P(Yt = 1)) [Zhao et al., 2021], but the learner is aware of the time when the change has occurred and needs to adapt. These setting are incomparable to our own.

Despite the limited literature in active learning, there have been many existing corruption-related works in the related problem areas of multi-arm bandits (MAB), linear bandits and episodic reinforcement learning. To be speciﬁc, for MAB, Gupta et al. [2019] achieves O(P

a =a 1 a + KC) by adopting a sampling strategy based on the estimated gap instead of eliminating arms permanently. Our proposed algorithm is inspired by this soft elimination" technique and requests labels based on the estimated gap of each hypothesis. Later, Zimmert and Seldin [2019] achieves a near-optimal result O P

a =a 1 a + q P

in MAB by using Follow-the-Regularized Leader (FTRL) with Tsallis Entropy. How to apply the FTRL technique in active learning, however, remains an open problem. Besides MAB, Lee et al. [2021] achieves e O (Gap Complexity + C) in stochastic linear bandits. We note that the linear bandits papers of Lee et al. [2021] and Camilleri et al. [2021] both leverage the Catoni estimator that we have repurposed for robust gap estimation in our algorithm. Finally, in the episodic reinforcement learning, Lykouris et al. [2020] achieves e O C Gap Complexity + C2

in non-tabular RL and Chen et al. [2021a] achieves e O Policy Gap Complexity + C2 in tabular RL. Very recently, Wei et al. [2021] obtains an Gap Complexity + C bounds for linear MDP.

2 Preliminaries

General protocol: A hypothesis class H is given to the learner such that for each h H we have h : X {0, 1}. Before the start of the game, Nature will draw n unlabeled samples in total. At each time t {1, . . . , n}, nature draws (xt, yt) X {0, 1} independently from a joint distribution Dt, the learner observes just xt and chooses whether to request yt or not. Note that in this paper, we assume X is countable, but it can be directly extended to uncountable case. Next, We denote the expected risk of a classiﬁer h H under any distribution D as RD(h) = Ex,y D (1{h(x) = y}), the marginalized distribution of x as ν and probability of y = 1 given x and D as ηx. Finally we deﬁne ρD(h, h ) = Ex ν1{h(x) = h (x)}.

Uncorrupted model: In the traditional uncorrupted setting, there exists a ﬁxed underlying distribution D where each (xt, yt) is drawn from this i.i.d distribution. Correspondingly, we deﬁne the marginalized distribution of x as ν and probability of y = 1 given x and D as ηx .

Oblivious and non-oblivious adversary model: In the corrupted setting, the label at time t is corrupted if (xt, yt) is drawn from some corrupted distribution Dt that differs from the base D . At the start of the game, an oblivious adversary will choose a sequence of functions ηx t : X [0, 1] for all t {1, . . . , n}. The corruption level at time t is measured as

ct = max x X |ηx ηx t |,

and the amount of corruptions during any time interval I as CI = P

t I ct. Correspondingly, we deﬁne Ctotal = C[0,n]. Then, Nature draws xt ν for each t {1, . . . , n} so that each xt is independent of whether yt was potentially corrupted or not.

One notable case case of the oblivious model is the γ-misspeciﬁcation model. In the binary classiﬁcation setting, it is equivalent to

ηx t = (1 γ)ηx + γ ηx t , x, t.

where ηx t can be any arbitrary probability. Such label contamination model can be regarded a special case of corruption where for each t,

ct = max x |ηx t ηx | = γ max x |ηx ηx|

Moreover, our main algorithm actually works for the non-oblivious adversary. In this more challenging case, each time t, the adversary adaptively decides ηx t before seeing actual xt, based on all the previous history.

Other notations: For convenience, we denote RDt(h) as Rt(h), RD (h) as R (h), ρDt(h, h ) = ρt(h, h ) and ρD (h, h ) = ρ (h, h ). We also deﬁne an average expected risk that will be used a lot in our analysis, RI(h) = 1 |I| P

t I Rt(h). In addition, we deﬁne h = arg min R (h), R = R (h ) and the gap of the suboptimal classiﬁer h as h = R (h) R .

Disagreement coefﬁcient: For some hypothesis class H and subset V H, the region of disagreement is deﬁned as Dis(V ) = {x X : h, h V s.t. h(x) = h (x)} , which is the set of unlabeled examples x for which there are hypotheses in V that disagree on how to label x. Correspondingly, the disagreement coefﬁcient of h H with respect to a hypothesis class H and. distribution ν is deﬁned as

θ (r0) = sup r r0

Px ν (X Dis(B(h , r)))

3 Passive Learning in the Corrupted Setting

We ﬁrst analyze the performance of empirical risk minimization (ERM) for passive learning in the corrupted setting as a benchmark.

Theorem 3.1 (Passive Learning). After n labeled samples, if hout = arg minh Pn t=1 1{h(xt) = yt} is the empirical risk minimizer, then with probability at least 1 δ, we have

R (hout) R log(|H|/δ)

8R log(|H|/δ)

n + 8Ctotal

n R + 5 log(|H|/δ)

n 1 (1 4Ctotal

This implies that, as long as Ctotal is small than some fraction of n, e.g., Ctotal n

8 , we can obtain R (hout) R ε + Ctotal

n R whenever

n 2 log(|H|/δ)

ε + 8R log(|H|/δ)

Algorithm 1 Robust CAL (modiﬁed the elimination condition)

1: Input: conﬁdence parameter δ 2: for t = 1, 2, . . . , n do 3: Nature reveals unlabeled data point xt 4: if xt Dis(Vt) then 5: Query yt and set ˆlt(h) = 1{h(xt) = yt} for all h H 6: end if 7: if log(t) N then 8: Set ˆLt(h) = 1

t P s t ˆls(h) and ˆht = arg minh Vt ˆLt(h) 9: Set ˆρt(h, h ) = 1

t 1{h(xt) = h (xt)} and βt = log(3 log(t)|H|2/δ)

10: Set Vt+1 = h Vlog(t) : ˆLt(h) ˆLt(ˆht) q

2βt ˆρt(h,ˆht)

2 ˆρt(h, ˆht)

11: else 12: Vt+1 = Vt, βt+1 = βt 13: end if 14: end for 15: Output: arg minh Vt ˆLt(h)

Proof Sketch By using Bernstein inequality and the deﬁnition of corruptions, we can get

n max{R (hout) R , 2R } +

4 log(|H|/δ) max{R (hout) R , 2R }

n + log(|H|/δ)

n Then we can directly get the result by solving this inequality. We postpone the details into Appendix B.

In addition to this result providing a benchmark, this passive learning result also inspires our analysis of Robust CAL in the corrupted setting as we will show in the next section.

4 Robust CAL in the Corrupted Setting

We restate the classical Robust CAL [Balcan et al., 2009, Dasgupta et al., 2007, Hanneke, 2014] in Algorithm 1 with slightly enlargement on the conﬁdence threshold used in the elimination condition (Line 10). The additional term 1

2 ˆρt(h, ˆht) ensures robustness because each (Rt(h) Rt(h ) will be corrupted at most 2ρ (h, h )ct. In the theorem below we show that, it can achieve the similar label complexity result as in the non-corrupted setting as long as the growth rate of corruptions is at most in a certain fraction of number of unlabeled samples. Theorem 4.1. Suppose the C[0,t] t 8 for all t {log(t) = N}, for example, the (1/8)- misspeciﬁcation model. Then with high probability as least 1 δ, for any n ( 8R

ε ) log(log(n)|H|2/δ), we have Rhout R ε + O( R Ctotal

n ) with label complexity at most

O θ (14R + 120log(log(n)|H|/δ)

n ) log(log(n)|H|2/δ) (R n + log(n))

Remark 4.1. In Appendix C.2, we show the necessity of enlarging the threshold in line 10 from the original

h Vlog(t) : ˆLt(h) ˆLt(ˆht) o

2βtˆρt(h, ˆht)

by giving an counter-example. The counter-example shows that, when R 0, the best hypothesis will be eliminated under the original condition even the C[0,t] t 8 for all t {log(t) = N}" assumption is satisﬁed.

Proof Sketch For correctness, it is easy to show by Bernstein inequality. For the sample complexity, Theorem 3.1 implies that, for any interval [0, t], as long as C[0,t] t

8, the learner can always identify

hypothesis which are O(R + 1

n)-optimal. Therefore, we get the probability of query as

P (xt+1 Dis(Vt+1)) P h Vt+1 : h(xt) = h (xt), h O R + βt

Then by standard analysis we can connect this disagreement probability with the disagreement coefﬁcient to get the ﬁnal bound. One thing to note is that, at the ﬁrst glance ˆρt(h, ˆht) might be much larger than the other two terms since it can goes to 1, which possibly renders a worse label complexity. Here we give an intuitive explanation on why this threshold is ﬁne: If ˆρt(h, ˆht) is close to the |R(h) R(ˆht)|, then we can achieve the inequality above by using some self-bounding techniques. If ˆρt(h, ˆht) is close to the R , then we can directly get some R -dependent term in the target bound. The full proof is deferred to Appendix C.1.

Comparison between the modiﬁed Robust Cal and passive learning: Assume disagreement coefﬁcient is a constant. In the non-corrupted case, the algorithm achieves the same performance guarantee as the vanilla Robust CAL. In the corrupted case, we still get the same accuracy as in Theorem 3.1 with at most e O(R n+log(n)) number of labels, which is the same as the non-corrupted case.

Discussion on the C[0,t] t 8 for all the {t| log(t) N}" condition: This condition can be reduced to the (1/8)-misspeciﬁcation model as deﬁned in Section 2 since CI |I|

8 for any I. But this condition does not contain the case where an adaptive poisoning adversary corrupts all the labels at the earlier stage and stop corrupting later, which still ensures the small total amount of corruptions, but will clearly mislead the algorithm to delete a true best hypothesis h . In Section 5, we will show a more general result that applies to scenarios beyond C[0,t] t

5 Main algorithm - CALruption

5.1 Algorithm

In this section we describe our new algorithm, CALruption. The pseudo-code is listed in Algorithm 2. Our previous analysis showed that in the agnostic setting the classical Robust CAL may permanently eliminate the best hypothesis due to the presence of corruptions. To ﬁx this problem, in our CALruption algorithm, the learner never makes a hard" decision to eliminate any hypothesis. Instead, it assigns different query probability to each x based on the estimated gap for each hypothesis as shown in line 4 and 5, which can be regarded as soft elimination". With this step, the key question becomes how to connect the estimated gaps with the query probability qx l .

We adopt the idea from the BARBAR algorithm proposed by Gupta et al. [2019] which was originally designed for multi-armed bandits (MAB). Instead of permanently eliminating a hypothesis, the learner will continue pulling each arm with a certain probability deﬁned by its estimated gap. However, the original BARBAR algorithm is mainly focused on estimating the reward of each individual arm. This aligns with its MAB feedback structure, where only the information of the pulled arm will be gained at each time. In the active learning setting, we instead focus on the difference of the risks of different hypotheses, because each time we request a label, values of all the hypotheses will be updated. Therefore, we implement a more complicated strategy to calculate the query probability at the end of each epoch l, as shown from line 7 to line 13.

In line 7, we estimate the disagreement probability for each hypothesis pair (h, h ) with an empirical quantity that upper bounds the expectation. In line 8, instead of estimating the value of each hypothesis, we estimate the gap between each hypothesis pair (h, h ), denoted as W h,h

l , by any δ-robust estimator that satisﬁes eq. 1. One example of δ-robust estimator is Catoni estimator [Lugosi and Mendelson, 2019]. Note that simple empirical estimator will lead to potentially rare but large variance, which has been discussed in Stochastic rounding section in Camilleri et al. [2021]. But what we truly care is the gap between any hypothesis h and the best hypothesis h . Therefore, inspired by Camilleri et al. [2021], we construct such estimation by using W h,h

l as shown in line 9 to 11. Finally, we divide the hypothesis set into several layers based on the estimated gap and set the query probability for each x based on the hypothesis layers, as shown in line 12 and 13. For more detailed explanation on line 9-13, please refer to Appendix D.

Algorithm 2 CALruption

1: Initialize: β3 = 2 log( 3

2 log(n) |H|2/δ), β1 = 32 640β3, β2 = 5 32, ϵi = 2 i, Nl = β1ϵ 2 l , ˆ 0 h = 0, V 0 1 = Z and τ1 = 1, qx l = 1 for all x X 2: for t = 1, 2, . . . , n do 3: Nature reveals unlabeled data point xt 4: Set Qt Ber(qx l ) and request yt if Qt = 1.

5: Set estimated loss for all h H as ˆℓt(h) = 1{h(xt) =yt}

qx l Qt 6: if t = τl + Nl 1 then 7: Set ˆρl(h, h ) = 1 Nl P

t Il 1{h(xt) = h (xt)} for all h, h H

8: For each (h, h ), set W h,h

l = Robust Estimator {ˆℓt(h) ˆℓt(h )}t Il , which satisﬁes that, with probability at least 1 δ,

|( ˆRl(h) ˆRl(h )) W h,h

10β3ˆρl(h, h ) Nl minx Dis(h,h ) qx l , (1)

where ˆRl(h) = 1 |Il| P

t Il Ey Ber(ηxt t ) [1{h(xt) = y}].

9: Set ˆDl = arg min D maxh,h H(RD(h) RD(h ) W h,h

minx Dis(h,h ) qx l ˆρl(h,h )

10: Set ˆhl = arg minh H R ˆ Dl(h) + β2 ˆ l 1 h

11: Set ˆ l h = max n ϵl, R ˆ Dl(h) R ˆ Dl(ˆhl ) + β2 ˆ l 1 ˆhl

12: Construct V i l+1 for all i = 0, 1, 2, . . . , l, such that,

ˆ l h ϵi, h V i l+1 and ˆ l h > ϵi, h / V i l+1

Therefore, V l l+1 V l 1 l+1 . . . V 0 l+1 13: Calculate the query probability qx l for each x as follows

Z(x) = {(h, h ) H | x Dis({h, h })}

k(h, h , l + 1) = max{i | h, h V i l+1}

qx l+1 = max (h,h ) Z(x) β1ˆρl(h, h )

Nl+1 ϵ 2 k(h,h ,l+1)

14: Set τl+1 = τl + Nl and denote the epoch l as I = [τl, τl+1 1]. Set l l + 1, go to the next epoch 15: end if 16: end for 17: Output: h V l 1 l

Remark 5.1. In Line 9, instead of estimating over all possible distribution D, we actually just need to estimate ηx for all x {xt}t Il and set the corresponding x distribution of D as the empirical distribution of x inside Il. Theorem 5.1 (CALruption). With n 72ε 2β1 number of unlabeled samples, with probability at least 1 δ we can get an hout satisfying

R (hout) R ε + 24Ctotal

with label complexity as most

n + 64Ctotal

n ) log(log(n)|H|2/δ) (R )2n + log(n)(1 + Ctotal) !

where Ctotal = P log4(n/β1) l=1 Cepoch l R 1{ Cepoch l

Nl 1 32} + 1{ Cepoch l

Nl > 1 32} and β1 = 16

2 log(n) |H|2/δ). Note that epoch l is prescheduled and not algorithm-dependent.

Corollary 5.1. Suppose the corruptions satisfy Cepoch l

Nl 1 32 for all epochs, for example, the (1/32)- misspeciﬁcation case, then for any n 72ε 2β1 number of unlabeled samples, with probability at least 1 δ we can get a hout satisfying

R (hout) R ε + 24R Ctotal

with label complexity as most

n + 64R Ctotal

n ) log(log(n)|H|/δ) (R )2n + (R Ctotal + 1) log(n) !

Comparison with passive learning and the Calruption: Consider the case where θ ( ) is of lower order like a constant. The Corollary 5.1 shows that, when Cepoch l

Nl 1 32 for all epochs, our algorithm

achieves a similar accuracy O ε + R Ctotal

n as in the passive learning case, while only requiring e O (R )2n + log(n)(1 + R Ctotal) number of labels, for n 1 ε2 . So if we set n = e O( 1

ε2 ), then the

label complexity becomes e O (R )2

ε2 + log(1/ε)(1 + R Ctotal) , which matches the minimax label complexity in the non-corrupted case.

Going beyond the Cepoch l

Nl 1 32 constraint, the general Theorem 5.1 shows that, for n 1 ε2 , our

algorithm achieves an accuracy O ε + Ctotal

n while only requiring e O((R )2n + log(n) + Ctotal) number of labels no matter how corruptions are allocated. When R is some constant, this result becomes similar to the Corollary 5.1. Moreover, we will argue that upper bound Ctotal by Ctotal is loose and in many case Ctotal will be close to R Ctotal instead of Ctotal. We show one example in the paragraph below.

When is Calruption better than modiﬁed Robust CAL? Consider the case where the adversary fully corrupts some early epoch and then performs corruptions satisfying Cepoch l

Nl 1 32 for rest epochs. Then the modiﬁed Robust CAL will mistakenly eliminate h so it can never achieve target result when ε < minh H h while Calruption can surely output the correct hypothesis. Moreover, according to Theorem 5.1, since the total amount of early stage corruptions are small, so here Ctotal is close to R Ctotal, which implies a similar result as in Corrollary 5.1.

When is Calruption worse then modiﬁed Robust CAL ? Consider the case where the total amount of corruption is, instead of ﬁxed, increasing with incoming unlabeled samples, for example, the misspeciﬁcation case. Then Ctotal in modiﬁed Robust CAL can be O( R

ε) while Ctotal in CALruption can goes to O( 1

ε2 ). Such gap comes from the extra unlabeled sample complexity, which we discuss in the paragraph below.

Discussion on the extra unlabeled samples complexity: We note that we require a larger number of unlabeled data than ERM in the passive learning setting. Here we explain the reason. Consider the version spaces V l 1 l for any ﬁxed epoch l. In the non-corrupted setting, this version space serves the similar purpose as the active hypothesis set in Robust CAL. In Robust CAL, its elimination

threshold is about e O q

(or e O ρ (h, h ) + 1

t in our modiﬁed version) while in our

CALruption, the threshold is about e O q

1 t , which is more conservative than the Robust CAL and leads to the extra unlabeled sample complexity. The reason about being conservative here is that we need more samples to weaken the effects of corruptions on our estimation. Whether such extra unlabeled samples complexity is unavoidable remains an open problem.

5.2 Proof sketch for Theorem 5.1

Here we provide main steps of the proof and postpone details in Appendix E.

First we show a key lemma which guarantees the closeness between ˆ l h and h for all l and h.

Lemma 5.1 (Upper bound and lower bound for all estimation). With probability at least 1 δ, for all epoch l and all h H,

ˆ l h 2 ( h + ϵl + gl) , h 3

2 ˆ l h + 3

where gl = 2 β1 ϵ2 l Pl s=1 Cs 2R 1 n 2CIs

Ns 1 16 o + 1 n 2CIs

Ns > 1 16 o .

Here the gl term implies that, as long as the total corruption is sublinear in n, the misleading effects on the gap estimations will fade when the number of unlabeled samples increasing.

Based on this lemma, we can directly get another useful lemma as follows. Lemma 5.2. For all epoch l and layer j, we have maxh V j l ρ (h, h ) 2R + 3ϵj + 3gl 1

In the following we ﬁrst deal with the correctness then then sample complexity.

Correctness. By Lemma 5.1, we have

2 ˆ L 1 hout + 3

2ϵL 1 + 3g L 1 6

n + 24 Ctotal

Sample complexity. For any t Il, recall that qx l = max(h,h ) Z(x) β1 ˆρl 1(h,h )

Nl ϵ 2 k(h,h ,l), the probability of xt being queried (Qt = 1) is

E[Qt] 10 β1

x X P(xt = x) max

h V jx l l ρ (h, h )ϵ 2 jx l + 8 β1

x X P(xt = x) 2R ϵ 2 jx l + 3ϵ 1 jx l + 3gl 1ϵ 2 jx l

2R ϵ 2 i + 3ϵ 1 i + 3gl 1ϵ 2 i P(x Dis(V i l )) + 8 β1

Here jx l is some arbitrary mapping from X to [l], which is formally deﬁned in detailed version in Appendix E.6. The ﬁrst inequality comes from the closeness of estimated ˆρl(h, h ) and the true ρ (h, h ), as well as some careful relaxation. The second inequality comes from Lemma 5.2.

Now we can use the standard techniques to upper bound P(x Dis(V i l )) as follows,

P h V i l : h(x) = h (x) P ( h H : h(x) = h (x), ρ (h, h ) 2R + 3ϵi + 3gl 1)

θ (2R + 3ϵi + gl 1) (2R + 3ϵi + 3gl 1)

where again the ﬁrst inequality comes from Lemma 5.2. Again we postpone the full version into Appendix E.6.

Combining the above results with the fact that gl = 2 β1 ϵ2 l Cl 1 and Cl 1 Pl 1 s=1 CIs 2β1ϵ 2 l 1, we get the expected number of queries inside a complete epoch l as,

t Il E[Qt] 20β1θ (2R + 3ϵl 1 + gl 1) 4(R )2ϵ 2 l + 12R ϵ 1 l + 132

β1 Cl 1 + 10

Finally, summing over all L = 1

2 log(n/β1) number of epochs, for any n, we can get the target lable complexity.

6 Conclusion and future works

In this work we analyzed an existing active learning algorithm in the corruption setting, showed when it fails, and designed a new algorithm that resolve the drawback. Relative to Robust CAL,

our algorithm requires a larger number of unlabeled data. One natural question is to design a corruption robust algorithm which requires the same number of unlabeled data as Robust CAL in the non-corrupted setting. Another potential question is that, the O ε + Ctotal

n accuracy from Algo. 2

in the general corruption case is generally worse than O ε + R Ctotal

n accuracy of passive learning. Although we state that these two bounds are close in many cases, a question is if there exists an alternative algorithm or analysis that will result a more smooth ﬁnal bound to interpolate between different corruptions cases.

Finally, we believe it is possible to replace the Catoni estimator with naive importance sampling estimator and even further simplify the layered active hypothesis sets construction step by adopting Beygelzimer et al. [2010] s idea. We plan to work on this direction in the future.

Acknowledgements

SSD gratefully acknowledges funding from NSF Award s IIS-2110170 and DMS-2134106

Steve Hanneke. Theory of Disagreement-Based Active Learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014. ISSN 1935-8237, 19358245. doi: 10.1561/2200000037. URL http://www.nowpublishers.com/articles/ foundations-and-trends-in-machine-learning/MAL-037.

Ajay J. Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. Multi-class active learning for image classiﬁcation. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2372 2379, 2009. doi: 10.1109/CVPR.2009.5206627.

Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and Alexander G Hauptmann. Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision, 113(2):113 127, 2015.

William H Beluch, Tim Genewein, Andreas Nürnberger, and Jan M Köhler. The power of ensembles for active learning in image classiﬁcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9368 9377, 2018.

Brad Miller, Alex Kantchelian, Sadia Afroz, Rekha Bachwani, Edwin Dauber, Ling Huang, Michael Carl Tschantz, Anthony D Joseph, and J Doug Tygar. Adversarial active learning. In Proceedings of the 2014 Workshop on Artiﬁcial Intelligent and Security Workshop, pages 3 14, 2014.

Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for efﬁciently learning linear separators with noise. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 449 458, 2014.

Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to train deep networks on labels corrupted by severe noise. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 10477 10486, 2018.

Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and Masashi Sugiyama. How does disagreement help generalization against label corruption? In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7164 7173. PMLR, 09 15 Jun 2019. URL http://proceedings.mlr.press/v97/yu19b.html.

Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pages 1562 1578. PMLR, 2019.

Julian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 467 475. PMLR, 2019.

Chen-Yu Wei, Haipeng Luo, and Alekh Agarwal. Taking a hint: How to leverage loss predictors in contextual bandits?, 2020.

Thodoris Lykouris, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. Corruption robust exploration in episodic reinforcement learning, 2020.

Yifang Chen, Simon S. Du, and Kevin Jamieson. Improved corruption robust algorithms for episodic reinforcement learning, 2021a.

Chen-Yu Wei, Christoph Dann, and Julian Zimmert. A model selection approach for corruption robust reinforcement learning, 2021.

Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78 89, 2009.

Sanjoy Dasgupta, Daniel J Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. Advances in neural information processing systems, 20:353 360, 2007.

Burr Settles. From theories to queries: Active learning in practice. In Active Learning and Experimental Design workshop In conjunction with AISTATS 2010, pages 1 18, 2011.

Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. Advances in Neural Information Processing Systems, 27:442 450, 2014.

Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. The Journal of Machine Learning Research, 11:2457 2485, 2010.

Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In International Conference on Computational Learning Theory, pages 35 50. Springer, 2007.

Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under logconcave distributions. In Conference on Learning Theory, pages 288 316, 2013.

Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Efﬁcient and parsimonious agnostic active learning. In Advances in Neural Information Processing Systems, pages 2755 2763, 2015.

Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In Proceedings of the 26th annual international conference on machine learning, pages 49 56, 2009.

Alina Beygelzimer, Daniel J Hsu, John Langford, and Tong Zhang. Agnostic active learning without constraints. In Advances in neural information processing systems, pages 199 207, 2010.

Daniel Joseph Hsu. Algorithms for active learning. Ph D thesis, UC San Diego, 2010.

Akshay Krishnamurthy, Alekh Agarwal, Tzu-Kuo Huang, Hal Daumé III, and John Langford. Active learning for cost-sensitive classiﬁcation. In International Conference on Machine Learning, pages 1915 1924. PMLR, 2017.

Yue Deng, Ka Wai Chen, Yilin Shen, and Hongxia Jin. Adversarial active learning for sequences labeling and generation. In IJCAI, pages 4012 4018, 2018.

Lei Pi, Zhuo Lu, Yalin Sagduyu, and Su Chen. Defending active learning against adversarial inputs in automated document classiﬁcation. In 2016 IEEE Global Conference on Signal and Information Processing (Global SIP), pages 257 261. IEEE, 2016.

Ashish Khetan and Sewoong Oh. Achieving budget-optimality with adaptive schemes in crowdsourcing. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 4851 4859, 2016.

Liu Yang. Active learning with a drifting distribution. In NIPS, pages 2079 2087. Citeseer, 2011.

Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning from single and multiple teachers. The Journal of Machine Learning Research, 13(1):2655 2697, 2012.

Steve Hanneke and Liu Yang. Toward a general theory of online selective sampling: Trading off mistakes and queries. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3997 4005. PMLR, 2021.

Yining Chen, Haipeng Luo, Tengyu Ma, and Chicheng Zhang. Active online learning with hidden shifting domains. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2053 2061. PMLR, 2021b.

Piyush Rai, Avishek Saha, Hal Daumé III, and Suresh Venkatasubramanian. Domain adaptation meets active learning. In Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing, pages 27 32, 2010.

Eric Zhao, Anqi Liu, Animashree Anandkumar, and Yisong Yue. Active learning under label shift. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3412 3420. PMLR, 2021.

Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang, and Xiaojin Zhang. Achieving near instance-optimality and minimax-optimality in stochastic and adversarial linear bandits simultaneously, 2021.

Romain Camilleri, Julian Katz-Samuels, and Kevin Jamieson. High-dimensional experimental design and kernel bandits, 2021.

Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145 1190, 2019.