# activehedge_hedge_meets_active_learning__902b2666.pdf

Active Hedge: Hedge meets Active Learning

Bhuvesh Kumar 1 Jacob Abernethy 1 Venkatesh Saligrama 2

Abstract We consider the classical problem of multiclass prediction with expert advice, but with an active learning twist. In this new setting the learner will only query the labels of a small number of examples, but still aims to minimize regret to the best expert as usual; the learner is also allowed a very short burn-in phase where it can fast-forward and query certain highly-informative examples. We design an algorithm that utilizes Hedge (aka Exponential Weights) as a subroutine, and we show that under a very particular combinatorial constraint on the matrix of expert predictions we can obtain a very strong regret guarantee while querying very few labels. This constraint, which we refer to as ζ-compactness, or just compactness, can be viewed as a non-stochastic variant of the disagreement coefﬁcient, another popular parameter used to reason about the sample complexity of active learning in the IID setting. We also give a polynomial time algorithm to calculate the ζcompactness of a matrix up to an approximation factor of 3.

1. Introduction

The problem of multiclass prediction with expert advice has emerged as a simple yet powerful framework for reasoning about sequential decision tasks. We imagine we have a set of N experts, at each round there are K possible outcomes, and where each expert j makes a prediction Xt,j [K] at time t about an unknown label yt [K]. Our learning task is to emit our own estimate ˆyt k of yt, that takes into account the advice of each expert along with their historical performance up until this time point. The simple goal is: can we predict well, in the long run, relative to the expert who performs optimally over the full sequence of predictions, despite that we do not know in advance which expert is

1Georgia Institute of Technology 2Department of Electrical and Computer Engineering, Boston University. Correspondence to: Bhuvesh Kumar <bhuvesh@gatech.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

best? Moreover, what can we guarantee even when some of these experts may be predicting in an arbitrary or perhaps adversarial fashion? These questions have received a great deal of attention over the past two decades.

The classical algorithm for this problem is commonly known as Hedge (Freund & Schapire, 1995), although variants are often referred to as exponential weights or weighted majority. While we give a precise description in Algorithm 1, Hedge is quite simple to explain in words: the algorithm combines the predictions of all the experts on a given round by taking their weighted average, where the weight of an expert exponentially decays according to the number of previous mistakes. Important details must be addressed, such as the exponential decay factor and what to do with fractional predictions, but a great deal of research has made one point very clear: Hedge is essentially the minimax optimal algorithm for the problem of prediction with expert advice.

One of the downsides of Hedge, as with many online learning algorithms, is that it is not label efﬁcient: the learning process requires that we observe the target yt on each round. Obtaining individual labels can, quite often, be very expensive to the learner; indeed this is central to why we design prediction algorithms in the ﬁrst place. Active learning, which refers broadly to a family of frameworks in which the learning algorithm can make selective label queries, are designed precisely with the goal of minimizing the number of needed labels while achieving a suitable learning performance. The key idea is that we do not necessarily need to have a batch of labelled examples prior to training, in many natural scenarios the algorithm may be able to actively engage with the labelling process to query labels on a set of unlabelled examples. The classical Binary Search algorithm is, in some sense, an active learning algorithm to ﬁnd an element in a sorted list.

It would be hard to argue against the wealth of empirical results showing the beneﬁts of active learning (Settles, 2011; Nguyen & Smeulders, 2004; Wang & Hua, 2011; Kapoor et al., 2007; Li & Guo, 2013). At the same time, while our theoretical understanding of the label-efﬁciency gains achieved using this new learning model has been studied in a range of scenarios (Hanneke, 2007; Zhang, 2018; Hanneke & Yang, 2012; Hanneke, 2011; Kulkarni et al., 1993; Koltchinskii, 2006; Freund et al., 1997; Dasgupta et al.,

Active Hedge: Hedge meets Active Learning

2008), our progress towards a full-ﬂedged concrete mathematical foundation of active learning has been relatively slow. A persistent challenge is that precisely identifying scenarios in which active label querying can provide provable beneﬁts, versus those where it necessarily can not, has proven quite difﬁcult (Zhang, 2018; Hanneke, 2011). The one notable exception is disagreement-based active learning (Hanneke, 2014): it has been shown that, as long as the binary hypothesis class possesses a particular property with respect to the underlying probability distribution, known as the disagreement coefﬁcient, a recursive algorithm can zoom in to the optimal hypothesis and achieve faster learning with lower label complexity. While the disagreement coefﬁcient is somewhat difﬁcult to deﬁne, the theoretical work associated to this framework has been perhaps the crowning achievement of the area.

In the following section we give longer outline of the existing work in this area. But it is worth noting up front that nearly all work on active learning has imagined a batch setting, where the algorithm is evaluated only at the end of the learning process, in expectation, on new samples. This is surprising, in particular, given that active learning methods are by their nature online, as they seek to iteratively reﬁne their learning process and selection of samples. But thus far there has been no work on putting active learning algorithms to the test in a no-regret setting of prediction with expert advice, where the algorithm s decision is evaluated at each round of the sequence, and where the expert s predictions as well as the labels can be non-stochastic and potentially chosen by an adversary.

In the present paper we aim to remedy this gap, and show that there is a natural framework for active learning in the no-regret setting of prediction with expert advice with strong learning guarantees as well as bounded label complexity. First, we deﬁne a notion of complexity of the experts predictions, somewhat akin to the disagreement coefﬁcient, that provides a key tool in obtaining a provable guarantee; we refer to this as compactness for a parameter ζ 1. Quite notably, this quantity can be efﬁciently estimated up to a constant factor!

Theorem 1.1 (Informal). There is a polynomial time algorithm to calculate the compactness ζ of a matrix up to an approximation factor of 3.

Second, we deﬁne no-regret active learning by laying out what we believe is the appropriate analogue to the batch setting. To put it brieﬂy, we imagine a scenario in which the learner must still make sequential predictions on an Mlength list of examples, but with the following modiﬁcations: (a) the learner is given the sequence of all experts predictions in advance, (b) the learner can only query the true label yt on a small number of examples, and (c) the learner is given a very short burn-in period where it can fast-forward

to future rounds in order to query particularly-informative examples. It is this last feature that makes our setting truly active, as this term is used in the batch setting, since the learner can recursively seek out useful datapoints. After the short burn-in, however, the learner must play the remainder of the sequence in its original order while querying only a small fraction of the labels.

Third, we propose an online learning algorithm for this setting, Active Hedge, that leans heavily on Hedge as a subroutine yet uses dramatically fewer label queries. We are able to show the following:

Theorem 1.2 (Informal). Assume we must predict a sequence of labels in [K], we have N experts who have provided predictions (in [K]) on all M examples, and the prediction matrix X [K]M N is ζ-compact for some ζ 1. If some expert makes only ϵM mistakes, for some ϵ > 0, then with probability 1 ρ algorithm Active Hedge guarantees that

1. with burn-in period of only O(ζ log N log 1

ϵ ) rounds,

2. no more than O ζϵMpolylog( N

ϵζρ) label queries,

3. can achieve regret O

ϵM ln N + ln N .

Assuming the prediction matrix X is ζ-compact for a reasonably-sized constant ζ, this theorem states that the regret of Active Hedge is indeed no worse than Hedge, yet requires a dramatically lower label complexity: roughly O(ζϵM) queries are needed. The only extra power we give the learner is a very brief burn-in period, roughly O(ζ) rounds, where it can do active exploration of future examples. We now give an illustrative example to view this setting in comparison with more classical batch active learning.

Batch vs Online Active Learning Before we dive into the related work and our results, let us lay out an intriguing scenario. Imagine that a worldwide viral pandemic has recently emerged, and a drug company has been working furiously for months to develop a vaccine to provide immunity to the novel virus. The company has been able to design two candidate vaccines, A and B, has proven to federal regulators that both drugs are safe enough to study in humans, but there s a challenge: some people have a mild allergic reaction to vaccine A but not B, and everyone else has a similar allergic reaction to vaccine B but not A, but this only occurs months after exposure. The company knows that the allergic reaction is based on one of thousands of possible genetic variants, yet must determine quickly which is the relevant gene. Unfortunately there are only two ways to determine if the allergic reaction will occur: (a) wait months to inquire with the patient, or (b) run an expensive test after administering the vaccine that determines immediately whether the allergic reaction will occur.

Active Hedge: Hedge meets Active Learning

In this scenario, the experts (hypotheses) correspond to candidate genes, a recipient of the vaccine is an example, the true label is their sensitivity to A or B, and the label query cost is incurred by the expensive test needed to detect a future allergic reaction. We introduce this challenge because it helps to highlight the distinction between the two modes of active learning, the classical batch framework and our online setting.

1. If the company decides to take a batch active learning approach, they would begin by asking random members of the population to submit their genetic proﬁle and sign up for a vaccine study, but with only a small chance to be selected. The company would then adaptively ﬁlter applicants, zero in on particularly-suitable individuals with the relevant genetic information, administer one of the two vaccines, and then immediately give the expensive test to detect for future allergic reactions. A population-wide vaccine administration protocol can then be developed once the key gene in question is determined. 2. The online approach is more aggressive: the company announces that anyone who would like to be vaccinated will have the opportunity, but they must submit a certiﬁed genetic proﬁle in advance, arrive at the local mall on a Saturday by 11am, and then wait in a line. All are promised to receive one of the two vaccines, with the goal of minimizing potential allergic reaction; some recipients will be given the expensive test to quickly determine this. Also, all participants are told that a small number may be brought to the front of the line so that more medically-informative candidates are treated ﬁrst; this is the burn-in phase which we ll discuss more in Section 2.

The typical way that medical procedures are tested and reﬁned is using the ﬁrst protocol, but we would argue1 that the second is superior in how it accounts for and manages the costs and beneﬁts of both vaccine recipients and developers. The batch active learning framework has generally been focused on simply minimizing the number of label queries (expensive tests) in order to achieve ϵ accuracy on future examples, but prediction errors that occur in the study phase are not accounted for in the loss objective. The online active learning framework, on the other hand, does not distinguish between study participants and regular vaccine recipients the goal is simply to induce the least number of allergic reactions at the smallest possible testing cost over the long term.

1We want to emphasize that we are not proposing to change the drug design and trial framework, as this involves a host of ethical and legal issues not considered here. Rather, drug development provides a useful hypothetical to consider the relative costs of testing and accuracy in an adaptive experimentation problem.

It is important to note that batch active learning methods, including disagreement-based learning we describe below, can not immediately be applied in the online setting. Batch active learning only considers label query costs in the training phase and prediction error costs in the testing phase. Another relevant distinction is that our results do not rely on any IID assumption indeed since the algorithm is allowed to move certain examples ahead in the queue adaptively, new examples are almost certain to be non-independent.

Related Work We brieﬂy survey prior work in the general area of active learning. We will describe salient aspects of these works, and outline how our paper differs from these existing approaches in terms of framework, method, and theory. At a fundamental level, active learning deals with label efﬁcient learning, namely, identifying a good predictor, h , from within a hypothesis class, H, based on selectively choosing examples to query for labels. Within this context, a number of methods under a variety of scenarios and assumptions have been studied.

There has been a great deal of work in this area, yet we limit our survey here to a few important themes, in order to draw contrasts and parallels to our setting. Label efﬁcient learning has been considered in pool-based (Settles, 2012; Hanneke, 2014), streaming (Cohn et al., 1994; Balcan et al., 2006; Beygelzimer et al., 2008) and online scenarios (Cesa Bianchi et al., 2006; 2009; Dekel et al., 2012). Pool and stream-based scenarios have been considered largely within the setting of IID examples and/or labels, whereas online methods have been considered under probabilistic (Dekel et al., 2012) as well as adversarial (Cesa-Bianchi et al., 2006) label noise assumptions. A number of approaches including disagreement-based (Beygelzimer et al., 2008; 2010; Hanneke, 2007; Dasgupta et al., 2008; Hanneke, 2009; Hanneke & Yang, 2012), margin-based (Dasgupta et al., 2005; Balcan et al., 2007; Balcan & Long, 2013; Awasthi et al., 2014; 2015; Zhang, 2018), importance-sampling-based (Beygelzimer et al., 2008; Cortes et al., 2019), and multiplicativeweight update-based (Cesa-Bianchi et al., 2006) and other online (Yang, 2011; Dekel et al., 2012) based methods.

In much of the pool and streaming based methods, the underlying assumption is that the examples and labels, are or can be, drawn IID from some ﬁxed unknown distribution, with labels hidden from the learner. The learner after making a number of label requests, not exceeding, say U, outputs a predictor ˆh. In this line of work, the active-learning protocol is based on comparing ˆh against the Bayes optimal predictor on an independent labeled sequence. While there is a rich history of methods, which have been explored under a variety of label noise assumptions, the setting of our work is quite different, in that we make no probabilistic assumptions on the data generation process or label noise; and our active learning protocol, in contrast to these works, does not

Active Hedge: Hedge meets Active Learning

require independence between training and test scenarios. In particular, our protocol follows the online regret setting, and the incorrect predictions are penalized on the dataset available to the learner during the training process. On the other hand, our proposed method and theoretical results are fundamentally related to the so called disagreement based methods, and leverages key insights of Hanneke s disagreement coefﬁcient (Hanneke, 2014). In particular, we develop the notion of ζ-compactness, which can be interpreted, in some sense, as a deterministic and combinatorial version of disagreement coefﬁcient. Nevertheless, since we make no probabilistic assumptions all previous disagreement-based methods, we cannot leverage classical empirical risk minimization bounds in our context. For this reason, we draw upon insights from the Hedge algorithm and its associated regret bounds, which are agnostic to such probabilistic assumptions.

Our work is also closely related to the label efﬁcient online learning methods, which have been analyzed both under unbiased probabilistic noise as well as adversarial noise assumptions. (Cesa-Bianchi et al., 2005) describes a selective sampling method within the framework of online regret minimization for bounded loss functions. The learner plays M rounds and at time t gets an input xt, and can decide to seek a label, while being aware of the overall label budget U. Within this setting, leveraging a variant of the Hedge algorithm, and with no additional assumptions on data process, (Cesa-Bianchi et al., 2005) provides regret guarantees,

which scale as M q

U for N experts (number of hypothesis). A number of online variants to this selective sampling approach have been proposed. (Cesa-Bianchi et al., 2009; Dekel et al., 2012) introduce probabilistic noise assumptions, and in particular assume that the regression function is linear, and the label noise is unbiased and independent of other examples or queries. The linearity of the regression function together with independent label noise allows them to leverage recursive least-squares techniques. Similar to these works, we also consider a regret-minimization techniques. Different from (Cesa-Bianchi et al., 2009; Dekel et al., 2012) we make no probabilistic assumptions on label noise. (Zhao et al., 2013; Hao et al., 2018) consider the same setting as that of selective sampling where the learner can request the label after making the predictions in each round but don t give any theoretical guarantees on the label complexity. In contrast to (Cesa-Bianchi et al., 2005) we assume data from all the N rounds are available to the learner a priori. In addition, we impose the notion of ζ-compactness on the dataset of experts predictions via a concept closely related to disagreement coefﬁcient, which allows for dramatic improvements in label efﬁciency. As a matter of comparison, say the optimal expert makes ϵM errors, then the existing selective sampling results with budget

U = O(ϵM), would lead to a regret equal to q

comparison to our result suggesting p

ϵM log(N). Nevertheless, improvement in our result can be attributed to the additional imposition of ζ-compactness.

2. Notation, Setting, and Background

For the remainder of the paper, we will consider a matrix X [K]M N that represent the predictions of a set of N experts on a sequence of M rounds. We will use the notation Xt to refer to the tth row of X, although we will often index rows using the letter i or I. We write Xi,j to denote the (i, j)th entry of X. Alongside this matrix will be an (unknown) sequence of labels y1, . . . , y M [K]. We require a loss function ℓ: K [K] R, and for simplicity we restrict our attention to the absolute loss ℓ(ˆy, y) := 1

2 ˆy δy 1. Here δy {0, 1}K is the indicator vector, with all zeros except a 1 in the y-th coordinate.

2.1. Basics: Prediction with Expert Advice, and Hedge

In the classical setting of prediction with expert advice, the learner receives prediction vector Xt at round t, makes a prediction ˆyt K, observes the true label yt, and suffers the loss ℓ(ˆyt, yt). Each expert j suffers a loss as well, ℓ(Xt,j, yt), and note that this loss is conveniently the 0-1 loss as well, 1[Xt,i =yt]. The algorithm wants to choose the predictions ˆy1, . . . , ˆy M in order to minimize the regret:

t=1 ℓ(ˆyt, yt) min j [N]

t=1 ℓ(Xt,j, yt).

At times it will be convenient to refer to the cumulative loss of expert j as LM j = PM i=1 ℓ(Xi,j, yi). Similarly, the loss of the algorithm is LM Hedge = PM t=1 ℓ(ˆyt, yt)

We have already discussed Hedge, the most well-known algorithm for the problem of prediction with expert advice. We lay this out in full detail in Algorithm 1, with two important subroutines, Hedge Update and Hedge Predict, that will be needed later. Theorem 2.1. Assume we know a quantity L such that minj=1,...,N LM j L . Then, choosing η = log 1 + q

L Algorithm 1 guarantees

LM Hedge min j=1,...,N LM j

2L ln N + ln N. (1)

This is, in many respects, a fundamental bound. We know, for example, that this can not be made any tighter, even up to constants (Cesa-Bianchi & Lugosi, 2006).

2.2. Prediction Matrix Compactness

In the typical adversarial learning setting we assume that the experts predictions and labels are chosen in some arbitrary

Active Hedge: Hedge meets Active Learning

Algorithm 1: Hedge

1 Input: η > 0 /* learning rate parameter */

2 Init: w0 = [1, . . . , 1] /* N initial weights */

3 for t = 1, . . . , M do

4 Xt Preds(t) /* Receive expert predictions */

5 ˆyt Hedge Predict(Xt, w)

6 yt Query Label(t)

7 w Hedge Update( w, Xt, yt, η)

1 Procedure Hedge Predict( x, w)

2 p h w1 PN i=1 wj , . . . , w N PN i=1 wj

3 ˆy p ONEHOT( x) /* Weighted multiclass pred */

/* One Hot converts multiclass preds x [K]N to one-hot

matrix encoding ( K)N */

4 return ˆy /* ˆy is a probability vec in K */

1 Procedure Hedge Update( w, x, y, η)

/* Decrease weight of incorrect experts */

2 for j = 1, . . . , N do

3 w+ j wj exp( η1[xj =y])

5 return w+

fashion. On the other hand, it is well understood that to obtain any reasonable learning result in an active labelefﬁcient mode one requires stronger assumptions on the input data. In our framework of prediction with expert advice this will mean we must constrain the matrix X in an appropriate fashion. Let us now describe a particular condition on X, which we call compactness, that measures a purely combinatorial property of the space of predictions.

Deﬁnition 2.2. Given X [K]M N, and for any subset V [N] of experts, the points of contention of V is the set

POCX(V ) := {i [M] | j, j V : Xi,j = Xi,j }

For any set of experts, the points of contention are the collection of examples where at least two of the experts in the set disagree.

Deﬁnition 2.3 (ζCompactness). For some ζ 1, we say that an expert prediction matrix X is ζ-compact if it satisﬁes

|POCX(V )| maxj,j V |POCX({j, j })| ζ (2)

for each V [N] with |V | 2. We refer to the compactness of X as the smallest ζ for which inequality (2) holds.

Given a prediction matrix X, the compactness of X controls the divergence between two key quantities of a group of experts V : the total number of points of contention of all of V versus the largest number of points of contention over

any pair in the group. In one sentence, the matrix X is ζ-compact if the size of the contentious set for any subset of experts is never ζ larger than that of the most contentious pair of experts in it. Here are two illuminating examples that illustrate matrix compactness:

1. Let K = 2, M = N and let X be the identity matrix, with all 0 entries except 1s on the diagonal. The compactness of this matrix is M

2 , unfortunately, which is very large. That s because if you take V = [N] we see that POCX(V ) = [M] the whole set of examples. But for any pair j, j we have POCX({j, j }) = {j, j }. In other words, any group of experts has as many points of contention as members in the group, but any pair of experts will disagree on only two points. This is indeed a very hard case for active learning, as individual examples are not very informative. 2. Continue to let M = N and now let X be the upper triangular matrix with all 1s on and above the diagonal, and 0s below. This is a very compact matrix, with ζ = 1! That s because for any subset V we have POCX(V ) = POCX({min(V ), max(V )}), i.e. the points of contention in V is identically the points of contention for the largest-index and smallest-index experts in the set.

Following point 1 above, we can give a simple bound on the compactness of any expert prediction matrix X, whose proof is in Appendix C. But this bound is mostly useless from the perspective of our main results, as we need ζ M for a non-trivial guarantee on label complexity.

Theorem 2.4. For any matrix X [K]M N, for M 2, the compactness of X is less than or equal to min {M, N}

Comparison to the Disagreement Coefﬁcient. As we mentioned early in the paper, one of the major theoretical accomplishments in the literature on label-efﬁcient statistical learning is the work on disagreement-based active learning, ﬁrst introduced by (Hanneke, 2007) with several followup works (Hanneke, 2009; 2011; Balcan et al., 2006; Hanneke, 2014; Hanneke & Yang, 2015). The key quantity of interest in this work is known as the disagreement coefﬁcient, a scalar that measures the difﬁculty of active learning with respect to a particular hypothesis class and data distribution. What was shown all the way back to (Hanneke, 2007) was that this coefﬁcient controls the label complexity of learning on the given task, and they show several examples where the disagreement coefﬁcient is of reasonable size.

While we developed our notion of compactness independently, and with a different model in mind, we later realized that in the case of binary classiﬁcation our deﬁnition can in some sense be viewed as a derandomization of Hanneke s disagreement coefﬁcient; we make this more precise in the proposition below. The compactness ζ of a prediction

Active Hedge: Hedge meets Active Learning

matrix X does not depend on any notion of IID sampling from an underlying data distribution, as ζ is purely a combinatorial property of the experts predictions which could have been adversarially chosen. And, while there is some resemblance between the burn-in procedure in Phase I of Active Hedge and the A2 algorithm of (Hanneke, 2007), our results are not at all comparable: the goal of our work was to produce an algorithm that suffers low regret, as it is forced to make a prediction and suffer loss on each example, and be robust against non-stochastic sequences of data. Proposition 2.5. Consider a binary expert prediction matrix X with compactness ζ. Construct a data distribution D which generates an x, y pair by uniformly sampling x as a row of X and let y be the corresponding label. We can considers the set of experts as an N-sized hypothesis class H. Then the disagreement coefﬁcient of (D, H), as deﬁned by (Hanneke, 2007), is 2ζ where ζ is the compactness of X.

2.3. Online active learning with experts

Let us now specify the details of our framework for active learning with expert advice. It can be described in terms of the vanilla Hedge setting, but with three key modiﬁcations:

1. The sequence of expert predictions, speciﬁed by X, can be precomputed and is given to the learner in advance of the prediction task. 2. The learner aims to make only a small number of label queries, limiting the number of times yt is observed. 3. We allow a very brief burn-in period, which we call Phase I, where the learner can fast-forward to act on particular examples, and query their labels, out of turn. In Phase II the learner then plays the remaining points, which are the vast majority, in the order they are given, with the occasional label query if needed.

Modiﬁcation 1 above is not unusual and arises naturally in settings where the experts are a set of pre-selected deterministic hypotheses, the rounds/examples are given by a queue of contexts/input vectors, and we can pre-evaluate each hypothesis on each context (the vaccine development scenario given in the introduction is another such example). Modiﬁcation 2 captures the underlying goal that we want to skip the potentially-expensive step of obtaining the correct multiclass label in all but a small fraction of rounds; adding this modiﬁcation alone is often referred to as label efﬁcient online learning, e.g. (Sculley, 2007).

Modiﬁcation 3 is perhaps the most unusual in the context of adversarial online learning, where one assumes that the learner the sequence of examples and labels is chosen in an adversarial fashion. But we would argue that this is actually necessary to achieve any kind of non-trivial guarantee: without a small number of fast-forward rounds, the adversary can simply postpone all informative examples to the end of the sequence, at which point querying their labels would

provide no beneﬁt to the learner. Indeed we show that the burn-in period can be extremely short, no more than roughly O(ζ log N log 1

ϵ ) where ζ is the compactness of X, in order to obtain the same regret as Hedge with vastly fewer label queries (roughly O(ζϵM)).

Note that if we don t allow a burn in phase, the lower bounds of Cesa-Bianchi et al. (2005, Theorem 13) apply to the online active learning setting as well. This implies that if we don t allow a burn-in phase, then to guarantee the same

2ϵM ln N regret as Hedge, any algorithm would require at least C M

ϵ labels for some constant C. Since ϵ 1, C M

ϵ = Ω(M). Thus, without a burn-in period, any algorithm would require Ω(M) labels to get the same regret guarantee as Hedge. Since Hedge also request O(M) labels, there would be no advantage in using anything other than Hedge.

3. Algorithm And Performance Guarantee

Henceforth we will let b denote the index of the best expert, i.e. b = argminj [N] LM j , and that the number of mistakes satisﬁes LM b ϵM.

3.1. An Overview of Active Hedge

We present a multiplicative style algorithm Active Hedge, described precisely in Algorithm 2. First let us give a highlevel intuitive description of the procedure. Active Hedge is divided into two phases.

1. Phase I. This is the so-called burn-in period, where the algorithm can fast-forward to future examples out of turn. On each such example, the algorithm must still make a prediction, and can then query the label. This phase, while short, is done in small epochs of length k = O(ζ log(N/ρ)), with a total of T = O(log(1/ϵ)) epochs. In a given epoch τ the algorithm has a set of candidate experts V τ who have predicted reasonably well thus far. To reduce the number of candidate experts, the algorithm samples future rounds from the points of contention of V τ, makes a Hedge prediction on each, and then queries the label. At the end of the epoch the algorithm discards any experts in V τ whose average error was above a given threshold. On the next epoch we shrink the threshold and consider the new set of candidate experts V τ+1, and sample examples from the new set POCX(V τ+1), etc.

2. Phase II. At the start of this phase the algorithm has a relatively small set of candidate best experts, V T, that were selected in Phase I, and with high probability b remains in V T and also every expert in V T agrees with b on all but O(ϵM) examples. With the burn-in segment over the algorithm now plays the remaining

Active Hedge: Hedge meets Active Learning

examples, which make up the vast majority, in their original (adversarial) order; rounds played in Phase I are skipped. Uses a very simple prediction strategy:

(a) if the example i is in POCX(V T), we use Hedge to make a prediction on this example, we query the label yi, and we do a Hedge update on the weights; (b) if i / POCX(V T), we simply use an arbitrary expert j V T and use Xi,j as our prediction.

Algorithm 2: Active Hedge

Parameters :ϵ, η, k, T, ζ Input :X [K]M N

Initialize :V 0 [N], t 0, DONE

/* //// PHASE I //// Recursively shrink candidate experts */

1 for τ = 0, . . . , T 1 do

2 Zτ j 0 ( j [N]) /* #errs expert j at epoch τ */

3 for c = 0, , k 1 do

4 I POCX(V τ) /* Sample w/ replacement */

5 if I / DONE then

6 ˆy I Hedge Predict(XI, wt)

7 y I Query Label(I)

8 wt+1 Hedge Update( wt, XI, y I, η)

9 t t + 1 /* increment hedge update count */

10 DONE DONE {I}

12 Zτ j Zτ j + 1[XI,j =y I] j V τ

14 δτ M 2|POCX(V τ)|

15 V τ+1 j V τ : Zτ j /k δτ

/* Shrink V */

/* //// PHASE II //// Play all remaining rounds */

17 Select j V T arbitrarily

18 for i = 1, . . . , M do

19 if i DONE then

20 continue /* skip if example already done */

21 else if i POCX(V T) then

22 ˆyi Hedge Predict(Xi, wt)

23 yi Query Label(i)

24 wt+1 Hedge Update( wt, Xi, yi, η)

25 t t + 1 /* increment hedge update count */

27 ˆyi ONEHOT(Xi,j ) /* use default expert j

/* One-hot encoding required so that ˆyi K */

The choice in condition (b) might seem unusual, but recall that all experts in V T agree on examples i /

POCX(V T). As long as we did not accidentally evict b from our candidate experts in Phase I, the prediction Xi,j will match that of Xi,b. Therefore on these rounds we should suffer no regret.

3.2. Regret and Label Guarantees

We now present the regret and label complexity guarantee for Active Hedge (Algorithm 2)

Theorem 3.1. Assume we have ϵ, ρ > 0, y, and ζ-compact matrix X such that 10ϵζ 1 and for some b [N] we have P

i [M] 1[Xi,b =yi] ϵM. We set the Active Hedge params

k := l 192ζ log N

ρ log 1 10ϵζ m , T := l log 1 10ϵζ m and

η := log 1 + q

(3) Then with probability at least 1 ρ: 1. the number of calls to Query Label is no more than

ρ log 1 10ϵζ log 1 10ϵζ + ϵζM

2. the length of Phase I is no more than Tk which, up to logarithmic terms, is O(ζ) rounds; 3. and ﬁnally we have that

REGActive Hedge

2ϵM ln N + ln N.

Corollary 3.2. If the burn-in phase in Active Hedge is limited to only B rounds, then we can achieve the same regret as Hedge with label complexity O(B + M 2B/ζ ).

Theorem 3.1 states that Active Hedge achieves the same regret guarantee as Hedge with high probability while using considerably less labels. Hedge requires a label complexity of M, where as for a small ϵ and ζ, the label complexity of Active Hedge is closer to O(ζϵM).

The proof of Theorem 3.1 can be found in the Appendix A. The basic proof sketch is that we divide the regret analysis and the label complexity analysis into the regret and label complexity of the two phases.

In Phase I, using induction, we show that with high probability, the size of the candidate experts set V τ shrinks in every round and the best expert is always present in V τ. After the end of the Phase I, we have narrowed down to the set of candidate experts V T so that with high probability |POCX(V T)| = O(ζϵM), using compactness, yet still b V T. In Phase II we only request the labels for the examples that are in POCX(V T), thus the label complexity of Phase II is bounded by O(ϵζM).

Bounding the regret of Active Hedge is surprisingly easy, since for all examples played in Phase I as well as for those played in Phase II from POCX(V T), we appeal directly to

Active Hedge: Hedge meets Active Learning

Hedge where we have an optimal bound. In many examples in Phase II, where i / POCX(V T), we make a prediction that (with high probability) agrees with expert b and thus we suffer no regret on these rounds.

It should be noted that even though the guarantees in Theorem 3.1 are dependent on the knowledge of ϵ and ζ for initializing the parameters K and T of Algorithm 2, for our proofs to follow through, we just an upper bound on the error rate ϵ of the best expert, and similarly for the compactness ζ. In Theorem 4.1, we give a polynomial time algorithm to approximate ζ; this can be used to initialize Algorithm 2. Using ϵ > ϵ in Theorem 3.1, we still get the same regret guarantee of

2ϵM ln N + ln N that still depends on ϵ, but the label complexity will now be O ζ log N

ρ log 1 10ϵ ζ log 1 10ϵ ζ + ϵ ζM .

If no estimate for ϵ is available, a standard halving trick can be applied to obtain similar regret and sample complexity guarantees. See Appendix B for more details.

4. Calculating compactness

Algorithm 3: Calculate compactness

1 Input: X [K]M N /* Expert prediction matrix */

2 Init: ζ 0

3 for all pairs j, j [N] do

4 Vj,j {j, j } /* Initialize Vj,j */

/* Add experts with distance from j dist(j, j ) */

5 Vj,j Vj,j {h|dist(h, j) dist(j, j )}

/* Add experts with distance from j dist(j, j ) */

6 Vj,j Vj,j {h|dist(h, j ) dist(j, j )}

7 ζj,j |POCX(Vj,j )|

DIAM(Vj,j ) /* Update ζ if a bigger ratio is found */

8 if ζj,j > ζ then

12 Return: ζ

The compactness of an expert prediction matrix is a combinatorial quantity which is easy to compute for some concept classes, but in the worst case it might be hard to compute exactly as we have a supremum over all subsets of experts. We present an algorithm that gives a 3-approximation of the compactness in polynomial time.

For the remainder of this section and the appendix, for any V [N] let DIAM(V ) := maxj,j V |POCX({j, j })| and for any experts j, j , let dist(j, j ) = |POCX({j, j })|.

Theorem 4.1. If the input matrix X to Algorithm 3 is ζcompact, then Algorithm 3 returns ζ such that ζ

in runtime O N 4M

As stated earlier, for initializing Algorithm 2 for the results in Theorem 3.1, we just need an upper bound on the ζcompactness. Using Algorithm 3, we can obtain an estimate ˆζ = 3 ζ such that ζ ˆζ 3ζ.

5. Experiments

We provide preliminary experiments to compare Active Hedge (Algorithm 2), with standard Hedge (Algorithm 1) and the label efﬁcient algorithm given by Cesa-Bianchi et al. (2005).

0 2500 5000 7500 10000

Labels queried

0 2500 5000 7500 10000 0

a) Experts: Linear Classifiers

Cumulative mistakes

0 2500 5000 7500 10000

0 2500 5000 7500 10000

b) Experts: Thresholds

0 2500 5000 7500 10000 Rounds

0 2500 5000 7500 10000 Rounds

c) Experts: Identity

Active Hedge Hedge CL05

Figure 1. Labels queried and the cumulative mistakes of Active Hedge, Hedge, and Cesa-Bianchi et al. (2005)(CL05) in 3 different settings. Hedge queries label in every round and is not shown in Labels queried plots to maintain readability.

We consider three different classes of experts for our experiments. In Figure 1: a) we consider linear classiﬁers passing through the origin as experts. We uniformly N sample linear classiﬁers from a unit sphere centred at origin. We then sample M points from a unit sphere and classify each point using the N experts to create the expert prediction matrix X.

Active Hedge: Hedge meets Active Learning

Similarly, in Figure 1: b), we consider multi-dimensional thresholds as experts where a point x Rd is labeled 1 by an expert h Rd if xi hi i [d]. The experts are sampled by sampling thresholds uniformly between 0 and 1. In both the cases, Active Hedge is able to achieve similar accuracy to Hedge and achieves better performance than Cesa-Bianchi et al. (2005) in terms of both regret and label complexity.

We also consider the more adversarial case in Figure 1: c), where the expert prediction matrix has an identity matrix like structure with ζ = O(N). Here the expert prediction matrix is designed such that only one of the experts labels each point as 1, and every expert classiﬁes approximately equal number of points as 1. Even in this adversarial case where the ζ compactness is very high, Active Hedge out performs the competition. Thus happens because even though the ζ compactness is high, it also implies that by removing an expert from consideration, we also remove a signiﬁcant fraction of points we are confused on. This allows us to quickly converge to the optimal expert. All experiments are repeated 100 times, with M = 10000 and N = 100 and d = 10. We use upper bounds for ζ and ϵ and other parameters are set optimally. For all these experiments, Active Hedge required less than 10% of the labels with the burn-in phase being less than 2% of the points.

Acknowledgements

This research was supported by the Army Research Ofﬁce Grant W911NF2110246, the National Science Foundation grants CCF-2007350 and CCF-1955981, and the Hariri Data Science Faculty Fellowship Grants, and a gift from the ARM corporation.

Awasthi, P., Balcan, M. F., and Long, P. M. The power of localization for efﬁciently learning linear separators with noise. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp. 449 458. ACM, 2014. 1

Awasthi, P., Balcan, M.-F., Haghtalab, N., and Urner, R. Efﬁcient learning of linear separators under bounded noise. In Conference on Learning Theory, pp. 167 190, 2015. 1

Balcan, M.-F. and Long, P. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, pp. 288 316, 2013. 1

Balcan, M.-F., Beygelzimer, A., and Langford, J. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pp. 65 72. ACM, 2006. 1, 2.2, D.1

Balcan, M.-F., Broder, A., and Zhang, T. Margin based active learning. In International Conference on Computational Learning Theory, pp. 35 50, 2007. 1

Beygelzimer, A., Dasgupta, S., and Langford, J. Importance weighted active learning. Ar Xiv, abs/0812.4952, 2008. 1

Beygelzimer, A., Hsu, D. J., Langford, J., and Zhang, T. Agnostic active learning without constraints. Advances in Neural Information Processing Systems, pp. 199 207, 2010. 1

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006. 5

Cesa-Bianchi, N., Lugosi, G., and Stoltz, G. Minimizing regret with label efﬁcient prediction. IEEE Transactions on Information Theory, 51(6):2152 2162, 2005. 1, 2.3, 5, 1, 5

Cesa-Bianchi, N., Gentile, C., and Zaniboni, L. Worst-case analysis of selective sampling for linear classiﬁcation. Journal of Machine Learning Research, 7:1205 1230, Jul 2006. 1

Cesa-Bianchi, N., Gentile, C., and Orabona, F. Robust bounds for classiﬁcation via selective sampling. In Proceedings of the 26th annual international conference on machine learning, pp. 121 128. ACM, 2009. 1

Cohn, D., Atlas, L., and Ladner, R. Improving generalization with active learning. Machine learning, 15(2): 201 221, 1994. 1

Cortes, C., De Salvo, G., Gentile, C., Mohri, M., and Zhang, N. Region-based active learning. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 2801 2809, 2019. 1

Dasgupta, S., Kalai, A. T., and Monteleoni, C. Analysis of perceptron-based active learning. In International Conference on Computational Learning Theory, pp. 249 263, 2005. 1

Dasgupta, S., Hsu, D. J., and Monteleoni, C. A general agnostic active learning algorithm. In Advances in neural information processing systems, pp. 353 360, 2008. 1, 1

Dekel, O., Gentile, C., and Sridharan, K. Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research, 13:2655 2697, Sep 2012. 1

Freund, Y. and Schapire, R. E. A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory, pp. 23 37. Springer, 1995. 1

Active Hedge: Hedge meets Active Learning

Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. Selective sampling using the query by committee algorithm. Machine learning, 28(2-3):133 168, 1997. 1

Hanneke, S. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th international conference on Machine learning, pp. 353 360. ACM, 2007. 1, 1, 2.2, 2.5

Hanneke, S. Adaptive rates of convergence in active learning. In COLT. Citeseer, 2009. 1, 2.2

Hanneke, S. Rates of convergence in active learning. The Annals of Statistics, pp. 333 361, 2011. 1, 2.2

Hanneke, S. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3): 131 309, 2014. 1, 1, 2.2

Hanneke, S. and Yang, L. Surrogate losses in passive and active learning. ar Xiv preprint, 2012. 1, 1

Hanneke, S. and Yang, L. Minimax analysis of active learning. The Journal of Machine Learning Research, 16(1): 3487 3602, 2015. 2.2

Hao, S., Hu, P., Zhao, P., Hoi, S. C., and Miao, C. Online active learning with expert advice. ACM Transactions on Knowledge Discovery from Data (TKDD), 12(5):1 22, 2018. 1

Kapoor, A., Grauman, K., Urtasun, R., and Darrell, T. Active learning with gaussian processes for object categorization. In 2007 IEEE 11th International Conference on Computer Vision, pp. 1 8. IEEE, 2007. 1

Koltchinskii, V. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593 2656, 2006. 1

Kulkarni, S. R., Mitter, S. K., and Tsitsiklis, J. N. Active learning using arbitrary binary valued queries. Machine Learning, 11(1):23 35, 1993. 1

Li, X. and Guo, Y. Adaptive active learning for image classiﬁcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 859 866, 2013. 1

Mitzenmacher, M. and Upfal, E. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017. E

Nguyen, H. T. and Smeulders, A. Active learning using preclustering. In Proceedings of the twenty-ﬁrst international conference on Machine learning, pp. 79, 2004. 1

Sculley, D. Online active learning methods for fast labelefﬁcient spam ﬁltering. In CEAS, volume 7, pp. 143, 2007. 2.3

Settles, B. From theories to queries: Active learning in practice. In Active Learning and Experimental Design workshop In conjunction with AISTATS 2010, pp. 1 18, 2011. 1

Settles, B. Active learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning, 6(1):1 114, 2012. 1

Wang, M. and Hua, X.-S. Active learning in multimedia annotation and retrieval: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 2(2):1 21, 2011. 1

Yang, L. Active learning with a drifting distribution. In Advances in Neural Information Processing Systems, pp. 2079 2087, 2011. 1

Zhang, C. Efﬁcient active learning of sparse halfspaces. ar Xiv preprint, 2018. 1, 1

Zhao, P., Hoi, S., and Zhuang, J. Active learning with expert advice. ar Xiv preprint ar Xiv:1309.6875, 2013. 1

Active Hedge: Hedge meets Active Learning

A. Proof of Theorem 3.1

To prove Theorem 3.1, we need a few preliminary lemmas.

Lemma A.1. If a set of experts H1 is a subset of another set of experts H2, then POCX(H1) POCX(H2)

Proof. If i POCX(H1), then there exist two experts j, j H1, such that Xi,j = Xi,j . Since H1 H2, j, j H2, hence i POCX(H2).

In each epoch τ of Phase I, we maintain a set of candidate experts V τ and a set of candidate points POCX(V τ) we might query the labels for. For ease of notation, let Sτ = POCX(V τ), DIAM(V ) := maxj,j V |POCX({j, j })|, and for any experts j, j , let dist(j, j ) = |POCX({j, j })|.

For the purpose of analysis, we partition the set V τ into two sets. Let

Bτ = n j V τ | dist(b, j) > M 2τ+1ζ o

and also Bτ = V τ \ Bτ.

Intuitively, Bτ are the experts which are far from the best expert and thus they make more mistakes and we want to remove them. Using an inductive analysis, we will show that in each epoch, with high probability, we can shrink the set of candidate experts, i.e for all τ, V τ+1 Bτ and that we never remove the best expert b, i.e b V τ+1. For the rest of the section, we

set k = 192ζ log( N

ρ log 1 10ϵζ ) , T = log 1 10ϵζ and η = log(1 + q

In the following lemma, we show that the size of the set of candidate points sampled from in epoch τ is bounded.

Lemma A.2. If V τ Bτ 1, then |Sτ| M 2τ 1

Proof. By deﬁnition, Sτ = POCX(V τ). Since V τ Bτ 1, using Lemma A.1, Sτ POCX(Bτ 1). By deﬁnition, of Bτ 1, these experts are at a distance of at most M 2τ ζ from the best expert, the diameter of this set is at most M 2τ 1ζ . Using

deﬁnition of ζ compactness, |POCX(Bτ 1)| ζ M 2τ 1ζ = M 2τ 1 . Hence |Sτ| M 2τ 1 .

Now we show that in expectation, any expert in Bτ makes a large number of mistakes in epoch τ which we will use to obtain a high probability bound.

Lemma A.3. If b V τ then for any j in Bτ, if Zτ j is the number of mistakes made in epoch τ, then E Zτ j k |Sτ |( M 2τ+1ζ ϵM)

Proof. Since j V τ and b V τ, By deﬁnition of Sτ = POCX(V τ), if for some i, Xi,j = Xi,b, then i Sτ. b makes at-most ϵM mistakes, so in the worst case, j can disagree with b on these points and be correct, but it has to be wrong on at least M 2τ+1ζ ϵM points in Sτ as it disagrees with b on M 2τ+1ζ points in Sτ.

We samples k points from Sτ. Let the examples samples in epoch τ be (I1, , Ik), then Zτ j = Pk c=1 1[XIc,j =y Ic], = E Zτ j = Pk c=1 E 1[XIc,j =y Ic] = Pk c=1 P[XIc,j = y Ic] Pk c=1 1 Sτ ( M 2τ+1ζ ϵM) = k Sτ ( M 2τ+1ζ ϵM)

Lemma A.4. If b V τ and V τ Bτ 1 then with probability at least 1 ρ|Bτ | N log 1 10ϵζ , V τ+1 Bτ

Proof. For a ﬁxed j Bτ, by deﬁnition the number of mistakes, Zτ j = Pk c=1 1[XIc,j =y Ic]. The probability that we keep j

Active Hedge: Hedge meets Active Learning

in V τ+1 is

P h Zτ j k 1 2|Sτ |( M 2τ+1ζ ϵM) i

= P h Zτ j k 1 |Sτ |( M 2τ+1ζ ϵM) 1 2|Sτ |( M 2τ+1ζ ϵM) i

P h Zτ j k E h Zτ j k i 1 2|Sτ |( M 2τ+1ζ ϵM) i

2|Sτ| )) (Chernoff Lower tail)

12(1 2τ+1ζϵ

8ζ )) (as |Sτ| M 2τ 1 )

16ζ )) (as τ < log2 1 10ϵζ )

= ρ N log 1 10ϵζ (as k = 192ζ log(N

ρ log 1 10ϵζ ))

Thus, with probability at least 1 ρ N log 1 10ϵζ , Zτ j > δτ, thus j / V τ+1. A union bound over j Bτ gives the proof.

So far in the inductive process we have shown that we shrink V τ to only keep experts from Bτ. Now we show that with high probability, we never remove the best expert b.

Lemma A.5. If Zτ b is the number of mistakes made in epoch τ by the best expert b, then E[Zτ b ] kϵM

Proof. Since the best expert makes at-most ϵM mistakes, in the worst case all of these ϵM examples are present in Sτ. Since we samples k points from St, Zτ b = Pk c=1 1[XIc,b =y Ic] = E[Zτ b ] = Pk c=1 E 1[XIc,b =y Ic] = Pk c=1 P[XIc,b = y Ic] Pk c=1 ϵM

Lemma A.6. If b V τ and V τ Bτ 1 then with probability at least 1 ρ N log 1 10ϵζ , b V τ+1

Proof. The probability that b is not present in V τ+1 is

P h Zτ b k 1 2|Sτ |( M 2τ+1ζ ϵM) i

= P h Zτ b k ϵM 2|Sτ |( 1 2τ+1ϵζ 1) i

P h Zτ b k E h Zτ b k i 1 2( 1 2τ+1ϵζ 1) i

6|Sτ| 1 2( 1 2τ+1ϵζ 3)) (Chernoff upper tail)

M 2τ+1ζ 3ϵM

3(1 2τ+13ζϵ

8ζ )) (as |Sτ| M 2τ 1 )

16ζ )) (as τ < log2 1 10ϵζ )

= ρ N log 1 10ϵζ (as k = 192ζ log(N

ρ log 1 10ϵζ ))

Combining the two results, we can prove the inductive step.

Lemma A.7. If b V τ and V τ Bτ 1, then with probability at least 1 ρ log 1 10ϵζ , b V τ+1 and V τ+1 Bτ

Active Hedge: Hedge meets Active Learning

Proof. Union bound over Lemma A.4 and A.6.

We consider the base case and show that even in the ﬁrst round, we shrink V 0 to get V 1 and that we don t remove b.

Lemma A.8. With prob. 1 ρ log 1 10ϵζ , V 1 B0 and b V 1

Proof. δ0 = k

2ζ ϵ). For any ﬁxed j B0, E Z0 j k( 1

2ζ ϵ) (A.3). Probability that j V 1 is

P h Z0 j k 1

P h Z0 j k E h Z0 j k i 1

4ζ )) (Chernoff lower tail)

8ζ )) (as 1 2ζϵ > 1/2)

ρ N log 1 10ϵζ (as k = 192ζ log(N

ρ log 1 10ϵζ ))

Thus with probability at least 1 ρ N log 1 10ϵζ , j / V 1

For b, E Z0 b k

ϵ . Probability that b / V 1

P h Z0 b k 1

P h Z0 b k E h Z0 b k i 1

4ζ )) (Chernoff lower tail)

8ζ )) (as 1 6ζϵ > 1/2)

ρ N log 1 10ϵζ (as k = 192ζ log(N

ρ log 1 10ϵζ ))

Thus with probability at least 1 ρ N log 1 10ϵζ , b V 1

Union bound over j B0 and over b proves the statement of the lemma.

Now that we have proved the inductive step and the base case, we can use these results to state the result for Phase I.

Lemma A.9. In Active Hedge (algorithm 2), when Phase I ends after T = 1 10ϵζ epochs, with probability at least 1 ρ, b V T and for all j V T , dist(b, j) 10ϵM

Proof. Using induction and union bound over τ = 1, , T for Lemmas A.8 and A.7, we get that with probability at least 1 ρ, b V T, and V T BT 1 n j [M] | dist(b, j) M 2Tζ o , M 2Tζ = M

2 log( 1 10ϵζ )ζ = 10ϵM

Now that we have shown that at the end of Phase I, i.e the burn-in period, we have considerably shrunk down our set of candidate experts and thus confusing points. We can prove Theorem 3.1.

Since, Active Hedge (Algorithm 2) is divided into two phases, a portion of the regret is incurred in each phase. The examples we predict and request labels for in Phase I are denoted by the set DONE at the end of Phase I. So the portion of regret incurred in Phase I be RI = P

i DONE(ℓ(ˆyi, yi) ℓ(Xi,b, yi)). For Phase II, the points are either in ST = POCX(V T) where we make hedge updates and request for labels, or they are not in POCX(V T), and we use an arbitrary expert j V T to make predictions. Let the regret on the points in POCX(V T), i.e. the points of contention for V T in phase II

Active Hedge: Hedge meets Active Learning

be Rcon = P

i ([M]\DONE) ST(ℓ(ˆyi, yi) ℓ(Xi,b, yi)) and the total regret for the points in Phase II not in POCX(V T) be Ragree = P

i ([M]\DONE)\ST(ℓ(ˆyi, yi) ℓ(Xi,b, yi))

Proof of Theorem 3.1. First, let s show the regret bound,

Regret Bound:

Since REGActive Hedge = RI + Rcon + Ragree, let s consider the terms individually.

RI and Rcon: We are using Hedge (Algorithm 1) to make predictions and make updates. If we re-sample a point for which we have already made a prediction, we do not incur loss on it again. We know that LM b ϵM, hence L = ϵM is an upper

bound on the loss of the best expert in RI + Rcon as well. Setting η = log 1 + q

ϵM , we can directly use the regret bound of Theorem 2.1, to show that

RI + Rcon = X

i DONE ST (ℓ(ˆyi, yi) ℓ(Xi,b, yi))

i DONE ST ℓ(ˆyi, yi) min j [N]

i DONE ST ℓ(Xi,j, yi)

2ϵM ln N + ln N

Ragree: Using Lemma A.9, with probability at least 1 ρ, the best expert b V T. Since ST = POCX(V T), all the experts present in V T agree on [M] \ ST. Since ([M] \ DONE) \ ST M \ ST all the experts in V T agree on all examples in ([M] \ DONE) \ ST. Thus for all i ([M] \ DONE) \ ST, for any j V T, Xi,j = Xi,b. This is also true for j selected before the start of Phase II, We get

i ([M]\DONE)\ST (ℓ(ˆyi, yi) ℓ(Xi,b, yi))

i ([M]\DONE)\ST ℓ(Xi,j , yi)) ℓ(Xi,b, yi))

i ([M]\DONE)\ST ℓ(Xi,b, yi)) ℓ(Xi,b, yi)) = 0

Thus with probability at least 1 ρ, REGActive Hedge

2ϵM ln N + ln N

Label complexity:

Let s consider the number of labels requested in each phase.

Since number of epochs T = log 1 10ϵζ and in each epoch we request the label for k = 192ζ log( N

ρ log 1 10ϵζ ) examples, the number of labels requested is Phase I is at most 192ζ log( N

ρ log 1 10ϵζ ) log 1 10ϵζ . This is also the size of the burn-in period.

Using Lemma A.9, with probability at least 1 ρ, for every j V T , dist(b, j) 10ϵM, thus DIAM(V T) 20ϵM. Using the deﬁnition of ζ-compactness, |ST| = |POCX(V T)| ζDIAM(V T) 20ϵζM. Since we only request labels for the examples in POCX(V T), the number of labels requested in Phase II is bounded by |POCX(V T)|, which is less than or equal to 20ϵζM

Hence with probability at least 1 ρ, the number of labels requested in Phase II is at most 20ϵζM

Combining the label complexity for each of the phase, with probability at least 1 ρ, the number of labels requested by Algorithm 2 is at most

ρ log 1 10ϵζ

log 1 10ϵζ + ϵζM

Active Hedge: Hedge meets Active Learning

Note that the regret bound and the label complexity result hold simultaneously with probability at least 1 ρ.

B. Halving trick for unknown ϵ

As the algorithm already works by zooming in on tighter error levels, a standard halving technique can be used to easily adapt to an unknown ϵ as well. Note that k s dependence on ϵ is only log log 1

ϵ coming from a union bound which can be upper bounded by log log M. Now to run the algorithm adaptively, instead of ﬁxing T, we keep the τ loop running till error rate of some expert is less than 0.5, i.e. minj Zτ j /k < 0.5 and setting δτ M 2τ+2ζ|POCX(V τ )|. Using similar steps as Theorem 3.1,

this ensures that with high probability, the best expert b is never removed from V τ. When τ < log 1 4ϵθ, we can show that with high probability, the best expert b has Zτ b /k 0.5 and for τ > log 1 2ϵθ, minj Zτ j /k > 0.5. Thus, without knowing ϵ, we

stop at the correct, τ, leading to the the optimal regret of

2ϵM ln N + ln N with O ζ log N

ρ log M log 1 10ϵζ + ϵζM .

C. Proof of theorem 2.4

Proof of Theorem 2.4. If for a set of experts V , if |V | 2 then |POCX(V )| = DIAM(v). Assume V has all unique experts. For any set V [N], |POCX(V )| M, thus ζ M.

For any V , we show that |POCX(V )| |V |DIAM(V ). Let show this by induction over the size of V . For |V | 2, the base cases are direct. Assume that it is true for some V , i.e. |POCX(V )| |V |DIAM(V ). If we add one more expert h to this set, then two cases are possible, a) DIAM(V + h) = DIAM(V ) or b) DIAM(V + h) > DIAM(V ).

a) DIAM(V + h) = DIAM(V )

We can show that |POCX(V + h)| |POCX(V )| + DIAM(V ). If this is not true, i.e. if |POCX(V + h)| > |POCX(V )| + DIAM(V ) then h disagrees with all j V on at least DIAM(V )+1 points which are not in POCX(V ). Thus POCX(h, j) DIAM(V )+1 > DIAM(V ) which would imply DIAM(V +h) > DIAM(V ) which is a contradiction. Thus |POCX(V +h)| |V + h|DIAM(V + h)

b) DIAM(V + h) > DIAM(V )

The extra points added in POCX(V ) by adding h is bounded by DIAM(V + h). We get

|POCX(V + h)| |POCX(V )| + DIAM(V + h)

|V |DIAM(V ) + DIAM(V + h)

|V + h|DIAM(V + h)

This implies for any V , |POCX(V )| DIAM(V )|V |. Since |V | N, ζ N.

D. Proof of Theorem 4.1

Proof. Consider the subset

V = argmax V,DIAM(V )>0

Let h1, h2 V be the experts such that dist(h1, h2) = DIAM(V ). For any h V , dist(h , h1) DIAM(V ) and dist(h , h2) DIAM(V ), hence h Vh1,h2, i.e V Vh1,h2 in Algorithm 3. This gives us that |POCX(Vh1,h2)| |POCX(V )|

Since we include all experts that are at a distance of at most dist(h1, h2) from h1 or h2, the diameter DIAM(Vh1,h2) 3dist(h1, h2) = 3DIAM(V )

Using these two facts, we get |POCX(Vh1,h2)|

DIAM(Vh1,h2) |POCX(V )|

3DIAM(V ) = ζ

We consider all pairs of experts in Algorithm 3, hence the ζ returned satisﬁes

ζ |POCX(Vh1,h2)|

DIAM(Vh1,h2) ζ

Active Hedge: Hedge meets Active Learning

For the upper bound, since the ζ returned is |POCX(Vj,j )|

DIAM(Vj,j ) for some j, j , it is obvious that

ζ max V,DIAM(V )>0

DIAM(V ) = ζ

The run time comes from the fact that we consider all O(N 2) pairs of experts and for any subset V [N], |POCX(V )| can be computed in O(|V |M) and DIAM(V ) can be computed in O(|V |2M)

D.1. Proof of Corollary 3.2

Proof. In Active Hedge (Algorithm 2), in the results of Theorem 3.1, the learner is allowed to set the length of the burn-in period itself, i.e. it can decide how many examples that we actually need to actively select and move ahead in the queue. The burn-in phase in Theorem 3.1 is set in such a way that it minimizes the overall label complexity of the the algorithm required to get the same regret bound as Hedge.

If instead of giving the learner the freedom to set its own length of Phase I, if the learner is only given a budget B of number of examples it can move ahead in the queue, then by setting k = O(ζ) and T = B/k, the size of the burn-in phase becomes B. At the end of Phase I, in this case, the size of the set of points of contentions, that is |POCX(V T)| is O( M 2B/ζ ) (Lemma A.2). Thus, the total samples queried would be O(B + M 2B/ζ ).

Similar to Theorem 3.1, since we don t make any mistakes on the points outside POCX(V T) in Phase II, the number of mistakes is bounded by the mistakes made by Hedge, resulting in the same regret guarantee.

If we were to ignore the mistakes in the learning part, then using an off-the-shelf active learning algorithm (eg (Balcan et al., 2006)) to solve this problem, i) We would need bounded VC dimension d, and ii) we would require a O(ζd log 1

ϵ + ϵζMd)- long burn-in to ensure an excess error rate of the same order as the O(

ϵM) regret on the remainder of the examples.

This brings out a key beneﬁt of our formulation: in pool-based batch active learning, there is no way to separate the number of targeted queries (i.e. burn-in) and the label complexity; in our online setting the former can be dramatically smaller than the latter.

E. Auxiliary lemmas

Lemma E.1 (Chernoff Bounds). Let X1, . . . , Xn be independent random variables, and Xi lies in the interval [0, 1]. Deﬁne X = Pn i=1 Xi and denote E[X] = µ. For any δ [0, 1], we have Chernoff lower tail:

Pr{X < (1 δ)µ} exp( µδ2

and we have Chernoff upper tail:

Pr{X > (1 + δ)µ}

3 ) for δ > 1

3 ) for δ [0, 1]

The proofs for the inequalities in Lemma E.1 can be found in Theorem 4.4 and Theorem 4.5 of (Mitzenmacher & Upfal, 2017)