# online_learning_with_abstention__9064a210.pdf Online Learning with Abstention Corinna Cortes 1 Giulia De Salvo 1 Claudio Gentile 1 2 Mehryar Mohri 3 1 Scott Yang 4 We present an extensive study of a key problem in online learning where the learner can opt to abstain from making a prediction, at a certain cost. In the adversarial setting, we show how existing online algorithms and guarantees can be adapted to this problem. In the stochastic setting, we first point out a bias problem that limits the straightforward extension of algorithms such as UCB-N to this context. Next, we give a new algorithm, UCB- GT, that exploits historical data and time-varying feedback graphs. We show that this algorithm benefits from more favorable regret guarantees than a natural extension of UCB-N. We further report the results of a series of experiments demonstrating that UCB-GT largely outperforms that extension of UCB-N, as well as other standard baselines. 1. Introduction We consider an online learning scenario, prevalent in many applications, where the learner is granted the option of abstaining from making a prediction, at a certain cost. For example, in the classification setting, at each round, the learner can choose to make a prediction and incur a standard zero-one misclassification cost, or elect to abstain, in which case she incurs an abstention cost, typically less than one. Abstention can thus represent an attractive option to avoid a higher cost of misclassification. Note, however, that when the learner abstains, she does not receive the true label (correct class), which results in a loss of information. This scenario of online learning with abstention is relevant to many real-life problems. As an example, consider the scenario where a doctor can choose to make a diagnosis based on the current information available about a patient, Work done at the Courant Institute of Mathematical Sciences. 1Google Research, New York, NY. 2INRIA Lille Nord Europe. 3Courant Institute of Mathematical Sciences, New York, NY. 4D. E. Shaw & Co., New York, NY. Correspondence to: Giulia De Salvo . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). or abstain and request further laboratory tests, which can represent both a time delay and a financial cost. In this case, the abstention cost is usually substantially lower than that of a wrong diagnosis. The online model is appropriate since it captures the gradual experience a doctor gains by testing, examining and following new patients. Another instance of this problem appears in the design of spoken-dialog applications such as those in modern personal assistants. Each time the user asks a question, the assistant can either offer a direct response to the question, at the risk of providing an inaccurate response, or choose to say I am sorry, I do not understand? , which results in a longer and thereby more costly dialog requesting the user to reformulate his question. Similar online learning problems arise in the context of self-driving cars where, at each instant, the assistant must determine whether to continue steering the car or return the control to the driver. Online learning with abstention also naturally models many problems arising in electronic commerce platforms such as an Ad Exchange, an online platform set up by a publisher where several advertisers bid in order to compete for an ad slot, the abstention cost being the opportunity loss of not bidding for a specific ad slot. In the batch setting, the problem of learning with abstention has been studied in a number of publications, starting with (Chow, 1957; 1970). Its theoretical aspects have been analyzed by several authors in the last decade. El-Yaniv & Wiener (2010; 2011) studied the trade-off between the coverage and accuracy of classifiers. Bartlett & Wegkamp (2008) introduced a loss function including explicitly an abstention cost and gave a consistency analysis of a surrogate loss that they used to derive an algorithm. More recently, Cortes et al. (2016a;b) presented a comprehensive study of the problem, including an analysis of the properties of a corresponding abstention (or rejection) loss with a series of theoretical guarantees and algorithmic results both for learning with kernel-based hypotheses and for boosting. This paper presents an extensive study of the problem of online learning with abstention, in both the adversarial and the stochastic settings. We consider the common scenario of prediction with expert advice (Littlestone & Warmuth, 1994) and adopt the same general abstention loss function as in (Cortes et al., 2016a), with each expert formed by a Online Learning with Abstention pair made of a predictor and an abstention function. A key aspect of the problem we investigate, which makes it distinct from both batch learning with abstention, where labels are known for all training points, and standard online learning (in the full information setting) is the following: if the algorithm abstains from making a prediction for the input point received at a given round, the true label of that point is not revealed. As a result, the loss of the experts that would have instead made a prediction on that point cannot be determined at that round. Thus, we are dealing with an online learning scenario with partial feedback. If the algorithm chooses to predict, then the true label is revealed and the losses of all experts, including abstaining ones, are known. But, if the algorithm elects to abstain, then only the losses of the abstaining experts are known, all of them being equal to the same abstention cost. As we shall see, our learning problem can be cast as a specific instance of online learning with a feedback graph, a framework introduced by Mannor & Shamir (2011) and later extensively analyzed by several authors (Caron et al., 2012; Alon et al., 2013; 2014; 2015; Koc ak et al., 2014; Neu, 2015; Cohen et al., 2016)). In our context, the feedback graph varies over time, a scenario for which most of the existing algorithms and analyses (specifically, in the stochastic setting) do not readily apply. Our setting is distinct from the KWIK (knows what it knows) framework of Li et al. (2008) and its later extensions, though there are some connections, as discussed in Appendix A. Our contribution can be summarized as follows. In Section 3, we analyze an adversarial setting both in the case of a finite family of experts and that of an infinite family. We show that the problem of learning with abstention can be cast as that of online learning with a time-varying feedback graph tailored to the problem. In the finite case, we show how ideas from (Alon et al., 2014; 2015) can be extended and combined with this time-varying feedback graph to devise an algorithm, EXP3-ABS, that benefits from favorable guarantees. In turn, EXP3-ABS is used as a subroutine for the infinite case where we show how a surrogate loss function can be carefully designed for the abstention loss, while maintaining the same partial observability. We use the structure of this loss function to extend CONTEXTUALEXP3 (Cesa-Bianchi et al., 2017) to the abstention scenario and prove regret guarantees for its performance. In Section 4, we shift our attention to the stochastic setting. Stochastic bandits with a fixed feedback graph have been previously studied by Caron et al. (2012) and Cohen et al. (2016). We first show that an immediate extension of these algorithms to the time-varying graphs in the abstention scenario faces a technical bias problem in the estimation of the expert losses. Next, we characterize a set of feedback graphs that can circumvent this bias problem in the general setting of online learning with feedback graphs. We further design a new algorithm, UCB-GT, whose feedback graph is estimated based on past observations. We prove that the algorithm admits more favorable regret guarantees than the UCB-N algorithm (Caron et al., 2012). Finally, in Section 5 we report the results of several experiments with both artificial and real-world datasets demonstrating that UCB-GT in practice significantly outperforms an unbiased, but limited, extension of UCB-N, as well as a standard bandit baseline, like UCB (Auer et al., 2002a). 2. Learning Problem Let X denote the input space (e.g., X is a bounded subset of Rd). We denote by H a family of predictors h: X ! R, and consider the familiar binary classification problem where the loss (y, h(x)) of h 2 H on a labeled pair (x, y) 2 X { 1} is defined by either the 0/1-loss 1yh(x)60, or some Lipschitz variant thereof (see Section 3). In all cases, we assume ( , ) 2 [0, 1]. We also denote by R a family of abstention functions r: X ! R, with r(x) 6 0 indicating an abstention on x 2 X (or that x is rejected), and r(x) > 0 that x is predicted upon (or that x is accepted). We consider a specific online learning scenario whose regime lies between bandit and full information, sometimes referred to as bandit with side-information (e.g., Mannor & Shamir (2011); Caron et al. (2012); Alon et al. (2013; 2014; 2015); Koc ak et al. (2014); Neu (2015); Cohen et al. (2016)). In our case, the arms are pairs made of a predictor function h and an abstention function r in a given family E H R. We will denote by j = (hj, rj), j 2 [K], the elements of E. In fact, depending on the setting, K may be finite or (uncountably) infinite. Given hj, one natural choice for the associated abstention function rj is a confidence-based abstention function of the form rj(x) = |hj(x)| , for some threshold > 0. Yet, more general pairs (hj, rj) can be considered here. This provides an important degree of flexibility in the design of algorithms where abstentions are allowed, as shown in (Cortes et al., 2016a;b). Appendix A presents a concrete example illustrating the benefits of learning with these pair of functions. The online learning protocol is described as follows. The set E is known to the learning algorithm beforehand. At each round t 2 [T], the online algorithm receives an input xt 2 X and chooses (possibly at random) an arm (henceforth also called expert or pair ) It = (h It, r It) 2 E. If the inequality r It(xt) 6 0 holds, then the algorithm abstains and incurs as loss an abstention cost c(xt) 2 [0, 1]. Otherwise, it predicts based on the sign of h It(xt), receives the true label yt 2 { 1}, and incurs the loss (yt, h It(xt)). Thus, the overall abstention loss L of expert = (h, r) 2 E on the labeled pair z = (x, y) 2 X { 1} is defined as Online Learning with Abstention r1(xt) > 0 r3(xt) 0 Experts r2(xt) > 0 Figure 1: Feedback graph GABS t for the scenario of online learning with abstention, with K = 5. L( , z) = (y, h(x))1r(x)>0 + c(x)1r(x)60 . (1) For simplicity, we will assume throughout that the abstention cost c(x) is a (known) constant c 2 [0, 1], independent of x, though all our results can be straightforwardly extended to the case when c is a (Lipschitz) function of x, which is indeed desirable in some applications. Our problem can be naturally cast as an online learning problem with side information in the form of a feedback graph. Online learning with a feedback graph is a general framework that covers a variety of problems with partial information, including the full information scenario, where the graph is fully connected, and the bandit scenario where all vertices admit only self-loops and are disconnected (Alon et al., 2013; 2014). In our case, we have a directed graph GABS t = (V, Et) that depends on the instance xt received by the algorithm at round t 2 [T]. Here, V denotes the finite set of vertices of this graph, which, in the case of a finite set of arms, coincides with the set of experts E, while Et denotes the set of directed edges at round t. The directed edge i ! j is in Et if the loss of expert j 2 V is observed when expert i is selected by the algorithm at round t. In our problem, if the learner chooses to predict at round t (i.e., if r It(xt) > 0), then she observes the loss L( j, zt) of all experts j, since the label yt is revealed to her. If instead she abstains at round t (i.e., if r It(xt) 6 0), then she only observes L( j, zt) for those experts j that are abstaining in that round, that is, the set of j such that rj(xt) 6 0, since for all such j, we have L( j, zt) = c. Notice that in both cases the learner can observe the loss of her own action. Thus, the feedback graph we are operating with is a nearly fully connected directed graph with self-loops, except that it admits only one-way edges from predicting to abstaining vertices (see Figure 1 for an example). Observe also that the feedback graph GABS t is fully determined by xt. We will consider both an adversarial setting (Section 3), where no distributional assumption is made about the sequence zt = (xt, yt), t 2 [T], and a stochastic setting (Section 4), where zt is assumed to be drawn i.i.d. from some unknown distribution D over X { 1}. For both settings, we measure the performance of an algorithm A by its (pseudo-)regret RT (A), defined as RT (A) = sup 2E E[PT t=1 L( It, zt) PT t=1 L( , zt)] , where the expectation is taken both with respect to the algorithm s choice of actions Its and, in the stochastic setting, the random draw of the zts. In the stochastic setting, we will be mainly concerned with the case where E is a finite set of experts E = { 1, . . . , K}. We then denote by µj the expected loss of expert j 2 E, µj = Ez D[L( j, z)], by µ the expected loss of the best expert, µ = minj2[K] µj, and by j the loss gap to the best, j = µj µ . In the adversarial setting, we will analyze both the finite and infinite expert scenarios. In the infinite case, since L is non-convex in the relevant parameters (Eq. (1)), further care is needed. 3. Adversarial setting As a warm-up, we start with the adversarial setting with finitely-many experts. Following ideas from Alon et al. (2014; 2015), we design an online algorithm for the abstention scenario by combining standard finite-arm bandit algorithms, like EXP3 (Auer et al., 2003), with the feedback graph GABS t of Section 2. We call the resulting algorithm EXP3-ABS (EXP3 with abstention). The algorithm is a variant of EXP3 where the importance weighting scheme to achieve unbiased loss estimates is based on the probability of the loss of an expert being observed as opposed to that of an expert being selected see Appendix B (Algorithm 3). The following guarantee holds for this algorithm. Theorem 1 Let EXP3-ABS be run with learning rate over a set of K experts 1, . . . , K. Then, the algorithm admits the following regret guarantee after T rounds: RT (EXP3-ABS) 6 (log K)/ + T(c2 + 1)/2. In particular, if EXP3-ABS is run with = 2 log K (c2+1)T , then RT (EXP3-ABS) 6 2(c2 + 1)T log K. The proof of this result, as well as all other proofs, is given in the appendix. The dependency of the bound on the number of experts is clearly more favorable than the standard bound for EXP3 (plog K instead of K). Theorem 1 is in fact reminiscent of what one can achieve using the contextualbandit algorithm EXP4 (Auer et al., 2002b) run on K experts, each one having two actions. We now turn our attention to the case of an uncountably infinite E. To model this more general framework, one might be tempted to focus on parametric classes of functions h and r, e.g., the family E of linear functions $ (h, r) : h(x) = w>x, r(x) = |w>x| , w 2 Rd, > 0 introduce some convex surrogate of the abstention loss (1), and work in the parametric space of (w, ) through some Online Learning with Abstention Bandit Convex Optimization technique (e.g., (Hazan, 2016)). Unfortunately, this approach is not easy to put in place, since the surrogate loss not only needs to ensure convexity and some form of calibration, but also the ability for the algorithm to observe the loss of its own action (the self-loops in the graph of Figure 1). We have been unable to get around this problem by just resorting to convex surrogate losses (and we strongly suspect that it is not possible), and in what follows we instead introduce a surrogate abstention loss which is Lipschitz but not convex. Moreover, we take the more general viewpoint of competing with pairs (h, r) of Lipschitz functions with bounded Lipschitz constant. Let us then consider the version of the abstention loss (1) with (y, h(x)) = fγ( yh(x)), where fγ is the 0/1-loss with slope 1/(2γ) at the origin, 1|a|6γ + 1a>01|a|>γ (see Figure 2 (a)), and the class of experts E = = (h, r) | h, r: X Rd ! [ 1, 1] . Here, functions h and r in the definition of E are assumed to be LE-Lipschitz with respect to an appropriate distance on Rd, for some constant LE which determines the size of the family E. Using ideas from (Cesa-Bianchi et al., 2017), we present an algorithm that approximates the action space by a finite cover while using the structure of the abstention setting. The crux of the problem is to define a Lipschitz function e L that uppers bounds the abstention loss while maintaining the same feedback assumptions, namely the feedback graph given in Figure 1. One Lipschitz function e L that precisely solves this problem is the following: e L( , z) = 8 > > > > < c if r(x) 6 γ r(x) if r(x) 2 ( γ, 0) 1 fγ( yh(x)) r(x) if r(x) 2 [0, γ) fγ( yh(x)) if r(x) > γ , for γ 2 (0, 1). e L( , z) is plotted in Figure 2(b). Notice that this function is consistent with the feedback requirements of Section 2: r It(xt) 6 0 implies that e L((h(xt), r(xt)), zt) is known to the algorithm (i.e., is independent of yt) for all (h, r) 2 E such that r(xt) 6 0, while r It(xt) > 0 gives complete knowledge of e L((h(xt), r(xt)), zt) for all (h, r) 2 E, since yt is observed. We can then adapt the machinery from (Cesa-Bianchi et al., 2017) so as to apply a contextual version of EXP3-ABS to the sequence of losses e L( , zt), t 2 [T]. The algorithm adaptively covers X with balls of a fixed radius ", each ball hosting an instance of EXP3-ABS. We call this algorithm CONTEXP3-ABS see Appendix B.2 for details. Theorem 2 Consider the abstention loss L( , z) = fγ( yh(x))1r(x)>0 + c1r(x)60 , Figure 2: (a) The 0/1-loss function with slope 1/(2γ) at the origin. (b) For a given value of x and margin a = yh(x) (which in turn sets the value of f = fγ(a) 2 [0, 1]), plots of the abstention loss function L(a, r) (dotted blue curve), and the surrogate abstention loss e L(a, r) (red curve), both as a function of r = r(x) 2 [ 1, 1]. and let = (h , r ) = argmin 2E t=1 L( , zt), with E = {(h, r)} made of pairs of Lipschitz functions as described above. If CONTEXP3-ABS is run with parameter " ' T 1 2+d γ 2 2+d and an appropriate learning rate (see Appendix B), then, it admits the following regret guarantee: RT (CONTEXP3-ABS) 6 e O d+1 d+2 γ d d+2 T (γ) is the number of xt such that |r (xt)| 6 γ. In the above, e O hides constant and ln(T) factors, while ' disregards constants like LE, and various log factors. CONTEXP3-ABS is also computationally efficient, thereby providing a compelling solution to the infinite armed case of online learning with abstention. 4. Stochastic setting We now turn to studying the stochastic setting. As pointed out in Section 2, the problem can be cast as an instance of online learning with time-varying feedback graphs GABS t . Thus, a natural method for tackling the problem would be to extend existing algorithms designed for the stochastic setting with feedback graphs to our abstention scenario (Cohen et al., 2016; Caron et al., 2012). We cannot benefit from the algorithm of Cohen et al. (2016) in our scenario. This is because at the heart of its design and theoretical guarantees lies the assumption that the graphs and losses are independent. The dependency of the feedback graphs on the observations zt, which also define the losses, is precisely a property that we wish to exploit in our scenario. An alternative is to extend the UCB-N algorithm of Caron et al. (2012), for which the authors provide gap-based regret guarantees. This algorithm is defined for a stochastic setting with an undirected feedback graph that is fixed over time. The algorithm can be straightforwardly extended to the case of directed time-varying feedback graphs (see Algorithm 1). We will denote that extension by UCB-NT to explicitly differentiate it from UCB-N. Let Nt(j) denote the set of out-neighbors of vertex j in the directed graph at time t, i.e., the set of vertices k destinations of an edge from j. Then, as with UCB-N, the algorithm updates, at Online Learning with Abstention Expected loss Figure 3: Illustration of the bias problem. each round t, the upper-confidence bound of every expert for which a feedback is received (those in Nt(It)), as opposed to updating only the upper-confidence bound of the expert selected, as in the standard UCB of Auer et al. (2002a). In the context of learning with abstention, the natural feedback graph GABS t at time t depends on the observation xt and varies over time. Can we extend the regret guarantees of Caron et al. (2012) to UCB-NT with such graphs? We will show in Section 4.1 that vanishing regret guarantees do not hold for UCB-NT run with graphs GABS t . This is because of a fundamental estimation bias problem that arises when the graph at time t depends on the observation xt. This issue affects more generally any natural method using the GABS t graphs. Nevertheless, we will show in Section 4.2 that UCB-NT does benefit from favorable guarantees, provided the feedback graph GABS t it uses at round t is replaced by one that only depends on events up to time t 1. 4.1. Bias problem Assume there are two experts: 1 (red) and 2 (blue) with µ2 < µ1 and X = [0, 1] (see Figure 3). For x > 1 2, the red expert 1 is abstaining and incurring a loss c, whereas the blue expert is never abstaining. Assume that the probability mass is quasi-uniform over the interval [0, 1] but with slightly more mass over the region x < 1 2. The algorithm may then start out by observing points in this region. Here, both experts accept and the algorithm obtains error estimates corresponding to the solid red and blue lines for x < 1 2. When the algorithm observes a point x > 1 2, it naturally selects the red abstaining expert since it admits a better current estimated loss. However, for x > 1 2, the red expert is worse than the blue expert 2. Furthermore, it is abstaining and thus providing no updates for expert 2 (which is instead predicting). Hence, the algorithm continues to maintain an estimate of 2 s loss at the level of the blue solid line indicated for x < 1 2; it then continues to select the red expert for all xs and incurs a high regret.1 This simple example shows that, unlike the adversarial scenario (Section 3), GABS t , here, cannot depend on the input xt, and that, in general, the indiscriminate use of feedback graphs may result in biased loss observations. On the other 1 For the sake of clarity, we did not introduce specific real values for the expected loss of each expert on each of the half intervals, but that can be done straightforwardly. We have also verified experimentally with such values that the bias problem just pointed out indeed leads to poor regret for UCB-NT. ALGORITHM 1: UCB-NT for t > 1 do RECEIVE(xt); It argmin j2E bµj,t 1 Sj,t 1 ; for j 2 E do s=1 1j2Ns(Is) ; bµj,t 1 Qj,t s=1 L( j, zs)1j2Ns(Is). end for end for hand, we know that if we were to avoid using feedback graphs at all (which is always possible using UCB), we would always be able to define unbiased loss estimates. A natural question is then: can we construct time-varying feedback graphs that lead to unbiased loss observations? In the next section, we show how to design such a sequence of auxiliary feedback graphs, which in turn allows us to then extend UCB-NT to the setting of time-varying feedback graphs for general loss functions. Under this assumption, we can achieve unbiased empirical estimates of the average losses µj of the experts, which will allow us to apply standard concentration bounds in the proof of this algorithm. 4.2. Time-varying graphs for UCB-NT We now show that UCB-NT benefits from favorable guarantees, so long as the feedback graph GABS t it uses at time t depends only on events up to time t 1. This extension works for general bounded losses and does not only apply to our specific abstention loss L. So, let us assume that the feedback graph in round t (and the associated out-neighborhoods Nt( )) in Algorithm 1 only depends on the observed losses L( i, zs) and inputs xs, for s = 1, . . . , t 1, and i 2 [K], and let us denote this feedback graph by Gt, so as not to get confused with GABS t . Under this assumption, we can derive strong regret guarantees for UCB- NT with time-varying graphs. Our guarantees are expressed in terms of the best sequence of admissible p-partitionings of graphs Gt. For p 2 [K], we say that (Ct,k)k2[p] is an admissible p-partitioning of Gt = (V, Et) if V = { j : j 2 [K]} is the union of p (disjoint) components Ct,k, that is V = S k2[p] Ct,k, and all vertices within a component Ct,k are neighbors in Gt with reciprocal edges: if i and j are in Ct,k, then we have i 2 Nt( j) and j 2 Nt( i). Note that admissible p-partitionings are typically not unique since two neighbor vertices i and j may be placed in the same component or not. We denote by Sp the set of all sequences ((C1,k)k2[p], . . . , (CT,k)k2[p]) of admissible p-partitionings (Ct,k)k2[p] of graphs Gt. Moreover, for any sequence ((C1,k)k2[p], . . . , (CT,k)k2[p]) in Sp, we denote by Ck the union of the components indexed by k over all Online Learning with Abstention rounds: Ck = S t2[T ] Ct,k. Theorem 3 Assume that, for all t 2 [T], the feedback graph Gt depends only on information up to time t 1. Then, the regret of UCB-NT is bounded as follows: For a sequence Sp made up of the same partition (C1,k)k2[p] repeated T times, the theorem gives a bound on the regret based on this fixed partition, as it is the sum of p components, one per cluster C1,k in the partition. The minimum over (p, Sp) then simply chooses the number of clusters p and the partitioning of V into p clusters having the smallest regret. Theorem 3 can be interpreted as an extension of Theorem 2 in Caron et al. (2012) to time-varying feedback graphs. Its proof involves showing that the use of feedback graphs Gt that depend only on information up to t 1 can result in unbiased loss estimates, and it also uses the newly defined notion of admissible p-partitionings to derive a time-varying bound that leverages the shared updates from the graph. Moreover, the bound illustrates that if the feedback graphs in a problem admit a p-partitioning for some small p K (e.g. if the feedback graphs can be decomposed into a small number of components that are approximately fixed across time) for which maxj2Ck j minj2Ck j, then this bound can be up to a factor p K tighter than the bound guaranteed by the standard UCB algorithm. Moreover, this regret guarantee is always more favorable than that of the standard UCB since the (trivial) K-partitioning that splits V into K singletons for all t is an admissible K-partitioning for all Gt s. Furthermore, note that by construction, all vertices within the same component of an admissible p-partitioning are connected to one another. Thus, if the feedback graph is fixed throughout all rounds and we interpret the doubly-directed edges as edges of an undirected graph GU, we straightforwardly obtain the following result, which is comparable to Theorem 2 in (Caron et al., 2012). Corollary 1 If the feedback graph Gt = G is fixed over time, then the guarantee of Theorem 3 is upper-bounded by: maxi2C i mini2C 2 the outer minimum being over all clique coverings C of GU. Caron et al. (2012) present matching lower bounds for the case of stochastic bandits with a fixed feedback graph. Since we can again design abstention scenarios with fixed feedback graphs, these bounds carry over to our setting. Now, how can we use the results of this section to design an algorithm for the abstention scenario? The natural feedback graphs we discussed in Section 3 are no longer applicable since GABS t depends on xt. Nevertheless, we will present two solutions to this problem. In Section 4.3, we present a solution with a fixed graph G that closely captures the problem of learning with abstention. Next, in Section 4.4, we will show how to define and leverage a time-varying graph Gt that is estimated based on past observations. 4.3. UCB-N with the subset feedback graph In this section, we define a subset feedback graph, GSUB, that captures the most informative feedback in the problem of learning with abstention and yet is safe in the sense that it does not depend on xt. The definition of the graph is based on the following simple observation: if the abstention region associated with i is a subset of that of j, then, if i is selected at some round t and is abstaining, so is j. For an example, see i and j in Figure 4 (top). Crucially, this implication holds regardless of the particular input point xt received in the region of abstention of i. Thus, the set of vertices of GSUB is E, and GSUB admits an edge from i to j, iff {x 2 X: ri(x) 6 0} {x 2 X: rj(x) 6 0}. Since GSUB does not vary with time, it trivially verifies the condition of the previous section. Thus, UCB-NT run with GSUB admits the regret guarantees of Theorem 3, where the admissible p-partitionings are those of fixed graph GSUB. The example of Section 4.1 illustrated a bias problem in a special case where the feedback graphs Gt were not subgraphs of GSUB. The following result shows more generally that feedback graphs not included in GSUB may result in catastrophic regret behavior. Proposition 1 Assume that UCB-NT is run with feedback graphs Gt that are not subsets of GSUB. Then, there exists a family of predictors H, a Lipschitz loss function in (1), and a distribution D over zts for which UCB-NT incurs linear regret with arbitrarily high probability. The proof of the proposition is given in Appendix C.3. In view of this result, no fixed feedback graph for UCB-NT can be more informative than GSUB. But how can we leverage past observations (up to time t 1) to derive a feedback graph that would be more informative than the simple subset graph GSUB? The next section provides a solution based on feedback graphs estimated based on past observations and a new algorithm. 4.4. UCB-GT algorithm We seek graphs Gt that admit GSUB as a subgraph. We will show how certain types of edges can be safely added to GSUB based on past observations. This leads to a new algorithm, UCB-GT (UCB with estimated time-varying graph), whose pseudocode is given in Algorithm 2. As illustrated by Figure 4, the key idea of UCB-GT is to augment Online Learning with Abstention ALGORITHM 2: UCB-GT for t > 1 do RECEIVE(xt); It argmin i2E bµi,t 1 Si,t 1 , where Si,t 1 is as in Algorithm 1; for i 2 E do It,i 6 γi,t 1 then Qi,t Qi,t 1 + 1; if r It(xt) 6 0 ri(xt) > 0 then bµi,t 1; (*) else bµi,t L( i,zt) bµi,t 1; else Qi,t Qi,t 1, bµi,t bµi,t 1 . end for end for GSUB with edges from j to i where the subset property {x: rj(x) 6 0} {x: ri(x) 6 0} may not hold, but where the implication (rj(x) 6 0 ) ri(x) 6 0) holds with high probability over the choice of x 2 X, that is, the region {x: rj(x) 6 0 ri(x) > 0} admits low probability. Of course, adding such an edge j ! i can cause the estimation bias of Section 4.1. But, if we restrict ourselves to cases where pj,i = P[rj(x) 6 0 ri(x) > 0] is upper bounded by some carefully chosen quantity that changes over rounds, the effect of this bias will be limited. In reverse, as illustrated in Figure 4, the resulting feedback graph can be substantially more beneficial since it may have many more edges than GSUB, hence leading to more frequent updates of the experts losses and more favorable regret guarantees. This benefit is further corroborated by our experimental results (Section 5). Since we do not have access to pj,i, we use instead empirical estimates bpt 1 j,i := 1 t 1 s=1 1rj(xs)60,ri(xs)>0. At time t, if expert j is selected, we update expert i if the condition bpt 1 j,i 6 γi,t 1 holds with γi,t 1 = p 5Qi(t 1) log(t)/((K 1)(t 1)). If the expert It chosen abstains while expert j predicts and satisfies bpt 1 It,j 6 γj,t 1, then we do not have access to the true label yt. In that case, we update optimistically our empirical estimate as if the expert had loss 0 at that round (Step (*) in Alg. 2). The feedback graph Gt just described can be defined via the out-neighborhood of vertex j: Nt(j) = { i 2 E: bpt 1 j,i 6 γi,t 1}. Let Sp = (Ct,k)t2[T ],k2[p] denote any sequence of admissible p-partitionings of these feedback graphs, then the following regret guarantee holds for UCB-GT. Theorem 4 For any t 2 [T], let the feedback graph Gt be defined by the out-neighborhood Nt(j) = { i 2 E: bpt 1 j,i 6 γi,t 1}. Then, the regret of UCB-GT is bounded as follows: A B C D Region Subset property abstain predict abstain predict abstain predict Estimated graph : bp(A) small, : bp(B) small, : bp(D) small, : bp(D) small, : bp(A) + bp(B) small. Time-varying feedback graph: Figure 4: The top row shows three experts i, j, and k on a one-dimensional input space marked by their prediction and abstention regions. Below each region, the time-varying graph GABS t is shown. To avoid the bias problem affecting the graphs GABS t , one option is to use GSUB. Yet, as illustrated, GSUB is minimal and in this example admits only one edge (excluding self-loops). Thus, a better option is to use the time-varying graphs of UCB-GT since they are richer and more informative. For these graphs, an edge is added from j to i if the probability of the region where j is abstaining but i is predicting is (estimated to be) small. Since the graph Gt of UCB-GT has more edges than GSUB, it admits more admissible partitionings than GSUB, which leads to a more favorable guarantee than that of UCB-NT run with GSUB. The proof of this result differs from the standard UCB analysis and that of Theorem 3 in that it involves showing that the UCB-GT algorithm can adequately control the amount of bias introduced by the skewed loss estimates. The experiments in the next section provide an empirical validation of this theoretical comparison. 5. Experiments In this section, we report the results of several experiments on ten datasets comparing UCB-GT, UCB-NT with feedback graph GSUB, vanilla UCB (with no sharing information across experts), as well as Full-Supervision, FS. FS is an algorithm that at each round chooses the expert j with the smallest abstention loss so far, bµj,t 1, and even if this expert abstains, the algorithm receives the true label and can update the empirical abstention loss estimates for all experts. FS reflects an unrealistic and overly optimistic scenario that clearly falls outside the abstention setting, but it provides an upper bound for the best performance we may hope for. We used the following eight datasets from the UCI data repository: HIGGS, phishing, ijcnn, covtype, eye, skin, cod-rna, and guide. We also used the CIFAR dataset from (Krizhevsky et al., 2009), where we extracted the first twenty-five principal components and used their Online Learning with Abstention Figure 5: From the left, we show graphs of the average regret Rt( )/t, fraction of points the chosen expert abstained on, and the number of edges of the feedback graph as a function of t (log-scale) for UCB-GT, UCB-NT, UCB, and FS. Top row is the results for cod-rna for cost c = 0.2 and bottom row is the guide for cost c = 0.1. More results are in Appendix D. projections as features, and a synthetic dataset of points drawn according to the uniform distribution in [ 1, 1]2. For each dataset, we generated a total of K = 2,100 experts and all the algorithms were tested for a total of T = 10,000 rounds. The experts, = (h, r), were chosen in the following way. The predictors h are hyperplanes centered at the origin whose normal vector in Rd is drawn randomly from the Gaussian distribution, N(0, 1)d, where d is the dimension of the feature space of the dataset. The abstention functions r are concentric annuli around the origin with radii in (0, d 20 . . . , d). For each dataset, we generated 100 predictors and each predictor h is paired with the 21 abstention functions r. For a fixed set of experts, we first calculated the regret by averaging over five random draws of the data, where the best-in-class expert was determined in hindsight as the one with the minimum average cumulative abstention loss. We then repeated this experiment five times over different sets of experts and averaged the results. We report these results for c 2 {0.1, 0.2, 0.3}. Figure 5 shows the averaged regret Rt( )/t with standard deviations across the five repetitions for the different algorithms as a function of t 2 [T] for two datasets. In Appendix D, we present plots of the regret for all ten datasets. These results show that UCB-GT outperforms both UCB-NT and UCB on all datasets for all abstention cost values. Remarkably, UCB-GT s performance is close to that of FS for most datasets, thereby implying that UCB-GT attains almost the best regret that we could hope for. We also find that UCB-NT performs better than the vanilla UCB. Figure 5 also illustrates the fraction of points in which the chosen expert abstains, as well as the number of edges in the feedback graph as a function of rounds. We only plot the number of edges of UCB-GT since that is the only graph that varies with time. For both experiments depicted and in general for the rest of the datasets, the number of edges for UCB-GT is between 1 million to 3 million, which is at least a factor of 5 more than for UCB-NT, where the number of edges we observed are of the order 200,000. FS enjoys the full information property and the number of edges is fixed at 4 million (complete graph). The increased information sharing of UCB-GT is clearly a strong contributing factor to the algorithm s improvement in regret relative to UCB-NT. In general, we find that, provided that the estimation bias is controlled, the higher is the number of edges, the smaller the regret. Regarding the value of the cost c, as expected, we observe that the fraction of points that the chosen expert abstains on always decreases as c increases, but also that that fraction depends on the dataset and the experts used. Finally, Appendix D includes more experiments for different aspects of the problem. In particular, we tested how the number of experts or a different choice of experts (confidencebased experts) affected the results. We also experimented with some extreme abstention costs and, as expected, found the fraction of abstained points to be large for c = 0.001 and small for c = 0.9. In all of these additional experiments, UCB-GT outperformed UCB-NT. 6. Conclusion We presented a comprehensive analysis of the novel setting of online learning with abstention, including algorithms with favorable guarantees both in the stochastic and adversarial scenarios, and extensive experiments demonstrating the performance of UCB-GT in practice. Our algorithms and analysis can be straightforwardly extended to similar problems, including the multi-class and regression settings, as well as other related scenarios, such as online learning with budget constraints. A key idea behind the design of our algorithms in the stochastic setting is to leverage the stochastic sequence of feedback graphs. This idea can perhaps be generalized and applied to other problems where time-varying feedback graphs naturally appear. Online Learning with Abstention Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. From bandits to experts: A tale of domination and independence. In NIPS, 2013. Alon, N., Cesa-Bianchi, N., Gentile, C., Mannor, S., Man- sour, Y., and Shamir, O. Nonstochastic multi-armed bandits with graph-structured feedback. In Co RR, 2014. Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. Online learning with feedback graphs: Beyond bandits. JMLR, 2015. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time anal- ysis of the multi-armed bandit problem. Mach. Learn., 47(2-3):235 256, 2002a. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multi-armed bandit problem. SIAM J. Comput., 32(1):48 77, 2002b. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multi-armed bandit problem. SIAM J. Comput., 32(1):48 77, 2003. Bartlett, P. and Wegkamp, M. Classification with a reject option using a hinge loss. JMLR, pp. 291 307, 2008. Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochas- tic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. Caron, S., Kveton, B., Lelarge, M., and Bhagat, S. Lever- aging side observations in stochastic bandits. In UAI, 2012. Cesa-Bianchi, N., Gaillard, P., Gentile, C., and Gerchinovitz, S. Algorithmic chaining and the role of partial feedback in online nonparametric learning. In MLR, 2017. Chow, C. An optimum character recognition system using decision function. IEEE T. C., 1957. Chow, C. On optimum recognition error and reject trade-off. IEEE T. C., 1970. Clarkson, K. L. Nearest-neighbor searching and metric space dimensions. In Nearest-Neighbor Methods for Learning and Vision: Theory and Practice. MIT Press, 2006. Cohen, A., Hazan, T., and Koren, T. Online learning with feedback graphs without the graphs. In ICML, 2016. Cortes, C., De Salvo, G., and Mohri, M. Learning with rejec- tion. In ALT, pp. 67 82. Springer, Heidelberg, Germany, 2016a. Cortes, C., De Salvo, G., and Mohri, M. Boosting with abstention. In NIPS. MIT Press, 2016b. El-Yaniv, R. and Wiener, Y. On the foundations of noise-free selective classification. JMLR, 2010. El-Yaniv, R. and Wiener, Y. Agnostic selective classification. In NIPS, 2011. Hazan, E. Introduction to Online Convex Optimization. Foundations and Trends in Optimization. Now Publishers Inc., 2016. Hazan, E. and Megiddo, N. Online learning with prior knowledge. In COLT, pp. 499 513, 2007. Koc ak, T., Neu, G., Valko, M., and Munos, R. Efficient learning by implicit exploration in bandit problems with side observations. In NIPS, pp. 613 621, 2014. Krizhevsky, A., Nair, V., and Hinton, G. CIFAR-10 (Cana- dian Institute for Advanced Research), 2009. URL http: //www.cs.toronto.edu/ kriz/cifar.html. Li, L., Littman, M., and Thomas, W. Knows What It Knows: A framework for self-aware learning. In ICML, 2008. Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. Information and computation, 108(2):212 261, 1994. Mannor, S. and Shamir, O. From bandits to experts: On the value of side-observations. In NIPS, pp. 291 307, 2011. Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In NIPS, pp. 3168 3176, 2015. Sayedi, A., Zadimoghaddam, M., and Blum, A. Trading off mistakes and don t-know predictions. In NIPS, 2010. Zhang, C. and Chaudhuri, K. The extended Littlestone s dimension for learning with mistakes and abstentions. In COLT, 2016.