# online_agnostic_boosting_via_regret_minimization__18a7089b.pdf Online Agnostic Boosting via Regret Minimization Nataly Brukhim Google AI Princeton Princeton University Department of Computer Science nbrukhim@princeton.edu Xinyi Chen Google AI Princeton Princeton University Department of Computer Science xinyic@princeton.edu Elad Hazan Google AI Princeton Princeton University Department of Computer Science ehazan@princeton.edu Shay Moran Department of Mathematics Technion - Israel Institute of Technology smoran@technion.ac.il Boosting is a widely used machine learning approach based on the idea of aggregating weak learning rules. While in statistical learning numerous boosting methods exist both in the realizable and agnostic settings, in online learning they exist only in the realizable case. In this work we provide the first agnostic online boosting algorithm; that is, given a weak learner with only marginally-better-than-trivial regret guarantees, our algorithm boosts it to a strong learner with sublinear regret. Our algorithm is based on an abstract (and simple) reduction to online convex optimization, which efficiently converts an arbitrary online convex optimizer to a boosting algorithm. Moreover, this reduction extends to the statistical as well as the online realizable settings, thus unifying the 4 cases of statistical/online and agnostic/realizable boosting. 1 Introduction Boosting is a fundamental methodology in machine learning that can boost the accuracy of weak learning rules into a strong one. Boosting was first studied in the context of (realizable) PAC learning in a line of seminal works which include the celebrated Adaboost algorithm, as well an many other algorithms with various applications (see e.g. [29, 33, 17, 19]). It was later adapted to the agnostic PAC setting and was extensively studied in this context as well [7, 31, 21, 27, 30, 26, 28, 16, 13, 18]. More recently, [14] and [9] studied boosting in the context of online prediction and derived boosting algorithms in the realizable setting (a.k.a. mistake-bound model). In this work we study agnostic boosting in the online setting: let H be a class of experts and assume we have oracle access to a weak online learner for H with a non-trivial (yet far from desired) regret guarantee. The goal is to convert it into a strong online learner for H that exhibits vanishing regret. Why online agnostic boosting? The setting of online realizable boosting poses a restriction on the possible input sequences: there must be an expert that attains near-zero mistake-bound on the input sequence. This is a non-standard assumption in online learning. In contrast, in the online agnostic setting we consider, there is no restriction on the input sequence and it can be chosen adversarially. The research was done while author was co-affiliated with Google AI Princeton. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Applications of online agnostic boosting. Apart from being a fundamental question in a wellstudied learning setting, let us mention a few concrete incentives to study online agnostic boosting: Differential privacy and online learning: A recent line of work revealed deep connections between online learning and differentially private learning [5, 1, 6, 10, 32, 25, 22, 11]. In fact, these two notions are equivalent in the sense that a class H can be PAC learned by a differentially private algorithm if and only if it can be learned in the online setting with vanishing regret [6, 11]. However, the above equivalence is only known to hold from an information theoretic perspective, and deriving efficient reductions between online and private learning is an open problem [32]. The only case where an efficient reduction is known to exist is in converting a pure private learner to an online learner in the realizable setting [22]. This reduction relies heavily on the realizable online boosting algorithm by [9]. Moreover, the derivation of an agnostic online boosting algorithm is posed by [22] as an open problem towards extending their reduction to the agnostic setting. Time series prediction and online control: Recent machine learning literature considered the problem of controlling a dynamical system from the lens of online learning and regret minimization, see e.g. [3, 4, 24] and referenced work therein. The online learning approach also gave rise to the first boosting methods in this context [2], and demonstrates the potential impact of boosting in the online setting. Thus, the current work aims at continuing the development of the boosting methodology in online learning, starting from the basic setting of learning from expert advice. 1.1 Main results The weak learning assumption. In this paper we follow the same formulation as [28] used for boosting in the agnostic statistical setting. Towards this end, it is convenient to measure the performance of online learners using gain rather than loss: let (x1, y1) . . . (x T , y T ) X { 1} be an (adversarial and adaptive) input sequence of examples presented to an online learning algorithm A; that is, in each iteration t = 1 . . . T, the adversary picks an example (xt, yt), then the learner A first gets to observe xt, and predicts (possibly in a randomized fashion) ˆyt { 1}, and lastly it observes yt and gains a reward of yt ˆyt. The goal of the learner is to maximize the total gain (or correlation), given by P t yt ˆyt. Note that this is equivalent to the often-used notion of loss where in each iteration the learner suffers a loss of 1[yt = ˆyt] and its goal is to minimize the accumulated loss P t 1[yt = ˆyt]. 2 Definition 1 (Agnostic Weak Online Learning). Let H { 1}X be a class of experts, let T denote the horizon length, and let γ > 0 denote the advantage. An online learning algorithm W is a (γ, T)- agnostic weak online learner (AWOL) for H if for any sequence (x1, y1), ..., (x T , y T ) X { 1}, at every iteration t [T], the algorithm outputs W(xt) { 1} such that, t=1 W(xt)yt γ max h H E t=1 h(xt)yt where the expectation is taken w.r.t the randomness of the learner W and that of the possibly adaptive adversary, and RW : N R+ is the additive regret: a non-decreasing, sublinear function of T. Note the slight abuse of notation in the last definition: an online learner W is not an X { 1} function; rather it is an algorithm with an internal state that is updated as it is fed new examples. Thus, the prediction W(xt) depends on the internal state of W, and for notational convenience we avoid reference to the internal state. Our agnostic online boosting algorithm has oracle access to N weak learners and predicts each label by combining their predictions. The number of weak learners N is a meta-parameter which can be tuned by the user according to the following trade-off: on the one hand, the regret bound improves as N increases, and on the other hand, a larger number of weak learners is more costly in terms of computational resources. 2Indeed, ytˆyt = 1 2 1[yt = ˆyt] since yt, ˆyt { 1}. Therefore, the accumulated loss and correlation are affinely related by P yt ˆyt = T 2 P t 1[yt = ˆyt]. Theorem 2 (Agnostic Online Boosting). Let H be a class of experts, let T N denote the horizon length, and let W1, . . . , WN be (γ, T)-AWOL for H with advantage γ and regret RW(T) = o(T) (see Definition 1). Then, there exists an online learning algorithm, which has oracle access to each Wi, and has expected regret of at most To exemplify the interplay between RW( ) and N, imagine a scenario where RW(T) T (as is often the case for regret bounds). Then, setting N T gives that the overall regret remains T. By setting both T and N to be O( 1 γ2ϵ2 ) for any ϵ > 0, an average regret of ϵ is obtained. An abstract framework for boosting. Boosting and Regret Minimization algorithms are intimately related. This tight connection is exhibited in statistical boosting (see [20, 19, 34]). For example, Ada Boost can be interpreted as applying the Hedge algorithm to iteratively update probability weights associated with the training examples [18]. Our algorithm is inspired by this fruitful connection and utilizes it. We derive a general framework which reduces boosting to online convex optimization. Moreover, this reduction applies to 4 learning settings: realizable-statistical, realizable-online, agnostic-statistical, and agnostic-online. We note that in the agnostic boosting settings, both the assumption and the goal are stronger compared to the realizable case; an agnostic weak learner is assumed (which is stronger than a realizable weak learner), but the aim is to learn arbitrary distributions (which is a more challenging task than only learning realizable distributions). Therefore, a boosting algorithm for the realizable setting does not trivially follow from an agnostic boosting algorithm. A similar argument holds for the statistical vs. online boosting settings. The general framework we derive in this work does apply to all the 4 aforementioned learning settings, with only slight modification to the algorithm s update rule. On a high-level, these modification are based on the following observations: 1. Agnostic/Realizable: in the realizable setting the weights of instances (i.e. the x s) are updated (e.g., in Adaboost [34]). In the agnostic setting the weights of labels (i.e. the y s) are typically updated instead [16, 28]. Therefore, instances and labels are exchanged. 2. Statistical/Online: in the statistical setting, in each iteration a weak-hypothesis is produced and the weights of all examples are updated. In the online setting in each iteration a new example is processed and the weights corresponding to all weak-learners are updated [9]. Therefore, examples and weak learners are exchanged. Thus, there is an interesting duality which converts between the 4 settings by replacing reweighting (realizable) with relabeling (agnostic), and between updating all learners per example (online) and updating all examples per learner (statistical). Our main contribution is showing that the fashion in which reweighting/relabelling is performed, which was previously given by ad hoc update-rules, can be abstracted to an application of an arbitrary online convex optimization algorithm. Our general framework result may not come as a surprise given the well known connections between boosting and online convex optimization. However, we stress that the general reduction established here does introduce technical challenges. For instance, the final output of the algorithm is not obtained via a standard weighted majority-vote, but rather a different projected aggregation of the weak learners predictions. Thus, albeit our unified framework being simple and intuitive, the analysis is not straightforwardly derived. Paper structure. In the next subsection we discuss related work. The main result of our agnostic online boosting algorithm, and the proof of Theorem 2, are given in Section 2. In Section 3 we demonstrate a general reduction, in the statistical setting, from both the agnostic (Subsection 3.1), and realizable (Subsection 3.2) boosting settings, to online convex optimization. Then, in Section 4, we give a similar result for the online realizable boosting setting. 1.2 Related work Several previous works derived agnostic boosting algorithms in the statistical setting [26, 27, 28, 16]. However, previous works on online boosting have focused only on the realizable (mistake-bound) setting [14, 9]. The work by [14] provides a boosting algorithm inspired by classical algorithms in the realizable-statistical setting. Their work was later extended to an optimal and adaptive online boosting algorithm [9]. Although the work by [9] does not explicitly assume realizability, we remark that Equation (1) in [9] amounts to realizability: indeed, it assumes that the weak learner makes at most (0.5 γ)T + o(T) mistakes on every sequence of input examples (x1, y1)...(x T , y T ). This clearly cannot apply in an agnostic setting, since for a fixed γ > 0, if the input sequence is drawn randomly then with high probability the number of mistakes will concentrate around 0.5T and Equation (1) will fail. Thus, in order for Equation (1) to hold, one must restrict the set of allowable sequences, which essentially amounts to realizability with respect to some class. In contrast, our work focuses on the agnostic (regret-bound) setting, where the input sequence is not restricted and can be adversarial. Another contribution of this work is the above-mentioned general framework which reduces boosting to online convex optimization. We note however that this generality comes with a cost: in particular, in the 3 settings that were previously studied, the boosting algorithms derived from the general framework exhibit, in some aspects, inferior guarantees compared to the state-of-the-art (e.g., Adaboost [34] in the realizable-statistical case, [9] in the realizable-online case, and [28] in the agnostic-statistical case). We compare them in more detail in the following sections. It is therefore presumable that our online agnostic boosting algorithm can also be improved, e.g. by improving the oracle/regret bounds or adding adaptive capabilities. We leave these goals for future work. Several other works [8, 2] studied online boosting under real-valued loss functions. The main difference from our work is in the weak learning assumption: they consider weak learners that are in fact strong online learners for a base class of regression functions. The boosting procedure produces an online learner for a bigger class which consists of the linear span of the base class. This is different from the setting considered here where the class is fixed, but the regret bound is being boosted. A main motivation of this work is the connection between boosting and regret minimization. This builds on and is inspired by previous works demonstrating this relationship. We refer the reader to the book by [34] (Chapter 6) for an excellent presentation of this relationship in the context of Adaboost. 2 Online agnostic boosting In this section, we formally present our main result which enables converting an online convex optimizer to an online agnostic booster. We begin by providing an overview of online convex optimization, and then proceed to describe our main algorithm. Online Convex Optimization. (see e.g. [23]). In the Online Convex Optimization (OCO) framework, an online player iteratively makes decisions from a compact convex set K Rd. At iteration i = 1, ..., N, the online player chooses an action pi in the decision set K, and at the same time the adversary reveals a cost function ℓi : K R, chosen from a family F of bounded convex functions over K. We denote an algorithm A in this setting as a (K, N)-Online Convex Optimizer (OCO), if the regret RA(N) is a non-decreasing, sublinear function of N, where the regret is defined as the excess loss over the best single action in hindsight over the decision set K: i=1 ℓi(pi) min p K i=1 ℓi(p). (1) The OCO framework generalizes the statistical learning framework as it permits the adversary to adapt to the actions of the player. Under this setting, the best action in hindsight is a competitive benchmark, and a sublinear regret guarantees vanishing average excess loss over time. Main Algorithm. Our main algorithm is formally described in Algorithm 1. The booster has black-box oracle access to two types of auxiliary algorithms: N instances of an online weak learning algorithm, denoted W1, . . . , WN, and an online-convex optimizer, denoted by A. The algorithm proceeds in rounds, where in each round t it observes an example (xt, yt) and sequentially updates each weak learner Wi by feeding it with (xt, yi t), where yi t is a re-labeling. The role of the OCO is to determine the yi t s (via the parameters pi t s, see Line 4). Intuitively, it guides each weak learner Wi to correct for mistakes of the preceding learners. We stress that the OCO is restarted in each round. Randomized Majority-Vote/Projection. At the beginning of each iteration, the boosting algorithm aggregates the weak learners predictions using a randomized projection Π . For any z R, denote by Π(z) the following random label: sign(z) if |z| 1 +1 w.p. 1+z 2 1 w.p. 1 z Algorithm 1 Online Agnostic Boosting with OCO 1: for t = 1, . . . , T do 2: Get xt, predict: ˆyt = Π 1 γN PN i=1 Wi(xt) . 3: for i = 1, . . . , N do 4: If i > 1, set pi t = A(ℓ1 t, ..., ℓi 1 t ). \\Note that A is restarted at each time step t. 5: Else, set p1 t = 0 6: Set next loss: ℓi t(p) = p( 1 γ Wi(xt)yt 1). 7: Pass (xt, yi t) to Wi, where yi t is a random label s.t. P[yi t = yt] = 1+pi t 2 . 8: end for 9: end for Figure 1: The algorithm is given oracle access to N instances of a (γ, T)-AOWL algorithm, W1, ..., WN (see Definition 1), and to a ([ 1, 1], N)-OCO algorithm A (see Equation 1). The prediction Π( 1 γN PN i=1 Wi(xt)) in line 2 is a projected majority-vote (see Equation 2). We now state and prove the regret bound for Algorithm 1. Proposition 3 (Regret Bound). The accumulated gain of Algorithm 1 satisfies: t=1 h (xt)yt where (xt, yt) s are the observed examples, ˆyt s are the predictions, the expectation is with respect to the algorithm and learners randomness, and RW, RA are the AWOL, OCO regret terms, respectively. Proof. The proof follows by combining upper and lower bounds on the expected sum of losses incurred by the OCO algorithm. The bounds follow directly from the weak learning assumption (lower bound) and the OCO guarantee (upper bound). These bounds involve some simple algebraic manipulations. It is convenient to abstract out some of these calculations into lemmas, which are described later in this section. Before delving into the analysis, we first clarify several assumptions used below. For simplicity of presentation we assume an oblivious adversary, however, using a standard reduction, our results can be generalized to an adaptive one 3. Let (x1, y1), ..., (x T , y T ) be any sequence of observed examples. Observe that there are several sources of randomness at play; the weak learning algorithm Wi s internal randomness, the random re-labeling (line 6, Algorithm 1), and the randomized prediction (line 7, Algorithm 1). The analysis below is given in expectation w.r.t. all these random variables. Note the following fact used in the analysis; for all i [N], t [T], the random variables Wi(xt) and yi t are conditionally independent given pi t and yt. Since E[yi t|pi t, yt] = pi t yt, using the conditional independence, it follows that E[Wi(xt)yi t] = E[Wi(xt)pi tyt] (see Lemma 13 in the Appendix). We can now begin the analysis, starting with lower bounding the expected sum of losses, using the weak learning guarantee, 1 γ E h N X t=1 Wi(xt) ytpi t i = 1 i=1 E h T X t=1 Wi(xt) ytpi t i = 1 i=1 E h T X t=1 Wi(xt)yi t i (See Lemma 13) 3See discussion in [12], Pg. 69, as well as Exercise 4.1 formulating the reduction. γ max h H E h T X t=1 h(xt)yi t i RW(T) (Weak Learning (1)) t=1 h(xt) E[yi t] 1 t=1 h (xt) E[ytpi t] N t=1 E h (xt) ytpi t N where h is an optimal expert in hindsight for the observed sequence of examples (xt, yt) s. Thus, we obtain the lower bound on the expected sum of losses P i ℓi t(pi t) (see Line 6 in Algorithm 1 for the definition of the ℓi t s), given by, i=1 ℓi t(pi t)] t=1 E pi t(h (xt)yt 1) N t=1 (h (xt)yt 1) N γ RW(T). (See Lemma 4 below) For the upper bound, observe that the OCO regret guarantee implies that for any t [T], and any p t [ 1, 1], i=1 ℓi t(pi t) i p t i=1 E Wi(xt) yt 1 + 1 Thus, by setting p t according to Lemma 5 (see below, with ˆh(x) := 1 γN PN i=1 E Wi(x) ), and summing over t [T], we get, i=1 ℓi t(pi t) i t=1 (E[ˆyt]yt 1) + T By combining the lower and upper bounds for E 1 NT P i ℓi t(pi t) , we get, t=1 E[ˆyt]yt 1 t=1 h (xt)yt RW(T) It remains to prove two short Lemmas that are used in the proof of the theorem above, as well as in the more general settings in the following sections. Lemma 4. For any p [ 1, 1], an example pair (x, y), and h : X { 1, 1}, we have: p(h(x)y 1) h(x)y 1. Proof. Let z = h(x)y 1. Observe that z { 2, 0}. Thus, since p [ 1, 1], pz z. Lemma 5. Given an example pair (x, y), and ˆh : X R, there exists p {0, 1}, such that, p (ˆh(x)y 1) ˆyy 1, where ˆyt = E[Π(ˆh(x))], with expectation taken only w.r.t. the randomness of Π (see Definition (2)). Proof. If |ˆh(x)| 1, by = ˆh(x) and by setting p = 1, the equality follows. Thus, assume |ˆh(x)| > 1, and consider the following cases: If ˆh(x)y 1 > 0, then byy 1 = 0. Hence, by setting p = 0, the equality follows. If ˆh(x)y 1 < 0, then since |ˆh(x)| > 1 it must be that sign(ˆh(x))y = 1, and byy 1 = 2. Since |ˆh(x)| > 1, we have ˆh(x)y 1 2. Hence, by setting p = 1 the inequality holds. 2.1 Proof of Theorem 2 The proof of Theorem 2 is a direct corollary of Proposition 3, by plugging Online Gradient Descent (OGD) to be the OCO algorithm A (e.g., see [23] Chapter 3.1): the OGD regret is O(GD N), where N is the number of iterations, G is an upper bound on the gradient of the losses, and D is the diameter of the set K = [ 1, 1]. In our setting, G 2 γ , and D = 2. Hence, RA = O( N/γ), and the overall bound on the regret follows. 3 Statistical realizable and agnostic boosting In this section we give an overview of our results in the statistical setting. The formal definitions and results are stated below, in sections 3.1 ans 3.2. The full analysis is deferred to the Appendix. The algorithms and analysis in the statistical-setting follow the same structure as in the online setting (see the discussion of the abstract framework for boosting in Section 1). To allow the reader to assess the simplicity of the abstraction, we include the pseudo-code for our algorithms in the statistical setting below in Algorithm 2. Notice that, as in the online setting, the difference between the agnosticand realizable-case boosting algorithms below boils down to a single line. Note that a caveat of this unified framework is that the oracle and sample complexity bounds we obtain in the realizable setting are inferior compared to the state-of-the-art bounds (see e.g., [34]); in the agnostic setting our bounds match the state-of-the-art bounds [28, 16], however our algorithm lacks adaptivity (whereas, the algorithms by [28, 16] are adaptive). Albeit not achieving quantitative improvement, the applications given below demonstrate the generality of the proposed framework. Finally, let us remark a curious connection between our boosting algorithms in the statistical setting with repeated zero-sum game-play with access to an approximately-best-response oracle. Given an input strategy of the opponent, such an oracle returns a strategy whose payoff competes with the best response up to a multiplicative factor. Interestingly, our algorithm can be seen as an improper policy in this repeated-game setting. We refer the reader to the appendix for the formal details. Algorithm 2 Boosting with OCO 1: for t = 1, . . . , T do 2: Pass m0 examples to W drawn from the following distribution: 3: Realizable: Draw (xi, yi) w.p. pt(i). 4: Agnostic: Draw xi w.p. 1 m, and re-label according to yipt(i). 5: Let ht be the weak hypothesis returned by W. 6: Set loss: ℓt(p) = Pm i=1 p(i)( 1 γ ht(xi)yi 1). 7: Update: pt+1 = A(ℓ1, ..., ℓt). 8: end for 9: return h(x) = Π 1 γT PT t=1 ht(x) . Figure 2: The booster has oracle access to either a (γ, ϵ0, m0)-AWL (see Definition 6) or a (γ, m0)-WL (see Definition 8), both denoted as W. The optimizer is a (γ, K, T)-OCO A (see Definition 1), K = [0, 1]m in the realizable case and K = [ 1, 1]m in the agnostic case. The final predictor (Line 9) applies a projected majority-vote (see Equation 2). 3.1 Statistical agnostic boosting Let H { 1}X be a hypothesis class, and let D be any a distribution over X { 1}. Define the correlation of any h H with respect to D by cor D(h) = E(x,y) D[h(x) y]. Definition 6 (Empirical Agnostic Weak Learning Assumption). Let x = (x1 . . . xm) X denote an unlabeled sample. A learning algorithm W is a (γ, ϵ0, m0)-agnostic weak learner (AWL) for H with respect to x if for any labels y = (y1, . . . , ym), ES [ corµ y(W(S ))] γ max h H corµ y(h ) ϵ0, where µ y is the distribution which uniformly assigns to each example (xi, yi) probability 1/m, and S is an independent sample of size m0 drawn from µ y. In accordance with previous works, we focus on the setting where γ is a small constant (say γ = 0.1) and ε0 d/ m, where d is the VC-dimension of H (see [28] for a detailed discussion). We stress however that our results apply for any setting of γ, ϵ0 [0, 1]. The above weak learning assumption can be seen as an empirical variant of the assumption in [28], where µ is replaced with the population distribution over X and the labels yi s are replaced with an arbitrary classifier c : X { 1}. Both of these assumptions are weaker than the standard agnostic weak learning assumption, for which the guarantee holds with respect to every distribution D over X { 1}. It will be interesting to investigate the relationship between the assumption of [28] and our empirical variant, however this is beyond the scope of this work. We now state the empirical error bound for Algorithm 2. Theorem 7 (Empirical Agnostic Boosting). The correlation of h, output of Algorithm 2, satisfies: E cor S( h) max h H E cor S(h ) Observe that by setting T = O( 1 γ2ϵ2 ) for any ϵ > 0, an error of ϵ is obtained. Note that this in fact matches bound on T in the state-of-the-art agnostic boosting method [28]. However, we remark that the method given in [28] is adaptive. 3.2 Statistical realizable boosting Definition 8 (Empirical Weak Learning Assumption [34]). Let H { 1}X be a hypothesis class, and let S = {(x1, y1), . . . , (xm, ym)} X { 1} be a sample. A learning algorithm W is a (γ, m0)-weak learner (WL) for H with respect to S if for any distribution p = (p1, . . . , pm) which assigns each example (xi, yi) with probability pi, ES [ corp(W(S ))] γ, where S is an independent sample of size m0 drawn from p. Theorem 9 (Empirical Realizable Boosting). The correlation of h, output of Algorithm 2, satisfies: E[ cor S( h)] 1 O 1 γ Observe that by setting T = O( 1 γ2ϵ2 ) for any ϵ > 0, at most ϵ error is obtained. We remark that classical boosting results are able to achieve ϵ error, by setting T = O 1 γ2 log( 1 3.3 Generalization Theorems 7 and 9 imply that the correlation of the output hypothesis is competitive with the best hypothesis in H with respect to the empirical distribution. A similar guarantee with respect to the population distribution can be obtained using a standard sample compression argument (see, e.g. [34, 15]): indeed, the final hypothesis h is obtained by aggregating the T weak hypotheses ht s, each of which is determined by the m0 examples fed to the weak learner. Thus, h can be encoded by T m0 input examples and hence the entire algorithm forms a sample compression scheme of this size. Consequently, setting the input sample size m = O(T m0/ε2) yields the same guarantees of Theorems 7, 9 up to an additive error of ε. 4 Online realizable boosting In this section, we give an online realizable boosting algorithm, and state its regret bound. The result follows along similar lines as our main result given in Section 2. We first state the weak learning assumption for the online realizable setting. Definition 10 (Online Weak Learning). Let H { 1}X be a class of experts, let T denote the horizon length, and let γ > 0 denote the advantage. An online learning algorithm W is a (γ, T)-weak online learner (WOL) for H if for any sequence (x1, y1), ..., (x T , y T ) X { 1} that is realizable by H, at every iteration t [T], the algorithm outputs W(xt) { 1} such that, t=1 W(xt)yt where the expectation is taken over the randomness of the weak learner W and RW : N R+ is the additive regret: a non-decreasing, sub-linear function of T. Recall that the restriction of the sequence {(xt, yt)}T t=1 to being realizable by H corresponds to: maxh H 1 T PT t=1 h(xt)yt = 1. Similar to the online agnostic case, the booster maintains N instances W1, . . . , WN of an online weak learning algorithm, and a single instance of the OCO algorithm A. However, in the previous section, the OCO s decision set is [ 1, 1], corresponding to its role of relabeling examples in 1. In this setting, the decision set of the OCO is [0, 1], corresponding to its role of choosing probability weights for each example. Indeed, observe that Algorithm 1 and Algorithm 3 differ on Line 7. Algorithm 3 Online boosting with OCO 1: for t = 1, . . . , T do 2: Get xt, predict: ˆyt = Π 1 γN PN i=1 Wi(xt) . 3: for i = 1, . . . , N do 4: If i > 1, set pi t = A(ℓ1 t, ..., ℓi 1 t ). \\Note that A is restarted at each time step t. 5: Else, set p1 t = 1/2. 6: Set next loss: ℓi t(p) = p( 1 γ Wi(xt)yt 1). 7: Pass (xt, yt) to Wi w.p. pi t. 8: end for 9: end for Figure 3: The algorithm is given oracle access to N instances of a (γ, T)-WOL algorithm, W1, ..., WN (see Definition 10), and to a ([0, 1], N)-OCO algorithm A (see Equation 1). The prediction Π( 1 γN PN i=1 Wi(xt)) in line 2 is a projected majority-vote (see Equation 2). The following theorem formally states the guarantees of the online realizable boosting algorithm: Theorem 11. The accumulated gain of Algorithm 3 satisfies: where (xt, yt) s are the observed examples, ˆyt s are the predictions, the expectation is with respect to the algorithm and weak learners randomness, RW(T) := 2RW(T) + O( T), and RW and RA are the regret terms of the weak learner and the OCO, respectively. The proof follows similarly to that of Proposition 3, and is deferred to the Appendix. By plugging Online Gradient Descent (OGD) to be the OCO algorithm A, we get that RA(N) = O( N γ ). Thus, by Theorem 11, we obtain that the overall error is O( 1 γ N ). Note that by setting N = O( 1 γ2ϵ2 ) and T = O( 1 γ2ϵ2 ), the mistake-bound is at most ε T. This is worse by a factor of 1/ϵ2 than the bound given by [9], yielding T = O( 1 γ2ϵ) and N = O( 1 ϵ )), which is optimal. 5 Discussion We have presented the first boosting algorithm for agnostic online learning. In contrast to the realizable setting, we do not place any restrictions on the online sequence of examples. It remains open to prove lower bounds on online agnostic boosting as a function of the natural parameters of the problem and/or improve our upper bounds. Broader Impact There are no foreseen ethical or societal consequences for the research presented herein. Acknowledgments and Disclosure of Funding NB, XC, and EH acknowledge the support of NSF grant 1704860 and Google. In addition, XC acknowledges the support of the NSF Graduate Research Fellowships Program. This work was done when SM was a visiting scholar at Google. [1] Jacob D. Abernethy, Chansoo Lee, Audra Mc Millan, and Ambuj Tewari. Online learning via differential privacy. Co RR, abs/1711.10019, 2017. [2] Naman Agarwal, Nataly Brukhim, Elad Hazan, and Zhou Lu. Boosting for dynamical systems. ar Xiv preprint ar Xiv:1906.08720, 2019. [3] Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control with adversarial disturbances. In International Conference on Machine Learning, pages 111 119, 2019. [4] Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. In Advances in Neural Information Processing Systems 32, pages 10175 10184. 2019. [5] Naman Agarwal and Karan Singh. The price of differential privacy for online learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 32 40. PMLR, 2017. [6] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private PAC learning implies finite Littlestone dimension. In Proceedings of the 51st Annual ACM Symposium on the Theory of Computing, STOC 19, New York, NY, USA, 2019. ACM. [7] Shai Ben-David, Philip M Long, and Yishay Mansour. Agnostic boosting. In International Conference on Computational Learning Theory, pages 507 516. Springer, 2001. [8] Alina Beygelzimer, Elad Hazan, Satyen Kale, and Haipeng Luo. Online gradient boosting. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2458 2466. Curran Associates, Inc., 2015. [9] Alina Beygelzimer, Satyen Kale, and Haipeng Luo. Optimal and adaptive algorithms for online boosting. In International Conference on Machine Learning, pages 2323 2331, 2015. [10] Olivier Bousquet, Roi Livni, and Shay Moran. Passing tests without memorizing: Two models for fooling discriminators, 2019. [11] Mark Bun, Roi Livni, and Shay Moran. An Equivalence Between Private Classification and Online Prediction. [12] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006. [13] Shang-Tse Chen, Maria-Florina Balcan, and Duen Horng Chau. Communication efficient distributed agnostic boosting. In Artificial Intelligence and Statistics, pages 1299 1307, 2016. [14] Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. An online boosting algorithm with theoretical justifications, 2012. [15] Ofir David, Shay Moran, and Amir Yehudayoff. Supervised learning through the lens of compression. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2784 2792, 2016. [16] Vitaly Feldman. Distribution-specific agnostic boosting. ar Xiv preprint ar Xiv:0909.2927, 2009. [17] Yoav Freund. Boosting a weak learning algorithm by majority. In Mark A. Fulk and John Case, editors, Proceedings of the Third Annual Workshop on Computational Learning Theory, COLT 1990, University of Rochester, Rochester, NY, USA, August 6-8, 1990, pages 202 216. Morgan Kaufmann, 1990. [18] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In Proceedings of the ninth annual conference on Computational learning theory, pages 325 332, 1996. [19] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119 139, 1997. [20] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79 103, 1999. [21] Dmitry Gavinsky. Optimally-smooth adaptive boosting and application to agnostic learning. Journal of Machine Learning Research, 4(May):101 117, 2003. [22] Alon Gonen, Elad Hazan, and Shay Moran. Private learning implies online learning: An efficient reduction. Co RR, abs/1905.11311, 2019. [23] Elad Hazan. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016. [24] Elad Hazan, Sham M. Kakade, and Karan Singh. The nonstochastic control problem. ar Xiv preprint ar Xiv:1911.12178, 2019. [25] Matthew Joseph, Jieming Mao, Seth Neel, and Aaron Roth. The role of interactivity in local differential privacy. In FOCS, 2019. [26] Adam Tauman Kalai, Yishay Mansour, and Elad Verbin. On agnostic boosting and parity learning. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 629 638, 2008. [27] Adam Tauman Kalai and Rocco A. Servedio. Boosting in the presence of noise. J. Comput. Syst. Sci., 71(3):266 290, 2005. [28] Varun Kanade and Adam Kalai. Potential-based agnostic boosting. In Advances in neural information processing systems, 2009. [29] M. Kearns. Thoughts on hypothesis boosting. Unpublished, December 1988. [30] Philip M. Long and Rocco A. Servedio. Adaptive martingale boosting. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Léon Bottou, editors, Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 977 984. Curran Associates, Inc., 2008. [31] Yishay Mansour and David A. Mc Allester. Boosting using branching programs. J. Comput. Syst. Sci., 64(1):103 112, 2002. [32] Seth Neel, Aaron Roth, and Zhiwei Steven Wu. How to use heuristics for differential privacy. In FOCS, 2019. [33] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197 227, 1990. [34] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. Cambridge university press, 2012. [35] Maurice Sion. On general minimax theorems. Pacific J. Math., 8(1):171 176, 1958.