# online_learning_via_sequential_complexities__6f8d6d9f.pdf Journal of Machine Learning Research 16 (2015) 155-186 Submitted 7/13; Revised 7/14; Published 2/15 Online Learning via Sequential Complexities Alexander Rakhlin rakhlin@wharton.upenn.edu Department of Statistics University of Pennsylvania Philadelphia, PA 19104 Karthik Sridharan skarthik@wharton.upenn.edu Department of Computer Science Cornell University Ithaca, NY 14853 Ambuj Tewari tewaria@umich.edu Department of Statistics University of Michigan Ann Arbor, MI 48109 Editor: Mehryar Mohri We consider the problem of sequential prediction and provide tools to study the minimax value of the associated game. Classical statistical learning theory provides several useful complexity measures to study learning with i.i.d. data. Our proposed sequential complexities can be seen as extensions of these measures to the sequential setting. The developed theory is shown to yield precise learning guarantees for the problem of sequential prediction. In particular, we show necessary and sufficient conditions for online learnability in the setting of supervised learning. Several examples show the utility of our framework: we can establish learnability without having to exhibit an explicit online learning algorithm. Keywords: online learning, sequential complexities, regret minimization 1. Introduction This paper is concerned with sequential prediction problems where no probabilistic assumptions are made regarding the data generating mechanism. Our viewpoint is expressed well by the following quotation from Cover and Shenhar (1977): We are interested in sequential prediction procedures that exploit any apparent order in the sequence. We do not assume the existence of any underlying distributions, but assume that the sequence is an outcome of a game against a malevolent intelligent nature. We will, in fact, take the game theoretic viewpoint seriously. All our investigations will proceed by analyzing the minimax value of a repeated game between a player or learner and a malevolent intelligent nature , or the adversary. Even though we have the setting of prediction problems in mind, it will be useful to develop the theory in a somewhat abstract setting. Towards this end, fix the sets F and 2015 Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari. Rakhlin, Sridharan and Tewari Z, as well as a loss function ℓ F Z R, and consider the following T-round repeated two-player game, which we term the online learning or sequential prediction model. On round t {1,...,T}, the learner chooses ft F, the adversary picks zt Z, and the learner suffers loss ℓ(ft,zt). At the end of T rounds we define regret R(f1 T ,z1 T ) T t=1 ℓ(ft,zt) inf f F T t=1 ℓ(f,zt) as the difference between the cumulative loss of the player and the cumulative loss of the best fixed decision. For the given pair (F,Z), the problem is said to be online learnable if there exists an algorithm for the learner such that regret grows sublinearly in the time horizon T, no matter what strategy the adversary employs. The origin of the online learning (or sequential prediction) model can be traced back to the work of Robbins (1950) on compound statistical decision problems. Some of the earliest sequential prediction algorithms were proposed by Blackwell (1956a,b) and Hannan (1957). Blackwell s method was based on his celebrated approachability theorem whereas Hannan s was based on minimizing a randomly perturbed sum of previous losses. Hannan s ideas were to later resurface in the influential Follow-the-Perturbed-Leader family (Kalai and Vempala, 2005) of online learning algorithms. The seminal ideas in the work of Robbins, Blackwell and Hannan led to further developments in many different fields. Cover (1967), Davisson (1973), Ziv and Lempel (1977), Rissanen (1984), Feder et al. (1992), and others laid the foundation of universal coding, compression and prediction in the Information Theory literature. Within Computer Science, Littlestone and Warmuth (1994), Cesa-Bianchi et al. (1997), Vovk (1998), and others studied the online learning model and the prediction with expert advice framework. The connections between regret minimization and convergence to equilibria was studied in Economics by Foster and Vohra (1997), Hart and Mas-Colell (2000) and others. We have no doubt left out many interesting works above. But even our partial list will convince the reader that research in online learning and sequential prediction has benefited from contributions by researchers from a variety of fields including Computer Science, Economics, Information Theory, and Statistics. For an excellent synthesis and presentation of results from these different fields we refer the reader to the book by Cesa-Bianchi and Lugosi (2006). Many of the ideas in the field are constructive, resulting in beautiful algorithms, or algorithmic techniques, associated with names such as Follow-the-Regularized-Leader, Follow-the-Perturbed-Leader, Weighted Majority, Hedge, and Online Gradient Descent. However, analyzing specific algorithms has obvious disadvantages. The algorithm may not be optimal for the task at hand. Even if it is optimal, one cannot prove that fact unless one develops tools for analyzing the inherent complexity of the online learning problem. Our goal is precisely to provide such tools. We will begin by defining the minimax value of the game underlying the abstract online learning model. Then we will develop tools for controlling the minimax value resulting in a theory that parallels statistical learning theory. In particular, we develop analogues of combinatorial dimensions, covering numbers, and Rademacher complexities. We will also provide results relating these complexities. Note that our approach is non-constructive: controlling the sequential complexities mentioned above will only guarantee the existence of a good online learning algorithm but Online Learning via Sequential Complexities will not explicitly create one. However, it turns out that that the minimax point of view can indeed lead to constructive algorithms as shown by Rakhlin et al. (2012). 2. Minimax Value and Online Learnability To proceed further in our analysis of the minimax value of the repeated game between the learner and the adversary, we need to make a few technical assumptions. We assume that F is a subset of a separable metric space. Let Q be the set of probability measures on F and assume that Q is weakly compact. In order to allow randomized prediction, we allow the learner to choose a distribution qt Q on every round. The minimax value of the game is then defined as VT (F,Z) inf q1 Q sup z1 Z E f1 q1 inf q T Q sup z T Z E f T q T [ T t=1 ℓ(ft,zt) inf f F T t=1 ℓ(f,zt)] . (1) Henceforth, the notation Ef q stands for the expectation operator integrating out the random variable f with distribution q. We consider here the adaptive adversary who gets to choose each zt based on the history of moves f1 t 1 and z1 t 1. The first key step in the study of the value of the game is to appeal to the minimax theorem and exchange the pairs of infima and suprema in (1). This dual formulation is easier to analyze because the choice of the player comes after the choice of the mixed strategy of the adversary. We remark that the minimax theorem holds under a very general assumption of weak compactness of Q and lower semi-continuity of the loss function.1 Under these conditions, we can appeal to Theorem 1 stated below, which is adapted for our needs from the work of Abernethy et al. (2009). Theorem 1 Let F and Z be the sets of moves for the two players, satisfying the necessary conditions for the minimax theorem to hold. Denote by Q and P the sets of probability measures (mixed strategies) on F and Z, respectively. Then VT (F,Z) = sup p1 E z1 p1 sup p T E z T p T [ T t=1 inf ft F E zt pt [ℓ(ft,zt)] inf f F T t=1 ℓ(f,zt)], (2) where suprema over pt range over all distributions in P. The question of learnability in the online learning model is now reduced to the study of VT (F,Z), taking (2) as the starting point. Definition 2 A class F is said to be online learnable with respect to the given Z and ℓif Note that our notion of learnability is related to, but distinct from, Hannan consistency (Hannan, 1957; Cesa-Bianchi and Lugosi, 2006). The latter notion requires the iterated game to go on for an infinite number of rounds and is formulated in terms of almost sure 1. We refer to Appendix A for a precise statement of the minimax theorem, as well as sufficient conditions. Rakhlin, Sridharan and Tewari convergence. In contrast, we consider a distinct game for each T and look at expected regret. Nevertheless, it is possible to obtain Hannan consistency using the techniques developed in this paper by considering a slightly different game (Rakhlin et al., 2011). We also remark that the statements in this paper extend to the case when the learner is allowed to make decisions in a larger set G, while the best-in-hindsight term in the regret definition is computed with respect to F G. Such a setting interesting especially with regard to computational concerns is termed improper learning. For example, prediction with side information (or, the supervised learning problem) is one such case, where we choose Y R, Z = X Y, F YX = G and ℓ(f,(x,y)) = f(x) y . This setting will be studied later in the paper. Note that in the proper learning scenario, VT (F,Z) 0 (e.g., since all zt s can be chosen to be the same), and thus the limsup in Definition 2 can be simply replaced with the limit being equal to zero. This paper is aimed at understanding the value of the game VT (F,Z) for various function classes F. Since our focus is on the complexity of F, we shall often write VT (F) keeping the dependence on Z (and ℓ) implicit. As we show, the sequential complexity notions that were shown by Rakhlin et al. (2014) to characterize uniform martingale Laws of Large Numbers also give us a handle on the value VT (F). In the next section, we briefly define these sequential complexity notions and mention some of the key relations between them. A more detailed account of the relationships between sequential complexity measures along with complete proofs can be found in (Rakhlin et al., 2014). 3. Sequential Complexities Unlike the well-studied statistical learning scenario with i.i.d. data, the online learning problem possesses a certain sequential dependence. Such dependence cannot be captured by classical notions of complexity that are based on a batch of data given as a tuple of T examples. A basic unit that does capture temporal dependence is a binary tree. Surprisingly, for the sequential prediction problems considered in this paper, one need not look further than binary trees to capture the relevant complexity. A Z-valued tree z of depth T is a complete rooted binary tree with nodes labeled by elements of Z. Such a tree z is identified with the sequence (z1,...,z T ) of labeling functions zi { 1}i 1 Z which provide the labels for each node. Therefore, z1 Z is the label for the root of the tree, while zi for i > 1 is the label of the node obtained by following the path of length i 1 from the root, with +1 indicating right and 1 indicating left . A path of length T is given by the sequence ϵ = (ϵ1,...,ϵT ) { 1}T . For brevity, we shall often write zt(ϵ), where ϵ = (ϵ1,...,ϵT ), but it is understood that zt depends only on the prefix (ϵ1,...,ϵt 1). Now, let ϵ1,...,ϵT be independent Rademacher random variables. Given a Z-valued tree z of depth T, we define the sequential Rademacher complexity of a function class G RZ on a Z-valued tree z as RT (G,z) E[sup g G T t=1 ϵtg(zt(ϵ))], and we denote by RT (G) = supz RT (G,z) its supremum over all Z-valued trees of depth T. The importance of the introduced notion stems from the following result (Rakhlin et al., Online Learning via Sequential Complexities 2014, Theorem 2): for any distribution over a sequence (Z1,...,ZT ), we have T t=1 (E[g(Zt) Zt 1] g(Zt))] 2RT (G) , (3) where Zt 1 = (Z1,...,Zt 1). In other words, the martingale version of the uniform deviations of means from expectations is controlled by the worst-case sequential Rademacher complexity. A matching lower bound also holds for the supremum over distributions on sequences in ZT . It then follows that a uniform martingale Law of Large Numbers holds for G if and only if RT (G) 0. For i.i.d. random variables, a similar statement can be made in terms of the classical Rademacher complexity, and so one might hope that many other complexity notions from empirical process theory have martingale (or we may say, sequential) analogues. Luckily, this is indeed the case (Rakhlin et al., 2014). As we show in this paper, these generalizations of the classical notions also give a handle on (as well as necessary and sufficient conditions for) online learnability, thus painting a picture that completely parallels statistical learning theory. But before we present our main results, let us recall some key definitions and results in (Rakhlin et al., 2014). In providing further upper bounds on sequential Rademacher complexity, the following definitions of an effective size of a function class generalize the classical notions of a covering number. A set V of R-valued trees of depth T is a (sequential) α-cover (with respect to ℓp norm) of G RZ on a tree z of depth T if g G, ϵ { 1}T , v V s.t. ( 1 T t=1 vt(ϵ) g(zt(ϵ)) p) The (sequential) covering number of a function class G on a given tree z is defined as Np(α,G,z) min{ V V is an α-cover w.r.t. ℓp norm of G on z}. It is straightforward to check that Np(α,G,z) Nq(α,G,z) whenever 1 p q . Further define Np(α,G,T) = supz Np(α,G,z), the maximal ℓp covering number of G over depth T trees. For a class G of binary-valued functions, we also define a so-called 0-cover (or, cover at scale 0), denoted by N(0,G,z), as equal to any Np(0,G,z). The definition of a 0-cover can be seen as the correct analogue of the size of a projection of G onto a tuple of points in the i.i.d. case. The size of this projection in the i.i.d. case was the starting point of the work of Vapnik and Chervonenkis. When G [ 1,1]Z is a finite class of bounded functions, one can show (Rakhlin et al., 2014, Lemma 1) that a bound that should (correctly) remind the reader of the Exponential Weights regret bound. With the definition of an α-cover with respect to ℓ1 norm, one can easily extend (4) beyond the finite case. Immediately from the definition of ℓ1 covering number, it follows that for any G [ 1,1]Z, for any α > 0, RT (G,z) α + 2log N1(α,G,z) Rakhlin, Sridharan and Tewari (Rakhlin et al., 2014, Eq. 9). A tighter control is obtained by integrating the covering numbers at different scales. To this end, consider the following analogue of the Dudley entropy integral bound. For p 1, the integrated complexity of a function class G [ 1,1]Z on a Z-valued tree of depth T is defined as Dp T (G,z) inf α 0 {4α + 12 log Np(δ,G,z) dδ} (6) and Dp T (G) = supz Dp T (G,z), with D2 T (G,z) denoted simply by DT (G,z). We have previously shown (Rakhlin et al., 2014, Theorem 3) that, for any function class G [ 1,1]Z and any Z-valued tree z of depth T, RT (G,z) DT (G,z). (7) We next turn to the description of sequential combinatorial parameters. A Z-valued tree z of depth d is shattered by a function class G { 1}Z if for all ϵ { 1}d, there exists g G such that g(zt(ϵ)) = ϵt for all t [d]. The Littlestone dimension Ldim(G,Z) is the largest positive integer d such that G shatters a Z-valued tree of depth d (Littlestone, 1988; Ben-David et al., 2009). The scale-sensitive version of Littlestone dimension is defined as follows. A Z-valued tree z of depth d is α-shattered by a function class G RZ if there exists an R-valued tree s of depth d such that ϵ { 1}d, g G s.t. t [d], ϵt(g(zt(ϵ)) st(ϵ)) α/2. The tree s will be called a witness to shattering. The (sequential) fat-shattering dimension fatα(G,Z) at scale α is the largest d such that G α-shatters a Z-valued tree of depth d. The notions introduced above can be viewed as sequential generalizations of the VC dimension and the fat-shattering dimension where tuples of points get replaced by complete binary trees. In fact, one recovers the classical notions if the tree z in the above definitions is restricted to have the same values within a level (hence, no temporal dependence). Crucially, the sequential combinatorial analogues provide control for the growth of sequential covering numbers, justifying the definitions. First, let G {0,...,k}Z be a class of functions with fat2(G) = d. Then, it can be shown (Rakhlin et al., 2014, Theorem 4) that for any T 1, N (1/2,G,T) d i=0 (T i )ki (ek T)d . For the second result (Rakhlin et al., 2014, Corollary 1), suppose G is a class of [ 1,1]-valued functions on Z. Then, for any α > 0, and any T 1, N (α,G,T) (2e T α ) fatα(G) . (8) Finally, we recall a bound on the size of the 0-cover in terms of the fat1 combinatorial parameter (Rakhlin et al., 2014, Theorem 5). For a class G {0,...,k}Z with fat1(G) = d, we have N(0,G,T) d i=0 (T i )ki (ek T)d . (9) Online Learning via Sequential Complexities In particular, for k = 1 (that is, binary classification) we have fat1(G) = Ldim(G). The inequality (9) is therefore a sequential analogue of the celebrated Vapnik-Chervonenkis Sauer-Shelah lemma. 4. Structural Properties For the examples developed in this paper, it will be crucial to exploit a number of useful properties that RT (G) satisfies. These properties allow one to establish online learnability for complex function classes even if no explicit learning algorithms are available. We first state some properties that are easily proved but are nevertheless very useful. Lemma 3 Let F,G RZ and let conv(G) denote the convex hull of G. Let z be any Zvalued tree of depth T. Then the following properties hold. 1. If F G, then RT (F,z) RT (G,z). 2. RT (conv(G),z) = RT (G,z) 3. RT (c G,z) = c RT (G,z) for all c R. 4. For any h Z R, RT (G + h,z) = RT (G,z) where G + h = {g + h g G}. These properties match those of the classical Rademacher complexity (Bartlett and Mendelson, 2003) and can be proved in essentially the same way (we therefore skip the straightforward proofs). The next property is a key tool for many of the applications: it allows us to bound the sequential Rademacher complexity for the Cartesian product of function classes composed with a Lipschitz mapping in terms of complexities of the individual classes. Lemma 4 Let G = G1 ... Gk where each Gj [ 1,1]Z. Further, let φ Rk Z R be such that φ( ,z) is L-Lipschitz with respect to for all z Z, and let φ G = {z φ((g1(z),...,gk(z)),z) gj Gj}. Then we have RT (φ G) 8L (1 + 4 2log3/2(e T 2)) k j=1 RT (Gj) as long as RT (Gj) 1/T for each j. Let us explicitly state the more familiar contraction property, an immediate corollary of the above result. Corollary 5 Fix a class G [ 1,1]Z with RT (G) 1/T and a function φ R Z R. Assume φ( ,z) is L-Lipschitz for all z Z. Then RT (φ G) 8L (1 + 4 2log3/2(e T 2)) RT (G), where φ G = {z φ(g(z),z) g G}. We state another useful corollary of Lemma 4. Rakhlin, Sridharan and Tewari Corollary 6 For a fixed binary function b { 1}k { 1} and classes G1,...,Gk of { 1}- valued functions, RT (b(G1,...,Gk)) O (log3/2(T)) k j=1 RT (Gj). Note that, in the classical case, the Lipschitz contraction property holds without any extra poly-logarithmic factors in T (Ledoux and Talagrand, 1991). It is an open question whether the poly-logarithmic factors can be removed in the results above. It is worth pointing out ahead of time that Theorem 8 below in the setting of supervised learning with convex Lipschitz loss does allow us to avoid the extraneous factor that would otherwise appear from a combination of Theorem 7 and Corollary 5. 5. Main Results We now relate the value of the game to the worst case expected value of the supremum of an empirical process. However, unlike empirical processes that involve i.i.d. sums, our process involves a sum of martingale differences. In view of (3), the expected supremum can be further upper-bounded by the sequential Rademacher complexity. Theorem 7 The minimax value is bounded as 1 T VT (F) sup P E sup g ℓ(F) [ 1 T t=1 (E[g(Zt) Z1,...,Zt 1] g(Zt))] 2RT (ℓ(F)), where ℓ(F) = {ℓ(f, ) f F} and the supremum is taken over all distributions P over (Z1,...,ZT ). We can now employ the tools developed earlier in the paper to upper bound the value of the game. Interestingly, any non-trivial upper bound guarantees existence of a prediction strategy that has sublinear regret irrespective of the sequence of the moves of the adversary. This complexity-based approach of establishing learnability should be contrasted with the purely algorithm-based approaches found in the literature. 5.1 Supervised Learning In this subsection we study the supervised learning problem mentioned earlier in the paper. In this improper learning scenario, the learner at time t picks a function ft X R and the adversary provides the input target pair zt = (xt,yt) X Y where Y R. In particular, the binary classification problem corresponds to the case Y = { 1}. Let F YX and let us fix the absolute value loss function ℓ(ˆy,y) = ˆy y . While we focus on the absolute loss, it is easy to see that all the results hold (with modified rates) for any loss ℓ(ˆy,y) such that for all ˆy and y, φ(ℓ(ˆy,y)) ˆy y Φ(ℓ(ˆy,y)) where Φ and φ are monotonically increasing functions. For instance, the squared loss (ˆy y)2 is a classic example. We now observe that the value of the improper supervised learning game can be equivalently written as VS T (F) = sup x1 inf q1 Q sup y1 E ˆy1 q1 sup x T inf q T Q sup y T E ˆy T q T [ T t=1 ℓ(ˆyt,yt) inf f F T t=1 ℓ(f(xt),yt)], (10) Online Learning via Sequential Complexities where Q denotes the set of probability distributions over Y and ˆyt has distribution qt. This equivalence is easy to verify: we may view the choice ft X Y as pre-specifying predictions ft(x) for all the possible x X, while alternatively we can simply make the choice ˆyt Y having observed the particular move xt X. The advantage of rewriting the game in the form (10) is that the minimax theorem only needs to be applied to the pair ˆyt and yt, given the fixed choice xt. The minimax theorem then holds even if weak compactness cannot be shown for the set of distributions on the original space of functions of the type X Y. An examination of the proof of Theorem 7 reveals that the value (10) is upper bounded in exactly the same way, and the side information simply appears as an additional tree x in sequential Rademacher complexity, giving us: 1 T VS T (F) 2sup x,y E[sup f F T t=1 ϵtℓ(f(xt(ϵ)),yt(ϵ))] . (11) However, for the supervised learning setting, we can strengthen Theorem 7. The following theorem allows us to remove any convex Lipschitz loss (including the absolute loss) before passing to the sequential Rademacher complexity. Theorem 8 Let Y = [ 1,1] and suppose, for any y Y, ℓ( ,y) is convex and L-Lipschitz. Then the minimax value of a supervised learning problem is upper bounded as 1 T VS T (F) 2LRT (F). We remark that the contraction property for sequential Rademacher complexity, stated in Section 4, yields an extraneous logarithmic factor when applied to (11); here, we achieve the desired bound by removing the Lipschitz function directly during the symmetrization step. Armed with the theorem, we now prove the following result. Proposition 9 Consider the supervised learning problem with a function class F [ 1,1]X and absolute loss ℓ(ˆy,y) = ˆy y . Then, for any T 1, we have min{fatα,T} T VS T (F) 2RT (F) 2DT (F) fatβ log (2e T β ) dβ , (12) where fatα = fatα(F). The proposition above implies that finiteness of the fat-shattering dimension at all scales is necessary and sufficient for online learnability of the supervised learning problem. Further, all the complexity notions introduced so far are within a poly-logarithmic factor from each other whenever the problem is learnable. These results are summarized in the next theorem: Theorem 10 For any function class F [ 1,1]X , the following statements are equivalent Rakhlin, Sridharan and Tewari 1. Function class F is online learnable in the supervised setting with absolute loss. 2. Sequential Rademacher complexity satisfies lim T RT (F) = 0. 3. For any α > 0, the scale-sensitive dimension fatα(F) is finite. Moreover, if the function class is online learnable, then the value of the supervised game VS T (F), the sequential Rademacher complexity RT (F), and the integrated complexity DT (F) are within a multiplicative factor of O(log3/2 T) of each other. Remark 11 Additionally, the three statements of Theorem 10 are equivalent to F satisfying a martingale version of the uniform Law of Large Numbers. This property is termed Sequential Uniform Convergence by Rakhlin et al. (2014), and we refer to their paper for more details. For binary classification, we write VBinary T for VS T . This case has been investigated thoroughly by Ben-David et al. (2009) and indeed served as a key motivation for this paper. As a consequence of Proposition 9 and (9), we have a tight control on the value of the game for the binary classification problem. Note that the absolute loss in the binary classification setting is simply the 0-1 loss ℓ(ˆy,y) = 1{ˆy y}, where 1{U} is 1 if U is true and 0 otherwise. Corollary 12 For the binary classification problem with function class F and the 0-1 loss, we have K1 T min{Ldim(F),T} VBinary T (F) K2 T Ldim(F)log T for some universal constants K1,K2 > 0. Both the upper and the lower bound in the above result were originally derived in Ben David et al. (2009). Notably, we achieved the same bounds non-constructively through purely combinatorial and covering number arguments. It is natural to ask whether being able to learn in the online model is different from learning in the i.i.d. model (in the distribution-free supervised setting). The standard example that exhibits a gap between the two frameworks (e.g., Littlestone, 1988; Ben-David et al., 2009) is binary classification using the class of step functions F = {fθ(x) = 1{x θ} θ [0,1]} on [0,1]. This class has VC dimension 1, but is not learnable in the online setting. Indeed, it is possible to verify that the Littlestone dimension is infinite. Interestingly, the closelyrelated class of ramp functions with slope L > 0 FL = {fθ(x) = 1{x θ} + (1 L(x θ))1{θ < x θ + 1/L} θ [0,1]} is learnable (say for supervised learning using absolute loss) in the online setting (and hence also in the i.i.d. case). Furthermore, the larger class of all bounded L-Lipschitz functions on a bounded interval is also online learnable (see Eq. 14 and proof of Proposition 18). Once again, we are able to make these statements from purely complexity-based considerations, without exhibiting an algorithm. Further examples where we can demonstrate online learnability are explored in Section 6. Online Learning via Sequential Complexities 5.2 Online Convex Optimization Over the past decade, Online Convex Optimization (OCO) has emerged as a unified online learning framework (Zinkevich, 2003; Shalev-Shwartz, 2011). Various methods, such as Exponential Weights, can be viewed as instances of online mirror descent, solving the associated OCO problem. Much research effort has been devoted to understanding this abstract and simplified setting. It is tempting to say that any problem of online learning, as defined in the Introduction, can be viewed as OCO (in fact, online linear optimization) over the set of probability distributions; however, one should also recognize that by linearizing the problem, any interesting structure is lost and one instead suffers from the possibly unnecessary dependence on the number of functions in the class F. Nevertheless, OCO is a central part of the recent literature, and we will study this scenario using techniques developed in this paper. The standard setting of online convex optimization is as follows. The set of moves of the learner F is a bounded closed convex subset of a Banach space (B, ) with f D for all f F (the reader can think of Rd equipped with an ℓp norm for simplicity). Let be the dual norm. The adversary s set Z consists of convex G-Lipschitz (with respect to ) functions over F: Z = Zcvx = {g F R g convex and G-Lipschitz w.r.t. } . Let the loss function be ℓ(f,g) = g(f), the evaluation of the adversarially chosen function at f. For the particular case of online linear optimization, we instead take Z = Zlin = {f f,z z G} with Z now a subset of the dual space. It is well-known (e.g., Abernethy et al., 2008) that the online convex optimization problem (without further assumptions on the functions in Zcvx) is as hard as the corresponding linear optimization problem with Zlin if one considers deterministic algorithms. The same trivially extends to randomized methods: Lemma 13 Suppose F,Zcvx,Zlin be defined as above. Then we have VT (F,Zcvx) = VT (F,Zlin) . We will now show how to use the above result to derive minimax regret guarantees for OCO. The reader may wonder why we do not directly try to bound the value VT (F,Zcvx) by RT (F,Zcvx). In fact, this proof strategy cannot give a non-trivial bound if F is a subset of a high-dimensional (or infinite-dimensional) space (Shalev-Shwartz et al., 2009, Sec. 4.1). Instead, we use the lemma above to bound the value of the game where adversary plays convex functions with that of the game where adversary plays linear functions. A function Ψ F R is (σ,q)-uniformly convex (for q [2, )) on F with respect to a norm if, for all θ [0,1] and f1,f2 F, Ψ(θf1 + (1 θ)f2) θΨ(f1) + (1 θ)Ψ(f2) σ θ (1 θ) q f1 f2 q . A (σ,2)-uniformly convex function will be called σ-strongly convex. Rakhlin, Sridharan and Tewari We will give examples shortly but we first state a proposition that is useful to bound the sequential Rademacher complexity of linear function classes. The crucial duality fact exploited in its proof is that Ψ is (σ,q)-uniformly convex with respect to if and only if Ψ is (1/σ,p)-uniformly smooth with respect to where 1/p + 1/q = 1. Proposition 14 (Rakhlin et al., 2014) Let F be a subset of some Banach space B with norm and let Z be a subset of the dual space B equipped with norm . Suppose that Ψ F R is (σ,q)-uniformly convex with respect to and 0 Ψ(f) Ψmax for all f F. Then we have RT (F) Cp Z ( Ψp 1 max σ T p 1 ) where Z = supz Z z , p is such that 1/p + 1/q = 1, and Cp = (p/(p 1)) Using the above Proposition in conjunction with Lemma 13 and Theorem 7, we can immediately conclude that VT (F,Zcvx) 2T RT (F) 2G for any non-negative function Ψ F R that is σ-strongly convex w.r.t. . Note that, typically, Ψmax will depend on D. For example, in the particular case when = = 2, we can take Ψ(u) = 1 2 u 2 2 and the above bound becomes 2GD T and recovers the guarantee for the online gradient descent algorithm. In general, for = p and = q, we can use Ψ(u) = 1 2 u 2 p to get a bound of 2GD T/(p 1) since Ψ is (p 1)-strongly convex w.r.t. p. These O( T) regret rates are not new but we re-derive them to illustrate the usefulness of the tools we developed. 6. Further Examples Now we present some further applications of the tools we have developed in this paper for some specific learning problems. To begin, we show how to bound the sequential Rademacher complexity of functions computed by neural networks. Then, we derive margin based regret bounds in a fairly general setting. The classical analogues of these margin bounds have played a big role in the modern theory of supervised learning where they help explain the success of linear classifiers in high dimensional spaces (e.g., Schapire et al., 1997; Koltchinskii and Panchenko, 2002). We then study the complexity of classes formed by decision trees, analyze the setting of transductive learning, and consider an online version of the Isotron problem. Finally, we make a connection to the seminal work of Cesa-Bianchi and Lugosi (1999) by re-deriving their bound on the minimax regret in a static experts game in terms of the classical Rademacher averages. 6.1 Neural Networks We provide below a bound on the sequential Rademacher complexity for classic multi-layer neural networks thus showing they are learnable in the online setting. The model of neural Online Learning via Sequential Complexities networks we consider below and the bounds we provide are analogous to the ones considered in the i.i.d. setting by Bartlett and Mendelson (2003). Consider a k-layer 1-norm neural network, defined by a base function class F1 and, recursively, for each 2 i k, Fi = x j wi jσ (fj(x)) j fj Fi 1, wi 1 Bi , where σ is a Lipschitz transfer function, such as the sigmoid function. Proposition 15 Suppose σ R [ 1,1] is L-Lipschitz with σ(0) = 0. Then it holds that RT (Fk) ( k i=2 16Bi)Lk 1 (1 + 4 2log3/2(e T 2)) k RT (F1). In particular, for the case of F1 = {x j w1 jxj w 1 B1} and X Rd we have the bound RT (Fk) ( k i=1 16Bi)Lk 1 (1 + 4 2log3/2(e T 2)) k X where X is such that x X, x X . Our result is a non-constructive guarantee, and, to the best of our knowledge, no algorithms for learning neural networks within the online learning model exist. It is not clear if the above bounds could be obtained via computationally efficient methods. 6.2 Margin Based Regret In the classical statistical setting, margin bounds provide guarantees on the expected zeroone loss of a classifier based on the empirical margin zero-one error. These results form the basis of the theory of large margin classifiers (see Schapire et al., 1997; Koltchinskii and Panchenko, 2002). Recently, in the online setting, bounds of a similar flavor have been shown through the concept of margin via the Littlestone dimension (Ben-David et al., 2009). We show that our machinery can easily lead to margin bounds for binary classification problems for general function classes F based on their sequential Rademacher complexity. We use ideas from (Koltchinskii and Panchenko, 2002) to do this. Proposition 16 For any function class F [ 1,1]X , there exists a randomized prediction strategy given by τ such that for any sequence z1,...,z T where each zt = (xt,yt) X { 1}, T t=1 Eˆyt τt(z1 t 1) [1{ˆytyt < 0}] inf γ>0 {inf f F T t=1 1{f(xt)yt < 2γ} + 16 2log3/2(e T 2))TRT (F) + 2 T (1 + log log (1 Rakhlin, Sridharan and Tewari To interpret the above bound, suppose that the sequence of yt s is predicted with a margin 2γ by some function f F. The upper bound guarantees that there exists a strategy (that does not need to know the value of γ) with cumulative loss given by the sequential Rademacher complexity of F divided by the margin, up to poly-logarithmic factors. Crucially, the bound does not directly depend on the dimensionality of the input space X. 6.3 Decision Trees We consider here the binary classification problem where the learner competes with a set of decision trees of depth no more than d. The function class F for this problem is defined as follows. Each f F is defined by choosing a rooted binary tree of depth no more than d and associating to each node a binary valued decision function from a set H { 1}X . A binary value for a given x can be obtained by traversing the tree from the root according to the value of the decision function at each node and then reading offthe label of the leaf. Importantly, x reaches only one leaf of the tree. Alternatively, for any leaf l, the membership of x is given by the conjunction i 1{hl,i(x) = 1} where hl,i is either the decision function at node i along the path to the leaf l, or its negation. To complete the definition of f, we choose weights wl > 0, l wl = 1, along with the value σl { 1} of the function on each leaf l. The resulting function f can be written as f(x) = l wlσl i 1{hl,i(x) = 1} where the sum runs over all the leaves of the tree. The following proposition is the online analogue of a result about decision tree learning that Bartlett and Mendelson (2003) proved in the i.i.d. setting. Proposition 17 Denote by F the class of decision trees of depth at most d with decision functions in H. There exists a randomized strategy τ for the learner such that for any sequence of instances z1,...,z T , with zt = (xt,yt) X { 1}, T t=1 Eˆyt τt(z1 t 1) [1{ˆyt yt}] inf f F T t=1 1{f(xt) yt} + O ( l min(C(l),dlog3(T) T R(H)) + where C(l) denotes the number of instances that reach the leaf l and are correctly classified in the decision tree f that minimizes T t=1 1{ytf(xt) 0}, with N > 2 being the number of leaves in this tree. It is not clear whether computationally feasible online methods exist for learning decision trees, and this represents an interesting avenue of further research. Online Learning via Sequential Complexities 6.4 Transductive Learning Let F be a class of functions from X to R. Let N (α,F) = min{ G G RX s.t. f F g G satisfying f g α} (13) be the ℓ covering number at scale α, where the cover is pointwise on all of X. It is easy to verify that T, N (α,F,T) N (α,F) . (14) Indeed, let G be a minimal cover of F at scale α. We claim that for any X-valued tree of depth T, the set V = {vg = g x g G} of R-valued trees is an ℓ cover of F on x. Fix any ϵ { 1}T and f F, and let g G be such that f g α. Clearly vg t (ϵ) f(xt(ϵ)) α for any 1 t T, concluding the proof. This simple observation can be applied in several situations. First, consider the problem of transductive learning, where the set X = {x1,...,xn} is a finite set. To ensure online learnability, it is sufficient to consider an assumption on the dependence of N (α,F) on α. An obvious example of such a class is a VC-type class with N (α,F) (c/α)d for some c which can depend on n. Assume that F [ 1,1]X . Substituting this bound on the covering number into (6) and choosing α = 0, we observe that the value of the supervised game is upper bounded by 2DT (F) 48 d T log c by Proposition 9. It is easy to see that if n is fixed and the problem is learnable in the batch (i.e., i.i.d.) setting, then the problem is learnable in the online transductive model. In the transductive setting considered by Kakade and Kalai (2006), it is assumed that n T and F consists of binary-valued functions. If F is a class with VC dimension d, the Sauer-Shelah lemma ensures that the ℓ cover is smaller than (en/d)d (e T/d)d. Using the previous argument with c = e T, we obtain a bound of 4 d T log(e T) for the value of the game, matching the bound of Kakade and Kalai (2006) up to a constant factor. 6.5 Isotron Kalai and Sastry (2009) introduced a method called Isotron for learning Single Index Models (SIM). These models generalize linear and logistic regression, generalized linear models, and classification by linear threshold functions. For brevity, we only describe the Idealized SIM problem considered by the authors. In its batch version, we assume that the data are revealed at once as a set {(xt,yt)}T t=1 Rd R where yt = u( w,xt ) for some unknown w Rd of bounded norm and an unknown non-decreasing u R R with a bounded Lipschitz constant. Given this data, the goal is to iteratively find the function u and the direction w, making as few mistakes as possible. The error is measured as 1 T T t=1(fi(xt) yt)2, where fi(x) = ui( wi,x ) is the iterative approximation found by the algorithm on the ith round. The elegant computationally efficient method presented by Kalai and Sastry (2009) is motivated by Perceptron, and a natural open question posed by the authors is whether there is an online variant of Isotron. Before even attempting a quest for such an algorithm, we can ask a more basic question: is the (Idealized) SIM problem even learnable in the online framework? After all, most online methods deal with convex functions, but u is only assumed to be Lipschitz and non-decreasing. We answer the question easily with the tools we have developed. Rakhlin, Sridharan and Tewari We are interested in online learnability of H = {f(x,y) = (y u( w,x ))2 u [ 1,1] [ 1,1] 1-Lipschitz , w 2 1} (15) in the supervised setting, over X = B2 (the unit Euclidean ball in Rd) and Y = [ 1,1]. In particular, we prove the result for Lipschitz, but not necessarily non-decreasing functions. It is evident that H is a composition with three levels: the squared loss, the Lipschitz nondecreasing function, and the linear function. The proof of the following proposition shows that the covering number of the class does not increase much under these compositions. Proposition 18 The class H defined in (15) is online learnable in the (improper) supervised learning setting. Moreover, the minimax regret is T log3/2(T)). Once again, it is not clear whether a computationally efficient method attaining the above guarantee exists. 6.6 Prediction of Individual Sequences with Static Experts We also consider the problem of prediction of individual sequences, which has been studied both in information theory and in learning theory. In particular, in the case of binary prediction, Cesa-Bianchi and Lugosi (1999) proved upper bounds on the minimax value in terms of the (classical) Rademacher complexity and the (classical) Dudley integral. One of the assumptions made by Cesa-Bianchi and Lugosi (1999) is that experts are static. That is, their prediction only depends on the current round, not on the past information. Formally, we define static experts as vectors f = (f1,...,f T ) [0,1]T , and let F denote a class of such experts. Let Y = {0,1}, putting us in the scenario of binary classification with no side information. Then regret on a particular sequence y1,...,y T can be written as T t=1 ℓt( ft,yt) inf f F t=1 ℓt( f,yt), where ft is the expert chosen by the learning algorithm at time t. Observe that the proof of Theorem 7 does not require the loss to be time independent. In the case of absolute loss, the Rademacher complexity appearing on the right hand side in Theorem 7 becomes sup y Eϵ sup f F T t=1 ϵtℓt( f,yt(ϵ)) = sup y Eϵ sup f F T t=1 ϵt ft yt(ϵ) . where the supremum is over all Y-valued trees of depth T. Noting that for f [0,1],y {0,1}, f y can be written as (1 2y)f + y, the above equals T t=1 ϵt(1 2yt(ϵ))ft + T t=1 ϵtyt(ϵ) = sup y Eϵ sup f F T t=1 ϵt(1 2yt(ϵ))ft . Online Learning via Sequential Complexities It can be easily verified that the joint distribution of {ϵt(1 2yt(ϵ))}T t=1 is still i.i.d. Rademacher and hence the value of the game is upper bounded by 2Eϵ sup f F T t=1 ϵtft , recovering the upper bound of Theorem 3 in (Cesa-Bianchi and Lugosi, 1999). We note that for this particular scenario, the factor of 2 (that appears because of symmetrization) is not needed. This factor is the price we pay for deducing the result from the general statement of Theorem 7. 7. Discussion The tools provided in this paper allow us to establish existence of regret minimization algorithms by working directly with the minimax value. The non-constructive nature of our results is due to the application of the minimax theorem: the dual strategy does not give a handle on the primal strategy. Furthermore, by passing to upper bounds on the dual formulation (2) of the value of the game, we remove the dependence on the dual strategy altogether. After the original paper (Rakhlin et al., 2010) appeared, the algorithmic approach has been developed by Rakhlin et al. (2012) who showed that the prediction for round t can be obtained by appealing to the minimax theorem for rounds t + 1 to T, yet keeping the minimax expression for round t as is. The notion of a relaxation (in the spirit of approximate dynamic programming) then allowed the authors to develop a general recipe for deriving computationally feasible prediction methods. The techniques of the present paper form the basis for the algorithmic developments of Rakhlin et al. (2012). We refer the reader to (Rakhlin and Sridharan, 2014; Rakhlin et al., 2012) for details. Acknowledgments We would like to thank J. Michael Steele and Dean Foster for helpful discussions. We gratefully acknowledge the support of NSF under grants CAREER DMS-0954737 and CCF1116928. Appendix A. A Minimax Theorem The minimax theorem is one of this paper s main workhorses. For completeness, we state a general version of this theorem the von Neumann-Fan minimax theorem due to Borwein (2014) (see also Borwein and Zhuang, 1986). Theorem 19 (Borwein, 2014) Let A and B be Banach spaces. Let A A be nonempty, weakly compact, and convex, and let B B be nonempty and convex. Let g A B R be concave with respect to b B and convex and lower-semicontinuous with respect to a A, and weakly continuous in a when restricted to A. Then sup b B inf a A g(a,b) = inf a A sup b B g(a,b). (16) Rakhlin, Sridharan and Tewari In the proof of Theorem 1, the minimax theorem is invoked to assure that inf qt Q sup pt P E[ℓ(ft,zt) + ξ(zt)] = sup pt P inf qt Q E[ℓ(ft,zt) + ξ(zt)], (17) where ξ(zt) is a rather complicated function that includes the repeated infima and suprema from steps t + 1 to T of regret expression that includes the variable zt (but not ft). The expectation in (17) is with respect to ft qt and zt pt. To apply (16), we take g to be the bilinear form in qt and pt, with A = Q and B = P. Equipped with the total variation distance, Q and P can be seen as subsets of a Banach space of measures on F and Z, respectively. In terms of conditions, it is enough to check weak compactness of Q and assume continuity of the loss function (lower semi-continuity can be used as well). Weak compactness of the set of probability measures on a complete separable metric space is equivalent to uniform tightness by the fundamental result of Prohorov (see, e.g., Bogachev 2007, Theorem 8.6.2., and van der Vaart and Wellner 1996). If F itself is compact, then the set (F) of probability measures on F is tight, and hence (under the continuity of the loss) the minimax theorem holds. If F is not compact, tightness can be established under the following general condition. According to Example 8.6.5 (ii) in Bogachev (2007), a family (F) of Borel probability measures on a separable reflexive Banach space E is uniformly tight (under the weak topology) precisely when there exists a function V E [0, ) continuous in the norm topology such that lim f V (f) = and sup q (F) Ef q V (f) < . As an example, if F is a subset of a ball in E, it is enough to take V (f) = f . Finally, we remark that in the supervised learning case by considering the improper learning scenario we allow xt to be observed before the choice ˆyt is made. Therefore, we do not need to invoke the minimax theorem on the space of functions F, but rather (see the proof of Theorem 8) for two real-valued decisions in a bounded interval. This makes the application of the minimax theorem straightforward. Appendix B. Proofs Proof [of Theorem 1] For brevity, denote ψ(z1 T ) = inff F T t=1 ℓ(f,zt). The first step in the proof is to appeal to the minimax theorem for every couple of inf and sup: VT (F) = inf q1 sup p1 Ef1 q1 z1 p1 ...inf q T sup p T Ef T q T z T p T { T t=1 ℓ(ft,zt) ψ(z1 T )} = sup p1 inf q1 Ef1 q1 z1 p1 ...sup p T inf q T Ef T q T z T p T { T t=1 ℓ(ft,zt) ψ(z1 T )} = sup p1 inf f1 Ez1 p1 ...sup p T inf f T Ez T p T { T t=1 ℓ(ft,zt) ψ(z1 T )}, where qt and pt range over Q and P, the sets of distributions on F and Z, respectively. From now on, it will be understood that zt has distribution pt. By moving the expectation Online Learning via Sequential Complexities with respect to z T and then the infimum with respect to f T inside the expression, we arrive at sup p1 inf f1 E z1 ... sup p T 1 inf f T 1 E z T 1 sup p T { T 1 t=1 ℓ(ft,zt) + [inf f T E z T ℓ(f T ,z T )] E z T ψ(z1 T )} = sup p1 inf f1 E z1 ... sup p T 1 inf f T 1 E z T 1 sup p T E z T { T 1 t=1 ℓ(ft,zt) + [inf f T E z T ℓ(f T ,z T )] ψ(z1 T )}. (18) Let us now repeat the procedure for step T 1. The above expression is equal to sup p1 inf f1 E z1 ... sup p T 1 inf f T 1 E z T 1 { T 1 t=1 ℓ(ft,zt) + sup p T E z T [inf f T E z T ℓ(f T ,z T ) ψ(z1 T )]} which, in turn, is equal to sup p1 inf f1 E z1 ... sup p T 1 { T 2 t=1 ℓ(ft,zt) + [ inf f T 1 E z T 1 ℓ(f T 1,z T 1)] + E z T 1 sup p T E z T [inf f T E z T ℓ(f T ,z T ) ψ(z1 T )]} = sup p1 inf f1 E z1 ... sup p T 1 E z T 1 sup p T E z T { T 2 t=1 ℓ(ft,zt) + [ inf f T 1 E z T 1 ℓ(f T 1,z T 1)] +[inf f T E z T ℓ(f T ,z T )] ψ(z1 T )}. Continuing in this fashion for T 2 and all the way down to t = 1 proves the theorem. Proof [of Lemma 4] Without loss of generality assume that the Lipschitz constant L = 1, as the general case follows by scaling φ. Fix a Z-valued tree z of depth T. We first claim that log N2(β,φ G,z) k j=1 log N (β,Gj,z) . Suppose V1,...,Vk are minimal β-covers with respect to ℓ for G1,...,Gk on the tree z. Consider the set V φ = {vφ v V1 ... Vk}, where vφ is the tree such that vφ t (ϵ) = φ(vt(ϵ),zt(ϵ)). Then, for any g = (g1,...,gk) G and any ϵ { 1}T , with representatives (v1,...,vk) V1 ... Vk, we have, T t=1 (φ(g(zt(ϵ)),zt(ϵ)) vφ t (ϵ)) 2 max t [T] φ(g(zt(ϵ)),zt(ϵ)) vφ t (ϵ) = max t [T] φ(g(zt(ϵ)),zt(ϵ)) φ(vt(ϵ),zt(ϵ)) max j [k] max t [T] gj(zt(ϵ))) vj t(ϵ) β. Rakhlin, Sridharan and Tewari Thus we see that V φ is an β-cover with respect to ℓ for φ G on z. Hence log N2(β,φ G,z) log( V φ ) = k j=1 log( Vj ) = k j=1 log N (β,Gj,z). (19) For any g G and z Z, the value φ(g(z),z) is contained in the interval [ 1 + φ(0,z),+1 + φ(0,z)] by the Lipschitz property. Consider the R-valued tree φ(0, ) z. We now center by this tree and consider the set of trees {φ(g( ), ) z φ(0, ) z g G}. The centering does not change the size of the cover calculated in (19), but allows us to invoke (7) since the function values are now in [ 1,1]: RT (φ G,z) inf α k j=1 log N (β,Gj,z) dβ log N (β,Gj,z) dβ . (20) We substitute the upper bound on covering numbers in (8) for each Gj and arrive at an upper bound of fatβ(Gj)log(2e T/β)dβ . (21) Lemma 2 of Rakhlin et al. (2014) implies that for any β > 2RT (Gj), fatβ(Gj) 32T RT (Gj)2 Let j = argmax j RT (Gj). Substituting this together with the value of α = 2RT (Gj ) into (21) yields an upper bound 8 RT (Gj ) + 48 2 k j=1 RT (Gj) 2RT (Gj ) 1 β log(2e T/β)dβ. Using the fact that for any b > 1 and α (0,1) log(b/β)dβ = log xdx = 2 3 log3/2(x) b/α 3 log3/2(b/α) (22) we obtain a further upper bound of 8 RT (Gj ) + 32 2 k j=1 RT (Gj) log3/2 ( e T RT (Gj )) . Online Learning via Sequential Complexities Replacing the first term by 8 j RT (Gj), we conclude that RT (φ G,z) 8(1 + 4 2log3/2(e T 2)) k j=1 RT (Gj) as long as RT (Gj) 1/T for each j. The statement is concluded by observing that z was chosen arbitrarily. Proof [of Corollary 6] We first extend the binary function b to a function b to any x Rk as follows : b(x) = { (1 x a )b(a) if x a < 1 for some a { 1}k 0 otherwise First note that b is well-defined since all points in the k-cube are separated by L distance 2. Further note that b is 1-Lipschitz w.r.t. the L norm and so applying Lemma 4 we conclude the statement of the corollary. Proof [of Theorem 7] Let Et 1[ ] = E[ Z1,...,Zt 1] denote the conditional expectation. Using Theorem 1 we have, VT (F) = sup p1 E Z1 p1 ...sup p T E ZT p T [ T t=1 inf ft F Et 1ℓ(ft, ) inf f F T t=1 ℓ(f,Zt)] = sup p1 E Z1 p1 ...sup p T E ZT p T [sup f F { T t=1 inf ft F Et 1ℓ(ft, ) T t=1 ℓ(f,Zt)}] sup p1 E Z1 p1 ...sup p T E ZT p T [sup f F { T t=1 Et 1ℓ(f, ) T t=1 ℓ(f,Zt)}]. (23) The upper bound is obtained by replacing each infimum by a particular choice f. This step also holds if the choice ft of the learner comes from a larger set G, as long as F G. The proof is concluded by appealing to (3). Proof [of Theorem 8] Let Q denote the set of distributions on Y = [ 1,1]. By convexity, T t=1 ℓ(ˆyt,yt) inf f F T t=1 ℓ(f(xt),yt) sup f F T t=1 ℓ (ˆyt,yt)(ˆyt f(xt)), where ℓ (ˆyt,yt) is a subgradient of the function y ℓ( ,yt) at ˆyt. Then the minimax value (10) can be upper bounded as VS T (F) sup x1 inf q1 Q sup y1 E ˆy1 q1 ...sup x T inf q T Q sup y T Eˆy T q T [sup f F T t=1 ℓ (ˆyt,yt)(ˆyt f(xt))]. Rakhlin, Sridharan and Tewari By the Lipschitz property of ℓ, we can replace each subgradient ℓ (ˆyt,yt) with a number st [ L,L] to obtain the upper bound sup x1 inf q1 Q sup y1 E ˆy1 q1 sup s1 [ L,L] ...sup x T inf q T Q sup y T E ˆy T q T sup s T [ L,L] {sup f F T t=1 st (ˆyt f(xt))}. Since yt s no longer appear in the optimization objective, we can simply write the above as sup x1 inf q1 Q E ˆy1 q1 sup s1 [ L,L] ...sup x T inf q T Q E ˆy T q T sup s T [ L,L] {sup f F T t=1 st (ˆyt f(xt))} = sup x1 inf ˆy1 [ 1,1] sup s1 [ L,L] ...sup x T inf ˆy T [ 1,1] sup s T [ L,L] {sup f F T t=1 st (ˆyt f(xt))}, where the equality follows because infima are obtained at point distributions. By the same reasoning, we now pass to distributions over st s: sup x1 inf ˆy1 [ 1,1] sup p1 E s1 p1 ...sup x T inf ˆy T [ 1,1] sup p T Es T p T [ T t=1 st ˆyt inf f F T t=1 stf(xt)]. (24) From now on, it will be understood that the supremum over pt ranges over all distributions supported on [ L,L], for any t, and st has distribution pt. Now note that Es T [ T t=1 st ˆyt inf f F T t=1 st f(xt)] is concave (linear) in p T and is convex in ˆy T and hence by the minimax theorem, inf ˆy T [ 1,1] sup p T Es T [ T t=1 st ˆyt inf f F T t=1 stf(xt)] = sup p T inf ˆy T [ 1,1] Es T [ T t=1 st ˆyt inf f F T t=1 stf(xt)] = T 1 t=1 st ˆyt + sup p T Es T inf ˆy T [ 1,1] Es T [s T ] ˆy T inf f F T t=1 stf(xt) , where the last step is similar to the one in the proof of Theorem 1, specifically (18). Similarly note that the term T 1 t=1 st ˆyt + sup p T ,x T Es T inf ˆy T [ 1,1] Es T [s T ] ˆy T inf f F T t=1 stf(xt) is concave (linear) in p T 1 and is convex in ˆy T 1 and hence again by the minimax theorem, inf ˆy T 1 [ 1,1] sup p T 1 E s T 1 T 1 t=1 st ˆyt + sup p T ,x T E s T inf ˆy T [ 1,1] Es T [s T ] ˆy T inf f F T t=1 stf(xt) = sup p T 1 inf ˆy T 1 [ 1,1] E s T 1 T 1 t=1 st ˆyt + sup p T ,x T Es T inf ˆy T [ 1,1] Es T [s T ] ˆy T inf f F T t=1 stf(xt) = T 2 t=1 st ˆyt + sup p T 1 E s T 1 sup p T ,x T Es T T t=T 1 inf ˆyt [ 1,1] Est [st] ˆyt inf f F T t=1 stf(xt) . Online Learning via Sequential Complexities Proceeding in similar fashion and using this in (24) we conclude that, VS T (F) sup x1 inf ˆy1 [ 1,1] sup p1 E s1 p1 ...sup x T inf ˆy T [ 1,1] sup p T Es T p T [ T t=1 st ˆyt inf f F T t=1 stf(xt)] = sup x1 sup p1 E s1 p1 ...sup x T sup p T E s T p T T t=1 inf ˆyt [ 1,1] Est pt [st] ˆyt inf f F T t=1 stf(xt) sup x1 sup p1 E s1 p1 ...sup x T sup p T Es T p T [sup f F T t=1 (Est pt [st] st)f(xt)], where we replaced each ˆyt with a potentially suboptimal choice f(xt). Passing the expectation past the suprema we obtain an upper bound sup x1 sup p1 E s1,s 1 p1 ...sup x T sup p T Es T ,s T p T [sup f F T t=1 (s t st)f(xt)] (25) = sup x1 sup p1 E s1,s 1 p1 E ϵ1 ...sup x T sup p T E s T ,s T p T EϵT [sup f F T t=1 ϵt (s t st)f(xt)] sup x1 sup s1 [ 2L,2L] E ϵ1 ...sup x T sup s T [ 2L,2L] EϵT [sup f F T t=1 ϵtstf(xt)] = sup x1 sup s1 { 2L,2L} E ϵ1 ...sup x T sup s T { 2L,2L} EϵT [sup f F T t=1 ϵtstf(xt)] (26) = 2L sup x1 sup s1 { 1,1} E ϵ1 ...sup x T sup s T { 1,1} EϵT [sup f F T t=1 ϵtstf(xt)], (27) where the last inequality is because, for every t [T], we have convexity in st and so supremum is achieved at either 2L or 2L. Notice that after using convexity to go to gradients, the proof technique above basically mimics the proofs of Theorems 1 and 7 to get to a symmetrized term as we did in those theorems. Now consider any arbitrary function ψ { 1} R, we have that sup s { 1} Eϵ [ψ(s ϵ)] = sup s { 1} 1 2 (ψ(+s) + ψ( s)) = 1 2 (ψ(+1) + ψ( 1)) = Eϵ [ψ(ϵ)]. Since in (27), for each t, st and ϵt appear together as ϵt st using the above equation repeatedly, we conclude that VS T (F) 2L sup x1 sup s1 { 1,1} Eϵ1 ...sup x T sup s T { 1,1} EϵT [sup f F T t=1 ϵtstf(xt)] = 2L sup x1 Eϵ1 ...sup x T EϵT [sup f F T t=1 ϵtf(xt)]. (28) We now claim that the above supremum can be written in terms of an X-valued tree. Briefly, the solution for x1 in (28) is attained (for simplicity, assume the supremum is attained) at Rakhlin, Sridharan and Tewari an optimal value x 1. The optimal value x 2 can be calculated for ϵ1 = 1 and ϵ1 = 1. Arguing in this manner leads to a tree x. We conclude VS T (F) 2L sup x Eϵ1 T [sup f F T t=1 ϵtf(xt(ϵ))] = 2LT RT (F). Proof [of Proposition 9] For the upper bound, we start by using Theorem 8 for absolute loss, which has a Lipschitz constant of 1, to bound the value of the game by sequential Rademacher complexity, 1 T VS T (F) 2RT (F) . We combine the above inequality with (7) and (8) to obtain the upper bound. Observe that a lower bound on the value can be obtained by choosing any particular joint distribution on sequences (x1,y1),...,(xt,yt) in (2): VS T (F) E[ T t=1 inf ft F E(xt,yt) [ yt ft(xt) (x,y)1 t 1] inf f F T t=1 yt f(xt) ]. To this end, choose any X-valued tree x of depth T. Let y1,...,y T be i.i.d. Rademacher random variables and define xt = x(y1 t 1) deterministically (that is, the conditional distribution of xt is a point distribution on x(y1 t 1)). It is easy to see that this distribution makes the choice ft irrelevant, yielding VS T (F) E[ T t=1 1 inf f F T t=1 yt f(xt) ] = Ey1,...,y T sup f F T t=1 ytf(xt). Since this holds for any tree x, we obtain the desired lower bound VS T (F) RT (F). The final lower bound on RT (F) (in terms of the fat-shattering dimensions) is proved by Rakhlin et al. (2014, Lemma 2). Proof [of Theorem 10] The equivalence of 1 and 2 follows directly from Proposition 9. First, suppose that fatα is infinite for some α > 0. Then, the lower bound says that VS T (F) αT/(4 2) and hence limsup T VS T (F)/T α/(4 2). Thus, the class F is not online learnable in the supervised setting. Now, assume that fatα is finite for all α. Fix an ϵ > 0 and choose α = ϵ/16. Using the upper bound, we have VS T (F) 8Tα + 24 fatβ log (2e T fatα log (2e T ϵT/2 + ϵT/2 Online Learning via Sequential Complexities for T large enough. Thus, limsup T VS T (F)/T ϵ. Since ϵ > 0 was arbitrary, this proves that F is online learnable in the supervised setting. The statement that VS T (F), RT (F), and DT (F) are within a multiplicative factor of O(log3/2 T) of each other whenever the problem is online learnable follows immediately from Eq. (10) in (Rakhlin et al., 2014) and Proposition 9. Proof [of Lemma 13] Consider the game (F,Zcvx) and fix a randomized strategy π of the player. Then, the expected regret of a randomized strategy π against any adversary playing g1,...,g T can be lower-bounded via Jensen s inequality as T t=1 Eut πt(g1 t 1) [gt(ut)] inf u F T t=1 gt(u) T t=1 gt (Eut πt(g1 t 1) [ut]) inf u F T t=1 gt(u), which is simply regret of a deterministic strategy obtained from π by playing Eut πt(g1 t 1) [ut] on round t. Thus, to any randomized strategy corresponds a deterministic one that is no worse. On the other hand, the set of randomized strategies contains the set of deterministic ones. Hence, VT (F,Zcvx) = Vdet T (F,Zcvx) where Vdet T is defined as the minimax regret obtainable only using deterministic player strategies. Now, we appeal to Theorem 14 of Abernethy et al. (2008) that says Vdet T (F,Zcvx) = Vdet T (F,Zlin). Note that Abernethy et al. (2008) deal with convex sets in finite dimensional spaces only. However, their proof relies on fundamental properties of convex functions that are true in any general vector space (such as the fact that the first order Taylor expansion of a convex function globally lower bounds the convex function). Since Zlin also consists of convex (in fact, linear) functions, the above argument again gives Vdet T (F,Zlin) = VT (F,Zlin). This finishes the proof of the lemma. Proof [of Proposition 15] We shall prove that for any i {2,...,k}, RT (Fi) 16LBi (1 + 4 2log3/2(e T 2))RT (Fi 1). To see this note that for any x, RT (Fi,x) is equal to sup wi wi 1 Bi j fj Fi 1 T t=1 ϵt j wi jσ (fj(xt(ϵ))) sup wi wi 1 Bi j fj Fi 1 wi 1 max j T t=1 ϵtσ (fj(xt(ϵ))) by H older s inequality. Then RT (Fi) is upper bounded as sup x Eϵ [Bi sup f Fi 1 max{ T t=1 ϵtσ (f(xt(ϵ))), T t=1 ϵtσ (f(xt(ϵ)))}] sup x Eϵ [Bi max{ sup f Fi 1 T t=1 ϵtσ (f(xt(ϵ))), sup f Fi 1 T t=1 ϵtσ (f(xt(ϵ)))}] . Rakhlin, Sridharan and Tewari Since 0 Fi together with the assumption of σ(0) = 0, both terms are non-negative, and thus the maximum above can be upper bounded by the sum sup x Eϵ [Bi sup f Fi 1 T t=1 ϵtσ (f(xt(ϵ)))] + sup x Eϵ [Bi sup f Fi 1 T t=1 ϵtσ (f(xt(ϵ)))] . We now claim that the two terms are equal. Indeed, let x be the tree achieving the supremum in the first term (a modified analysis can be carried out if the supremum is not achieved). Then the mirror tree x defined via xt(ϵ) = x t ( ϵ) yields the same value for the second term. Since the argument can be carried out in the reverse direction, the two terms are equal, and the upper bound of 2Bi sup x Eϵ [ sup f Fi 1 T t=1 ϵtσ (f(xt(ϵ)))] follows. In view of contraction in Corollary 5, we obtain a further upper bound of 16Bi L(1 + 4 2log3/2(e T 2))RT (Fi 1). (29) To finish the proof we note that for the base case of i = 1, RT (F1) is equal to sup x Eϵ sup w Rd w 1 B1 T t=1 ϵtw xt(ϵ) which is upper bounded by sup x Eϵ sup w Rd w 1 B1 w 1 T t=1 ϵtxt(ϵ) B1 sup x Eϵ [max i [d] { T t=1 ϵtxt(ϵ)[i]}]. Note that the instances x X are vectors in Rd and so for a given instance tree x, for any i [d], x[i] given by only taking the ith co-ordinate is a valid real valued tree. By (4), T RT (F1) B1 sup x Eϵ [max i [d] { T t=1 ϵtxt(ϵ)[i]}] B1 2TX2 log d. Using the above and (29) repeatedly we conclude the proof. Proof [of Proposition 16] Fix a γ > 0 and use loss 1 ˆyy 0 1 ˆyy/γ 0 < ˆyy < γ 0 ˆyy γ Since this loss is 1/γ-Lipschitz, we can use (11) and the Rademacher contraction Corollary 5 to show that for each γ > 0 there exists a randomized strategy τ γ such that for any data sequence T t=1 Eˆyt τ γ t (z1 t 1) [ℓ(ˆyt,yt)] inf f F T t=1 ℓ(f(xt),yt) + γ 1ρT TRT (F), Online Learning via Sequential Complexities where ρT = 16(1 + 4 2log3/2(e T 2)) throughout the proof. Further, observe that the loss function is lower bounded by the zero-one loss 1{ˆyy < 0} and is upper bounded by the margin zero-one loss 1{ˆyy < γ}. Hence, T t=1 Eˆyt τ γ t (z1 t 1) [1{ˆytyt < 0}] inf f F T t=1 1{ytf(xt) < γ} + γ 1ρT TRT (F). (30) The above bound holds for randomized each strategy given by τ γ, for any given γ. Now we discretize the set of γ s as γi = 1/2i and use the output of the randomized strategies τ γ1,τ γ2,..., that attain the regret bounds given in (30), as experts. We then run a countable experts algorithm (Algorithm 1) with initial weight for expert i as pi = 6 π2i2 . Such an algorithm achieves O( T log(1/pi)) regret w.r.t. expert i. In view of Proposition 20, for this randomized strategy τ, for any i, T t=1 Eˆyt τt(z1 t 1) [1{ˆytyt < 0}] inf f F T t=1 1{ytf(xt) < γi} + γ 1 i ρT TRT (F) + T (1 + 2log ( iπ For any γ > 0, let iγ 0,1,..., be such that 2 (iγ+1) < γ 2 iγ. Then above right-hand side is upper bounded by T t=1 1{ytf(xt) < 2γ} + γ 1ρT TRT (F) + T (1 + 2log (iγπ The proof is concluded using the inequality iγ log(1/γ) and upper bounding constants. Proof [of Proposition 17] Fix some L > 0. The loss 1 if α 0 1 Lα if 0 < α 1/L 0 otherwise is L-Lipschitz and so by Theorem 7 and Corollary 5 we have that for every L > 0, there exists a randomized strategy τ L for the player, such that for any sequence z1 = (x1,y1),...,z T = (x T ,y T ), T t=1 Eˆyt τ L t (z1 t 1) [φL(ytˆyt)] inf f F T t=1 φL(ytf(xt)) + LρT TRT (F), (31) where ρT = 16(1 + 4 2log3/2(e T 2)) throughout this proof. Since φL dominates the step function, the left hand side of (31) also upper-bounds the expected indicator loss T t=1 Eˆyt τ L t (z1 t 1) [1{ˆyt yt}]. For any f F, we can relate the φL-loss to the indicator loss by T t=1 φL(ytf(xt)) = T t=1 1{ytf(xt) 0} + l C(l)φL(wl). Rakhlin, Sridharan and Tewari Let us now use the above decomposition in (31). Crucially, the sign of f(x) does not depend on wl, but only on the label σl of the unique leaf l reached by x. Thus, the infimum in (31) can be split into two infima: T t=1 φL(ytf(xt)) = inf f F T t=1 1{ytf(xt) 0} + inf wl l C(l)φL(wl), where it is understood that the C(l) term on the right hand side is computed using the function f minimizing the first sum on the right hand side. We can further write l C(l)φL(wl) l C(l)max(0,1 Lwl) = l max(0,(1 Lwl)C(l)). So far, we have derived a regret bound for a given L. Let us now remove the requirement to know L a priori by running the experts Algorithm 1 with τ 1,τ 2,... as a countable set of experts corresponding to the values L N. The prior on expert L is taken to be p L = 6 π2 L 2 so that p L = 1. For the randomized strategy τ obtained in this manner, from Proposition 20, for any sequence of instances and any L N, T t=1 Eˆyt τt(z1 t 1) [1{ˆy yt}] inf f F T t=1 1{ytf(xt) 0} + inf f F l max(0,(1 Lwl)C(l)) + LρT TRT (F) + Now we pick L = {l C(l) > ρT TRT (F)} N and upper bound the second infimum by choosing wl = 0 if C(l) ρT TRT (F) and wl = 1/L otherwise: inf wl l max(0,(1 Lwl)C(l)) + LρT TRT (F) l C(l)1{C(l) ρT TRT (F)} + ρT TRT (F) l 1{C(l) > ρT TRT (F)} which can be written succinctly as l min{C(l),ρT TRT (F)}. We conclude that T t=1 Eˆyt τt(z1 t 1) [1{ˆyt yt}] inf f F T t=1 1{ytf(xt) 0} + l min(C(l),ρT TRT (F)) + T (1 + 2log(Nπ/ Finally, we apply Corollary 6 and Lemma 3(2) to bound RT (F) d O(log3/2 T) RT (H) and thus conclude the proof. Proof [of Proposition 18] First, by the classical result of Kolmogorov and Tikhomirov (1959), the class G of all bounded Lipschitz functions on a bounded interval has small metric Online Learning via Sequential Complexities entropy: log N (α,G) = Θ(1/α). For the particular class of non-decreasing 1-Lipschitz functions, it is trivial to verify that the entropy is in fact bounded by 2/α. Considering all 1-Lipschitz functions increases this to c0/α for some universal constant c0. Next, consider the class F = { w,x w 2 1} over the Euclidean ball. By Proposition 14, RT (F) 1/ T. Using the lower bound of Proposition 9, fatα 32/α2 whenever α > 4 T. This implies that N (α,F,T) (2e T/α)32/α2 whenever α > 4 T. Note that this bound does not depend on the ambient dimension of X. Next, we show that a composition of G with any small class F [ 1,1]X also has a small cover. To this end, suppose N (α,F,T) is the covering number for F. Fix a particular tree x and let V = {v1,...,v N} be an ℓ cover of F on x at scale α. Analogously, let W = {g1,...,g M} be an ℓ cover of G with M = N (α,G). Consider the class G F = {g f g G,f F}. The claim is that {g(v) v V,g W} provides an ℓ cover for G F on x. Fix any f F,g G and ϵ { 1}T . Let v V be such that maxt [T] f(xt(ϵ)) vt(ϵ) α, and let g W be such that g g α. Then, using the fact that functions in G are 1-Lipschitz, for any t [T], g(f(xt(ϵ))) g (vt(ϵ)) g(f(xt(ϵ))) g (f(xt(ϵ)) + g (f(xt(ϵ)) g (vt(ϵ)) 2α . Hence, N (2α,G F,T) N (α,G) N (α,F,T). Finally, we put all the pieces together. By Theorem 8, the minimax value is bounded by 8T times the sequential Rademacher complexity of the class G F = {u( w,x ) u [ 1,1] [ 1,1] is 1-Lipschitz , w 2 1} since the squared loss is 4-Lipschitz on the space of possible values. The latter complexity is then bounded by TDT (G F) 32 T log N(δ,G F,T) dδ δ2 log(2e T)dδ . We therefore conclude that the value of the game for the supervised learning problem is bounded by O( T log3/2(T)). Appendix C. Exponentially Weighted Average (EWA) Algorithm on Countable Experts We consider here a version of the exponentially weighted experts algorithm for a countable (possibly infinite) number of experts and provide a bound on the expected regret of the randomized algorithm. The proof of the result closely follows the finite case (e.g., Cesa Bianchi and Lugosi, 2006, Theorem 2.2). This result is well known and we include it here for completeness, as it is needed in the proofs of Proposition 16 and Proposition 17. Suppose we are provided with countable experts E1,E2,..., where each expert can herself be thought of as a randomized/deterministic player strategy which, given history, produces an element of F at round t. Here we also assume that F [0,1]X . Denote by fi t the function output by expert i at round t given the history. The EWA algorithm we consider needs access to the countable set of experts and also needs an initial weighting on each expert p1,p2,... such that i pi = 1. Rakhlin, Sridharan and Tewari Algorithm 1 EWA (E1,E2,..., p1,p2,...) Initialize each w1 i pi for t = 1 to T do Pick randomly an expert i with probability wt i Play ft = ft i Receive xt Update for each i, wt+1 i = wt ie ηft i (xt) i wt ie ηft i (xt) end for Proposition 20 The exponentially weighted average forecaster (Algorithm 1) with η = T 1/2 enjoys the regret bound T t=1 E[ft(xt)] T t=1 ft i (xt) + T log (1/pi) for any i N. J. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the 21st Annual Conference on Learning Theory, pages 414 424. Omnipress, 2008. J. Abernethy, A. Agarwal, P. L. Bartlett, and A. Rakhlin. A stochastic view of optimal regret through minimax duality. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009. P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2003. S. Ben-David, D. Pal, and S. Shalev-Shwartz. Agnostic online learning. In Proceedings of the 22th Annual Conference on Learning Theory, 2009. D. Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1 8, 1956a. D. Blackwell. Controlled random walks. In Proceedings of the International Congress of Mathematicians, 1954, volume 3, pages 336 338. North Holland, 1956b. V.I. Bogachev. Measure Theory, volume 2. Springer, 2007. ISBN 3540345132. J.M. Borwein. A very complicated proof of the minimax theorem. Minimax Theory and Its Applications, 1(1), 2014. J.M. Borwein and D Zhuang. On Fan s minimax theorem. Mathematical programming, 34 (2):232 234, 1986. Online Learning via Sequential Complexities N. Cesa-Bianchi and G. Lugosi. On prediction of individual sequences. Annals of Statistics, pages 1865 1895, 1999. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427 485, 1997. T. Cover. Behavior of sequential predictors of binary sequences. In Transactions of the Fourth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes, 1965, pages 263 272. Publishing House of the Czechoslovak Academy of Sciences, 1967. T. M. Cover and A. Shenhar. Compound Bayes predictors for sequences with apparent Markov structure. IEEE Transactions on Systems, Man and Cybernetics, 7(6):421 424, 1977. L. Davisson. Universal noiseless coding. Information Theory, IEEE Transactions on, 19 (6):783 795, 1973. M. Feder, N. Merhav, and M. Gutman. Universal prediction of individual sequences. Information Theory, IEEE Transactions on, 38(4):1258 1270, 1992. D. P. Foster and R. V. Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21(1):40 55, 1997. J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97 139, 1957. S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127 1150, 2000. S. M. Kakade and A. T. Kalai. From batch to transductive online learning. In Y. Weiss, B. Sch olkopf, and J.C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 611 618. MIT Press, 2006. A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005. A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In Proceedings of the 22th Annual Conference on Learning Theory, 2009. A.N. Kolmogorov and V.M. Tikhomirov. ε-entropy and ε-capacity of sets in function spaces. Uspekhi Matematicheskikh Nauk, 14(2):3 86, 1959. V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30(1):1 50, 2002. M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer-Verlag, New York, 1991. Rakhlin, Sridharan and Tewari N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4):285 318, 04 1988. N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212 261, 1994. A. Rakhlin and K. Sridharan. Statistical learning and sequential prediction, 2014. Available at http://stat.wharton.upenn.edu/~rakhlin/courses/stat928/stat928_notes.pdf. A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. In Advances in Neural Information Processing Systems 23, pages 1984 1992, 2010. A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Beyond regret. In Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of JMLR Workshop and Conference Proceedings, pages 559 594, 2011. A. Rakhlin, O. Shamir, and K. Sridharan. Relax and randomize: From value to algorithms. In Advances in Neural Information Processing Systems 25, pages 2150 2158, 2012. A. Rakhlin, K. Sridharan, and A. Tewari. Sequential complexities and uniform laws of large numbers. Probability Theory and Related Fields, 2014. J. Rissanen. Universal coding, information, prediction, and estimation. Information Theory, IEEE Transactions on, 30(4):629 636, 1984. H. Robbins. Asymptotically subminimax solutions of compound statistical decision problems. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, pages 131 149. University of California Press, 1950. R. E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, pages 322 330, 1997. S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011. S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In Conference on Learning Theory, 2009. A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes with Applications to Statistics. Springer-Verlag, New York, 1996. V. Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2):153 173, 1998. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning, pages 928 936, 2003. J. Ziv and A. Lempel. A universal algorithm for sequential data compression. Information Theory, IEEE Transactions on, 23(3):337 343, 1977.