# efficient_online_portfolio_with_logarithmic_regret__0b924fdb.pdf Efficient Online Portfolio with Logarithmic Regret Haipeng Luo Department of Computer Science University of Southern California haipengl@usc.edu Chen-Yu Wei Department of Computer Science University of Southern California chenyu.wei@usc.edu Kai Zheng Key Laboratory of Machine Perception, MOE, School of EECS, Peking University Center for Data Science, Peking University, Beijing Institute of Big Data Research zhengk92@pku.edu.cn We study the decades-old problem of online portfolio management and propose the first algorithm with logarithmic regret that is not based on Cover s Universal Portfolio algorithm and admits much faster implementation. Specifically Universal Portfolio enjoys optimal regret O(N ln T) for N financial instruments over T rounds, but requires log-concave sampling and has a large polynomial running time. Our algorithm, on the other hand, ensures a slightly larger but still logarithmic regret of O(N 2(ln T)4), and is based on the well-studied Online Mirror Descent framework with a novel regularizer that can be implemented via standard optimization methods in time O(TN 2.5) per round. The regret of all other existing works is either polynomial in T or has a potentially unbounded factor such as the inverse of the smallest price relative. 1 Introduction We consider the well-known online portfolio management problem [8], where a learner has to sequentially decide how to allocate her wealth over a set of N financial instruments in order to maximize her return, importantly under no assumptions at all on how the market behaves. Specifically, for each trading period t = 1, . . . , T, the learner first decides the proportion of her wealth to invest on each stock, and then by the end of the period observes the return of each stock and continues to invest with her total wealth. The goal of the learner is to maximize the ratio between her total wealth after T rounds and the total wealth of the best constant-rebalanced portfolio (CRP) which always rebalances the wealth to ensure a fixed proportion of investment for each stock. Equivalently, the learner aims to minimize her regret, which is the negative logarithm of the aforementioned ratio. The minimax optimal regret for this problem is O(N ln T), achieved by Cover s Universal Portfolio algorithm [8]. This algorithm requires sampling from a log-concave distribution and all known efficient implementations have large polynomial (in N and T) running time, such as O(T 14N 4) [15].1 Online Newton Step (ONS) [12], on the other hand, follows the well-studied framework of Online Mirror Descent (OMD) with a simple time-varying regularizer and admits much faster implementation via standard optimization methods. The regret of ONS is O(GN ln T) where G is the largest gradient ℓ -norm encountered over T rounds (formally defined in Section 1.1) and can be arbitrarily large making the bound meaningless. A typical way to prevent unbounded gradient is to mix the output 1Recent improvements on log-concave sampling such as [7, 16, 17] may lead to improved running time, but it is still a large polynomial. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. Table 1: Comparisons of regret and running time of different algorithms. Note that G is potentially unbounded. For running time, we assume Interior Point Method is used to solve the involved optimization problems, and omit all logarithmic (in N and T) factors. Algorithm Regret Time (per round) Universal Portfolio [8, 15] N ln T T 14N 4 ONS [12] GN ln T N 3.5 FTRL [3] G2N ln(NT) TN 2.5 T ln N N Soft-Bayes [19] NT ln N N ADA-BARRONS (this work) N 2(ln T)4 TN 2.5 of ONS with a small amount of uniform distribution, which after optimal trade-off can at best lead to a regret bound of O(N T ln T). An earlier work [3] achieves a worse regret bound of O(G2N ln(NT)) via an efficient algorithm based on another well-known framework Follow-the Regularized-Leader (FTRL). There are also extremely efficient approaches with time complexity O(N) or O(N ln N) per round, such as exponentiated gradient [14], online gradient descent [22], and Soft-Bayes from recent work of [19]. The first two achieve regret of order O(G T ln N) and O(G T) respectively2 while the last one achieves O( NT ln N) without the dependence on G. Despite being highly efficient, all of these approaches fail to achieve the optimal logarithmic dependence on T for the regret. As one can see, earlier works all exhibit a trade-off between regret and time complexity. A longstanding open question is how fast an algorithm with optimal regret can be. Specifically, are there algorithms with optimal regret and similar or even better time complexity compared to ONS? In this work, we make a significant step toward answering this question by proposing a simple algorithm with regret O(N 2(ln T)4) and time complexity O(TN 2.5) per round. To the best of our knowledge, this is the first algorithm with logarithmic regret (and no dependence on G) that is not based on Cover s algorithm and admits fast implementation comparable to ONS and [3]. As a comparison, we show in Table 1 the regret and time complexity of existing works and ours, where for OMD/FTRL-type algorithms we do a naive calculation of the running time based on the Interior Point Method [18] (to solve the key optimization problems involved), despite the possibility of even faster implementation. Our algorithm is parameter-free and deterministic. It follows the OMD framework with a novel regularizer that is a mixture of the one used in ONS and the so-called log-barrier (a.k.a. Burg entropy) [10, 2, 21].3 Critically, our algorithm also relies on an increasing learning rate schedule for the log-barrier similar to recent works on bandit problems [2, 21], as well as another sophisticated adaptive tuning method for the learning rate of the ONS regularizer, which resembles the standard doubling trick but requires new analysis since monotonicity does not hold for our problem. 1.1 Notation and Setup The online portfolio problem fits into the well-studied online learning framework (see for example [13]). Formally, the problem proceeds for T rounds (for some T > N). On each round t = 1, . . . , T, the learner first decides a distribution xt N where N is the (N 1)-dimensional simplex. After that the learner observes the price relative vector rt RN + so that her total wealth changes by a factor of xt, rt . Taking the negative logarithm, this corresponds to observing a loss function ft(x) = ln xt, rt for x N, chosen arbitrarily by an adversary. The regret of the 2For online gradient descent, G is the largest gradient ℓ2-norm. 3A recent work [6] uses a mixture of the Shannon entropy and log-barrier as the regularizer for a different problem. Algorithm 1: BARrier-Regularized Online Newton Step (BARRONS) 1 Input: 0 < β 1 2 Define: N = {x N : xi 1 NT , i} 3 Initialize: x1 = 1 N 1, A0 = NIN where 1 is the all-one vector and IN is the N N identity matrix. 4 for t = 1, 2, . . . do 5 Predict xt and observe loss function ft(x) = ln x, rt . 6 Make updates At = At 1 + t t ηt,i = η exp max s [t] log T 1 Nxs,i xt+1 = argmin x N x, t + Dψt(x, xt) (2) where t = ft(xt) and ψt(x) = β 2 x 2 At + PN i=1 1 ηt,i ln 1 learner against a CRP parameterized by u N is then defined as t=1 ft(u) = ln ΠT t=1 xt, rt ΠT t=1 u, rt , which is exactly the negative logarithm of the ratio of total wealth per dollar invested between the learner and the CRP u. Our goal is to minimize the regret against the best CRP, that is, to minimize maxu N Reg(u). This setup is also useful for other non-financial applications such as data compression [9, 19]. Note that the regret is invariant to the scaling of each rt and thus without loss of generality we assume maxi [N] rt,i = 1 for all t where we use the notation [n] to represent the set {1, . . . , n}. It is now clear what the aforementioned largest gradient norm G formally is: G = maxt [T ] ft(xt) = maxt [T ],i [N] rt,i xt,rt min{ 1 mint,i rt,i , 1 mint,i,xt,i }, which in general can be unbounded. To control its magnitude, previous works [3, 4, 12, 14] either explicitly force xt,i to be lower bounded, which leads to worse regret, or make the so-called no-junk-bonds assumption (that is, mint,i rt,i is not too small), which might make sense for the portfolio problem but not other applications [19]. Our main technical contribution is to show how this term can be automatically canceled by a negative regret term obtained from the log-barrier regularizer with increasing learning rates. 2 Barrier-Regularized Online Newton Step Recall that for a sequence of convex regularizers ψt, the outputs of Online Mirror Descent are defined by xt+1 = argminx N x, t + Dψt(x, xt) where t is a shorthand for ft(xt), Dψt(x, y) = ψt(x) ψt(y) ψt(y), x y is the Bregman divergence associated with ψt, and we start with x1 being the uniform distribution. The intuition is that we would like xt+1 to have small loss with respect to a linear approximation of ft, and at the same time to be close to the previous decision xt to ensure stability of the algorithm. Although not presented in this form originally, Online Newton Step [12] is an instance of OMD with ψt(x) = β 2 x 2 At = β 2 x Atx where At = At 1 + t t (for some A0) is the gradient covariance matrix and β is a parameter. The analysis of [12] shows that the regret of ONS for the portfolio problem is Reg(u) = O( N ln T β ) as long as β min 1 2, mins [T ] 1 8|(u xs) s| . Even assuming an oracle tuning, by Hölder inequality this gives O(GN ln T) as mentioned. To get rid of the dependence on G, we observe the following. Since t = rt xt,rt , its ℓ -norm is large only when there is a stock with high reward rt,i while the learner puts a small weight xt,i on it. However, the reason that the weight xt,i is small is because the learner finds it performing poorly prior to round t, which means that the learner had better choices and actually should have performed better than stock i previously (that is, negative regret against stock i). Now as stock i becomes good at time t and potentially in the future, as long as the learner can pick up this change quickly, the overall regret should not be too large. Similar observations were made in previous work [2, 21] for different problems in the bandit setting, where they introduced the log-barrier regularizer with increasing learning rate to explicitly ensure a large negative regret term based on the intuition above. This motivates us to add an extra log-barrier regularizer to ONS for our problem. Specifically, we define our regularizer to be the following mixture: ψt(x) β 2 x 2 At + PN i=1 1 ηt,i ln 1 xi , where ηt,i is individual and time-varying learning rate. Different from previous work, we propose a more adaptive tuning schedule for these learning rates based on Eq. (1) (instead of a doubling schedule [2, 21]), but the key idea is the same: increase the learning rate for a stock when its weight is small so that the algorithm learns faster in case the stock becomes better in the future. Another modification is that we force the decision set to be N = {x N : xi 1 NT , i} instead of N to ensure an explicit lower bound for xt,i. We call this algorithm BARrier-Regularized Online Newton Step (BARRONS) (see Algorithm 1). Under the same condition on β as for ONS, we prove the following key theorem for BARRONS which highlights the important negative regret term obtained from the extra log-barrier regularizer. Note that it is enough to provide a regret bound only against smooth CRP u N since one can verify that the total loss of any CRP u N can be approximated by a smooth CRP in N up to an additive constant of 2 (Lemma 10 in Appendix B). Theorem 1. For any u N, if β αT (u) for αt(u) min 1 2, mins [t] 1 8|(u xs) s| , then BARRONS ensures Reg(u) O N ln T β 1 8(ln T)η i=1 max t [T ] ui xt,i . (3) The second term of Eq. (3) comes from ONS while the rest comes from the log-barrier. To see why the negative term is useful, for a moment assume that we were able to pick β such that 1 2αT (u ) β αT (u ) where u is the best (smoothed) CRP. Then by setting η = 1 1024N(ln T )2 , the regret against u can be upper bounded by αT (u ) 1 8(ln T)η i=1 max t [T ] u i xt,i + 128N(ln T) max t [T ] + 32N ln T 1 8(ln T)η i=1 max t [T ] u i xt,i + 128N(ln T) max t [T ],i u i xt,i + 1 + 32N ln T 128N(ln T) i=1 max t [T ] u i xt,i O N 2(ln T)3 , which completely eliminates the dependence on the largest gradient norm G! The problem is, of course, it is not clear at all how to tune β in this way ahead of time. On a closer look, it is in fact not even clear whether such β exists since αT (u ) depends on the sequence x1, . . . , x T and thus also on β itself (see Appendix A for more discussions). Assuming its existence, a natural idea would be to run many copies of BARRONS with different β and to choose them adaptively via another online learning algorithm such as Hedge [11]. We are unable to analyze this method due to some technical challenges (discussed in Appendix A), let alone the fact that it leads to much higher time complexity making the algorithm impractical. In the next section, however, we completely resolve this issue via an adaptive tunning scheme for β. 3 ADA-BARRONS Our main idea to resolve the parameter tuning issue is based on a more involved doubling trick. As dicussed we would like to set β to be roughly αT (u T ) where u t = minu P s t fs(u). A standard Algorithm 2: ADA-BARRONS 1 Initialize: β = 1 2, η = 1 2048N(ln T )2 , γ = 1 25 2 Run BARRONS with parameter β and η, where after each round t, if the following holds: β > αt(ut), (4) with αt defined in Theorem 1 and ut = argmin u N s=1 fs(u) + 1 then set β β 2 , and rerun BARRONS from Line 2 with time index reset to 1. doubling trick would suggest halving β whenever it is larger than αt(u t ) and then restart the algorithm. However, since αt(u t ) is not monotone in t, standard analysis of doubling trick does not work. Fortunately, due to the special structure of our problem, we are able to analyze a slight variant of the above proposal where we halve the value of β whenever it is larger than αt(ut), for the regularized leader ut (defined in Eq. (5)) instead of the actual leader u t . The regularization used here to compute ut is again the log-barrier, but the purpose of using log-barrier is simply to ensure the stability of ut as discussed later. In fact, ut is exactly the prediction of the FTRL approach of [3], up to a different value of the parameter γ. Here we only use ut to assist the tunning of β. We call the final algorithm ADA-BARRONS (see Algorithm 2). Note that for notational simplicity, we reset the time index back to 1 at the beginning of each rerun, that is, the algorithm forgets all the previous data. To see why this works, suppose condition (4) holds at time t and triggers the restart. Then we know αt(ut) < β αt 1(ut 1). On one hand, this implies that the condition of Theorem 1 holds at time t 1 for ut 1, so the regret bound (3) holds for ut 1; on the other hand, this also implies that the term N ln T β in Eq. (3) is bounded by N ln T αt(ut), which further admits an upper bound in terms of maxs [t],i [N] ut,i xs,i by the same calculation shown after Theorem 1. Therefore, if we can show maxs [t],i [N] ut,i xs,i maxs [t 1],i [N] ut 1,i xs,i , then the same cancellation will happen which leads to small regret against ut 1 for this period. It is also not hard to see that ut 1 will have similar total loss compared to the actual best CRP u t 1, leading to the desired regret bound overall. Indeed, we show in Appendix B that both xt and ut enjoy a certain kind of stability, which then implies the following lemma. Lemma 2. If condition (4) holds at time t, then maxs [t 1],i [N] ut 1,i 2 maxs [t],i [N] ut,i xs,i . Call the period between two restart an epoch and use the notation epoch(β) to indicate the epoch that runs with parameter β. We then prove the following key lemma based on the discussions above. Lemma 3. For any u N, if epoch(β) is not the last epoch, then we have s epoch(β) (fs(xs) fs(u)) O N 2(ln T)3 8N ln T s epoch(β) (fs(xs) fs(u)) O N 2(ln T)3 + 8N ln T With this key lemma, we finally prove the claimed regret bound of ADA-BARRONS. Theorem 4. ADA-BARRONS ensures Reg(u) O N 2(ln T)4 for any u N. Proof. Again by Lemma 10 it suffices to consider u N. Let the number of epochs be B. When B = 1, the bound holds trivially by Lemma 3. Otherwise, the regret is upper bounded by B O N 2(ln T)3 + 8 2 b + 8 2 B N ln T = B O(N 2(ln T)3) + 16N ln T. Since for any t [T] and u N, 1 αt(u) O(maxs [t],i ut,i xs,i ) O(NT), which means αt(u) Ω( 1 NT ), condition (4) cannot hold after O(ln(NT)) epochs. Therefore B = O(ln(NT)) = O(ln T) (since T > N) and the regret bound follows. Computational complexity. It is clear that the computational bottleneck of our algorithm is to solve the optimization problems defined by Eq. (2) and Eq. (5). Suppose we use Interior Point Method to solve these two problems. It takes time O M ϵ to obtain 1 ϵ accuracy where M is the time complexity to compute the gradient and Hessian inverse of the objective [5], which in our case is O(N 3) for solving xt and O(TN 2 + N 3) for solving ut. As T > N, the complexity per round is therefore O(TN 2.5) ignoring logarithmic factors. We note that this is only a pessimistic estimation and faster implementation is highly possible, especially for solving ut (given ut 1) in light of the efficient implementation discussed in [1] for similar problems. 4 Detailed Analysis In this section we provide the key proofs for our results. Analysis of BARRONS The proof of Theorem 1 is a direct combination of the following three lemmas, where the first one is by standard OMD analysis and analysis from [12] and the proof is deferred to Appendix B. Lemma 5. Under the condition of Theorem 1, BARRONS ensures for any u N t, xt xt+1 + Dψt(u, xt) Dψt(u, xt+1) β 2 t, xt u 2 . Lemma 6. BARRONS with parameters β 1 2 and η 1 guarantees Dψt(u, xt) Dψt(u, xt+1) β 2 t, xt u 2 O N ln T i=1 max t [T ] ui xt,i . Proof. Define φt(x) = β 2 x 2 At, ϕt(x) = PN i=1 1 ηt,i ln 1 xi . Then ψt(x) = φt(x) + ϕt(x) and Dψt = Dφt + Dϕt. Note that Dφt(x, y) = β 2 x y 2 At and Dϕt(x, y) = PN i=1 1 ηt,i h xi yi h(z) = z 1 ln z. For notation simplicity, we also define η0,i = η1,i for all i. Now we have t=1 (Dψt(u, xt) Dψt(u, xt+1)) Dψ0(u, x1) + Dψt(u, xt) Dψt 1(u, xt) Dψ0(u, x1) + β u xt 2 At u xt 2 At 1 ηt,i 1 ηt 1,i = Dψ0(u, x1) + β t=1 t, u xt 2 + ηt,i 1 ηt 1,i = O βN + N ln T t=1 t, u xt 2 + ηt,i 1 ηt 1,i It remains to deal with PT t=2 PN i=1 1 ηt,i 1 ηt 1,i . Fix i, let t = s1, s2, . . . , s M [2, T] be the rounds where ηt,i = ηt 1,i. Define s0 = 1 and let η(m) = ηsm,i and x(m) = xsm,i for notational convenience. Note that by definition η(m) = η exp(log T 1 Nx(m) ) η exp(log T T) = ηe. We thus have ηt,i 1 ηt 1,i 1 η(m) 1 η(m 1) 1 exp log T x(m 1) log T x(m 1) log2 x(m 1) x(m) . log2 T log2 x(m 1) x(m) 4(ln T)η We first consider the case when x(M) min 1 2 . Because x(M) 1 2N = x(0) 2 and x(m) is decreasing in m, there must exist an m {1, 2, . . . , M} such that x(m 1) x(M) 2 and x(m ) x(M) 2. The last expression can thus be further bounded by log2 x(m 1) x(m) 4(ln T)η log2 x(m 1) x(m) 4(ln T)η = log2 x(m 1) x(M) 4(ln T)η h ui 2x(M) 1 4(ln T)η h ui 2x(M) (x(m 1) 2x(M)) = 1 4(ln T)η ui 2x(M) 1 ln ui 2x(M) = 1 8(ln T)η max t [T ] ui xt,i + O ln(NTui) 1 8(ln T)η max t [T ] ui xt,i + O 1 + Nui where in the first inequality we use the fact for m m , ui x(m) ui x(m ) ui 2x(M) 1, and that h(y) is positive and increasing when y 1. On the other hand, if x(M) 1 2N or x(M) ui 2 , we have maxt [T ] ui xt,i = ui x(M) 2Nui + 2 and thus T X ηt,i 1 ηt 1,i 0 1 8(ln T)η max t [T ] ui xt,i + 2Nui + 2 . Considering both cases and Eq. (6), we get Dψt(u, xt) Dψt(u, xt+1) β t=1 t, u xt 2 ! O βN + N ln T + 1 8(ln T)η max t [T ] ui xt,i + 2Nui + 2 + i=1 O 1 + Nui i=1 max t [T ] ui xt,i , finishing the proof. Lemma 7. BARRONS guarantees PT t=1 t, xt xt+1 8N ln T Proof. Define Ft(x) x, t + Dψt(x, xt). Using Taylor s expansion and first order optimality of xt+1 we have Ft(xt) Ft(xt+1) = Ft(xt+1) (xt xt+1) + 1 2(xt xt+1) 2Ft(ξt)(xt xt+1) 2(xt xt+1) 2Ft(ξt)(xt xt+1) = 1 2 xt xt+1 2 2Ft(ξt) , where ξt is some point that lies on the line segment joining xt and xt+1. On the other hand, by the definition of Ft, nonnegativity of Bregman divergence, and Hölder inequality, we have Ft(xt) Ft(xt+1) = xt xt+1, t Dψt(xt+1, xt) xt xt+1 2Ft(ξt) t 2Ft(ξt) . Combining the above two inequalities we get xt xt+1 2Ft(ξt) 2 t 2Ft(ξt), and thus t, xt xt+1 t 2Ft(ξt) xt xt+1 2Ft(ξt) 2 t 2 2Ft(ξt) = 2 T t (βAt + 2ϕt(ξt)) 1 t 2 β t A 1 t t, where ϕt is the log-barrier regularizer defined in the proof of Lemma 6 (whose Hessian is clearly positive semi-definite). Using Lemma 11 in [12], we continue with t=1 t A 1 t t 2 |A0| 2N ln(1 + T 3) where the second inequality uses the fact ln |A0| = N ln N and by AM-GM inequality ln |AT | N ln Tr(AT ) PT t=1 t 2 2 N N ln(N + NT 3) since t 2 2 i r2 t,i)/(P i rt,i)2 N 2T 2. This finishes the proof. Analysis of ADA-BARRONS To prove Lemma 2, we make use of the following stability lemmas whose proofs are deferred to Appendix B. Lemma 8. In ADA-BARRONS, if γ 1 25, then 1 γ 2 for all t and i. Lemma 9. In ADA-BARRONS, if η 1 300, then 1 3η xt,i 1 + 3η 2 for all t and i. Proof of Lemma 2. Denote at maxs [t],i [N] ut,i xs,i . Suppose at attains its max at s = s and i = i (i.e., at = ut,i xs ,i ), then when s t 1, we have by Lemma 8 at 1 ut 1,i xs ,i = ut 1,i 2at; when s = t, we have by Lemma 8 and 9 at 1 ut 1,i xt 1,i = ut 1,i ut,i xt,i xt 1,i at 2at. This concludes the proof. Proof of Lemma 3. Define Γt(u) = Pt s=1 fs(xs) Pt s=1 fs(u) 1 γ PN i=1 ln 1 ui . Suppose condition (4) holds at some time t at the end of epoch(β) and cause the algorithm to restart. Then we know that β αt 1(ut 1) and β > αt(ut). The first condition guarantees that Eq. (3) holds for ut 1 at time t 1. Also, note that ut 1 is the maximizer of Γt 1. Together they imply for any u N, Γt 1(u) Γt 1(ut 1) s=1 fs(ut 1) O N ln T β at 1 8(ln T)η , where we recall the notation at maxs [t],i [N] ut,i xs,i . The second condition implies 1 β < 1 αt(ut) = max n 2, max s [t] 8| s (ut xs)| o = max n 2, max s [t] 8 xs, ut xs max n 2, 8 max s [t],i [N] ut,i xs,i + 8 o = 8at + 8 16at 1 + 8, (8) where we apply Lemma 2 for the last step. Further combining this with Eq. (7), and noting that ft(x) ft(u) maxi ln ui xi ln(NT) for any x N, we have for any u N, s=1 fs(u) ln(NT) + s=1 fs(u) ln(NT) + Γt 1(u) + N ln(NT) β 1 8(ln T)η (by (7) and (8)) O N 2(ln T)3 8N ln T For the last epoch, we can apply Theorem 1 over the entire epoch and simply discard the negative term to obtain the claimed bound. 5 Conclusions and Open Problems We have shown that our new algorithm ADA-BARRONS achieves logarithmic regret of O(N 2(ln T)4) for online portfolio with much faster running time compared to Universal Portfolio, the only previous algorithm with truly logarithmic regret. A natural open problem is whether it is possible to further improve either the regret (from N 2 to N) or the computational efficiency without hurting the other. It is conjectured in [20] that FTRL with log-barrier [3] (i.e. our ut s) might also achieve logarithmic regret without dependence on G. On the pessimistic side, it is also a conjecture that it might be impossible to have the optimal regret with O(N) computations per round [19]. Acknowledgements. The authors would like to thank Tim van Erven for introducing the problem and to thank Tim van Erven, Dirk van der Hoeven, and Wouter Koolen for helpful discussions throughout the projects, especially on the FTRL approach. The work was done while KZ visited the University of Southern California. KZ gratefully acknowledges financial support from China Scholarship Council. HL and CYW are grateful for the support of NSF Grant #1755781. [1] Jacob D Abernethy, Elad Hazan, and Alexander Rakhlin. Interior-point methods for fullinformation and bandit online learning. IEEE Transactions on Information Theory, 58(7):4164 4175, 2012. [2] Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12 38, 2017. [3] Amit Agarwal and Elad Hazan. Efficient algorithms for online game playing and universal portfolio management. Electronic Colloquium on Computational Complexity, TR06-033, 2005. [4] Amit Agarwal, Elad Hazan, Satyen Kale, and Robert E Schapire. Algorithms for portfolio management based on the newton method. In Proceedings of the 23rd international conference on Machine learning, pages 9 16, 2006. [5] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [6] Sébastien Bubeck, Michael B. Cohen, and Yuanzhi Li. Sparsity, variance and curvature in multi-armed bandits. In International Conference on Algorithmic Learning Theory, 2018. [7] Sébastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution with projected langevin monte carlo. ar Xiv preprint ar Xiv:1507.02564, 2015. [8] Thomas M Cover. Universal portfolios. Mathematical Finance, 1(1):1 29, 1991. [9] Thomas M Cover. Universal data compression and portfolio selection. In Foundations of Computer Science, 1996. Proceedings., 37th Annual Symposium on, pages 534 538. IEEE, 1996. [10] Dylan J Foster, Zhiyuan Li, Thodoris Lykouris, Karthik Sridharan, and Eva Tardos. Learning in games: Robustness of fast convergence. In Advances in Neural Information Processing Systems, pages 4734 4742, 2016. [11] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, August 1997. [12] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007. [13] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016. [14] David P Helmbold, Robert E Schapire, Yoram Singer, and Manfred K Warmuth. On-line portfolio selection using multiplicative updates. Mathematical Finance, 8(4):325 347, 1998. [15] Adam Kalai and Santosh Vempala. Efficient algorithms for universal portfolios. Journal of Machine Learning Research, 3(Nov):423 440, 2002. [16] László Lovász and Santosh Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In Foundations of Computer Science, 2006. FOCS 06. 47th Annual IEEE Symposium on, pages 57 68. IEEE, 2006. [17] Hariharan Narayanan and Alexander Rakhlin. Random walk approach to regret minimization. In Advances in Neural Information Processing Systems, pages 1777 1785, 2010. [18] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming, volume 13. Siam, 1994. [19] Laurent Orseau, Tor Lattimore, and Shane Legg. Soft-bayes: Prod for mixtures of experts with log-loss. In International Conference on Algorithmic Learning Theory, pages 372 399, 2017. [20] Tim van Erven, Dirk van der Hoeven, and Wouter Koolen. personal communication, 2018. [21] Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Conference on Learning Theory, 2018. [22] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928 936, 2003.