# stochastic_and_adversarial_online_learning_without_hyperparameters__f1308b87.pdf Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford University boahen@stanford.edu Most online optimization algorithms focus on one of two things: performing well in adversarial settings by adapting to unknown data parameters (such as Lipschitz constants), typically achieving O( T) regret, or performing well in stochastic settings where they can leverage some structure in the losses (such as strong convexity), typically achieving O(log(T)) regret. Algorithms that focus on the former problem hitherto achieved O( T) in the stochastic setting rather than O(log(T)). Here we introduce an online optimization algorithm that achieves O(log4(T)) regret in a wide class of stochastic settings while gracefully degrading to the optimal O( T) regret in adversarial settings (up to logarithmic factors). Our algorithm does not require any prior knowledge about the data or tuning of parameters to achieve superior performance. 1 Extending Adversarial Algorithms to Stochastic Settings The online convex optimization (OCO) paradigm [1, 2] can be used to model a large number of scenarios of interest, such as streaming problems, adversarial environments, or stochastic optimization. In brief, an OCO algorithm plays T rounds of a game in which on each round the algorithm outputs a vector wt in some convex space W, and then receives a loss function ℓt : W R that is convex. The algorithm s objective is to minimize regret, which is the total loss of all rounds relative to w , the minimizer of PT t=1 ℓt in W: t=1 ℓt(wt) ℓt(w ) OCO algorithms typically either make as few as possible assumptions about the ℓt while attempting to perform well (adversarial settings), or assume that the ℓt have some particular structure that can be leveraged to perform much better (stochastic settings). For the adversarial setting, the minimax optimal regret is O(BLmax T), where B is the diameter of W and Lmax is the maximum Lipschitz constant of the losses [3]. A wide variety of algorithms achieve this bound without prior knowledge of one or both of B and Lmax [4, 5, 6, 7], resulting in hyperparameter-free algorithms. In the stochastic setting, it was recently shown that for a class of problems (those satisfying the so-called Bernstein condition), one can achieve regret O(d BLmax log(T)) where W Rd using the METAGRAD algorithm [8, 9]. This approach requires knowledge of the parameter Lmax. In this paper, we extend an algorithm for the parameter-free adversarial setting [7] to the stochastic setting, achieving both optimal regret in adversarial settings as well as logarithmic regret in a wide class of stochastic settings, without needing to tune parameters. Our class of stochastic settings is those for which E[ ℓt(wt)] is aligned with wt w , quantified by a value α that increases with 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. increasing alignment. We call losses in this class α-acutely convex, and show that a single quadratic lower bound on the average loss is sufficient to ensure high α. This paper is organized as follows. In Section 2, we provide an overview of our approach. In Section 3, we give explicit pseudo-code and prove our regret bounds for the adversarial setting. In Section 4, we formally define α-acute convexity and prove regret bounds for the acutely convex stochastic setting. Finally, in Section 5, we give some motivating examples of acutely convex stochastic losses. Section 6 concludes the paper. 2 Overview of Approach Before giving the overview, we fix some notation. We assume our domain W is a closed convex subset of a Hilbert space with 0 W. We write gt to be an arbitrary subgradient of ℓt at wt for all t, which we denote by gt ℓt(wt). Lmax is the maximum Lipschitz constant of all the ℓt, and B is the diameter of the space W. The norm we use is the 2-norm: w = w w. We observe that since each ℓt is convex, we have RT (w ) PT t=1 gt(wt w ). We will make heavy use of this inequality; every regret bound we state will in fact be an upper bound on PT t=1 gt(wt w ). Finally, we use a compressed sum notation g1:t = Pt t =1 gt , and we use O to suppress logarithmic terms in big-Oh notation. All proofs omitted from the main text appear in the appendix. Our algorithm works by trading off some performance in order to avoid knowledge of problem parameters. Prior analysis of the METAGRAD algorithm [9] showed that any algorithm guaranteeing RT (w ) = O q PT t=1(gt (wt w ))2 will obtain logarithmic regret for stochastic settings satisfying the Bernstein condition. We will instead guarantee the weaker regret bound: v u u t Lmax t=1 gt wt w 2 which we will show in turn implies T regret in adversarial settings and logarithmic regret for acutely convex stochastic settings. Although (1) is weaker than the METAGRAD regret bound, we can obtain it without prior knoweldge. In order to come up with an algorithm that achieves the bound (1), we interpret it as the square root of E[ w w 2], where w takes on value wt with probability proportional to gt . This allows us to use the bias-variance decomposition to write (1) as: Lmax g 1:T + t=1 Lmax gt wt w 2 PT t=1 gt wt g 1:T . Certain algorithms for unconstrained OCO can achieve RT (u) = O( u Lmax p g 1:T ) simultaneously for all u W [10, 6, 11, 7]. Thus if we knew w ahead of time, we could translate the predictions of one such algorithm by w to abtain RT (w ) O( w w Lmax p g 1:T ), the bias term of (2). We do not know w, but we can estimate it over time. Errors in the estimation procedure will cause us to incur the variance term of (2). We implement this strategy by modifying FREEREX [7], an unconstrained OCO algorithm that does not require prior knowledge of any parameters. Our modification to FREEREX is very simple: we set wt = ˆwt + wt 1 where ˆwt is the tth output of FREEREX, and wt 1 is (approximately) a weighted average of the previous vectors w1, . . . , wt 1 with the weight of wt equal to gt . This wt offset can be viewed as a kind of momentum term that accelerates us towards optimal points when the losses are stochastic (which tends to cause correlated wt and therefore large offsets), but has very little effect when the losses are adversarial (which tends to cause uncorrelated wt and therefore small offsets). 3 FREEREXMOMENTUM In this section, we explicitly describe and analyze our algorithm, FREEREXMOMENTUM, a modification of FREEREX. FREEREX is a Follow-the-Regularized-Leader (FTRL) algorithm, which means that for all t, there is some regularizer function ψt such that wt+1 = argmin W ψt(w) + g1:t w. Specifically, FREEREX uses ψt = 5 atηt φ(atw), where φ(w) = ( w + 1) log( w + 1) w and ηt and at are specific numbers that grow over time as specified in Algorithm 1. FREEREXMOMENTUM s predictions are given by offsetting FREEREX s predictions wt+1 by a momentum term Pt 1 t =1 gt wt 1+ g 1:t . We accomplish this by shifting the regularizers ψt by wt, so that FREEREXMO- MENTUM is FTRL with regularizers ψt(w wt). Algorithm 1 FREEREXMOMENTUM Initialize: 1 η2 0 0, a0 0, w1 0, L0 0, ψ(w) = ( w + 1) log( w + 1) w for t = 1 to T do Play wt Receive subgradient gt ℓt(wt) Lt max(Lt 1, gt ). // Lt = maxt t gt 1 η2 t max 1 η2 t 1 + 2 gt 2, Lt g1:t . at max(at 1, 1/(Ltηt)2) Pt 1 t =1 gt wt 1+ g 1:t wt+1 argmin W h 5φ(at(w wt) atηt + g1:t w i 3.1 Regret Analysis We leverage the description of FREEREXMOMENTUM in terms of shifted regularizers to prove a regret bound of the same form as (1) in four steps: 1. From [7] Theorem 13, we bound the regret by t=1 gt (wt w ) t=1 ψt 1(w+ t+1) ψ+ t (w+ t+1) + gt (wt w+ t+1) + ψ+ T (w ) ψT (w ) + t=1 ψ+ t (w+ t+2) ψt(w+ t+2) where ψ+ t (w) 5φ(at(w wt 1) atηt is a version of ψt shifted by wt 1 instead of wt, and w+ t+1 = argmin W ψ+ t (w) + g1:tw. This breaks the regret out into two sums, one in which we have the term ψt 1(w+ t+1) ψ+ t (w+ t+1) for which the two different functions are shifted by the same amount, and one with the term ψ+ t (w+ t+2) ψt(w+ t+2), for which the functions are shifted differently, but the arguments are the same. 2. Because ψt 1 and ψ+ t are shifted by the same amount, the regret analysis for FREEREX in [7] applies to the second line of the regret bound, yielding a quantity similar to w w T p Lmax g 1:T . 3. Next, we analyze the third line. We show that wt wt 1 cannot be too big, and use this observation to bound the third line with a quantity similar to q PT t=1 Lmax gt (wt w T )2. At this point we have enough results to prove a bound of the form (2) (see Theorem 1). 4. Finally, we perform some algebraic manipulation on the bound from the first three steps to obtain a bound of the form (1) (see Corollary 2). The details of Steps 1-3 procedure are in the appendix, resulting in Theorem 1, stated below. Step 4 is carried out in Corollary 2, which follows. Theorem 1. Let ψ(w) = ( w +1) log( w +1) w . Set Lt = maxt t gt , and QT = 2 g 1:T Lmax . Define 1 ηt and at as in the pseudo-code for FREEREXMOMENTUM (Algorithm 1). Then the regret of FREEREXMOMENTUM is bounded by: t=1 gt (wt w ) 5 QT ηT ψ(QT (w w T ))+405Lmax+2Lmax B+3Lmax 2Lmax 1 + L1 B log(Ba T +1) v u u t2Lmax t=1 gt wt w T 2 ! 2 + log 1 + g 1:T log(Ba T + 1) Corollary 2. Under the assumptions and notation of Theorem 1, the regret of FREEREXMOMENTUM is bounded by: t=1 gt (wt w ) 2 v u u t Lmax t=1 gt w wt 2 ! log(2BT + 1)(2 + log(T)) + 405Lmax + 2Lmax B + 3Lmax 2Lmax 1 + L1 B log(2BT + 1) Observe that since wt and w are both in W, w and wt w both are at most B, so that Corollary 2 implies that FREEREXMOMENTUM achieves O(BLmax T) regret in the worst-case, which is optimal up to logarithmic factors. 3.2 Efficient Implementation for L Balls A careful reader may notice that the procedure for FREEREXMOMENTUM involves computing argmin W h 5ψ(at(w wt) atηt + g1:t w i , which may not be easy if the solution wt+1 is on the boundary of W. When the wt+1 is not on the boundary of W, then we have a closed-form update: wt+1 = wt g1:t at g1:t exp ηt g1:t However, when wt+1 lies on the boundary of W, it is not clear how to compute it for general W. In this section we offer a simple strategy for the case that W is an L ball, W = Qd i=1[ b, b]. In this setting, we can use the standard trick (e.g. see [12]) of running a separate copy of FREEREXMOMENTUM for each coordinate. That is, we observe that t=1 gt (wt u) = t=1 gt,i(wt,i ui) (4) so that if we run an independent online learning algorithm on each coordinate, using the coordinates of the gradients gt,i as losses, then the total regret is at most the sum of the individual regrets. More detailed pseudocode is given in Algorithm 2. Coordinate-wise FREEREXMOMENTUM is easily implementable in time O(d) per update because the FREEREXMOMENTUM update is easy to perform in one dimension: if the update (3) is outside the domain [ b, b], simply set wt+1 to b or b, whichever is closer to the unconstrained update. Therefore, coordinate-wise FREEREXMOMENTUM can be computed in O(d) time per update. We bound the regret of coordinate-wise FREEREXMOMENTUM using Corollary 2 and Equation (4), resulting the following Corollary. Algorithm 2 Coordinate-Wise FREEREXMOMENTUM Initialize: w1 = 0, d copies of FREEREXMOMENTUM, F1,...,Fd, where each Fi uses domain W = [ b, b]. for t = 1 to T do Play wt, receive subgradient gt. for i = 1 to d do Give gt,i to Fi. Get wt+1,i [ b, b] from Fi. end for end for Corollary 3. The regret of coordinate-wise FREEREXMOMENTUM is bounded by: t=1 gt (wt w ) 2 v u u td Lmax t=1 gt w wt 2 ! log(2Tb + 1)(2 + log(T)) + 405d Lmax + 2Lmaxdb + 3d Lmax 2Lmax 1 + L1 b log(2b T + 1) 4 Logarithmic Regret in Stochastic Problems In this section we formally define α-acute convexity and show that FREEREXMOMENTUM achieves logarithmic regret for α-acutely convex losses. As a warm-up, we first consider the simplest case in which the loss functions ℓt are fixed, ℓt = ℓfor all t. After showing logarithmic regret for this case, we will then generalize to more complicated stochastic settings. Intuitively, an acutely convex loss function ℓis one for which the gradient gt is aligned with the vector wt w where w = argmin ℓ, as defined below. Definition 4. A convex function ℓis α-acutely convex on a set W if ℓhas a global minimum at some w W and for all w W, for all subgradients g ℓ(w), we have g (w w ) α g w w 2 With this definition in hand, we can show logarithmic regret in the case where ℓt = ℓfor all t for some α-acutely convex function ℓ. From Corollary 2, with w = argmin ℓ, we have t=1 gt (wt w ) O v u u t Lmax t=1 gt w wt 2 ! v u u t Lmax t=1 gt (w wt) Where the O notation suppresses terms whose dependence on T is at most O(log2(T)). Now we need a small Proposition: Proposition 5. If a, b, c and d are non-negative constants such that x 4a2b + 2a c + 2d Applying Proposition 5 to Equation (5) with x = PT t=1 gt (wt w ) yields RT (u) O Lmax w where the O again suppresses logarithmic terms, now with dependence on T at most O(log4(T)). Having shown that FREEREXMOMENTUM achieves logarithmic regret on fixed α-acutely convex losses, we now generalize to stochastic losses. In order to do this we will necessarily have to make some assumptions about the process generating the stochastic losses. We encapsulate these assumptions in a stochastic version of α-acute convexity, given below. Definition 6. Suppose for all t, gt is such that E[gt|g1, . . . gt 1] ℓ(wt) for some convex function ℓwith minimum at w . Then we say gt is α-acutely convex in expectation if: E[gt] (wt w ) α E[ gt wt w 2] where all expectations are conditioned on g1, . . . , gt 1. Using this definition, a fairly straightforward calculation gives us the following result. Theorem 7. Suppose gt is α-acutely convex in expectation and gt is bounded gt Lmax with probability 1. Then FREEREXMOMENTUM achieves expected regret: E[RT (w )] O Lmax w Proof. Throughout this proof, all expectations are conditioned on prior subgradients. By Corollary 2 and Jensen s inequality we have t=1 gt (wt w ) 405Lmax + 2Lmax B + 3Lmax 2Lmax 1 + L1 B log(2BT + 1) v u u t Lmax t=1 gt w wt 2 ! log(2TB + 1)(2 + log(T)) 405Lmax + 2Lmax B + 3Lmax 2Lmax δ B log(2BT + 1) v u u t Lmax t=1 E[ gt w wt 2] log(2TB + 1)(2 + log(T)) 405Lmax + 2Lmax B + 3Lmax 2Lmax δ B log(2BT + 1) v u u t Lmax t=1 E[gt (wt w )] log(2TB + 1)(2 + log(T)) Set R = E h PT t=1 gt(wt w ) i . Then we have shown log(2TB + 1)(2 + log(T)) + 405Lmax + 2Lmax B + 3Lmax 2Lmax δ B log(BT + 1) And now we use Proposition 5 to conclude: t=1 E[gt (wt w )] = O Lmax w as desired, where again O hides at most a O(log4(T)) dependence on T. Exactly the same argument with an extra factor of d applies to the regret of FREEREXMOMENTUM with coordinate-wise updates. 5 Examples of α-acute convexity in expectation In this section, we show that α-acute convexity in expectation is a condition that arises in practice, justifying the relevance of our logarithmic regret bounds. To do this, we show that a quadratic lower bound on the expected loss implies α-acute convexity, demonstrating acutely convexity is a weaker condition than strong convexity. Proposition 8. Suppose E[gt|g1, . . . , gt 1] ℓ(wt) for some convex ℓsuch that for some µ > 0 and w = argmin ℓ, ℓ(w) ℓ(w ) µ 2 w w 2 for all w W. Suppose g Lmax with probability 1. Then gt is µ 2Lmax -acutely convex in expectation. Proof. By convexity and the hypothesis of the proposition: E[gt] (wt w ) ℓ(wt) ℓ(w ) µ 2 wt w 2 µ 2Lmax E[ gt wt w 2 With Proposition 8, we see that FREEREXMOMENTUM obtains logarithmic regret for any loss that is larger than a quadratic, without requiring knowledge of the parameter µ or the Lipschitz bound Lmax. Further, this result requires only the expected loss ℓ= E[ℓt] to have a quadratic lower bound - the individual losses ℓt themselves need not do so. The boundedness of W makes it surprisingly easy to have a quadratic lower bound. Although a quadratic lower bound for a function ℓis easily implied by strong convexity, the quadratic lower bound is a significantly weaker condition. For example, since W has diameter B, w 1 and so the absolute value is 1 B -acutely convex, but not strongly convex. The following Proposition shows that existence of a quadratic lower bound is actually a local condition; so long as the expected loss ℓhas a quadratic lower bound in a neighborhood of w , it must do so over the entire space W: Proposition 9. Supppose ℓ: W R is a convex function such that ℓ(w) ℓ(w ) µ 2 w w for all w with w w r. Then ℓ(w) ℓ(w ) min µr 2 w w 2 for all w W. Proof. We translate by w to assume without loss of generality that w = 0. Then the statement is clear for w r. By convexity, ℓ(w) ℓ(w ) w r h ℓ rw w ℓ(w ) i µr Finally, we provide a simple motivating example of an interesting problem we can solve with an α-acutely convex loss that is not strongly convex: computing the median. Proposition 10. Let W = [a, b], and ℓt(w) = |w xt| where each xt is drawn i.i.d. from some fixed distribution with a continuous cumulative distribution function D, and assume D(x ) = 1 2. Further, suppose |2D(w) 1| F|w x | for all |w x | G. Suppose gt = ℓ t(wt) for wt = xt and gt = 1 with equal probability if wt = xt. Then gt is min F G b a, F -acutely convex in expectation. Proof. By a little calculation, E[gt] = ℓ (wt) = 2D(wt) 1, and E[|gt|] = 1. Since ℓ (x ) = 0, w = x (the median). For |wt x | G, we have |2D(w) 1| FG, which gives E[gt] (wt w ) F G b a E[|gt|](wt w )2. For |wt x | G, we have E[gt] (wt w ) F E[|gt|](wt w )2, so that gt is min F G b a, F -acutely convex in expectation. Proposition 10 shows that we can obtain low regret for an interesting stochastic problem without curvature. The condition on the cumulative distribution function D is asking only that there be positive density in a neighborhood of the median; it would be satisfied if D (w) F for |w| G. If the expected loss ℓis µ-strongly convex, we can apply Proposition 8 to see that ℓis µ/2-aligned, and then use Theorem 7 to obtain a regret of O(Lmax w /µ). This is different from the usual regret bound of O(L2 max/µ) obtained by Online Newton Step [13], which is due to an inefficiency in using the wearker α-alignment condition. Instead, arguing from the regret bound of Corollary 2 directly, we can recover the optimal regret bound: Corollary 11. Suppose each ℓt is an independent random variable with E[ℓt] = ℓfor some µ-strongly convex ℓwith minimum at w . Then the expected regret of FREEREXMOMENTUM satisfies t=1 ℓ(wt) ℓ(w ) O(L2 max/µ) Where the O hides terms that are logarithmic in TB. Proof. From strong-convexity, we have µ(ℓ(wt) ℓ(w )) Therefore applying Corollary 2 we have E[RT (w )] = E t=1 ℓ(wt) ℓ(w ) v u u t L2max E[ t=1 wt w 2] L2max E[RT (w )]) So that applying Proposition 5 we obtain the desired result. As a result of Corollary 11, we see that FREEREXMOMENTUM obtains logarithmic regret for αaligned problems and also obtains the optimal (up to log factors) regret bound for µ-strongly-convex problems, all without requiring any knowledge of the parameters α or µ. This stands in contrast to prior algorithms that adapt to user-supplied curvature information such as Adaptive Gradient Descent [14] or (A, B)-prod [15]. 6 Conclusions and Open Problems We have presented an algorithm, FREEREXMOMENTUM, that achieves both O(BLmax T) regret in adversarial settings and O Lmax B α regret in α-acutely convex stochastic settings without requiring any prior information about any parameters. We further showed that a quadratic lower bound on the expected loss implies acute convexity, so that while strong-convexity is sufficient for acute convexity, other important loss families such as the absolute loss may also be acutely convex. Since FREEREXMOMENTUM does not require prior information about any problem parameters, it does not require any hyperparameter tuning to be assured of good convergence. Therefore, the user need not actually know whether a particular problem is adversarial or acutely convex and stochastic, or really much of anything at all about the problem, in order to use FREEREXMOMENTUM. There are still many interesting open questions in this area. First, we would like to find an efficient way to implement the FREEREXMOMENTUM algorithm or some variant directly, without appealing to coordinate-wise updates. This would enable us to remove the factor of d we incur by using coordinate-wise updates. Second, our modification to FREEREX is extremely simple and intuitive, but our analysis makes use of some of the internal logic of FREEREX. It is possible, however, that any algorithm with sufficiently low regret can be modified in a similar way to achieve our results. Finally, we observe that while log4(T) is much better than T asymptotically, it turns out that log4(T) > T for T < 1011, which casts the practical relevance of our logarithmic bounds in doubt. Therefore we hope that this work serves as a starting point for either new analysis or algorithm design that further simplifies and improves regret bounds. [1] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928 936, 2003. [2] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011. [3] Jacob Abernethy, Peter L Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the nineteenth annual conference on computational learning theory, 2008. [4] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory (COLT), 2010. [5] H. Brendan Mc Mahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010. [6] Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 577 585. Curran Associates, Inc., 2016. [7] Ashok Cutkosky and Kwabena Boahen. Online learning without prior information. ar Xiv preprint ar Xiv:1703.02629, 2017. [8] Tim van Erven and Wouter M Koolen. Metagrad: Multiple learning rates in online learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3666 3674. Curran Associates, Inc., 2016. [9] Wouter M Koolen, Peter Grünwald, and Tim van Erven. Combining adversarial guarantees and stochastic fast rates in online learning. In Advances in Neural Information Processing Systems, pages 4457 4465, 2016. [10] Francesco Orabona. Dimension-free exponentiated gradient. In Advances in Neural Information Processing Systems, pages 1806 1814, 2013. [11] Ashok Cutkosky and Kwabena A Boahen. Online convex optimization with unconstrained domains and losses. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 748 756. Curran Associates, Inc., 2016. [12] Brendan Mcmahan and Matthew Streeter. No-regret algorithms for unconstrained online convex optimization. In Advances in neural information processing systems, pages 2402 2410, 2012. [13] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192, 2007. [14] Peter L Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. In NIPS, volume 20, pages 65 72, 2007. [15] Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. In Advances in Neural Information Processing Systems, pages 810 818, 2014.