# coin_betting_and_parameterfree_online_learning__d50e8457.pdf

Coin Betting and Parameter-Free Online Learning

Francesco Orabona Stony Brook University, Stony Brook, NY francesco@orabona.com

D avid P al Yahoo Research, New York, NY dpal@yahoo-inc.com

In the recent years, a number of parameter-free algorithms have been developed for online linear optimization over Hilbert spaces and for learning with expert advice. These algorithms achieve optimal regret bounds that depend on the unknown competitors, without having to tune the learning rates with oracle choices. We present a new intuitive framework to design parameter-free algorithms for both online linear optimization over Hilbert spaces and for learning with expert advice, based on reductions to betting on outcomes of adversarial coins. We instantiate it using a betting algorithm based on the Krichevsky-Troﬁmov estimator. The resulting algorithms are simple, with no parameters to be tuned, and they improve or match previous results in terms of regret guarantee and per-round complexity.

1 Introduction

We consider the Online Linear Optimization (OLO) [4, 25] setting. In each round t, an algorithm chooses a point wt from a convex decision set K and then receives a reward vector gt. The algorithm s goal is to keep its regret small, deﬁned as the difference between its cumulative reward and the cumulative reward of a ﬁxed strategy u K, that is

Regret T (u) =

t=1 gt, wt .

We focus on two particular decision sets, the N-dimensional probability simplex N = {x RN : x 0, x 1 = 1} and a Hilbert space H. OLO over N is referred to as the problem of Learning with Expert Advice (LEA). We assume bounds on the norms of the reward vectors: For OLO over H, we assume that gt 1, and for LEA we assume that gt [0, 1]N. OLO is a basic building block of many machine learning problems. For example, Online Convex Optimization (OCO), the problem analogous to OLO where gt, u is generalized to an arbitrary convex function ℓt(u), is solved through a reduction to OLO [25]. LEA [17, 27, 5] provides a way of combining classiﬁers and it is at the heart of boosting [12]. Batch and stochastic convex optimization can also be solved through a reduction to OLO [25]. To achieve optimal regret, most of the existing online algorithms require the user to set the learning rate (step size) η to an unknown/oracle value. For example, to obtain the optimal bound for Online Gradient Descent (OGD), the learning rate has to be set with the knowledge of the norm of the competitor u, u ; second entry in Table 1. Likewise, the optimal learning rate for Hedge depends on the KL divergence between the prior weighting π and the unknown competitor u, D (u π); seventh entry in Table 1. Recently, new parameter-free algorithms have been proposed, both for LEA [6, 8, 18, 19, 15, 11] and for OLO/OCO over Hilbert spaces [26, 23, 21, 22, 24]. These algorithms adapt to the number of experts and to the norm of the optimal predictor, respectively, without the need to tune parameters. However, their design and underlying intuition is still a challenge. Foster et al. [11] proposed a uniﬁed framework, but it is not constructive. Furthermore, all existing algorithms for LEA either have sub-optimal regret bound (e.g. extra O(log log T) factor) or sub-optimal running time (e.g. requiring solving a numerical problem in every round, or with extra factors); see Table 1.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Algorithm Worst-case regret guarantee Per-round time complexity Adaptive Uniﬁed analysis

T [25] O((1 + u 2)

T), u H O(1) OGD, η = U

T for any u H s.t. u U O(1) [23] O( u ln(1 + u T)

T), u H O(1) [22, 24] O( u p

T ln(1 + u T)), u H O(1) This paper, Sec. 7.1 O( u p

T ln(1 + u T)), u H O(1)

Hedge, η = q

T ln N), u N O(N)

Hedge, η = U

T) for any u N s.t. p

D (u π) U O(N) [6] O( p

T(1 + D (u π)) + ln2 N), u N O(N K)1 [8] O( p

T (1 + D (u π))), u N O(N K)1 [8, 19, 15]2 O( p

T (ln ln T + D (u π))), u N O(N) [11] O( p

T (1 + D (u π))), u N O(N ln maxu N D (u π))3 This paper, Sec. 7.2 O( p

T (1 + D (u π))), u N O(N) Table 1: Algorithms for OLO over Hilbert space and LEA.

Contributions. We show that a more fundamental notion subsumes both OLO and LEA parameterfree algorithms. We prove that the ability to maximize the wealth in bets on the outcomes of coin ﬂips implies OLO and LEA parameter-free algorithms. We develop a novel potential-based framework for betting algorithms. It gives intuition to previous constructions and, instantiated with the Krichevsky-Troﬁmov estimator, provides new and elegant algorithms for OLO and LEA. The new algorithms also have optimal worst-case guarantees on regret and time complexity; see Table 1.

2 Preliminaries

We begin by providing some deﬁnitions. The Kullback-Leibler (KL) divergence between two discrete distributions p and q is D (p q) = P

i pi ln (pi/qi). If p, q are real numbers in [0, 1], we denote by D (p q) = p ln (p/q)+(1 p) ln ((1 p)/(1 q)) the KL divergence between two Bernoulli distributions with parameters p and q. We denote by H a Hilbert space, by , its inner product, and by the induced norm. We denote by 1 the 1-norm in RN. A function F : I R+ is called logarithmically convex iff f(x) = ln(F(x)) is convex. Let f : V R { }, the Fenchel conjugate of f is f : V R { } deﬁned on the dual vector space V by f (θ) = supx V θ, x f(x). A function f : V R {+ } is said to be proper if there exists x V such that f(x) is ﬁnite. If f is a proper lower semi-continuous convex function then f is also proper lower semi-continuous convex and f = f.

Coin Betting. We consider a gambler making repeated bets on the outcomes of adversarial coin ﬂips. The gambler starts with an initial endowment ϵ > 0. In each round t, he bets on the outcome of a coin ﬂip gt { 1, 1}, where +1 denotes heads and 1 denotes tails. We do not make any assumption on how gt is generated, that is, it can be chosen by an adversary.

The gambler can bet any amount on either heads or tails. However, he is not allowed to borrow any additional money. If he loses, he loses the betted amount; if he wins, he gets the betted amount back and, in addition to that, he gets the same amount as a reward. We encode the gambler s bet in round t by a single number wt. The sign of wt encodes whether he is betting on heads or tails. The absolute value encodes the betted amount. We deﬁne Wealtht as the gambler s wealth at the end of round t and Rewardt as the gambler s net reward (the difference of wealth and initial endowment), that is

Wealtht = ϵ +

i=1 wigi and Rewardt = Wealtht ϵ . (1)

In the following, we will also refer to a bet with βt, where βt is such that wt = βt Wealtht 1 . (2) The absolute value of βt is the fraction of the current wealth to bet, and sign of βt encodes whether he is betting on heads or tails. The constraint that the gambler cannot borrow money implies that βt [ 1, 1]. We also generalize the problem slightly by allowing the outcome of the coin ﬂip gt to be any real number in the interval [ 1, 1]; wealth and reward in (1) remain exactly the same.

1These algorithms require to solve a numerical problem at each step. The number K is the number of steps needed to reach the required precision. Neither the precision nor K are calculated in these papers. 2The proof in [15] can be modiﬁed to prove a KL bound, see http://blog.wouterkoolen.info. 3A variant of the algorithm in [11] can be implemented with the stated time complexity [10].

3 Warm-Up: From Betting to One-Dimensional Online Linear Optimization

In this section, we sketch how to reduce one-dimensional OLO to betting on a coin. The reasoning for generic Hilbert spaces (Section 5) and for LEA (Section 6) will be similar. We will show that the betting view provides a natural way for the analysis and design of online learning algorithms, where the only design choice is the potential function of the betting algorithm (Section 4). A speciﬁc example of coin betting potential and the resulting algorithms are in Section 7.

As a warm-up, let us consider an algorithm for OLO over one-dimensional Hilbert space R. Let {wt} t=1 be its sequence of predictions on a sequence of rewards {gt} t=1, gt [ 1, 1]. The total reward of the algorithm after t rounds is Rewardt = Pt i=1 giwi. Also, even if in OLO there is no concept of wealth , deﬁne the wealth of the OLO algorithm as Wealtht = ϵ + Rewardt, as in (1).

We now restrict our attention to algorithms whose predictions wt are of the form of a bet, that is wt = βt Wealtht 1, where βt [ 1, 1]. We will see that the restriction on βt does not prevent us from obtaining parameter-free algorithms with optimal bounds.

Given the above, it is immediate to see that any coin betting algorithm that, on a sequence of coin ﬂips {gt} t=1, gt [ 1, 1], bets the amounts wt can be used as an OLO algorithm in a onedimensional Hilbert space R. But, what would be the regret of such OLO algorithms?

Assume that the betting algorithm at hand guarantees that its wealth is at least F(PT t=1 gt) starting from an endowment ϵ, for a given potential function F, then

t=1 gtwt = Wealth T ϵ F

Intuitively, if the reward is big we can expect the regret to be small. Indeed, the following lemma converts the lower bound on the reward to an upper bound on the regret. Lemma 1 (Reward-Regret relationship [22]). Let V, V be a pair of dual vector spaces. Let F : V R {+ } be a proper convex lower semi-continuous function and let F : V R {+ } be its Fenchel conjugate. Let w1, w2, . . . , w T V and g1, g2, . . . , g T V . Let ϵ R. Then,

| {z } Reward T

ϵ if and only if u V ,

t=1 gt, u wt

| {z } Regret T (u)

F (u) + ϵ .

Applying the lemma, we get a regret upper bound: Regret T (u) F (u) + ϵ for all u H.

To summarize, if we have a betting algorithm that guarantees a minimum wealth of F(PT t=1 gt), it can be used to design and analyze a one-dimensional OLO algorithm. The faster the growth of the wealth, the smaller the regret will be. Moreover, the lemma also shows that trying to design an algorithm that is adaptive to u is equivalent to designing an algorithm that is adaptive to PT t=1 gt. Also, most importantly, methods that guarantee optimal wealth for the betting scenario are already known, see, e.g., [4, Chapter 9]. We can just re-use them to get optimal online algorithms!

4 Designing a Betting Algorithm: Coin Betting Potentials

For sequential betting on i.i.d. coin ﬂips, an optimal strategy has been proposed by Kelly [14]. The strategy assumes that the coin ﬂips {gt} t=1, gt {+1, 1}, are generated i.i.d. with known probability of heads. If p [0, 1] is the probability of heads, the Kelly bet is to bet βt = 2p 1 at each round. He showed that, in the long run, this strategy will provide more wealth than betting any other ﬁxed fraction of the current wealth [14].

For adversarial coins, Kelly betting does not make sense. With perfect knowledge of the future, the gambler could always bet everything on the right outcome. Hence, after T rounds from an initial endowment ϵ, the maximum wealth he would get is ϵ2T . Instead, assume he bets the same fraction β of its wealth at each round. Let Wealtht(β) the wealth of such strategy after t rounds. As observed in [21], the optimal ﬁxed fraction to bet is β = (PT t=1 gt)/T and it gives the wealth

Wealth T (β ) = ϵ exp T D 1 2 +

2 ϵ exp (PT t=1 gt)2

where the inequality follows from Pinsker s inequality [9, Lemma 11.6.1].

However, even without knowledge of the future, it is possible to go very close to the wealth in (4). This problem was studied by Krichevsky and Troﬁmov [16], who proposed that after seeing the coin

ﬂips g1, g2, . . . , gt 1 the empirical estimate kt = 1/2+Pt 1 i=1 1[gi=+1]

t should be used instead of p. Their estimate is commonly called KT estimator.1 The KT estimator results in the betting

βt = 2kt 1 =

Pt 1 i=1 gi

which we call adaptive Kelly betting based on the KT estimator. It looks like an online and slightly biased version of the oracle choice of β . This strategy guarantees2

Wealth T Wealth T (β )

T exp T D 1 2 +

This guarantee is optimal up to constant factors [4] and mirrors the guarantee of the Kelly bet.

Here, we propose a new set of deﬁnitions that allows to generalize the strategy of adaptive Kelly betting based on the KT estimator. For these strategies it will be possible to prove that, for any g1, g2, . . . , gt [ 1, 1], Wealtht Ft

where Ft(x) is a certain function. We call such functions potentials. The betting strategy will be determined uniquely by the potential (see (c) in the Deﬁnition 2), and we restrict our attention to potentials for which (6) holds. These constraints are speciﬁed in the deﬁnition below.

Deﬁnition 2 (Coin Betting Potential). Let ϵ > 0. Let {Ft} t=0 be a sequence of functions Ft : ( at, at) R+ where at > t. The sequence {Ft} t=0 is called a sequence of coin betting potentials for initial endowment ϵ, if it satisﬁes the following three conditions:

(a) F0(0) = ϵ.

(b) For every t 0, Ft(x) is even, logarithmically convex, strictly increasing on [0, at), and limx at Ft(x) = + .

(c) For every t 1, every x [ (t 1), (t 1)] and every g [ 1, 1], (1 + gβt) Ft 1(x) Ft(x + g), where βt = Ft(x+1) Ft(x 1)

Ft(x+1)+Ft(x 1) . (7)

The sequence {Ft} t=0 is called a sequence of excellent coin betting potentials for initial endowment ϵ if it satisﬁes conditions (a) (c) and the condition (d) below.

(d) For every t 0, Ft is twice-differentiable and satisﬁes x F t (x) F t(x) for every x [0, at).

Let s give some intuition on this deﬁnition. First, let s show by induction on t that (b) and (c) of the deﬁnition together with (2) give a betting strategy that satisﬁes (6). The base case t = 0 is trivial. At time t 1, bet wt = βt Wealtht 1 where βt is deﬁned in (7), then

Wealtht = Wealtht 1 +wtgt = (1 + gtβt) Wealtht 1

(1 + gtβt)Ft 1

i=1 gi + gt

The formula for the potential-based strategy (7) might seem strange. However, it is derived see Theorem 8 in Appendix B by minimizing the worst-case value of the right-hand side of the inequality used w.r.t. to gt in the induction proof above: Ft 1(x) Ft(x+gt)

The last point, (d), is a technical condition that allows us to seamlessly reduce OLO over a Hilbert space to the one-dimensional problem, characterizing the worst case direction for the reward vectors.

1Compared to the maximum likelihood estimate

Pt 1 i=1 1[gi=+1]

t 1 , KT estimator shrinks slightly towards 1/2. 2See Appendix A for a proof. For lack of space, all the appendices are in the supplementary material.

Regarding the design of coin betting potentials, we expect any potential that approximates the best possible wealth in (4) to be a good candidate. In fact, Ft(x) = ϵ exp x2/(2t) /

t, essentially the potential used in the parameter-free algorithms in [22, 24] for OLO and in [6, 18, 19] for LEA, approximates (4) and it is an excellent coin betting potential see Theorem 9 in Appendix B. Hence, our framework provides intuition to previous constructions and in Section 7 we show new examples of coin betting potentials.

In the next two sections, we presents the reductions to effortlessly solve both the generic OLO case and LEA with a betting potential.

5 From Coin Betting to OLO over Hilbert Space

In this section, generalizing the one-dimensional construction in Section 3, we show how to use a sequence of excellent coin betting potentials {Ft} t=0 to construct an algorithm for OLO over a Hilbert space and how to prove a regret bound for it.

We deﬁne reward and wealth analogously to the one-dimensional case: Rewardt = Pt i=1 gi, wi and Wealtht = ϵ + Rewardt. Given a sequence of coin betting potentials {Ft} t=0, using (7) we deﬁne the fraction βt = Ft( Pt 1 i=1 gi +1) Ft( Pt 1 i=1 gi 1) Ft( Pt 1 i=1 gi +1)+Ft( Pt 1 i=1 gi 1) . (8)

The prediction of the OLO algorithm is deﬁned similarly to the one-dimensional case, but now we also need a direction in the Hilbert space:

wt = βt Wealtht 1

Pt 1 i=1 gi Pt 1 i=1 gi = βt

Pt 1 i=1 gi Pt 1 i=1 gi

If Pt 1 i=1 gi is the zero vector, we deﬁne wt to be the zero vector as well. For this prediction strategy we can prove the following regret guarantee, proved in Appendix C. The proof reduces the general Hilbert case to the 1-d case, thanks to (d) in Deﬁnition 2, then it follows the reasoning of Section 3. Theorem 3 (Regret Bound for OLO in Hilbert Spaces). Let {Ft} t=0 be a sequence of excellent coin betting potentials. Let {gt} t=1 be any sequence of reward vectors in a Hilbert space H such that gt 1 for all t. Then, the algorithm that makes prediction wt deﬁned by (9) and (8) satisﬁes

T 0 u H Regret T (u) F T ( u ) + ϵ .

6 From Coin Betting to Learning with Expert Advice

In this section, we show how to use the algorithm for OLO over one-dimensional Hilbert space R from Section 3 which is itself based on a coin betting strategy to construct an algorithm for LEA.

Let N 2 be the number of experts and N be the N-dimensional probability simplex. Let π = (π1, π2, . . . , πN) N be any prior distribution. Let A be an algorithm for OLO over the one-dimensional Hilbert space R, based on a sequence of the coin betting potentials {Ft} t=0 with initial endowment3 1. We instantiate N copies of A.

Consider any round t. Let wt,i R be the prediction of the i-th copy of A. The LEA algorithm computes bpt = (bpt,1, bpt,2, . . . , bpt,N) RN 0,+ as

bpt,i = πi [wt,i]+, (10)

where [x]+ = max{0, x} is the positive part of x. Then, the LEA algorithm predicts pt = (pt,1, pt,2, . . . , pt,N) N as pt = bpt bpt 1 . (11)

If bpt 1 = 0, the algorithm predicts the prior π. Then, the algorithm receives the reward vector gt = (gt,1, gt,2, . . . , gt,N) [0, 1]N. Finally, it feeds the reward to each copy of A. The reward for

3Any initial endowment ϵ > 0 can be rescaled to 1. Instead of Ft(x) we would use Ft(x)/ϵ. The wt would become wt/ϵ, but pt is invariant to scaling of wt. Hence, the LEA algorithm is the same regardless of ϵ.

the i-th copy of A is egt,i [ 1, 1] deﬁned as

egt,i = gt,i gt, pt if wt,i > 0 , [gt,i gt, pt ]+ if wt,i 0 . (12)

The construction above deﬁnes a LEA algorithm deﬁned by the predictions pt, based on the algorithm A. We can prove the following regret bound for it. Theorem 4 (Regret Bound for Experts). Let A be an algorithm for OLO over the one-dimensional Hilbert space R, based on the coin betting potentials {Ft} t=0 for an initial endowment of 1. Let f 1 t be the inverse of ft(x) = ln(Ft(x)) restricted to [0, ). Then, the regret of the LEA algorithm with prior π N that predicts at each round with pt in (11) satisﬁes

T 0 u N Regret T (u) f 1 T (D (u π)) .

The proof, in Appendix D, is based on the fact that (10) (12) guarantee that PN i=1 πiegt,iwt,i 0 and on a variation of the change of measure lemma used in the PAC-Bayes literature, e.g. [20].

7 Applications of the Krichevsky-Troﬁmov Estimator to OLO and LEA

In the previous sections, we have shown that a coin betting potential with a guaranteed rapid growth of the wealth will give good regret guarantees for OLO and LEA. Here, we show that the KT estimator has associated an excellent coin betting potential, which we call KT potential. Then, the optimal wealth guarantee of the KT potentials will translate to optimal parameter-free regret bounds.

The sequence of excellent coin betting potentials for an initial endowment ϵ corresponding to the adaptive Kelly betting strategy βt deﬁned by (5) based on the KT estimator are

Ft(x) = ϵ 2t Γ t+1

π t! t 0, x ( t 1, t + 1), (13)

where Γ(x) = R 0 tx 1e tdt is Euler s gamma function see Theorem 13 in Appendix E. This potential was used to prove regret bounds for online prediction with the logarithmic loss [16][4, Chapter 9.7]. Theorem 13 also shows that the KT betting strategy βt as deﬁned by (5) satisﬁes (7).

This potential has the nice property that is satisﬁes the inequality in (c) of Deﬁnition 2 with equality when gt { 1, 1}, i.e. Ft(x + gt) = (1 + gtβt) Ft 1(x).

We also generalize the KT potentials to δ-shifted KT potentials, where δ 0, deﬁned as

Ft(x) = 2t Γ(δ+1) Γ t+δ+1

2 Γ(t+δ+1) .

The reason for its name is that, up to a multiplicative constant, Ft is equal to the KT potential shifted in time by δ. Theorem 13 also proves that the δ-shifted KT potentials are excellent coin

betting potentials with initial endowment 1, and the corresponding betting fraction is βt =

Pt 1 j=1 gj δ+t .

7.1 OLO in Hilbert Space

We apply the KT potential for the construction of an OLO algorithm over a Hilbert space H. We will use (9), and we just need to calculate βt. According to Theorem 13 in Appendix E, the formula

for βt simpliﬁes to βt = Pt 1 i=1 gi

t so that wt = 1

t ϵ + Pt 1 i=1 gi, wi Pt 1 i=1 gi.

The resulting algorithm is stated as Algorithm 1. We derive a regret bound for it as a very simple corollary of Theorem 3 to the KT potential (13). The only technical part of the proof, in Appendix F, is an upper bound on F t since it cannot be expressed as an elementary function. Corollary 5 (Regret Bound for Algorithm 1). Let ϵ > 0. Let {gt} t=1 be any sequence of reward vectors in a Hilbert space H such that gt 1. Then Algorithm 1 satisﬁes

T 0 u H Regret T (u) u

T ln 1 + 24T 2 u 2

ϵ2 + ϵ 1 1 e

Algorithm 1 Algorithm for OLO over Hilbert space H based on KT potential Require: Initial endowment ϵ > 0

1: for t = 1, 2, . . . do

2: Predict with wt 1

t ϵ + Pt 1 i=1 gi, wi Pt 1 i=1 gi

3: Receive reward vector gt H such that gt 1 4: end for

Algorithm 2 Algorithm for Learning with Expert Advice based on δ-shifted KT potential Require: Number of experts N, prior distribution π N, number of rounds T

1: for t = 1, 2, . . . , T do

2: For each i [N], set wt,i

Pt 1 j=1 egj,i t+T/2 1 + Pt 1 j=1 egj,iwj,i

3: For each i [N], set bpt,i πi[wt,i]+

4: Predict with pt bpt/ bpt 1 if bpt 1 > 0 π if bpt 1 = 0 5: Receive reward vector gt [0, 1]N

6: For each i [N], set egt,i gt,i gt, pt if wt,i > 0 [gt,i gt, pt ]+ if wt,i 0 7: end for

It is worth noting the elegance and extreme simplicity of Algorithm 1 and contrast it with the algorithms in [26, 22 24]. Also, the regret bound is optimal [26, 23]. The parameter ϵ can be safely set to any constant, e.g. 1. Its role is equivalent to the initial guess used in doubling tricks [25].

7.2 Learning with Expert Advice

We will now construct an algorithm for LEA based on the δ-shifted KT potential. We set δ to T/2, requiring the algorithm to know the number of rounds T in advance; we will ﬁx this later with the standard doubling trick.

To use the construction in Section 6, we need an OLO algorithm for the 1-d Hilbert space R. Using the δ-shifted KT potentials, the algorithm predicts for any sequence {egt} t=1 of reward

wt = βt Wealtht 1 = βt

= Pt 1 i=1 egi T/2 + t

Then, following the construction in Section 6, we arrive at the ﬁnal algorithm, Algorithm 2. We can derive a regret bound for Algorithm 2 by applying Theorem 4 to the δ-shifted KT potential. Corollary 6 (Regret Bound for Algorithm 2). Let N 2 and T 0 be integers. Let π N be a prior. Then Algorithm 2 with input N, π, T for any rewards vectors g1, g2, . . . , g T [0, 1]N satisﬁes

u N Regret T (u) p

3T(3 + D (u π)) .

Hence, the Algorithm 2 has both the best known guarantee on worst-case regret and per-round time complexity, see Table 1. Also, it has the advantage of being very simple.

The proof of the corollary is in the Appendix F. The only technical part of the proof is an upper bound on f 1 t (x), which we conveniently do by lower bounding Ft(x).

The reason for using the shifted potential comes from the analysis of f 1 t (x). The unshifted algorithm would have a O( p

T(log T + D (u π)) regret bound; the shifting improves the bound to O( p

T(1 + D (u π)). By changing T/2 in Algorithm 2 to another constant fraction of T, it is possible to trade-off between the two constants 3 present in the square root in the regret upper bound.

The requirement of knowing the number of rounds T in advance can be lifted by the standard doubling trick [25, Section 2.3.1], obtaining an anytime guarantee with a bigger leading constant,

T 0 u N Regret T (u)

3T(3 + D (u π)) .

10 2 10 1 10 0 10 1 10 2 3

Year Prediction MSD dataset, absolute loss

OGD, ηt = U p

DFEG Adaptive Normal

10 1 10 0 10 1 10 2 5.5

cpusmall dataset, absolute loss

OGD, ηt = U p

DFEG Adaptive Normal

10 0 10 2 10 4 10 6 1.7

2.05 x 10 9

cadata dataset, absolute loss

OGD, ηt = U p

DFEG Adaptive Normal

Figure 1: Total loss versus learning rate parameter of OGD (in log scale), compared with parameter-free algorithms DFEG [23], Adaptive Normal [22], Pi STOL [24] and the KT-based Algorithm 1.

10 0 10 1 200

Replicated Hadamard matrices, N=126, k=2 good experts

Regret to best expert after T=32768

Hedge, ηt = U p

Normal Hedge

Ada Normal Hedge

10 0 10 1 180

Replicated Hadamard matrices, N=126, k=8 good experts

Regret to best expert after T=32768

Hedge, ηt = U p

Normal Hedge

Ada Normal Hedge

Replicated Hadamard matrices, N=126, k=32 good experts

Regret to best expert after T=32768

Hedge, ηt = U p

Normal Hedge

Ada Normal Hedge

Figure 2: Regrets to the best expert after T = 32768 rounds, versus learning rate parameter of Hedge (in log scale). The good experts are ϵ = 0.025 better than the others. The competitor algorithms are Normal Hedge [6], Ada Normal Hedge [19], Squint [15], and the KT-based Algorithm 2. πi = 1/N for all algorithms.

8 Discussion of the Results

We have presented a new interpretation of parameter-free algorithms as coin betting algorithms. This interpretation, far from being just a mathematical gimmick, reveals the common hidden structure of previous parameter-free algorithms for both OLO and LEA and also allows the design of new algorithms. For example, we show that the characteristic of parameter-freeness is just a consequence of having an algorithm that guarantees the maximum reward possible. The reductions in Sections 5 and 6 are also novel and they are in a certain sense optimal. In fact, the obtained Algorithms 1 and 2 achieve the optimal worst case upper bounds on the regret, see [26, 23] and [4] respectively.

We have also run an empirical evaluation to show that the theoretical difference between classic online learning algorithms and parameter-free ones is real and not just theoretical. In Figure 1, we have used three regression datasets4, and solved the OCO problem through OLO. In all the three cases, we have used the absolute loss and normalized the input vectors to have L2 norm equal to 1. From the empirical results, it is clear that the optimal learning rate is completely data-dependent, yet parameter-free algorithms have performance very close to the unknown optimal tuning of the learning rate. Moreover, the KT-based Algorithm 1 seems to dominate all the other similar algorithms.

For LEA, we have used the synthetic setting in [6]. The dataset is composed of Hadamard matrices of size 64, where the row with constant values is removed, the rows are duplicated to 126 inverting their signs, 0.025 is subtracted to k rows, and the matrix is replicated in order to generate T = 32768 samples. For more details, see [6]. Here, the KT-based algorithm is the one in Algorithm 2, where the term T/2 is removed, so that the ﬁnal regret bound has an additional ln T term. Again, we see that the parameter-free algorithms have a performance close or even better than Hedge with an oracle tuning of the learning rate, with no clear winners among the parameter-free algorithms.

Notice that since the adaptive Kelly strategy based on KT estimator is very close to optimal, the only possible improvement is to have a data-dependent bound, for example like the ones in [24, 15, 19]. In future work, we will extend our deﬁnitions and reductions to the data-dependent case.

4Datasets available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

Acknowledgments. The authors thank Jacob Abernethy, Nicol o Cesa-Bianchi, Satyen Kale, Chansoo Lee, Giuseppe Molteni, and Manfred Warmuth for useful discussions on this work.

[1] E. Artin. The Gamma Function. Holt, Rinehart and Winston, Inc., 1964. [2] N. Batir. Inequalities for the gamma function. Archiv der Mathematik, 91(6):554 563, 2008. [3] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer Publishing Company, Incorporated, 1st edition, 2011. [4] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. [5] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. J. ACM, 44(3):427 485, 1997. [6] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. In Advances in Neural Information Processing Systems 22, pages 297 305, 2009. [7] C.-P. Chen. Inequalities for the polygamma functions with application. General Mathematics, 13(3): 65 72, 2005. [8] A. Chernov and V. Vovk. Prediction with advice of unknown number of experts. In Proc. of the 26th Conf. on Uncertainty in Artiﬁcial Intelligence. AUAI Press, 2010. [9] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 2nd edition, 2006. [10] D. J. Foster. personal communication, 2016. [11] D. J. Foster, A. Rakhlin, and K. Sridharan. Adaptive online learning. In Advances in Neural Information Processing Systems 28, pages 3375 3383. Curran Associates, Inc., 2015. [12] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Computer and System Sciences, 55(1):119 139, 1997. [13] A. Hoorfar and M. Hassani. Inequalities on the Lambert W function and hyperpower function. J. Inequal. Pure and Appl. Math, 9(2), 2008. [14] J. L. Kelly. A new interpretation of information rate. Information Theory, IRE Trans. on, 2(3):185 189, September 1956. [15] W. M. Koolen and T. van Erven. Second-order quantile methods for experts and combinatorial games. In Proc. of the 28th Conf. on Learning Theory, pages 1155 1175, 2015. [16] R. E. Krichevsky and V. K. Troﬁmov. The performance of universal encoding. IEEE Trans. on Information Theory, 27(2):199 206, 1981. [17] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108 (2):212 261, 1994. [18] H. Luo and R. E. Schapire. A drifting-games analysis for online learning and applications to boosting. In Advances in Neural Information Processing Systems 27, pages 1368 1376, 2014. [19] H. Luo and R. E. Schapire. Achieving all with no parameters: Ada Normal Hedge. In Proc. of the 28th Conf. on Learning Theory, pages 1286 1304, 2015. [20] D. Mc Allester. A PAC-Bayesian tutorial with a dropout bound, 2013. ar Xiv:1307.2118. [21] H. B. Mc Mahan and J. Abernethy. Minimax optimal algorithms for unconstrained linear optimization. In Advances in Neural Information Processing Systems 26, pages 2724 2732, 2013. [22] H. B. Mc Mahan and F. Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Proc. of the 27th Conf. on Learning Theory, pages 1020 1039, 2014. [23] F. Orabona. Dimension-free exponentiated gradient. In Advances in Neural Information Processing Systems 26 (NIPS 2013), pages 1806 1814. Curran Associates, Inc., 2013. [24] F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems 27 (NIPS 2014), pages 1116 1124, 2014. [25] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011. [26] M. Streeter and B. Mc Mahan. No-regret algorithms for unconstrained online convex optimization. In Advances in Neural Information Processing Systems 25 (NIPS 2012), pages 2402 2410, 2012. [27] V. Vovk. A game of prediction with expert advice. J. Computer and System Sciences, 56:153 173, 1998. [28] E. T. Whittaker and G. N. Watson. A Course of Modern Analysis. Cambridge University Press, fourth edition, 1962. Reprinted. [29] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context tree weighting method: Basic properties. IEEE Trans. on Information Theory, 41:653 664, 1995.