# unconstrained_online_learning_with_unbounded_losses__ad0b1692.pdf

Unconstrained Online Learning with Unbounded Losses

Andrew Jacobsen 1 2 Ashok Cutkosky 3

Algorithms for online learning typically require one or more boundedness assumptions: that the domain is bounded, that the losses are Lipschitz, or both. In this paper, we develop a new setting for online learning with unbounded domains and non-Lipschitz losses. For this setting we provide an algorithm which guarantees RT (u) e O(G u

T) regret on any problem where the subgradients satisfy gt G+L wt , and show that this bound is unimprovable without further assumptions. We leverage this algorithm to develop new saddle-point optimization algorithms that converge in duality gap in unbounded domains, even in the absence of meaningful curvature. Finally, we provide the ﬁrst algorithm achieving non-trivial dynamic regret in an unbounded domain for non-Lipschitz losses, as well as a matching lower bound. The regret of our dynamic regret algorithm automatically improves to a novel L bound when the losses are smooth.

1. Online Learning

This paper introduces new techniques for online convex optimization (OCO), a standard framework used to model learning from a stream of data (Cesa-Bianchi & Lugosi, 2006; Shalev-Shwartz, 2011; Hazan, 2016; Orabona, 2019). Formally, consider T rounds of interaction between an algorithm and an environment. In each round, the algorithm chooses a wt in some convex subset W of a Hilbert space, after which the environment chooses a convex loss function ℓt : W R. The standard performance metric in this setting is regret RT (u), the cumulative loss relative to an

1Department of Computing Science, University of Alberta, Edmonton, Canada 2Alberta Machine Intelligence Institute (Amii), Edmonton, Canada 3Department of Electrical and Computer Engineering, Boston University, Boston, Massachussetts. Correspondence to: Andrew Jacobsen <ajjacobs@ualberta.ca>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

unknown benchmark point u W:

t=1 ℓt(wt) ℓt(u).

In many applications of interest the appropriate baseline is not any ﬁxed comparator, but rather a trajectory of points. This is often the case in true streaming settings, wherein the losses are generated from a data distribution that may be slowly shifting over time. To better model settings such as these, dynamic regret measures the total loss relative to that of a benchmark sequence of points u = (u1, . . . , u T ) in W:

t=1 ℓt(wt) ℓt(ut).

Our goal in this work is to develop algorithms that achieve favorable regret and dynamic regret guarantees when both the domain W and range of ℓt may be unbounded.

To illustrate the difﬁculty of our goal, let us consider the special case where the loss functions are linear functions, ℓt(w) = gt, w . Clearly, if both gt and w are allowed to be arbitrarily large then the adversary can always ensure that the learner takes an arbitrarily large loss on each round. To alleviate this difﬁculty, prior works assume that one has access to a bound D u (usually by assuming that the domain is bounded with D supx,y W x y ), that the subgradients are bounded GT maxt gt , that the losses map to a bounded range ℓt : W [a, b], or some combination thereof.

In the simplest case, when one has access to both a bound D u and a bound on the subgradients GT gt for all t, classic methods based on Mirror Descent and Follow the Regularized Leader achieve minimax optimal regret of RT (u) O DGT

T using a strongly convex regularizer (Hazan et al., 2007; Duchi et al., 2010; Mc Mahan & Streeter, 2010). When D is available but not the Lipschitz bound GT , it is still possible to match this guarantee up to constant factors, in which case the algorithm is said to be Lipschitz adaptive (Orabona & Pál, 2018; Mayo et al., 2022; Cutkosky, 2019). When the losses are L-smooth, these bounds can

be improved to RT (u) O LD2 + D q

L PT t=1 ℓt(u)

referred to as an L bound though prior works still

Unconstrained Online Learning with Unbounded Losses

require one or more of the following assumptions: prior knowledge of GT , that ℓt has bounded range (known in advance), prior knowledge of a lower bound ℓ t ℓt(w) for all w W, additional structural assumptions such as strong convexity or exp-concavity, or by assuming the losses take some speciﬁc form such as the square loss (Cesa-Bianchi et al., 1996; Kivinen & Warmuth, 1997; Srebro et al., 2010; Orabona et al., 2012).

If a bound GT maxt gt is known but not the bound D u , the situation gets signiﬁcantly trickier. The essential difﬁculty is that without prior knowledge of how large the comparator might be, the predictions wt could at any point be arbitrarily far away from the benchmark, leading to high regret. As such, the learner must take great care to control wt in such a way that it is adaptive to the unknown comparator norm u . A standard result in this setting is

RT (u) O u GT p

T log(T u + 1) , (1)

which holds for all u W and is known to be optimal up to constants (Orabona, 2013). Bounds of this form are commonly referred to as comparator adaptive or parameterfree (Foster et al., 2015; Orabona & Pál, 2016; van der Hoeven, 2019; Cutkosky & Orabona, 2018; Mhammedi & Koolen, 2020; Jacobsen & Cutkosky, 2022).

The ﬁrst results to avoid both the bounded domain and bounded gradient assumptions have only been achieved in recent years. Cutkosky (2019) develops an algorithm which achieves RT (u) O u GT p

T log ( u T + 1) + GT u 3 , and Mhammedi & Koolen (2020) shows that the additional cubic penalty is unavoidable while maintaining the e O u GT

T dependence. Alternatively, Orabona &

Pál (2018) show that RT (u) O( u 2 GT

T) can be attained without prior knowledge of GT in an unbounded domain, avoiding the cubic penalty in exchange for a horizon-dependent quadratic penalty. Works such as Mayo et al. (2022) and Kempka et al. (2019) show that Equation (1) can be achieved with essentially no extra penalty in certain special cases such as regression-type losses.

When it comes to dynamic regret, much less progress has been made in alleviating boundedness assumptions, with nearly all existing results assuming both a bounded domain and Lipschitz losses. Under both boundedness assumptions, prior works have achieved minimax optimal dynamic regret of RT (u) O GT p

(D2 + DPT )T ,

where PT = PT t=2 ut ut 1 is the path-length of the comparator sequence (Zhang et al., 2018; Jadbabaie et al., 2015; Cutkosky, 2020). In an unbounded domain with

Lipschitz losses, recent works have achieved an analogous guarantee of RT (u) e O(GT p

(M 2 + MPT )T), where M = maxt ut (Jacobsen & Cutkosky, 2022; Luo et al., 2022). We are unaware of any existing works that explicitly investigate Lipschitz-adaptive dynamic regret, though existing results can likely be made Lipschitzadaptive in a black-box manner using the gradient clipping approach of Cutkosky (2019) in exchange for an appropriate GT maxt ut 3 penalty.

Importantly, note that in all of these prior works there is an implicit assumption that there exists a uniform bound such that G ℓt(w) for any w W and ℓt(w) ℓt(w) even if it is not known in advance. Otherwise, the terms GT = maxt gt can easily make any of the aforementioned regret guarantees vacuous.

In this work, we study unconstrained online convex optimization under a more general boundedness assumption on the gradients, allowing the gradient norms to grow arbitrarily large away from a given reference point w0 W. In Section 2 we provide an algorithm for this more general problem setting which achieves a strict generalization of the usual comparator-adaptive bound in Equation (1), as well as a lower bound showing that our result is unimprovable in general. In Section 3 we leverage this algorithm to develop a new saddle-point optimization algorithm which converges in duality gap in an unbounded domain without requiring additional curvature assumptions such as strong-convexity/concavity. In Section 4, we turn to the problem of dynamic regret minimization and develop an algorithm which achieves dynamic regret RT (u) e O M 3/2p

(M + PT )T and provide a matching lower bound. This is the ﬁrst algorithm to signiﬁcantly alleviate both the bounded domain and bounded subgradient assumptions for dynamic regret. Moreover, when the losses are Lt-smooth, the same algorithm automatically improves to an L bound of the form RT (u)

(M 2 + MPT ) PT t=1 Lt [ℓt(ut) ℓ t ] . To the best

of our knowledge, this is in fact the ﬁrst L bound to be achieved for general smooth losses without making either a uniformly-bounded subgradient or bounded range assumption in an unbounded domain.

Notations. For brevity, we occasionally abuse notation by letting f(x) denote an element of f(x). The Bregman divergence w.r.t. a differentiable function ψ is Dψ(x|y) = ψ(x) ψ(y) ψ(y), x y . We use the compressed sum notation gi:j = Pj t=i gt and g 2 a:b = Pb t=a gt 2. We denote a b = max {a, b} and a b = min {a, b}. N denotes the N-dimensional simplex. The notation O( ) hides constants, b O( ) hides constants and log(log) terms, and e O( ) hides up to and including log factors.

Unconstrained Online Learning with Unbounded Losses

2. Online Learning with Quadratically Bounded Losses

In an unbounded domain with unbounded losses, it will generally be impossible to avoid linear regret without some additional assumptions. Intuitively, what s missing in this problem is a frame-of-reference for the magnitude of a given loss. In the Lipschitz or bounded-range settings, the learner always has a frame-of-reference for the worst-case loss they might encounter. In contrast, without these assumptions, hindsight becomes the only frame-of-reference, and the adversary can exploit this to trick the learner into playing too greedily or too conservatively.

To make the problem tractable, yet still allowing the losses to have unbounded range and subgradients, we assume that the subgradients are bounded for all t at some reference point w0, but may become arbitrarily large away from w0. This effectively gives the learner access to an a priori frameof-reference for loss magnitudes, yet still captures many problem settings where the losses can become arbitrarily large in an unbounded domain.

Deﬁnition 2.1. Let (W, ) be a normed space. A function ℓ: W R is (G, L)-quadratically bounded w.r.t at w0 if for any w W and ℓ(w) ℓ(w) it holds that

ℓ(w) G + L w w0 . (2)

Note that Deﬁnition 2.1 is a strict generalization of the standard Lipschitz condition: any G-Lipschitz function is (G, 0)-quadratically bounded. The deﬁnition also captures L-smooth functions as a special case, since any L-smooth function is ( ℓt(w0) , L)-quadratically bounded at w0. However, in general a function satisfying the quadratically bounded property need not be smooth. 1 For the remainder of the paper we assume without loss of generality that w0 = 0 and is the Euclidean norm.

This assumption was initially studied in the context of stochastic optimization by Telgarsky (2022), where it was sufﬁcient to attain convergence in several settings of practical relevance. In this work, we show that it is also sufﬁcient to achieve sublinear regret even in adversarial problem settings. We will in fact take it one step further and consider a stronger Online Linear Optimization (OLO) version of the problem. We say that a sequence {gt} is (Gt, Lt)- quadratically bounded w.r.t {wt} if for every t we have gt Gt + Lt wt . Then using the standard reduction from OCO to OLO (see e.g. Zinkevich (2003)), for any sequence of (Gt, Lt)-quadratically bounded convex functions

1As a simple illustration, note that if f(w) is an L-smooth and (G, L)-quadratically bounded function, then f(w) + c w will be (G + c, L) quadratically bounded but non-smooth.

we have the following regret upper bound:

t=1 ℓt(wt) ℓt(u)

t=1 gt, wt u ,

where gt ℓt(wt) and {gt} is a (Gt, Lt)-quadratically bounded sequence w.r.t {wt}. Hence, one can solve OCO problems involving quadratically bounded losses using any OLO algorithm that achieves sublinear regret against sequences {gt} that are quadratically bounded w.r.t its outputs {wt}. Note that this is potentially a more difﬁcult problem, as it gives the adversary freedom to impose severe penalties whenever the learner plays large wt, yet this effect is experienced asymmetrically by the comparator: the comparator can have large norm and not necessarily experience large losses unless u is aligned with gt and the learner plays a point wt u . For brevity we refer to this harder problem setting as the QB-OLO setting, and QB-OCO for the setting where adversary must play ℓt satisfying Deﬁnition 2.1.

Surprisingly, it turns out that it is possible to achieve sublinear regret even in the QB-OLO setting. The following theorem provides an algorithm which achieves sublinear regret and requires no instance-speciﬁc hyperparameter tuning. Proof can be found in Appendix B.

Theorem 2.2. Let A be an online learning algorithm and let wt W its output on round t. Let {gt} be a (Gt, Lt)-quadratically bounded sequence w.r.t {wt}, where Gt [0, Gmax] and Lt [0, Lmax] for all t. Let ϵ > 0, Vt+1 = 4G2 max + G2 1:t, ρt+1 = 1

L2max+L2 1:t , αt+1 = ϵGmax

Vt+1 log2(Vt+1/G2max). Denote

Ψt(w) = 3 R w 0 minη 1 Gmax

h log(x/αt+1)

η + ηVt i dx and set

ψt(w) = Ψt(w) + 2

ρt w 2 , ϕt(w) = L2 t 2 p

L2 1:t w 2 .

Then for any u W, Algorithm 1 guarantees

RT (u) O ϵ + u q

G2 1:T FT ( u ) + u 2 q

where FT ( u ) log u

Let us brieﬂy develop some intuition for how the above result is constructed. Algorithm 1 can be interpreted as an instance of the Centered Mirror Descent algorithm recently developed by Jacobsen & Cutkosky (2022), which admits a generic regret guarantee of the form RT (u) ψT (u) + PT t=1 ϕt(u) + PT t=1 δt, where the δt are similar to the stability terms encountered in vanilla Mirror Descent, but

Unconstrained Online Learning with Unbounded Losses

Algorithm 1 Algorithm for Quadratically Bounded Losses

Input: ψ1 : W R 0 with minw W ψ1(w) = 0, Gmax and Lmax Initialize: w1 = arg minw W ψ1(w) for t = 1 : T do

Play wt, observe gt ℓt(wt) Choose Gt and Lt satisfying gt Gt + Lt wt Choose functions ψt+1, ϕt Set ϕt(wt) ϕt(wt) and egt = gt + ϕt(wt) Set t(w) = ψt+1(w) ψt(w) Update

wt+1 = arg minw W egt, w + Dψt(w|wt) + t(w)

with certain additional negative terms t and ϕt:

δt O gt, wt wt+1 Dψt(wt+1|wt)

t(wt+1) ϕt(wt) .

It s easily veriﬁed that that ψT +1(u) + PT t=1 ϕt(u) match the terms in the upper bound, so the main difﬁculty is making sure that the stability terms PT t=1 δt disappear. Crucially, because {gt} is a (Gt, Lt)-quadratically bounded sequence w.r.t {wt}, we have gt Gt + Lt wt . The utility of this is that we can design separate regularizers control the Lipschitz part Gt and the non-Lipschitz part Lt wt . In particular, using a similar argument to Jacobsen & Cutkosky (2022), by setting Ψt(w) =

T/ϵ we can ensure that

the Lipschitz part of the bound is well-controlled:

t=1 Gt wt wt+1 DΨt(wt+1|wt) t(wt+1) O(1).

However, in general this Ψt is not strong enough to control the non-Lipschitz part Lt wt . Instead, for this term

we use Φt(w) = O Lmax

T w 2 , and then using standard arguments for Mirror Descent with a strongly convex regularizer, it can be shown that

Lt wt wt wt+1 DΦt(wt+1|wt) ϕt(wt)

by choosing ϕt(wt) = O Lt wt 2

Note that in the setting of G-Lipschitz losses we have Lmax = 0 and hence set Gt = gt , so the bound reduces to the comparator-adaptive rate

of RT (u) b O u r PT t=1 gt 2 log u

which is known to be optimal up to constant and log(log) terms (Mcmahan & Streeter, 2012; Orabona, 2013). On the other hand, if Lmax > 0 the algorithm can choose any Gt Gmax and Lt Lmax such that Gt + Lt wt gt . Ideally these factors should be chosen to be tight, i.e., to minimize G + L wt subject to the constraints {G Gmax, L Lmax, G + L wt gt }. However, there may be many such (G, L) satisfying these conditions, and in general it is unclear whether there exists a generalpurpose strategy to choose among them without further assumptions. Indeed, Theorem 2.2 suggests that when u is very large, we d prefer to set the Lt s smaller at the expense of large Gt s, and vice-versa when u is sufﬁciently small, so optimally trading off Gt and Lt would require some prior knowledge about u .

Nevertheless, there are many situations in which one can choose (Gt, Lt) pairs along some pareto-frontier. As an illustrative example, consider an online regression setting in which ℓt(w) = 1

2(yt xt, w )2 for some target variable yt R and feature vector xt Rd. Observe that ℓt(wt) = (yt xt, wt )xt, so setting Gt = |yt| xt and Lt = | xt, wt/ wt | xt , we have

ℓt(wt) = (yt xt, wt )xt Gt + Lt wt ,

so { ℓt(wt)} is a (Gt, Lt)-quadratically bounded sequence w.r.t {wt}, and Theorem 2.2 quarantees regret scaling as

t=1 y2 t xt 2 + u 2

which is more adaptive to sequence of observed feature vectors xt and targets yt than the worst-case bound of RT (u) e O u |ymax| xmax

T + u 2 xmax 2

Finally, notice that for Lmax > 0 Algorithm 1 suffers an additional O(Lmax u 2

T) penalty which is not present in the Lipschitz losses setting. The following theorem demonstrates that this penalty is in fact unavoidable in our problem setting. Proof can be found in Appendix B.2.

Theorem 2.3. Let A be an algorithm deﬁned over R2 and let wt denote the output of A on round t. Let ϵ > 0 and suppose A guarantees RT (0) ϵ against any quadratically bounded sequence {gt}. Then for any T 1, G > 0 and L 0 there exists a sequence g1, . . . , g T satisfying gt G + L wt and a comparator u R2 such that

Remark 2.4. An alternative way to approach online learning in our problem setting would be to apply an algo-

Unconstrained Online Learning with Unbounded Losses

Algorithm 2 Saddle-point Reduction

Input Domain W = X Y, OLO Algorithm A for t = 1 : T do

Get wt = (xt, yt) W from A Receive gx t x L(xt, yt) and gy t y[ L(xt, yt)] Send gt = (gx t , gy t ) to A as the tth subgradient end for Return w T = PT t=1 xt

rithm which is both comparator-adaptive and Lipschitzadaptive, since these algorithms do not require an a priori upper bound on u nor on gt . Theorem 2.3 demonstrates that this approach would be sub-optimal in our setting. Indeed, Mhammedi & Koolen (2020) show that without prior knowledge of a Lipschitz bound, there is an unavoidable O( u 3 maxt T gt ) penalty associated with comparator-norm adaptivity, which can lead to a suboptimal O( u 3 L maxt wt ) O(L u 3

T) dependence in our problem setting.

3. Unconstrained Saddle-point Optimization

As a result of the algorithm in the previous section, we are immediately able to produce a novel algorithm for saddlepoint optimization in unbounded domains. Consider the following convex-concave saddle-point problem:

min x X max y Y L(x, y)

where X and Y are unbounded compact convex sets, x 7 L(x, y) is convex for all y Y, and y 7 L(x, y) is concave for all x X. The saddle-point (x , y ) of L is the point (x , y ) X Y satisfying L(x , y ) = minx X maxy Y L(x, y) = maxy Y minx X L(x, y), where the last equality follows from Sion s minimax theorem. Then for any (x, y) X Y, we have

L(x, y) L(x , y ) L(x, y ) L(x , y) | {z } =:G(x,y)

Hence, the sub-optimality of a point (x, y) X Y can be controlled so long as we can control the quantity G(x, y), which we refer to as the duality gap. Fortunately, the duality gap is easily controlled using an online learning algorithm via the well-known reduction to Online Linear Optimization (OLO) shown in Algorithm 2.

Lemma 3.1. For any w = ( x, y) X Y, Algorithm 2 guarantees

L(x T , y) L( x, y T ) PT t=1 gt, wt w T = RA T ( w) T .

Proof. To see why this is true, observe that by convexity of x 7 L(x, y) and y 7 L(x, y), we can apply Jensen s

inequality in both arguments to get:

L(x T , y) L( x, y T )

t=1 L(xt, y) L( x, yt)

now add and subtract L(xt, yt):

= PT t=1 L(xt, yt) L( x, yt) L(xt, yt) + L(xt, y)

let gx t x L(xt, yt) and gy t y[ L(xt, yt)] and again use convexity to upper bound both difference terms:

PT t=1 gx t , xt x + gy t , yt y T

and now deﬁne wt = (xt, yt), w = ( x, y), and gt = (gx t , gy t ) to complete the proof:

= PT t=1 gt, wt w T = RA T ( w) T .

Thus in order to control the duality gap G(x, y), it sufﬁces to provide any OLO algorithm that achieves sublinear regret under the given assumptions.

To the best of our knowledge, the only existing work to achieve a comparator-adaptive convergence guarantee for the duality gap in general saddle-point problems is Liu & Orabona (2022). Their approach does indeed guarantee a rate of the form G(x T , y T ) RA T (w )

under the assumption that the L( , ) is G-Lipschitz in both arguments, which is justiﬁed by assuming that X and Y are bounded domains. However, generally saddle-point problems can have some coupling between the x X and y Y, leading to factors of x and y showing up in both x L(x, y) and y L(x, y) . Thus, even in a bounded domain a bound of the form RA T (w ) e O w G

T actually still falls short of being fully comparator-adaptive because the Lipschitz constant G is subtly hiding factors of DX = maxx,x X x x and DY = maxy,y Y y y . See Section 3.1 for a more explicit example of this issue.

On the other hand, for many interesting problems L( , ) is quadratically bounded in both arguments, which will enable us to immediately apply Algorithm 1 to the linear losses gt = (gx t , gy t ) as described above. In particular, we have the following:

Unconstrained Online Learning with Unbounded Losses

Proposition 3.2. Suppose that for all ey Y, the function x 7 L(x, ey) is (Gx + Lxy ey , Lxx)-quadratically bounded, and for all ex X the function y 7 L(ex, y) is (Gy + Lyx ex , Lyy)-quadratically bounded. Let gx t x L(xt, yt) and gy t y[ L(xt, yt)], and set gt = (gx t , gy t ). Then {gt} is a (Gw, Lw)- quadratically bounded sequence w.r.t norm (x, y) = q

x 2 + y 2, where Gw O q

G2x + G2y and Lw

L2xx + L2yy + L2xy + L2yx .

Proof. Let (x, y) W. For gx x L(x, y) observe that

gx 2 (Gx + Lxy y + Lxx x )2

5 G2 x + L2 xy y 2 + L2 xx x 2 ,

where the ﬁrst line uses the assumption that x 7 L(x, y) is (Gx + Lxy y , Lxx) quadratically bounded for any y Y and the last line uses (a + b + c)2 5a2 + 5b2 + 5c2. Likewise,

gy 2 5 G2 y + L2 yx x 2 + L2 yy y 2 ,

and so overall, letting gw = (gx, gy) we have

gx 2 + gy 2

G2x + G2y | {z } =:Gw

L2xx + L2yy + L2xy + L2yx | {z } Lw

= Gw + Lw w

where ( ) uses x + y x + y for x, y 0.

Hence, with this in hand we can use Algorithm 1 to guarantee that for any w = ( x, y) W,

L(x T , y) L( x, y T )

Lemma 3.1 RA T ( w) T

T heorem 2.2 e O

Gw w + Lw w 2

which is indeed fully adaptive to comparator w.

It is important to note that our results in this section are made possible because our algorithm works even in the more difﬁcult QB-OLO setting. It may be possible to get a similar result by using two QB-OCO algorithms designed for quadratically bounded functions ℓt, though it it seems

more challenging. In particular, letting ℓx t ( ) = L( , yt) and ℓy t ( ) = L(xt, ), one might instead run separate algorithms against the quadratically bounded loss sequences ℓx t and ℓy t . However, now both algorithms need to very carefully regularize their iterates such that the gradients received by the other algorithm are never too large, since ℓx t (xt) may can contain factors of yt and ℓy t (yt) can contain factors of xt . Hence careful coordination between the two algorithms will be required. The upshot is that by using un-linearized losses ℓx t and ℓy t it may be possible to get faster rates in some settings by better accounting for local curvature. We leave this as an exciting direction for future investigation.

3.1. Example: Bilinearly-coupled saddle-points

Before moving on, let us make things less abstract with a simple example. Consider a bilinearly-coupled saddle-point problem of the form

L(x, y) = Fx(x) + H(x, y) Fy(y) (3)

where Fx and Fy are convex and ( e Gx, e Lx) and ( e Gy, e Ly)- quadratically bounded respectively, and H(x, y) = x, By ux, x + uy, y for some coupling matrix B and vectors ux, uy. This problem captures several notable problem settings, such as minimizing the mean-squared projected bellman error for off-policy policy evaluation in reinforcement learning, quadratic games, and regularized empirical risk minimization (Du et al., 2022). The following proposition demonstrates that these problems do indeed satisfy the conditions of Proposition 3.2.

Proposition 3.3. Equation (3) satisﬁes the assumptions of Proposition 3.2 with Gx = e Gx + ux , Lxx = e Lx, Lxy = B op, Gy = e Gy + uy , Lyy = e Ly, and Lyx = B op.

Proof. Observe that for any (x, y) X Y and gx x L(x, y), we have

gx = Fx(x) + By ux

Fx(x) + B op y + ux

e Gx + ux + e Lx x + B op y ,

where Fx(x) Fx(x) and B op denotes the operator norm B op = supx: x =1 Bx . Likewise,

gy e Gy + uy + e Ly y + B op x .

Hence, L( , ) satisﬁes the assumptions of Proposition 3.2 with Gx = e Gx + ux , Lxx = e Lx, Lxy = B op, Gy = e Gy + uy , Lyy = e Ly, and Lyx = B op.

We note that this speciﬁc example is mainly for illustrative purposes in many instances of Equation (3) the functions

Unconstrained Online Learning with Unbounded Losses

Fx and Fy satisfy stronger curvature assumptions than used here, and our approach would be improved by more explicitly leveraging these assumptions when they hold. Nevertheless, our approach here does have a few key beneﬁts: ﬁrst, we naturally attain convergence in duality gap with an explicit dependence on the comparator, whereas prior works generally only attain a bound of this form making stronger assumptions such as strong convexity or one of the boundedness assumptions we re seeking to avoid (Liu & Orabona, 2022; Du et al., 2022; Ibrahim et al., 2020; Azizian et al., 2020). Second, our approach can be applied under fairly weak assumptions: L( , ) need not be Lipschitz, strongly-convex, nor smooth in either argument, and we do not require X Y to be a bounded domain.

4. Dynamic Regret

Next, we turn our attention to dynamic regret. In the static regret setting, we saw in Section 2 that to control the stability of the algorithm it was necessary to add an additional term Φt(w) = O Lmax

T w 2 to the regularizer to help control the non-Lipschitz part of the loss. We will likewise need a stronger regularizer to control the gradients for dynamic regret, but now it will lead to difﬁculties. To see why, consider the dynamic regret of gradient descent with a ﬁxed step-size η. Using existing analyses one can arrive at

u T 2 + maxt wt PT

where gt ℓt(wt) and PT = PT t=2 ut ut 1 . In a bounded domain of diameter D, we can bound u T 2 D2

and maxt wt D, and then by optimally tuning η we

get RT (u) O q

(D2 + DPT ) PT t=1 gt 2 which is

optimal in the Lipschitz loss setting (Zhang et al., 2018). More generally, using Mirror Descent with regularizer ψ(w), one can derive an analogous bound:

ψ(u T ) + max t ψ(wt) PT +

where δt are the stability terms discussed in Section 2. In an unbounded domain, Jacobsen & Cutkosky (2022) use a regularizer of the form ψ(w) = O ( w log ( w T) /η), which enables them to bound PT t=1 δt O(η), and moreover, they show that maxt ψt(wt) can be bound from above by O (log (MT/ϵ) /η) after adding a composite penalty ϕt(w) = η gt 2 w to the update. Then optimally tuning

Algorithm 3 Dynamic Regret Algorithm

Input: Gmax, Lmax, weights β1, . . . , βT in [0, 1], hyperparameter set S = n (η, D) : η 1 8Lmax , D > 0 o , p1 |S|. for τ = (η, D) S do

Initialize: w(τ) 1 = 0, q1(τ) = p1(τ) Deﬁne µτ = 1 2D(Gmax+D/η) Deﬁne ψτ(x) = 9 2µτ R x 0 log (v) dv end for for t = 1 : T do

Play wt = P

τ S pt(τ)w(τ) t Query ℓt( ewt) for any ewt W s.t. ewt Dmin for τ = (η, D) S do

Query g(τ) t ℓt(w(τ) t )

Set w(τ) t+1 = Π {w W : w D}

w(τ) t η(1 + 8ηLt)g(τ) t

Deﬁne eℓt,τ = ℓt(w(τ) t ) ℓt( ewt) end for Set qt+1 = argmin q |S|

τ S (eℓtτ +µτ eℓ2 tτ)qτ +Dψτ (qτ|ptτ)

Set pt+1 = (1 βt)qt+1 + βtp1. end for

η leads to regret scaling as

v u u t(M 2 + MPT )

where M = maxt ut , which matches the bound from the bounded-domain setting up to logarithmic terms.

In our setting, the situation gets signiﬁcantly more challenging. As in Section 2, we will need to include an O( w 2 /η) term in the regularizer ψt in order to control the non Lipschitz part of the loss function. However, as above this leads to coupling maxt ψt(wt) Pt = maxt wt PT /η in the dynamic regret, and the term maxt wt is generally too large to cancel out with additional regularization as done by Jacobsen & Cutkosky (2022). Even more troubling is that our lower bound in Theorem 4.2 suggests that the ideal dependence would be O(MPT /η), which we can only hope to achieve by constraining wt to a ball of diameter proportional to M = maxt ut . Yet M is unknown to the learner!

Luckily, hope is not all lost. Taking inspiration from Luo et al. (2022), we can still attain a bound similar to Equation (5) by tuning the diameter of an artiﬁcial domain constraint. The approach is as follows: for each (η, D) in some set S, we run an instance of gradient descent A(η, D) which uses step-size η and projects to the set WD = {w W : w D}. Then, using a carefully designed experts algorithm, it is possible to ensure

Unconstrained Online Learning with Unbounded Losses

that the overall regret of the algorithm scales roughly as RT (u) e O RA(η,D) T (u) for any (η, D) S. Thus if we can ensure that there is some (η, D) S for which D M and η is near-optimal, then we ll be able to achieve dynamic regret with the desired MPT dependence. The following theorem, proven in Appendix C, characterizes an algorithm which achieves dynamic regret analogous to the above bounds, and in Theorem 4.2 we show that this is indeed unimprovable. Notably, our result also automatically improves to a novel L bound when the losses are smooth.

Theorem 4.1. For all t let ℓt : W R be a (Gt, Lt)- quadratically bounded convex function with Gt [0, Gmax] and Lt [0, Lmax]. Let ϵ > 0, βt 1 exp ( 1/T) for all t, and for any i, j 0 let Dj = ϵ T 2j 2T

and ηi = h ϵ2i 8(Gmax+ϵLmax)T 1 8Lmax

i , and let S =

{(ηi, Dj) : i, j 0}. For each τ = (η, D) S let

µτ = 1 2D(Gmax+D/η) and set p1(τ) = µ2 τ P

eτ S µ2 eτ . Then for any u = (u1, . . . , u T ) in W, Algorithm 3 guarantees

Gmax(M + ϵ)Λ T + Lmax(M + ϵ)2Λ T

+ Gmax PT + Lmax(M + ϵ)PT

(M 2Λ T + MPT )ΩT ,

where Λ T O log MT log(T )

ϵ + log log Gmax ϵLmax

ΩT PT t=1 G2 t + L2 t M 2 , PT = PT t=2 ut ut 1 , and M = maxt ut . Moreover, when the losses are Lt-smooth, ΩT automatically improves to ΩT min n PT t=1 Lt [ℓt(ut) ℓ t ] , PT t=1 G2 t + L2 t M 2o .

Hiding constants and logarithmic terms, the bound is effectively

v u u t(M 2 + MPT )

t=1 G2 t + L2 t M 2

Notice that our result again generalizes the bounds established in prior works. Unfortunately, the result is not a strict generalization as Theorem 4.1 requires Lmax > 0 for the hyperparameter set S to be ﬁnite. To achieve a strict generalization, one can simply deﬁne a procedure which runs the algorithm of Jacobsen & Cutkosky (2022) when Lmax = 0 and Algorithm 3 otherwise; this is possible because Lmax must be provided as input to the algorithm. Notably, the algorithm of Jacobsen & Cutkosky (2022) does not use the aforementioned domain tuning trick and requires signiﬁcantly less per-round computation as a result (O(d log (T)) vs. O(d T log (T))). We leave open the question of whether

the exists a unifying analysis for Lmax = 0 and Lmax > 0, and whether the per-round computation can be improved.

As in Section 2, we again observe an additional penalty associated with non-Lipschitzness, this time on the order of

e O M 3/2 q

(M + PT ) PT t=1 L2 t

. The following theorem

shows that these penalties are unavoidable in general (proof in Appendix C.3).

Theorem 4.2. For any M > 0 there is a sequence of (G, L)- quadratically bounded functions with G

L M such that for any γ [0, 1

RT (u) Ω GM 1 γ [PT T]γ + LM 2 γ [PT T]γ .

where PT = PT t=2 ut ut 1 and M maxt ut .

Notice that with γ = 1 2, we have RT (u) Ω G MPT T + LM 3/2 PT T , matching our upper bound in Theorem 4.1 up to logarithmic terms. On the otherhand, for γ = 0 we have RT (u) GM + LM 2, suggesting that the lower-order leading terms of our upper bound are also necessary. We also note that the assumption G/L M is without loss of generality: when G/L M one can construct a sequence of (G+LM)-Lipschitz losses according to existing lower bounds to show that

RT (u) Ω (G + LM) p

MPT T + LM 3/2p

Interestingly, when the losses are smooth, the bound in Theorem 4.1 has the appealing property that it automatically improves to an L bound of the form

v u u t(M 2 + MPT )

t=1 Lt [ℓt(ut) ℓ t ]

which matches bounds established in the Lipschitz and bounded domain setting up to logarithmic penalties (Zhao et al., 2020). This is the ﬁrst L bound that we are aware of to be achieved in an unbounded domain for general smooth losses without a Lipschitz or bounded-range assumption. Moreover, our bound features improved adaptivity to the individial Lt s, scaling as PT t=1 Lt [ℓt(ut) ℓ t ] instead of the usual Lmax PT t=1 ℓt(ut) ℓ t achieved by prior works (Srebro et al., 2010; Orabona et al., 2012; Zhao et al., 2020).

On the other hand, our upper bound bound contains terms of the form Gmax

Lmaxϵ. Such ratios are unappealing in general because Gmax and Lmax are not under our control it s possible for this ratio to be arbitrarily large. Fortunately, this ratio only shows up only in doubly-logarithmic terms, and hence these penalties can be regarded as effectively constant as far as the regret bound is concerned.

Unconstrained Online Learning with Unbounded Losses

A more pressing issue is that the ratio Gmax ϵLmax shows up in the number of experts. That is, setting S as in Theorem 4.1 requires a collection of O(T log2(

T) + T log2 (Gmax/Lmaxϵ)) experts, so in practice we can only tolerate Gmax/Lmaxϵ poly(T) without increasing the (already quite high!) order of computation. We note that any algorithm that guarantees RT (0) Gmaxϵ can t hope to ensure non-vacuous regret when Gmax/Lmaxϵ > T anyways, so this seems to be a fundamental restriction in this setting. Nevertheless, the following result shows that if we know a priori that the losses will be smooth, then we can avoid this log log Gmax Lmaxϵ penalty entirely and reduce

the number of experts to T log2(

T) by instead setting ηmin 1 Lmax

T . Proof can be found in Appendix C.2.

Theorem 4.3. For all t let ℓt : W R be (Gt, Lt)- quadratically bounded and Lt-smooth convex function with Gt [0, Gmax] and Lt [0, Lmax]. Let ϵ > 0 and for any i, j 0 let Dj = ϵ

T 2j 2T and ηi =

T i , and let S = {(ηi, Dj) : i, j 0}.

Then for any u = (u1, . . . , u T ) in W, Algorithm 3 guarantees

Gmax(M + ϵ)Λ T + Lmax(M + ϵ)2Λ T

+Lmax(M + ϵ)PT +

t=1 [ℓt(ut) ℓ t ]2

v u u t(M 2Λ T +MPT )

t=1 Lt [ℓt(ut) ℓ t ]

where Λ T O log M

, M = maxt ut ,

and PT = PT t=2 ut ut 1 .

5. Conclusion

In this paper, we developed new algorithms for online learning in unbounded domains with potentially unbounded losses. We achieve several regret guarantees that have previously only been attained by assuming Lipschitz losses, losses with bounded range, a bounded domain, or a combination thereof. We provide algorithms for both static and dynamic regret, as well as an application in saddle-point optimization leading to new results for unbounded decision sets. Our lower bounds show that our results are optimal, and moreover, our algorithms achieve these results without appealing to any instance-speciﬁc hyperparameter tuning.

There are a few natural directions for future work. It is still unclear whether the dynamic regret achieved in Theorem 4.1 can be achieved in the more general QB-OLO setting. Moreover, while our dynamic regret algorithm attains the optimal

bound, it requires O(d T log(

T)) computation per round, whereas the optimal bounds in the Lipschitz loss setting are attained using O(d log (T)) per-round computation. Yet it is unclear how to achieve the lower bound in Theorem 4.2 without the artiﬁcial domain trick discussed in Section 4. We leave these questions as exciting directions for future work.

Acknowledgements

AJ is supported by an NSERC CGS-D Scholarship. AC is supported by NSF grant CCF-2211718, a Google gift and an Amazon Research Award.

Azizian, W., Scieur, D., Mitliagkas, I., Lacoste-Julien, S., and Gidel, G. Accelerating smooth games by manipulating spectral shapes. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 1705 1715. PMLR, 26 28 Aug 2020.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.

Cesa-Bianchi, N., Long, P. M., and Warmuth, M. K. Worstcase quadratic loss bounds for prediction using linear functions and gradient descent. IEEE Transactions on Neural Networks, 7(3):604 619, 1996.

Cesa-Bianchi, N., Gaillard, P., Lugosi, G., and Stoltz, G. Mirror descent meets ﬁxed share (and feels no regret). Advances in Neural Information Processing Systems, 25, 2012.

Chen, L., Luo, H., and Wei, C.-Y. Impossible tuning made possible: A new expert algorithm and its applications. In Belkin, M. and Kpotufe, S. (eds.), Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pp. 1216 1259. PMLR, 15 19 Aug 2021.

Cutkosky, A. Artiﬁcial constraints and hints for unbounded online learning. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 874 894, Phoenix, USA, 25 28 Jun 2019. PMLR.

Cutkosky, A. Parameter-free, dynamic, and stronglyadaptive online learning. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2250 2259, Virtual, 13 18 Jul 2020. PMLR.

Unconstrained Online Learning with Unbounded Losses

Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in banach spaces. In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 1493 1529. PMLR, 06 09 Jul 2018.

Du, S. S., Gidel, G., Jordan, M. I., and Li, C. J. Optimal extragradient-based bilinearly-coupled saddle-point optimization, 2022.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory, 2010.

Foster, D. J., Rakhlin, A., and Sridharan, K. Adaptive online learning. In Advances in Neural Information Processing Systems 28. 2015.

Gyorgy, A. and Szepesvari, C. Shifting regret, mirror descent, and matrices. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 2943 2951, New York, New York, USA, 20 22 Jun 2016. PMLR.

Hall, E. C. and Willett, R. M. Online optimization in dynamic environments, 2016.

Hazan, E. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016. ISSN 2167-3888. doi: 10.1561/2400000013.

Hazan, E., Rakhlin, A., and Bartlett, P. Adaptive online gradient descent. In Platt, J., Koller, D., Singer, Y., and Roweis, S. (eds.), Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.

Ibrahim, A., Azizian, W., Gidel, G., and Mitliagkas, I. Linear lower bounds and conditioning of differentiable games. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 4583 4593. PMLR, 13 18 Jul 2020.

Jacobsen, A. and Cutkosky, A. Parameter-free mirror descent, 2022.

Jadbabaie, A., Rakhlin, A., Shahrampour, S., and Sridharan, K. Online Optimization : Competing with Dynamic Comparators. In Lebanon, G. and Vishwanathan, S. V. N. (eds.), Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pp. 398 406, San Diego, California, USA, 09 12 May 2015. PMLR.

Kempka, M., Kotlowski, W., and Warmuth, M. K. Adaptive scale-invariant online algorithms for learning linear models. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 3321 3330. PMLR, 09 15 Jun 2019.

Kivinen, J. and Warmuth, M. K. Exponentiated gradient versus gradient descent for linear predictors. information and computation, 132(1):1 63, 1997.

Liu, M. and Orabona, F. On the initialization for convexconcave min-max problems. In Dasgupta, S. and Haghtalab, N. (eds.), Proceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 of Proceedings of Machine Learning Research, pp. 743 767. PMLR, 29 Mar 01 Apr 2022.

Luo, H., Zhang, M., Zhao, P., and Zhou, Z.-H. Corralling a larger band of bandits: A case study on switching regret for linear bandits, 2022.

Mayo, J. J., Hadiji, H., and van Erven, T. Scale-free unconstrained online learning for curved losses, 2022.

Mcmahan, B. and Streeter, M. No-regret algorithms for unconstrained online convex optimization. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.

Mc Mahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In Conference on Learning Theory, 2010.

Mhammedi, Z. and Koolen, W. M. Lipschitz and comparator-norm adaptivity in online learning. In Abernethy, J. and Agarwal, S. (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 2858 2887. PMLR, 09 12 Jul 2020.

Orabona, F. Dimension-free exponentiated gradient. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.

Orabona, F. A modern introduction to online learning. Co RR, abs/1912.13213, 2019.

Orabona, F. and Pál, D. Coin betting and parameter-free online learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pp. 577 585, Red Hook, NY, USA, 2016. Curran Associates Inc.

Unconstrained Online Learning with Unbounded Losses

Orabona, F. and Pál, D. Scale-free online learning. Theoretical Computer Science, 716:50 69, 2018. ISSN 03043975. doi: https://doi.org/10.1016/j.tcs.2017.11.021. Special Issue on ALT 2015.

Orabona, F., Cesa-Bianchi, N., and Gentile, C. Beyond logarithmic bounds in online learning. In Lawrence, N. D. and Girolami, M. (eds.), Proceedings of the Fifteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pp. 823 831, La Palma, Canary Islands, 21 23 Apr 2012. PMLR.

Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 2011.

Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. Advances in neural information processing systems, 23, 2010.

Telgarsky, M. Stochastic linear optimization never overﬁts with quadratically-bounded losses on general data, 2022.

van der Hoeven, D. User-speciﬁed local differential privacy in unconstrained adaptive online learning. In Neur IPS, pp. 14080 14089, 2019.

Zhang, L., Lu, S., and Zhou, Z.-H. Adaptive online learning in dynamic environments. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1330 1340, 2018.

Zhao, P., Zhang, Y.-J., Zhang, L., and Zhou, Z.-H. Dynamic regret of convex and smooth functions. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 12510 12520. Curran Associates, Inc., 2020.

Zinkevich, M. Online convex programming and generalized inﬁnitesimal gradient ascent. In International Conference on Machine Learning, 2003.

Unconstrained Online Learning with Unbounded Losses

A. Centered Mirror Descent with Adjustment

Algorithm 4 Centered Mirror Descent with Adjustment

Input: w1 W, ψ1 : W R 0 for t = 1 : T do

Play wt W, observe gt Choose functions ψt+1, φt Deﬁne t(w) = Dψt+1(w|w1) Dψt(w|w1) Update ewt+1 = arg min e w W gt, ew + Dψt( ew|wt) + t( ew) + φt( ew) Choose mapping Mt+1 : W W Update wt+1 = Mt+1( ewt+1) end for

The key tool we ll use to build our algorithms is a slight generalization of the Centered Mirror Descent algorithm of Jacobsen & Cutkosky (2022), which accounts for an additional post-hoc adjustment of wt through the use of an arbitrary mapping Mt : W W. Algorithms of this form have been studied in prior works such as Gyorgy & Szepesvari (2016); Hall & Willett (2016), wherein Mt is interpreted as a dynamical model. In contrast, we will use Mt as a convenient way to formulate a multi-scale version of the ﬁxed-share update, similar to the generalized share algorithm of Cesa-Bianchi et al. (2012). Note that when Mt is the identity mapping, Algorithm 4 is identical to the Centered Mirror Descent algorithm of Jacobsen & Cutkosky (2022).

The following lemma provides a regret template for Algorithm 4. Observe that several of the key terms related to the algorithm s stability replace the mirror descent iterates ewt with the adjusted iterates wt = Mt( ewt). The trade-off is that we must be careful to ensure that the new penalty terms ξt = Dψt+1(ut|wt+1) Dψt+1(ut| ewt+1) are not too large, which places an implicit restriction on how much we can adjust the iterates via Mt.

Lemma A.1. For all t let ψt : W R 0 be differentiable convex functions, φt : W R 0 be subdifferentiable convex functions, and let Mt : W W be arbitrary mappings. Then for any sequence u = (u1, . . . , u T ) in W, Algorithm 4 guarantees

RT (u) DψT +1(u T |w1) DψT +1(u T |w T +1) +

t=2 ψt(wt) ψt(w1), ut 1 ut | {z } =:Pt

t=1 Dψt+1(ut|wt+1) Dψt+1(ut| ewt+1) | {z } ξt

t=1 gt, wt ewt+1 Dψt( ewt+1|wt) t( ewt+1) φt( ewt+1) | {z } =:δt

Proof. Letting gt ℓt(wt), we have

t=1 gt, wt ut =

t=1 gt, ewt+1 ut +

t=1 gt, wt ewt+1 .

From the ﬁrst-order optimality condition ewt+1 = arg min e w W gt, ew + Dψt( ew|wt) + t( ew) + φt( ew), for any ut W we have

0 gt + ψt( ewt+1) ψt(wt), ewt+1 ut + t( ewt+1) + φt( ewt+1), ewt+1 ut

Unconstrained Online Learning with Unbounded Losses

so re-arranging,

gt, ewt+1 ut ψt(wt) ψt( ewt+1), ewt+1 ut

t( ewt+1), ewt+1 ut φt( ewt+1), ewt+1 ut

( ) ψt(wt) ψt( ewt+1), ewt+1 ut

+ φt(ut) φt( ewt+1)

+ t(ut) t( ewt+1) D t(ut| ewt+1)

where ( ) uses the deﬁnition of Bregman divergence to write f(x), x y = f(y) f(x) Df(y|x) and bounds φt( ewt+1), ut ewt+1 φt(ut) φt( ewt+1) by convexity of φt. Plugging this back into the previous display, we have

t=1 ψt(wt) ψt( ewt+1), ewt+1 ut

t=1 [ t(ut) + φt(ut)] +

t=1 D t(ut| ewt+1)

t=1 gt, wt ewt+1 t( ewt+1) φt( ewt+1),

and using the three-point relation for Bregman divergences f(y) f(x), x z = Df(z|y) Df(z|x) Df(x|y):

t=1 Dψt(ut|wt) Dψt(ut| ewt+1) +

t=1 [ t(ut) + φt(ut)] +

t=1 D t(ut| ewt+1)

t=1 gt, wt ewt+1 Dψt( ewt+1|wt) t( ewt+1) φt( ewt+1) | {z } =:δt

t=1 Dψt(ut|wt) Dψt+1(ut| ewt+1) +

t=1 [ t(ut) + φt(ut)] + δ1:T

t=1 Dψt(ut|wt) Dψt+1(ut|wt+1) +

t=1 Dψt+1(ut|wt+1) Dψt+1(ut| ewt+1) | {z } =:ξt

t=1 [ t(ut) + φt(ut)] + δ1:T

= Dψ1(u1|w1) DψT +1(u T |w T +1) +

t=2 [Dψt(ut|wt) Dψt(ut 1|wt)] +

| {z } =:( )

t=1 φt(ut) + ξ1:T + δ1:T ,

Unconstrained Online Learning with Unbounded Losses

where (a) observes D t(ut| ewt+1) = Dψt+1(ut| ewt+1) Dψt(ut| ewt+1). Observe that the term ( ) simpliﬁes as

t=1 t(ut) +

t=2 Dψt(ut|wt) Dψt(ut 1|wt)

t=1 Dψt+1(ut|w1) Dψt(ut|w1) +

t=2 Dψt(ut|wt) Dψt(ut 1|wt)

= DψT +1(u T |w1) Dψ1(u1|w1)

t=1 Dψt+1(ut|w1) Dψt+1(ut+1|w1)

t=2 Dψt(ut|wt) Dψt(ut 1|wt)

= DψT +1(u T |w1) Dψ1(u1|w1)

t=2 Dψt(ut 1|w1) Dψt(ut|w1)

t=2 Dψt(ut|wt) Dψt(ut 1|wt)

= DψT +1(u T |w1) Dψ1(u1|w1)

t=2 ψt(ut 1) ψt(ut) ψt(w1), ut 1 ut

t=2 ψt(ut) ψt(ut 1) ψt(wt), ut ut 1

= DψT +1(u T |w1) Dψ1(u1|w1) +

t=2 ψt(wt) ψt(w1), ut 1 ut | {z } =:Pt

so plugging this back in above yields

RT (u) Dψ1(u1|w1) DψT +1(u T |w T +1) + ( ) +

t=1 φt(ut) + ξ1:T + δ1:T

Dψ1(u1|w1) DψT +1(u T |w T +1) + DψT +1(u T |w1) Dψ1(u1|w1) + P2:T

t=1 φt(ut) + ξ1:T + δ1:T

= DψT +1(u T |w1) DψT +1(u T |w T +1) +

t=1 φt(ut) + P2:T + ξ1:T + δ1:T .

In this paper, we will frequently use composite penalties φt which are a linearization of some other function ϕt. The next lemma shows how this changes the bound.

Unconstrained Online Learning with Unbounded Losses

Lemma A.2. Under the same conditions as Lemma A.1, let ϕt : W R+ be subdifferentiable convex functions and suppose we set φt(w) = ϕt(wt), w for some ϕt(wt) ϕt(wt). Then Algorithm 4 guarantees

RT (u) DψT +1(u T |w1) DψT +1(u T |w T +1) +

t=1 ϕt(ut) + P2:T + ξ1:T

t=1 egt, wt ewt+1 Dψt( ewt+1|wt) t( ewt+1) ϕt(wt) | {z } =:δt

where egt = gt + ϕt(wt).

Proof. The proof is immediate by convexity of ϕt. From Lemma A.1 we have

RT (u) DψT +1(u T |w1) DψT +1(u T |w T +1) +

t=1 φt(ut) + P2:T + ξ1:T

t=1 gt, wt ewt+1 Dψt( ewt+1|wt) t( ewt+1) φt( ewt+1).

Observe that we can write

t=1 φt(ut) φt( ewt+1) =

t=1 ϕt(wt), ut ewt+1

t=1 ϕt(wt), ut wt + ϕt(wt), wt ewt+1

t=1 ϕt(ut) ϕt(wt) + ϕt(wt), wt ewt+1 ,

so plugging this back in above gives the stated result:

RT (u) DψT +1(u T |w1) DψT +1(u T |w T +1) +

t=1 ϕt(ut) + P2:T + ξ1:T

t=1 egt, wt ewt+1 Dψt( ewt+1|wt) t( ewt+1) ϕt(wt),

where egt = gt + ϕt(wt).

A.1. Multi-scale Experts Algorithm

Algorithm 5 Multi-scale Fixed-share

Input: p1 N (0, 1]N, µ1, . . . , µN in R>0, k > 0, weights β1, . . . , βT in [0, 1] Initialize: q1 = p1 Deﬁne ψi(x) = k µi R x 0 log (v) dv for i [N] for t = 1 : T do

Play pt N, receive loss eℓt RN

Update qt+1 = arg minq N PN i=1(eℓti + µieℓ2 ti)qi + Dψi(qi|pti) Set pt+1 = (1 βt)qt+1 + βtp1 end for

For completeness, in this section we provide a multi-scale experts algorithm which achieves the bound required for our dynamic regret algorithm in Section 4. Our approach is inspired by the Multi-scale Multiplicative-weight with Correction

Unconstrained Online Learning with Unbounded Losses

(Ms Mw C) algorithm of Chen et al. (2021), but formulated as a ﬁxed-share update instead of an update on a clipped simplex e N = N [β, 1]N. The Ms Mw C algorithm provides a guarantee analogous to the following theorem, but formulating it as a ﬁxed-share update will allow us a bit more modularity when constructing our dynamic regret algorithm in Appendix C, which requires several rather delicate conditions to come together in the right way.

Theorem A.3. Let k 9

2 and assume µ1, . . . , µN satisfy µieℓti 1 for all t [T] and i [N]. Then for any u N, Algorithm 5 guarantees

D eℓt, pt u E

k h log (ui/p1i) + PT t=1 log 1 1 βt

+ k(1 + β1:T )

Moreover, for βt 1 exp 1

D eℓt, pt u E

" k [log (ui/p1i) + 1]

Proof. The described algorithm is an instance of Algorithm 4 applied to the simplex N with ϕt(p) = PN i=1 µieℓ2 tipi, and Mt+1(p) = (1 βt)p + βtp1. Applying Lemma A.2:

D eℓt, pt u E Dψ(u|p1) Dψ(u|p T +1) + ϕ1:T (u) + ξ1:T + δ1:T ,

ξt = Dψ(u|pt+1) Dψ(u|qt+1)

δt = D eℓt + ϕt(pt), pt qt+1 E Dψ(qt+1|pt) ϕt(pt).

Observe that for any u, p, and q in N we can write

Dψ(u|p) Dψ(u|q) =

k µi [ui log (ui/pi) ui + pi]

k µi [ui log (ui/qi) ui + qi]

k µi [ui log (qi/pi) + pi qi] ,

Dψ(u|p1) Dψ(u|p T +1) = k

ui log (p T +1,i/p1i) + p1i p T +1,i

i=1 sup p 0

ui log (p/p1i) + p1i p

ui log (ui/p1i) + p1i ui

ui log (ui/p1i)

Unconstrained Online Learning with Unbounded Losses

t=1 Dψ(u|pt+1) Dψ(u|qt+1)

ui log (qt+1,i/pt+1,i) + pt+1,i qt+1,i

ui log qt+1,i (1 βt)qt+1,i+βtq1,i

(1 βt)qt+1,i + βtq1,i qt+1,i

ui µi log qt+1,i (1 βt)qt+1,i + βtq1,i

βt(q1,i qt+1,i)

ui µi log 1 1 βt

t=1 log 1 1 βt

where the last line recalls p1 = q1. Plugging these bounds back into the above regret bound yields

D eℓt, pt u E k

" ui log (ui/p1i)

µi + (1 + β1:T )p1i

t=1 log 1 1 βt

+ ϕ1:T (u) + δ1:T

"k h log (ui/p1i) + PT t=1 log 1 1 βt

+ k(1 + β1:T )

t=1 (eℓti + µieℓ2 ti)(pti qt+1,i) Dψi(qt+1,i|pt,i) µieℓ2 tipti | {z } =:δti

where the last line recalls δt = D eℓt + ϕt(pt), pt qt E Dψ(qt+1|pt) ϕt(pt), ϕt(p) = PN i=1 µieℓ2 tipi, and denotes

ψi(p) = k µi R p 0 log (x) dx so that ψ(p) = PN i=1 ψi(pi). We next focus our attention on the terms in the last line, δti.

Note that by construction, we have pti = (1 βt)qti + βtq1i βtq1i > 0 for all i. Thus, ψi(p) = k µi R p 0 log (v) dv is twice differentiable everywhere on the line connecting pti and qt+1,i for any i with qt+1,i > 0. For any such i, we have via Taylor s theorem that there exists a epi on the line connecting pti and qt+1,i such that

Dψi(qt+1,i|pti) 1

2(pti qt+1,i)2ψ i (epi) = 1

2 (pti qt+1,i)2k

Unconstrained Online Learning with Unbounded Losses

so using this with the assumption that µi eℓti 1, we have

δti eℓti + µieℓ2 ti |pti qt+1,i| 1

2 (pti qt+1,i)2k

µiepi µieℓ2 tipti

2 eℓti |pti qt+1,i| 1

2 (pti qt+1,i)2k

µiepi µieℓ2 tiepi + µieℓ2 ti |pti epi|

(a) 3 eℓti |pti qt+1,i| 1

2 (pti qt+1,i)2k

µiepi µieℓ2 tiepi

2k µi eℓti 2 epi µieℓ2 tiepi

where (a) uses |epi pti| |qt+1,i pti| for any epi on the line connecting qt+1,i and pti and (b) chooses k 9

2. Similarly, for any i for which qt+1,i = 0 we have

δti = (eℓti + µieℓ2 ti)pti Dψi(0|pti) µieℓ2 tipti

eℓtipti pti

where the last line again uses µi eℓti 1. Thus, in either case we have δti 0. Plugging this into Equation (6) reveals the ﬁrst statement of the theorem:

D eℓt, pt u E

k h log (ui/p1i) + PT t=1 log 1 1 βt

+ k(1 + β1:T )

For the second statement of the theorem, observe that βt 1 exp ( 1/T) 1

T , so β1:T 1, and likewise log 1 1 βt

log (exp (1/T)) = 1

T , so PT t=1 log 1 1 βt

1. Hence, the previous display is bounded as

D eℓt, pt u E

" k [log (ui/p1i) + 1]

B. Proofs for Section 2 (Online Learning with Quadratically Bounded Losses)

B.1. Proof of Theorem 2.2

Theorem 2.2. Let A be an online learning algorithm and let wt W be its output on round t. Let {gt} be a (Gt, Lt)- quadratically bounded sequence w.r.t {wt}, where Gt [0, Gmax] and Lt [0, Lmax] for all t. Let ϵ > 0, k 3, κ 4,

c 4, Vt+1 = c G2 max + G2 1:t, ρt+1 = 1

L2 max+L2 1:t , αt+1 =

Vt+1 log2(Vt+1/G2 max) ϵGmax , and set

ψt(w) = k Z w

0 min η 1 Gmax

log (x/αt + 1)

2ρt and ϕt(w) = L2 t 2 p

L2 1:t w 2 .

Then for any u W, Algorithm 1 guarantees

RT (u) 2ϵGmax + κ u 2 q

L2max + L2 1:T + 2k u max np

VT +1FT +1( u ), Gmax FT +1( u ) o

where FT +1( u ) = log ( u /αT +1 + 1).

Unconstrained Online Learning with Unbounded Losses

Proof. We can assume without loss of generality that 0 W, since we could otherwise just perform a coordinate translation. Hence, we have w1 = arg minw W ψ1(w) = 0, and it is easily seen that for any w W we ll have Dψt(w|w1) = Dψt(w|0) = ψt(w).

First apply Lemma A.2 with Mt(w) = w and ϕt(w) = L2 t 2

L2 1:t w 2 to get

t=1 gt, wt u DψT +1(u|w1) + ϕ1:T (u) +

t=1 gt + ϕt(wt), wt wt+1 Dψt(wt+1|wt) t(wt+1) ϕt(wt) | {z } =:δt ψT +1(u) + ϕ1:T (u) + δ1:T .

Let us ﬁrst bound the leading term ψT +1(u). For brevity, denote Ft(x) = log (x/αt + 1) and let Ψt( w ) = R w 0 minη 1/Gmax h Ft(x)

η + ηVt i dx and Φt( w ) = κ 2ρt w 2, so that ψt(w) = Ψt( w ) + Φt( w ). Then

ψT +1(u) = k Z u

0 Ψ T +1(x)dx + κ

k u Ψ T +1( u ) + κ

L2max + L2 1:T .

Ψ t( u ) = k min η 1/Gmax

Vt Ft( u ) if Gmax p

Ft( u ) Vt k Gmax Ft( u ) + k Vt Gmax otherwise

Vt Ft( u ) if Gmax p

Ft( u ) Vt 2k Gmax Ft( u ) otherwise

= 2k max np

Vt Ft( u ), Gmax Ft( u ) o .

where ( ) observes that Vt/Gmax Gmax Ft(x) whenever Ψ t(x) = k Gmax Ft(x) + k Vt/Gmax. Next, using Lemma D.1 we have

ϕ1:T (u) = 1

L2 1:t u 2 q

so overall we have

t=1 gt, wt u 2k u max np

VT +1FT +1( u ), Gmax FT +1( u ) o + κ

L2max + L2 1:T + u 2 q

L2 1:T + δ1:T

We conclude by bounding the stability terms δ1:T . Recall that

δt = gt + ϕt(wt), wt wt+1 Dψt(wt+1|wt) t(wt+1) ϕt(wt),

where t(w) = ψt+1(w) ψt(w). We ﬁrst separate into terms related to the Gt s and terms related to the Lt s:

δt ( gt + ϕt(wt) ) wt wt+1

Dψt(wt+1|wt) t(wt+1) ϕt(wt)

Gt wt wt+1 DΨt(wt+1|wt) t(wt+1)

+ 2Lt wt wt wt+1 DΦt(wt+1|wt) ϕt(wt),

Unconstrained Online Learning with Unbounded Losses

where we slightly abuse notations DΨt and DΦt to denote the Bregman divergences w.r.t the function w 7 Ψt( w ) and w 7 Φt( w ). In the second line, observe that Φt( w ) = κ 2ρt w 2 is κ ρt strongly convex, so DΦt(wt+1|wt)

κ 2ρt wt+1 wt 2 and an application of Fenchel-Young inequality yields

2Lt wt wt wt+1 DΦt(wt+1|wt) ϕt(wt) 2Lt wt wt wt+1 κ

2ρt wt+1 wt 2 ϕt(wt)

4ρt L2 t wt 2

= 2L2 t wt 2

Lmax + L2 1:t 1 L2 t 2 p

L2 1:t wt 2

L2 1:t L2 t 2 p

L2 1:t wt 2

for κ 4. Hence,

δt Gt wt wt+1 DΨt(wt+1|wt) t(wt+1),

which we will bound by showing that t(w) ηt(w)G2 t for some suitable Gt-Lipschitz convex function ηt and then invoking Lemma D.3. To this end, observe that

t(w) = ψt+1(w) ψt(w)

= Ψt+1( w ) Ψt( w ) | {z } =: Ψ t (w)

+ Φt+1( w ) Φt( w ) | {z } =: Φ t (w)

Moreover, writing Ψ t (w) = Ψt+1( w ) Ψt( w ) = R w 0 Ψ t+1(x) Ψ t(x)dx, we have

Ψ t+1(x) Ψ t(x) = k min η 1/Gmax

k min η 1/Gmax

k min η 1/Gmax

k min η 1/Gmax

and using the fact that for any η 1/Gmax, we can bound Ft(x)

η + ηVt + ηG2 t minη 1/G h Ft(x)

η + η Vt i + ηG2 t, we have

k min η 1/Gmax

k min η 1/Gmax

+ k G2 t min

Vt+1 , 1 Gmax

2Vt , 1 Gmax

Vt , 1 Gmax

where the last line observes that 1 Vt = 1 Vt+1 Vt+1

Vt = 1 Vt+1

1 + G2 t Vt

2 Vt+1 for Vt G2 t and recalls k 3. Deﬁning

ηt( w ) = R w 0 min q

Vt , 1 Gmax

dx, we then immediately have:

Ψ t ( w ) G2 t

Vt , 1 Gmax

dx = ηt( w )G2 t.

Unconstrained Online Learning with Unbounded Losses

δt Gt wt wt+1 Dψt(wt+1|wt) t(wt+1)

Gt wt wt+1 Dψt(wt|wt+1) ηt( wt+1 )G2 t (8)

Finally, we conclude by showing that ψt satisﬁes the assumptions of Lemma D.3 w.r.t this function ηt.

We can write

Ψt(x) = k Z x

0 min η 1/Gmax

Vt Ft(v), Gmax Ft(v) + Vt Gmax

and so for any x > 0 we have

Vt Ft(x) if Gmax p

Ft(x) Vt k Gmax Ft(x) + k Vt Gmax otherwise

k Vt (x+αt)

Ft(x) if Gmax p

x+αt otherwise

( k Vt(1+2Ft(x))

2(x+αt)2Ft(x)3/2 if Gmax p

Ft(x) Vt k Gmax (x+αt)2 otherwise .

Clearly, we have Ψt(x) 0, Ψ t(x) 0, Ψ t (x) 0, and Ψ t (x) 0 for all x > 0. Moreover, for any x αt(e 1) =: xt, we have

|Ψ t (x)| Ψ t (x)2 =

( k Vt(1+2Ft(x))

2(x+αt)2Ft(x)3/2 (x+αt)2Ft(x)

k2Vt if Gmax p

k Gmax (x+αt)2 (x+αt)2

k2G2max otherwise

Ft(x) + 2 p

Ft(x) if Gmax p

1 k Gmax otherwise

and since x > αt(e 1), we have Ft(x) > 1 and hence 1

Ft(x) 2k Vt if Gmax p

Ft(x) Vt 1 k Gmax otherwise

Vt , 1 Gmax

where the last line recalls ηt(x) = R x 0 min q

Vt , 1 Gmax

dv and chooses k 3. Further, observe that ηt(x) is

convex and η t(x) 1 Gmax , hence 1 Gmax -Lipschitz. Thus, Ψt satisﬁes the conditions of Lemma D.3 with ηt(x) =

Unconstrained Online Learning with Unbounded Losses

R x 0 min q

Vt , 1 Gmax

dx and xt = αt(e 1), so summing Equation (8) over all t, we have

t=1 Gt wt wt+1 DΨt(wt+1|wt) ηt( wt+1 )G2 t

2G2 t Ψ t ( xt) =

2G2 t k Vt ( xt + αt)

2eαt G2 t k Vt

t=1 2αt G2 t Vt

where the last line bounds e/k 3/k 1 for k 3. Next, substitute αt = ϵGmax Vt log2(Vt/Gmax) to bound

t=1 δt 2ϵGmax

G2 t Vt log2 (Vt/G2max)

G2 t ((c 1)G2max + G2 1:t) log2 (c 1)G2max+G2 1:t G2max

Z (c 1)G2 max+G2 1:T

(c 1)G2 max

1 x log2(x/G2max)dx

= 2ϵGmax 1 log(x/G2max)

(c 1)G2 max+G2 1:T

2ϵGmax log (c 1) 2ϵGmax,

for c 4. Finally, plugging this back into Equation (7) yields

t=1 gt, wt u 2k u max np

VT +1FT +1( u ), Gmax FT +1( u ) o

L2max + L2 1:T + u 2 q

L2 1:T + δ1:T

2k u max np

VT +1FT +1( u ), Gmax FT +1( u ) o

L2max + L2 1:T + u 2 q

L2 1:T + 2ϵGmax

2ϵGmax + κ u 2 q

L2max + L2 1:T

2k u max np

VT +1FT +1( u ), Gmax FT +1( u ) o

B.2. Proof of Theorem 2.3

Theorem 2.3. Let A be an algorithm deﬁned over R2 and let wt denote the output of A on round t. Let ϵ > 0 and suppose A guarantees RT (0) ϵ against any quadratically bounded sequence {gt}. Then for any T 1, G > 0 and L 0 there exists a sequence g1, . . . , g T satisfying gt G + L wt and a comparator u R2 such that

Unconstrained Online Learning with Unbounded Losses

Proof. Let wt R2 be the output of algorithm A at time t. Consider sequences g1, . . . , g T where gt G L wt

, and deﬁne the randomized sequence egt = G εt L wt

where εt are independent ran-

dom signs. Consider the worst-case regret against a comparator constrained to an ℓ ball of radius U:

sup g1,...,g T RT = sup g1,...,g T

t=1 gt, wt min u: u U

t=1 egt, wt min u: u U

t=1 G wt min u: u U

t=1 Gu1 u2εt L wt

= Eε1,...,εT

t=1 wt + GTU + max |u2| U u2L

= Eε1,...,εT

(a) Eε1,...,εT

(b) Eε1,...,εT

where (a) applies Khintchine inequality, (b) applies Cauchy-Schwarz inequality, and choosing U = G

where the ﬁnal equality bounds u 2 = u2 1 + u2 2 2U 2. Hence, there exists a sequence of gt which incurs at least Ω(L u 2

T) regret. Moreover, for any algorithm which guarantees RT (0) ϵ, there exists a sequence g1, . . . , g T with

gt G for all t such that for any T and u, RT (u) G 3

2ϵ (Mcmahan & Streeter, 2012,

Theorem 8). Thus, taking the worst of these two sequences yields

sup g1,...,g T RT max

C. Proofs for Section 4 (Dynamic Regret)

The main objective of this section is to prove Theorems 4.1 and 4.3. At a high level, the strategy is simple: we run several instances of projected gradient descent, each with a different restricted domain WD = {w W : w D} and stepsize η, and then use a particular experts algorithm to combine them. We ﬁrst assemble a collection of core lemmas that provide the regret of the base algorithm (Lemma C.1), the regret of Algorithm 3 in terms of the regret of any of the base algorithms (Lemma C.2), as well as some utility lemmas (Lemmas C.3 to C.6) to help tame some unwieldy algebraic expressions and case work. We then prove the main results Theorems 4.1 and 4.3 in Appendices C.1 and C.2 respectively. Finally, we prove our lowerbound Theorem 4.2 in Appendix C.3.

The base algorithms that we combine are instances of (projected) online gradient descent with an additional bias term added to the update. The following lemma provides the regret template for this algorithm.

Unconstrained Online Learning with Unbounded Losses

Lemma C.1. For all t let ℓt : W R be convex. Let K 1, Lt 0, and KηLt 1 for all t. Let WD = {w W : w D}, w1 = 0, and on each round update wt+1 = Πw WD (wt η(1 + KηLt)gt), where gt ℓt(wt). Then for any u = (u1, . . . , u T ) in WD,

RT (u) u T 2 + 2DPT

t=1 Lt [ℓt(ut) ℓt(wt)] + 2η

where PT = PT t=2 ut ut 1 .

Proof. The result follows easily using existing analyses. For instance, the update can be seen as an instance of Algorithm 4 with ψt(w) = 1 2η w 2, φt(w) = KηLt gt, w for gt ℓt(wt), domain WD = {w W : w D}, and Mt(w) = w for all t. Letting w1 = 0 and applying Lemma A.2, we have:

RT (u) ψT +1(u T ) +

t=2 ψt(wt), ut 1 ut + Kη

t=1 Ltℓt(ut)

t=1 gt + KηLtgt, wt wt+1 Dψt(wt+1|wt) KηLtℓt(wt)

η ut ut 1 + Kη

t=1 Lt [ℓt(ut) ℓt(wt)]

t=1 (1 + KηLt) gt, wt wt+1 Dψt(wt+1|wt)

(a) u T 2 + 2DPT

t=1 Lt [ℓt(ut) ℓt(wt)]

t=1 (1 + KηLt) gt, wt wt+1 wt+1 wt 2

(b) u T 2 + 2DPT

t=1 Lt [ℓt(ut) ℓt(wt)] + η

t=1 (1 + KηLt)2 gt 2

(c) u T 2 + 2DPT

t=1 Lt [ℓt(ut) ℓt(wt)] + 2η

the (a) observes that Dψt(wt+1|wt) wt+1 wt 2

η-strong convexity of ψ, (b) is Fenchel-Young inequality, and (c) uses KηLt 1.

The following lemma provides a generic regret bound for Algorithm 3. The take-away is that the regret will scale with the regret of any of the experts up to two extra terms CS and ΛT (η, D), which we will later ensure are small.

Unconstrained Online Learning with Unbounded Losses

Lemma C.2. For any τ = (η, D) S with η 1 KLmax and sequence u = (u1, . . . , u T ) in W satisfying ut D for all t, Algorithm 3 guarantees

RT (u) 2k CS + 2k DGmaxΛT (τ) + u T 2 + 2DPT + 4k D2ΛT (τ)

t=1 Lt h ℓt(ut) ℓt(w(τ) t ) i + 4η

where k 9/2 and

eτ S µ2 eτ , ΛT (τ) def = log P

eτ S µ2 eτ µ2τ

Proof. Let τ = (η, D) S and let Aτ denote an algorithm playing w(τ) t+1 = Πw W : w D w(τ) t η(1 + KηLt)g(τ) t

for g(τ) t ℓt(w(τ) t ). Algorithm 3 is constructed as a collection of algorithms Aτ, with an multi-scale experts algorithm (Algorithm 5) to combine their predictions. First, observe that the regret decomposes into the regret of any expert Aτ plus the regret of the experts algorithm relative to expert Aτ:

t=1 ℓt(wt) ℓt(ut)

t=1 ℓt w(τ) t ℓt(ut)

=:RAτ T (u)

t=1 ℓt(wt) ℓt w(τ) t

= RAτ T (u) +

eτ S pt(eτ)ℓt(w(eτ) t )

and by convexity of ℓt and Jensen s inequality:

RAτ T (u) +

eτ S pt(eτ)ℓt w(eτ) t #

= RAτ T (u) +

eτ S ℓt w(eτ) t [pt(eτ) 1 {τ = eτ}]

(a) = RAτ T (u)

h ℓt w(eτ) t ℓt( ewt) i [pt(eτ) p τ(eτ)]

eτ S ℓt( ewt)(pt(eτ) p τ(eτ))

(b) = RAτ T (u) +

h ℓt w(eτ) t ℓt( ewt) i [pt(eτ) p τ(eτ)]

(c) = RAτ T (u) +

D eℓt, pt p τ E

| {z } =:RMeta T (p τ )

= RAτ T (u) + RMeta T (p τ), (9)

Unconstrained Online Learning with Unbounded Losses

where ewt is an arbitrary reference point with ewt Dmin (and hence is in the domain of all of the experts Aτ), (a) deﬁnes p τ(eτ) = 1 if eτ = τ and 0 otherwise, (b) observes that P

eτ S ℓt( ewt)(pt(eτ) p τ(eτ)) = ℓt( ewt) P eτ S pt(eτ) p τ(eτ) = 0, and (c) deﬁnes eℓt R|S| with eℓt,τ = ℓt(w(τ) t ) ℓt( ewt).

Now for any τ = (η, D) S with D maxt ut , we have via Lemma C.1 that

RAτ T (u) u T 2 + 2DPT

t=1 Lt h ℓt(ut) ℓt(w(τ) t ) i + 2η

g(τ) t 2 . (10)

To bound RMeta T (p τ), observe that for any eτ = (eη, e D), we have

eℓt,eτ = ℓt(w(eτ) t ) ℓt( ewt) ℓt(w(eτ) t ) w(eτ) t ewt

Gmax + Lmax e D 2 e D,

and so with µeτ = 1 2 e D(Gmax+ e D/eη) and eη 1 KLmax 1 Lmax we have

2 e D Gmax + e D/eη 2 e D Gmax + Lmax e D

1 Gmax + Lmax e D Gmax + Lmax e D

so these choices meet the assumptions of Theorem A.3 and we have:

RMeta T (p τ) X

eτ S p τ(eτ)

" k [log (p τ(eτ)/p1eτ) + 1]

t=1 eℓ2 teτ

for k 9/2. Recalling that p τ(eτ) = 1 when eτ = τ and 0 otherwise and that τ = (D, η), the ﬁrst sum is bound as

k [log (p τ(τ)/p1τ) + 1]

t=1 eℓ2 tτ = 2k D Gmax + D

[log (1/p1τ) + 1] + η 2D (Gmaxη + D)

t=1 eℓ2 t,τ

2k D Gmax + D

[log (1/p1τ) + 1] + η 2D2

ℓt(w(τ) t ) 2 4D2

= 2k D Gmax + D

[log (1/p1τ) + 1] + 2η

and so with p1,τ = µ2 τ P

eτ S µ2 eτ , we have

RMeta T (p τ) 2k D Gmax + D

eτ S µ2 eτ µ2τ

g(τ) t 2 + 2k X

eτ S µ2 eτ .

Unconstrained Online Learning with Unbounded Losses

Combining this with Equations (9) and (10) yields the stated result:

RT (u) u T 2 + 2DPT

t=1 Lt h ℓt(ut) ℓt(w(τ) t ) i + 4η

+ 2k D Gmax + D

eτ S µ2 eτ µτ

= 2k CS + 2k DGmaxΛT (τ) + u T 2 + 2DPT + 4k D2ΛT (τ)

t=1 Lt h ℓt(ut) ℓt(w(τ) t ) i + 4η

where the last line deﬁnes the shorthand notations

CS def = P eτ S µeτ P

eτ S µ2 eτ and ΛT (τ) def = log P

eτ S µ2 eτ µ2τ

Next, we provide bounds on the terms terms CS and ΛT in terms of the hyperparameter ranges [ηmin, ηmax] and [Dmin, Dmax] that the meta-algorithm tunes the hyperparameters over.

Lemma C.3. Let 0 < ηmin ηmax, 0 < Dmin Dmax, and deﬁne the hyperparameter set S = Sη SD for Sη = ηi = ηmin2i ηmax : i 0 and SD = Dj = Dmin2j Dmax : j 0 . For each τ = (η, D) S, let µτ = 1 2D(Gmax+D/η). Then

Gmax + Dmin

and for any τ S,

ΛT (τ) def = log P

eτ S µ2 eτ µ2τ

+ 1 log 24η2 max D4

η2 min D4 min 6 |Sη| D2

Proof. For the ﬁrst statement, we have

T µ2 (ηmax,Dmin)

T(2Dmin)2 (Gmax + Dmin/ηmax)2

Gmax + Dmin

Unconstrained Online Learning with Unbounded Losses

where the ﬁrst inequality applies Cauchy-Schwarz inequality. Moreover, for any τ = (η, D) S we have

P eτ S µ2 eτ µ2τ = 1 µ2 (η,D)

(2Dj)2 [Gmax + Dj/ηi]2

1 4µ2 (η,D)

= 1 4µ2 (η,D)

22iη2 min D4 min24j

= η2 min 4µ2 (η,D)D4 min

log2(ηmax/ηmin) X

log2(Dmax/Dmin) X

η2 min 4µ2 (η,D)D4 min

22 log2(ηmax/ηmin) +2 1

η2 min 4µ2 (η,D)D4 min

2log2(η2 max/η2 min)+4 1 3 16 15

η2 min 4µ2 (η,D)D4 min 16η2 max η2 min

6η2 max 4µ2 (η,D)D4 min = 3η2 max 2µ2 (η,D)D4 min

At the same time, we can also bound this term as

eτ S µ2 eτ µ2τ = 1 µ2 (η,D)

(2Dj)2 [Gmax + Dj/ηi]2

1 4µ2 (η,D)

1 D2 j G2max

|Sη| 4µ2 (η,D)G2max

log2(Dmax/Dmin) X

1 D2 min22j

|Sη| 4µ2 (η,D)G2max D2 min

4 |Sη| 4 3µ2 (η,D)G2max D2 min = |Sη| 3µ2 (η,D)G2max D2 min

ΛT (η, D) = log P

3η2 max 2D2 min |Sη| 3G2max

1 µ2 (η,D)D2 min

3η2 max 2D2 min |Sη| 3G2max

(2D)2 [Gmax + D/η]2

Unconstrained Online Learning with Unbounded Losses

Now if Gmax D/η, we have

ΛT (η, D) log

3 4 η2 max D2 [Gmax + D/η]2

log 6η2 max D2 (2D/η)2

log 24η2 max D4

η2 min D4 min

and otherwise

ΛT (η, D) log

4 |Sη| D2 [Gmax + D/η]2

3G2max D2 min

log 6 |Sη| D2G2 max G2max D2 min

= log 6 |Sη| D2

Thus, we can bound

ΛT (η, D) log 24η2 max D4

η2 min D4 min 6 |Sη| D2

Lemma C.4 provides a simple but tedius calculation which we will use a few times in the proof of Theorem 4.1.

Lemma C.4. Let ℓt be (Gt, Lt)-quadratically bounded, c1, c2 0, u, w W, and gt ℓt(w). Assume w D and u D. Then

c1Lt [ℓt(u) ℓt(w)] + c2 gt 2 3(c1 + c2) G2 t + L2 t D2

Proof. Since ℓt is (Gt, Lt)-quadratically bounded, and gt ℓt(w) where w D we have

gt 2 (Gt + Lt w )2 2G2 t + 2L2 t w 2 2G2 t + 2L2 t D2.

Moreover, letting ℓt(u) ℓt(u) and u D we have

Lt (ℓt(u) ℓt(w)) Lt ℓt(u) u w

2DLt (Gt + Lt D)

= 2DLt Gt + 2L2 t D2

G2 t + L2 t D2 + 2L2 t D2

= G2 t + 3L2 t D2.

c1Lt (ℓt(u) ℓt(w)) + c2 gt 2 (c1 + 2c2)G2 t + (3c1 + 2c2)L2 t D2

3(c1 + c2) G2 t + L2 t D2

Unconstrained Online Learning with Unbounded Losses

Lastly, we provide two lemmas which let us assume that there is a τ = (η, D) S for which 1

2D M = maxt ut D by showing that the regret is trivially well-controlled whenever M is too big (Lemma C.5) or too small (Lemma C.6). Lemma C.5. For all t let ℓt be a (Gt, Lt)-quadratically bounded convex function for Gt [0, Gmax] and Lt [0, Lmax]. Let ε > 0, Dmax = ε2T , and let u = (u1, . . . , u T ) be an arbitrary sequence in W such that M := maxt ut Dmax. Then for any w1, . . . , w T with wt Dmax,

t=1 ℓt(wt) ℓt(ut) 2 Gmax M + Lmax M 2 log2

Proof. Let gt ℓt(wt) and observe that

t=1 ℓt(wt) ℓt(ut)

t=1 gt wt ut

t=1 gt (Dmax + ut )

2M (Gmax + Lmax Dmax) T

2M (Gmax + Lmax M) T

2 Gmax M + Lmax M 2 log2

where the last line uses M ε2T = T log2 M

Lemma C.6. For all t let ℓt be a (Gt, Lt)-quadratically bounded convex function for Gt [0, Gmax] and Lt [0, Lmax]. Let ε > 0, Dmin = ε T , ηmax = 1 KLmax , and ηmin = ϵ K(Gmax+ϵLmax)T . Let wt W be the outputs of the algorithm characterized in Lemma C.2 with η = ηmin and D = Dmin, and let u = (u1, . . . , u T ) be an arbitrary sequence in W with M = maxt ut Dmin. Then

RT (u) (Gmax + ϵLmax) [K(M + PT ) + ϵCT ]

where CT O log(log( Gmax ϵLmax )) T

Proof. For M Dmin, we can apply Lemma C.2 with τ = (ηmin, Dmin) to get

RT (u) 2k CS + 2k Dmin GmaxΛT (τ) + u T 2 + 2Dmin PT + 4k D2 minΛT (τ) 2ηmin

t=1 Lt h ℓt(ut) ℓt(w(τ) t ) i + 4ηmin

where µeτ = 1 2 e D(Gmax+ e D/eη) for any eτ = (eη, e D) S, k 9/2, and

eτ S µ2 eτ , ΛT (τ) = ΛT (ηmin, Dmin) def = log P

eτ S µ2 eτ µ(ηmin,Dmin)2

Observe that with M = maxt ut Dmin and Dmin

ηmin = K(Gmax + ϵLmax), we have

u T 2 + 2Dmin PT + 4k D2 minΛT (τ) 2ηmin Dmin

1 2 ( u T + 2PT + 4k DminΛT (τ))

2K(Gmax + ϵLmax) u T + 2PT + 4k ϵΛT (τ)

Unconstrained Online Learning with Unbounded Losses

Moreover, by Lemma C.4 we have

t=1 Kηmin Lt h ℓt(ut) ℓt(w(τ) t ) i + 4ηmin g(τ) t 2 ηmin

t=1 3(K + 4) G2 t + L2 t D2 min

3(K + 4)ϵ G2 max + L2 max D2 min

K(Gmax + ϵLmax)

ϵGmax + ϵ2Lmax

Plugging in the previous two displays back into the full regret bound yields

RT (u) 2k CS + 2kϵGmax ΛT (τ)

2K(Gmax + ϵLmax) u T + 2PT + 4k ϵΛT (τ)

ϵGmax + Lmaxϵ2

2k CS + ϵGmax

K + (K + 1)2kΛT (τ)

KT 2 + 2k KΛT (τ)

+ K(Gmax + ϵLmax) [M + PT ] .

Finally, Lemma C.3 bounds

Gmax + Dmin

Gmax + KϵLmax

4k ϵGmax + Kϵ2Lmax/T

ΛT (ηmin, Dmin) log (6 |Sη|) + 1 log (|Sη|) + 3

Plugging these back in above:

RT (u) ϵGmax(K + 4) 3

T + 2kΛT (ηmin, Dmin)

+ ϵ2Lmax(K + 4) 3 KT 2 + 4k T 3/2 + 2kΛT (ηmin, Dmin)

+ K(Gmax + ϵLmax) [M + PT ]

CT ϵGmax + ϵ2Lmax + K(Gmax + ϵLmax) [M + PT ]

= (Gmax + ϵLmax) [K(M + PT ) + ϵCT ]

Unconstrained Online Learning with Unbounded Losses

T + 2k log log2 T Gmax

log log Gmax ϵLmax

C.1. Proof of Theorem 4.1

Theorem 4.1. For all t let ℓt : W R be a (Gt, Lt)-quadratically bounded convex function with Gt [0, Gmax] and Lt [0, Lmax]. Let ϵ > 0, K 8, βt = 1 exp ( 1/T) for all t, and for any i, j 0 let Dj = ϵ T 2j 2T and

ηi = h ϵ2i K(Gmax+ϵLmax)T 1 KLmax

i , and let S = {(ηi, Dj) : i, j 0}. For each τ = (η, D) S let µτ = 1 2D(Gmax+D/η),

and set p1(τ) = µ2 τ P

eτ S µ2 eτ . Then for any u = (u1, . . . , u T ) in W, Algorithm 3 guarantees

Gmax + (M + ϵ)Lmax

(M + ϵ)Λ T + PT

v u u t(M 2Λ T + MPT )

t=1 G2 t + L2 t M 2,

where PT = PT t=2 ut ut 1 , M = maxt ut , and Λ T O log MT log(T )

ϵ + log log Gmax ϵLmax

. Moreover, when the losses are Lt-smooth, the bound automatically improves to

Gmax + (M + ϵ)Lmax

(M + ϵ)Λ T + PT

v u u t(M 2Λ T + MPT )

t=1 Lt [ℓt(ut) ℓ t ]

t=1 G2 t + L2 t M 2 #!

Proof. First observe that we can assume that there is a τ = (η, D) S for which D maxt ut = M, since otherwise using Lemma C.5 with ε = ϵ

T the regret is bounded as

RT (u) 2M (Gmax + MLmax) log MT

Likewise, if M Dmin then by Lemma C.6 we have

RT (u) (Gmax + Lmaxϵ) [K(M + PT ) + ϵCT ] , (12)

where CT O log(log( Gmax ϵLmax )) T

. Otherwise, we have M [Dmin, Dmax], in which case there is a Dj = ϵ2j

T for which

Dj M Dj 1 = 1

2Dj, so for any τ = (η, Dj) S we can apply Lemma C.2 to get

RT (u) 2k CS + 2k Dj GmaxΛT (τ) + u T 2 + 2Dj PT + 4k D2 jΛT (τ) 2η

t=1 Lt h ℓt(ut) ℓt(w(τ) t ) i + 4η

Unconstrained Online Learning with Unbounded Losses

where g(τ) t ℓt(w(τ) t ), PT = PT t=2 ut ut 1 , and

eτ S µeτ P eτ S µeτ

ΛT (η, Dj) = log

eτ S µ2 eτ µ2 (η,Dj)

(2M)2 Gmax + 2M

= ΛT (ηmin, 2M).

Thus, bounding Dj 2M and denoting ΩT := PT t=1 KLt h ℓt(ut) ℓt(w(τ) t ) i + 4 g(τ) t 2 , we have:

RT (u) 2k CS + 4k MGmaxΛT (ηmin, 2M) + M 2 (1 + 16kΛT (ηmin, 2M)) + 4MPT

KLt h ℓt(ut) ℓt(w(τ) t ) i + 4 g(τ) t 2

| {z } =:ΩT

Next, we show that there is an η for which the above expression is well-controlled.

Observe that choosing η optimally in Equation (13) would yield

M 2(1 + 16kΛT (ηmin, 2M)) + 4MPT

If η ηmax, then choosing η = ηmax yields

RT (u) 2k CS + 4k MGmaxΛT (ηmin, 2M) + M 2 (1 + 16kΛT (ηmin, 2M)) + 4MPT

2ηmax + η ΩT

= 2k CS + 4k MGmaxΛT (ηmin, 2M) + KLmax

2 M 2(1 + 16kΛT (ηmin, 2M)) + 4MPT

1 2 [M 2 (1 + 16kΛT (ηmin, 2M)) + 4MPT ] ΩT . (14)

Similarly, if η ηmin, then choosing η = ηmin yields

RT (u) 2k CS + 4k MGmaxΛT (ηmin, 2M) + M 2 (1 + 16kΛT (ηmin, 2M)) + 4MPT

2η + ηminΩT

= 2k CS + 4k MGmaxΛT (ηmin, 2M) +

1 2 [M 2 (1 + 16kΛT (ηmin, 2M)) + 4MPT ] ΩT

+ ϵΩT K (Gmax + ϵLmax) T .

Observe that by Lemma C.4, we have

t=1 KLt h ℓt(ut) ℓt(w(τ) t ) i + 4 g(τ) t 2

t=1 3(K + 4) G2 max + L2 max D2 j

3(K + 4) G2 max + 4M 2L2 max T.

Unconstrained Online Learning with Unbounded Losses

ϵΩT K (Gmax + ϵLmax) T ϵ 3(K + 4) G2 max + 4M 2L2 max T K (Gmax + ϵLmax) T

K ϵGmax + 4M 2Lmax

(K + 4) ϵGmax + 4M 2Lmax

for K 3. so overall when η ηmin the regret can be bounded as

RT (u) 2k CS + 4k MGmaxΛT (ηmin, 2M) +

1 2 [M 2 (1 + 16kΛT (ηmin, 2M)) + 4MPT ] ΩT

+ (K + 4)ϵGmax + 4(K + 4)M 2Lmax. (15)

Finally, if η [ηmin, ηmax], then there is an ηi = 2iϵ K(Gmax+ϵLmax)T such that ηi η ηi+1 = 2ηi, so choosing η = ηi Equation (13) is bounded by

RT (u) 2k CS + 4k MGmaxΛT (ηmin, 2M) + M 2 (1 + 16kΛT (ηmin, 2M)) + 4MPT

2k CS + 4k MGmaxΛT (ηmin, 2M) + 3

1 2 [M 2 (1 + 16kΛT (ηmin, 2M)) + 4MPT ] ΩT . (16)

Now combining Equations (11), (12) and (14) to (16), we have

RT (u) 2M (Gmax + MLmax) log MT

+ (Gmax + Lmaxϵ) [K(M + PT ) + ϵCT ]

+ 2k CS + 4k MGmaxΛT (ηmin, 2M)

1 2 [M 2 (1 + 16kΛT (ηmin, 2M)) + 4MPT ] ΩT

+ (K + 4)ϵGmax + 4(K + 4)M 2Lmax

2 M 2(1 + 16kΛT (ηmin, 2M)) + 4MPT .

From Lemma C.3 we have

Gmax + Dmin

2K ϵGmax + ϵ2Lmax

T ΛT (ηmin, 2M)

log 6 |S| (2M)2

24M 2T 2 l log2 T Gmax

Hence, hiding constants we may write

Gmax((M + ϵ)Λ T + PT ) + Lmax (M + ϵ)2Λ T + (M + ϵ)PT + q

(M 2Λ T + MPT )ΩT ,

Unconstrained Online Learning with Unbounded Losses

where Λ T O log MT

ϵ + log log T Gmax

O log MT log(T )

ϵ + log log Gmax ϵLmax

. Finally, the proof

is completed by observing that if the ℓt are Lt-smooth, then using the self-bounding property we have g(τ) t 2

2Lt ℓt(w(τ) t ) ℓ t for ℓ t = minw W ℓt(w), and thus

t=1 KLt h ℓt(ut) ℓt(w(τ) t ) i + 4

t=1 KLt h ℓt(ut) ℓt(w(τ) t ) i + 8

t=1 Lt h ℓt(w(τ) t ) ℓ t i

t=1 KLt [ℓt(ut) ℓ t ]

where the second-to-last line chooses K 8, and simultaneously we have using Lemma C.4 that

ΩT 3(K + 4)

G2 t + L2 t D2 j

G2 t + 4L2 t M 2 ,

and so we have ΩT O PT t=1 Lt [ℓt(ut) ℓ t ] PT t=1 G2 t + L2 t M 2 .

C.2. Proof of Theorem 4.3

Theorem 4.3. For all t let ℓt : W R be (Gt, Lt)-quadratically bounded and Lt-smooth convex function with Gt [0, Gmax] and Lt [0, Lmax]. Let ϵ > 0, K 8, and for any i, j 0 let Dj = ϵ

T 2j 2T and

ηi = 1 KLmax

T i , and let S = {(ηi, Dj) : i, j 0}. Then for any u = (u1, . . . , u T ) in W, Algorithm 3 guarantees

Gmax(M + ϵ)Λ T + Lmax(M + ϵ)2Λ T + Lmax(M + ϵ)PT

t=1 [ℓt(ut) ℓ t ]2 +

v u u t(M 2Λ T + MPT )

t=1 Lt [ℓt(ut) ℓ t ]

where M = maxt ut , PT = PT t=2 ut ut 1 , and Λ T O log M

Proof. By Lemma C.5, we can assume that there is a τ = (η, D) S for which D maxt ut = M, since otherwise the regret is bounded as

RT (u) 2M (Gmax + MLmax) log

Hence, we can assume there is a (η, D) S which has M D. For any such (η, D) S, we can apply Lemma C.2 to get

RT (u) 2k CS + 2k DGmaxΛT (τ) + u T 2 + 2DPT + 4k D2ΛT (τ)

t=1 Lt h ℓt(ut) ℓt(w(τ) t ) i + 4η

Unconstrained Online Learning with Unbounded Losses

where g(τ) t ℓt(w(τ) t ), PT = PT t=2 ut ut 1 and

eτ S µ2 eτ , ΛT (τ) def = log P

eτ S µ2 eτ µ2τ

where for any eτ = ( e D, eη) S we deﬁne µ(eη, e D) = 1 2 e D(Gmax+ e D/eη). Using the self-bounding property of smooth functions,

for any g(τ) t ℓt(w(τ) t ) we have g(τ) t 2 2Lt h ℓt(w(τ) t ) ℓ t i for ℓ t = arg minw W ℓt(w), so the last line is bound as

t=1 Lt h ℓt(ut) ℓt(w(τ) t ) i + 8η

t=1 Lt h ℓt(w(τ) t ) ℓ t i Kη

t=1 Lt [ℓt(ut) ℓ t ]

for K 8. Hence,

RT (u) 2k CS + 2k DGmaxΛT (τ) + u T 2 + 2DPT + 4k D2ΛT (τ)

t=1 Lt [ℓt(ut) ℓ t ] (18)

Now suppose that M Dmin, then choosing τ = τmin = (ηmin, Dmin) we would have

RT (u) 2k CS + 2k Dmin GmaxΛT (τmin) + u T 2 + 2Dmin PT + 4k D2 minΛT (τmin) 2ηmin

t=1 Lt [ℓt(ut) ℓ t ]

2k CS + 2k Dmin GmaxΛT (τmin) + Dmin

1 2 (M + 2PT + 4k DminΛT (τmin))

t=1 Lt [ℓt(ut) ℓ t ]

2k CS + 2k Gmax ϵΛT (τmin)

M + PT + 2k ϵΛT (τmin)

t=1 [ℓt(ut) ℓ t ]2 (19)

where the last line applies Cauchy-Schwarz inequality, observes that Dmin/ηmin = KϵLmax, and recalls Dmin = ϵ

Finally, assume that M [Dmin, Dmax], then there is a Dj = ϵ2j

T for which Dj M Dj 1 = 1

2Dj. Then, choosing τ = (η, Dj), Equation (18) yields

RT (u) 2k CS + 4k MGmaxΛT (ηmin, 2M) + M 2 + 4MPT + 16k M 2ΛT (ηmin, 2M)

t=1 Lt [ℓt(ut) ℓ t ] (20)

Unconstrained Online Learning with Unbounded Losses

where we ve observed that

Λ(η, Dj) = log P eτ S µ2 eτ µ(η,Dj)

eτ S µ2 eτD2 j [Gmax + Dj/η]2 !

eτ S µ2 eτ(2M)2 [Gmax + 2M/η]2 !

= ΛT (η, 2M)

so it remains to show that there is an η that favorably balances the last two terms of Equation (20).

Observe that the optimal choice for η would be

M 2(1 + 16kΛT (ηmin, 2M)) + 4MPT

2K PT t=1 Lt [ℓt(ut) ℓ t ] .

If η ηmin then choosing η = ηmin we have

RT (u) 2k CS + 4k MGmaxΛT (ηmin, 2M) + M 2(1 + 16kΛT (ηmin, 2M) + 4MPT

t=1 Lt [ℓt(ut) ℓ t ]

2k CS + 4k MGmaxΛT (ηmin, 2M) +

2 (M 2(1 + 16kΛT (ηmin, 2M)) + 4MPT ) ΩT

t=1 [ℓt(ut) ℓ t ]2, (21)

where the last line deﬁnes the short-hand notation ΩT = PT t=1 Lt [ℓt(ut) ℓ t ] and uses Cauchy-Schwarz inequality to

bound Kηmin PT t=1 Lt [ℓt(ut) ℓ t ] q PT t=1 [ℓt(ut) ℓ t ]2. Likewise, if η ηmax then by choosing η = ηmax we have via Equation (20) that

RT (u) 2k CS + 4k MGmaxΛT (ηmin, 2M) + M 2 + 4MPT + 16k M 2ΛT (ηmin, 2M)

t=1 Lt [ℓt(ut) ℓ t ]

= 2k CS + 4k MGmaxΛT (ηmin, 2M) + KLmax M 2(1 + 8kΛT (ηmin, 2M)) + 2MPT

2 (M 2(1 + 16kΛT (ηmin, 2M)) + 4MPT ) ΩT (22)

Finally, if η [ηmin, ηmax] then there is an ηi = 2i

T for which ηi η ηi+1 = 2ηi, so Equation (20) is gives us

RT (u) 2k CS + 4k MGmaxΛT (ηmin, 2M) + 3

2 (M 2(1 + 16kΛT (ηmin, 2M)) + 4MPT ) ΩT (23)

Unconstrained Online Learning with Unbounded Losses

Finally, combining Equations (17), (19) and (21) to (23), we have

RT (u) 2k CS + 4k Gmax

MΛT (ηmin, 2M) + ϵΛT (τmin)

+ 2M (Gmax + MLmax) log

M + PT + 2k ϵΛT (τmin)

+ KLmax M 2(1 + 8kΛT (ηmin, 2M)) + 2MPT

t=1 [ℓt(ut) ℓ t ]2

2 (M 2(1 + 16kΛT (ηmin, 2M)) + 4MPT ) ΩT

Lastly, note that by Lemma C.3, we have

Gmax + Dmin

= 2ϵGmax + 2Kϵ2Lmax

ΛT (τ) log 24η2 max D4

η2 min D4 min 6 |Sη| D2

log 6 |Sη| D2

so ΛT (ηmin, Dmin) log 6 log2 log2(

T) + 1 O log(log(

T)) and ΛT (ηmin, 2M)

log 24T M 2 log2( log2(

T /ϵ . Overall, we have

Gmax(M + ϵ)Λ T

+ Lmax(M + ϵ)2Λ T + Lmax(M + ϵ)PT

t=1 [ℓt(ut) ℓ t ]2

(M 2Λ T + MPT ) ΩT

where Λ T O log M

C.3. Proof of Theorem 4.2

We focus on the case where G/L M, since otherwise when G/L M the loss function ℓt(w) = ( 1

2LM)ξtw for ξt { 1, 1} satisﬁes |ℓ t(w)| = 1

2(G + LM) G for any w W, so ℓt is G-Lipschitz. Hence, existing lower bounds tell us that there exists a sequence ξt [ 1, 1] such that RT (u) Ω G MPT T Ω 1

2(G + LM) MPT T = Ω 1

2G MPT T + 1

2LM 3/2 PT T where M = maxt ut and PT = PT t=2 ut ut 1 (Zhang et al., 2018).

Unconstrained Online Learning with Unbounded Losses

Theorem 4.2. For any M > 0 there is a sequence of (G, L)-quadratically bounded functions with G

L M such that for any γ [0, 1

4 M 1 γ [PT T]γ + L

8 M 2 γ [PT T]γ .

where PT = PT t=2 ut ut 1 and M maxt ut .

Proof. On each round t, we can always ﬁnd a ut such that ut wt. Let ut := σ M for some σ to be decided. Let G > 0, L 0 such that G/L σ, let ξt = ut ut , and on each round set

2G ξt, w + L

4 (σ ξt, w )2.

Observe that these losses are ( e G, e L) quadratically bounded with e G = 1

2σL and e L = L, and e G/e L σ M as required. Since wt ξt and ξt, ut = ut = σ, we have

t=1 ℓt(wt) ℓt(ut) 1

Note also that the path-length of this comparator sequence is bounded as

t=2 ut ut 1 2σT.

Now for µ [0, 1/2] set σ = MT µ, then the path-length is bounded as

and the regret is bounded below by

1 2GMT 1 µ + L

4 T 1 2µM 2.

Now set γ = 1 2µ

2] and consider the second term:

4 T 1 2µM 2 = L

4 (MT 1 µ)γ(MT 1 µ)1 γT µM

L 4 2γ (PT )γ(MT 1 µ)1 γT µM

8 M 2 γP γ T T (1 µ)(1 γ) µ

8 M 2 γ [PT T]γ

where the last line observes γ = 1 2µ

2], so that (1 µ)(1 γ) µ = γ. Similarly,

1 2GMT 1 µ = 1

2G(MT 1 µ)γ(MT 1 µ)1 γ 1 2 2γ GM 1 γ(PT )γT (1 µ)(1 γ)

4GM 1 γ(PT T)γ.

4 M 1 γ [PT T]γ + L

8 M 2 γ [PT T]γ .

Unconstrained Online Learning with Unbounded Losses

D. Supporting Lemmas

We collect here various miscellaneous supporting lemmas that we use throughout paper. The following lemma is standard but shown here for completeness. Lemma D.1. Let a1, . . . , a T be arbitrary non-negative numbers in R. Then v u u t

at q Pt s=1 as 2

Proof. By concavity of x 7 x, we have a1:t a1:t 1 at 2 a1:t ,

so summing over t and observing the resulting telescoping sum yields

a1:t a1:t 1 = 2 a1:T .

For the lower bound, observe that

at a1:T = a1:T a1:T = a1:T

We borrow the following lemma from Jacobsen & Cutkosky (2022). Lemma D.2. (Jacobsen & Cutkosky, 2022, Lemma 7) Let f : R R and let g : W R be deﬁned as g(x) = f( x ). Suppose that f (x) is concave and non-negative. If f is twice-differentiable at x and x > 0, then

2g(x) f ( x )I

The following is a simple modiﬁcation of the stability lemma used in Jacobsen & Cutkosky (2022), reported here with slight modiﬁcation to handle a leading constant. Lemma D.3. For all t, set ψt+1(w) = Ψt+1( w ) where Ψt : R 0 R 0 is a convex function satisfying Ψ t(x) 0, Ψ t (x) 0, and Ψ t (x) 0 for all x 0. Let c > 0 and assume that there exists an xt > 0 and Gt-Lipschitz convex function ηt : R 0 R 0 such that |Ψ t (x)| 2η t(x) (c+1)2 Ψ t (x)2 for all x xt. Then for any wt+1, wt W,

bδt def = c Gt wt wt+1 Dψt(wt+1|wt) ηt( wt+1 )G2 t (c + 1)2G2 t 2Ψ t ( xt)

Proof. The proof follows using similar arguments to Jacobsen & Cutkosky (2022) with a few minor adjustments to correct for the leading term c.

First, consider the case that the origin is contained in the line segment connecting wt and wt+1. Then, there exists sequences bw1 t , bw2 t . . . and bw1 t+1, bw2 t+1 . . . such that limn bwn t = wt, limn bwn t+1 = wt+1 and 0 is not contained in the line segment connecting bwn t and bwn t+1 for all n. Since ψ is twice differentiable everywhere except the origin, if we deﬁne bδn t = Gt bwn t bwn t+1 Dψt( bwn t+1| bwn t ) ηt( bwn t+1 )G2 t, then limn bδn t = bδt. Thus, it sufﬁces to prove the result for the case that the origin is not contained in the line segment connecting wt and wt+1. The rest of the proof considers exclusively this case.

For brevity denote bδt def = Gt wt wt+1 Dψt(wt+1|wt) ηt( wt+1 ) gt 2. Since the origin is not in the line segment connecting wt and wt+1, ψt is twice differentiable on this line segment. Thus, By Taylor s theorem, there is a ew on the line connecting wt and wt+1 such that

Dψt(wt+1|wt) = 1

2 wt wt+1 2 2ψt( e w) 1

2 wt wt+1 2 Ψ t ( ew )

Unconstrained Online Learning with Unbounded Losses

where the last line observes ψt(w) = Ψt( w ) and uses the regularity assumptions Ψ t (x) 0, and Ψ t(x) 0 for x 0 to apply Lemma D.2. Hence,

bδt = c Gt wt wt+1 Dψt(wt+1|wt) ηt( wt+1 )G2 t

c Gt wt wt+1 1

2 wt wt+1 2 Ψ t ( ew ) ηt( wt+1 )G2 t

(a) c Gt wt wt+1 1

2 wt wt+1 2 Ψ t ( ew ) ηt( ew )G2 t + η t( ew )G2 t wt+1 ew

(b) (c + 1)Gt wt wt+1 1

2 wt wt+1 2 Ψ t ( ew ) ηt( ew )G2 t

(c) (c + 1)2G2 t 2Ψ t ( ew ) ηt( ew )G2 t

where (a) uses convexity of ηt(x), (b) uses the Lipschitz assumption η t( ew ) 1/Gt and the fact that ew wt wt+1 wt for any ew on the line connecting wt and wt+1, and (c) uses Fenchel-Young inequality. If ew xt, then we have

(c + 1)2G2 t 2Ψ t ( ew ) ηt( ew )G2 t (c + 1)2G2 t 2Ψ t ( xt) ,

which follows from the fact that Ψ t (x) 0 implies Ψ t (x) is non-increasing in x, and hence Ψ t ( ew ) Ψ t ( xt).

Otherwise, if ew xt, we have by assumption that |Ψ t (x)| Ψ t (x)2 = Ψ t (x) Ψ t (x)2 = d dx 1 Ψ t (x) 2η t(x) (c+1)2 for any x xt, so integrating from xt to ew we have

1 Ψ t ( ew ) 1 Ψ t ( xt) 2 (c + 1)2

xt η t(x)dx,

1 Ψ t ( ew ) 1 Ψ t ( xt) + 2 (c + 1)2

xt η t(x)dx

1 Ψ t ( xt) + 2 (c + 1)2

= 1 Ψ t ( xt) + 2ηt( ew )

(c + 1)2G2 t 2Ψ t ( ew ) ηt( ew )G2 t (c + 1)2G2 t 2Ψ t ( xt) + (c + 1)2G2 t 2 2 (c + 1)2 ηt( ew ) ηt( ew )G2 t

= (c + 1)2G2 t 2Ψ t ( xt) + ηt( ew )G2 t ηt( ew )G2 t

= (c + 1)2G2 t 2Ψ t ( xt) ,

so in either case we have

bδt = Gt wt wt+1 Dψt(wt+1|wt) ηt( wt+1 )G2 t

(c + 1)2G2 t 2Ψ t ( xt) .