# fully_unconstrained_online_learning__bd456174.pdf

Fully Unconstrained Online Learning

Ashok Cutkosky Boston University ashok@cutkosky.com

Zakaria Mhammedi Google Research mhammedi@google.com

We provide a technique for online convex optimization that obtains regret

T) + w 2 + G2 on G-Lipschitz losses for any comparison point w without knowing either G or w . Importantly, this matches the optimal bound G w

T available with such knowledge (up to logarithmic factors), unless either w or G is so large that even G w

T is roughly linear in T. Thus, at a high level it matches the optimal bound in all cases in which one can achieve sublinear regret.

1 Unconstrained Online Learning

This paper provides new algorithms for online learning, which is a standard framework for the design and analysis of iterative first-order optimization algorithms used throughout machine learning. Specifically, we consider a variant of online learning often called online convex optimization [1, 2]. Formally, an online learning algorithm is designed to play a kind of game between the learning algorithm and the environment, which we can describe using the following protocol:

Protocol 1. Online Learning/Online Convex Optimization. Input: Convex domain W Rd, number of rounds T. For t = 1,...,T:

1. Learner outputs wt W. 2. Nature reveals loss vector gt ℓt for some convex function ℓt W R to the learner. 3. Learner suffers loss gt,wt .

The learner is evaluated with the regret T t=1(ℓt(wt) ℓt(w )) against comparators w W. By convexity, the regret is bounded by the linearized regret T t=1 gt,wt w . Our goal is to ensure that for all w W simultaneously:

Regret T (w ) = T t=1 gt,wt w goal

T t=1 gt 2 . (1)

Qualitatively, we consider a learner to be performing well if 1

T T t=1(ℓt(wt) ℓt(w )) is very small, usually going to zero as T . This indicates that the average loss of the learner is close to the average loss of any chosen comparison point w W. This property is called sublinear regret . The bound (1) is unimprovable in general [3, 4, 5], and clearly implies sublinear regret.

Algorithms that achieve low regret are used in a variety of machine learning applications. Perhaps the most famous such application is in the analysis of stochastic gradient descent, which achieves (1) for appropriately tuned learning rate [6]. More generally, stochastic convex optimization can be reduced

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

to online learning via the online to batch conversion [7]. Roughly speaking, this result says that an online learning algorithm that guarantees low regret can be immediately converted into a stochastic convex optimization algorithm that converges at a rate of E[Regret T (w )]

T , where w is the minimizer of the objective. Online learning can also be used to solve non-convex optimization problems [8] and can even be used to prove concentration inequalities [9, 10, 11, 12]. In all of these cases, achieving the bound (1) produces methods that are optimal for their respective tasks. Thus, it is desirable to be able to achieve (1) in as robust a manner as possible.

Our goal is to come as close as possible to achieving the bound (1) while requiring minimal prior user knowledge about the loss sequence g1,...,gt and w . In the past, several prior works have achieved the bound (1) when given prior knowledge of either w or maxt gt [13, 14, 15, 16, 17, 18, 19, 20, 21]. However, such knowledge is frequently unavailable. Instead, many problems are fully unconstrained in the sense that we do not have any reasonable upper bounds on either w or maxt gt . In particular, when considering the application to stochastic convex optimization, the values for w and maxt gt can be interpreted as knowledge of the correct learning rate for stochastic gradient descent [6]. Thus, achieving the bound (1) with less prior knowledge roughly corresponds to building algorithms that are able to achieve optimal convergence guarantees without requiring manual hyperparameter tuning. For this reason, it is common to refer to such algorithms as parameter-free . This paper focuses on this difficult but realistic setting.

Our new upper bound. Unfortunately, the bound (1) is actually unobtainable in general without prior knowledge of either the magnitude w or the value of maxt gt [18, 22]. Nevertheless, we will obtain a new compromise bound. For any user-specified γ > 0, our method will achieve:

T t=1 gt,wt w O max t [T ] gt 2/γ + γ w 2 + w

T t=1 gt 2 . (2)

To dissect this compromise, let us consider the case gt = G for all t and γ = 1. In this situation, our bound (2) is roughly G2 + w 2 + w G

T, while the ideal bound (1) is merely w G

T. However, for our bound to be significantly worse than (1), we must have either G w

T. In either case, we might expect that w G

T is roughly Ω(T) (assuming that neither G nor w is very small). So, intuitively the only cases in which our bound is worse than the ideal bound are those for which the ideal bound is already rather large the problem is in some sense too hard .

Comparison with previous bounds Our bound (2) is not the first attempted compromise in our fully unconstrained setting. Prior work [18, 23] instead provides the bound:

T t=1 gt,wt w O

Á Á Àmax t [T ] gt T t=1 gt + γ2 max t [T ] gt w 3 + w

T t=1 gt 2 . (3)

In fact, readers familiar with this literature may be surprised that our bound is even possible; [18] show that the bound (3) is optimal for the fully-unconstrained case. However, the lower-bound provided by [18] actually has a small loophole; it only applies to algorithms that insist on a linear dependence on maxt gt . Our method avoids this lower bound by instead suffering a quadratic dependence on maxt gt .

While our new bound (2) does not uniformly improve the prior bound (3), it has several qualitative differences that may be more appealing.

1. The bound (3) does not have the desirable property outlined above for our new bound; for gt = G, it is possible for the bound (3) to be much greater than the ideal bound (1) even when (1) is small. 2. The dependency on the user-specified value γ is arguably more sensitive; in (3), decreasing γ comes at a 1

T cost while increasing gamma comes at an γ2 w 3 cost. In contrast, in our bound (3), the γ-dependencies are milder; G2/γ for decreasing γ (which does not depend on T) and γ w 2 for increasing γ.

3. The previous compromise bound (3) has a term that depends on maxt gt ( T t=1 gt ) rather than T t=1 gt 2. The dependence on the second power of gt is sometimes referred to as a

second-order bound and is known to imply constant regret in certain settings [1, 24] (so-called fast rates ).

4. Consider the case that both bounds are tuned with their respective optimal values for γ. Our

new bound would then reduce to O ( w

T t=1 gt 2 + w G), while the previous bound

would instead become O ( w

T t=1 gt 2 + w G2/3 ( T t=1 gt ) 1/3). Thus, our new bound

appears more desirable even with individually optimal tuning.

5. Our bound ensures that when w = 0, the dependence on T is O(1). This has a number of useful consequences. For example, by running a separate instance of our algorithm for each dimension of a d-dimensional problem, we can achieve:

T t=1 gt,wt w O

d i=1 max t [T ] g2 t,i + γ w 2 2 + d i=1 w ,i

T t=1 g2 t,i .

Attempting this with the bound (3) would incur a more significant dependence on the dimension d. More generally, this property means that our bound fits into the framework for combining regret bounds of [25].

Throughout this paper, we use W to refer to a convex domain contained in Rd. Our results can in fact be extended to Banach spaces relatively easily using the reduction techniques of [16], but we focus on Rd here to keep things more familiar. We use to indicate the Euclidean norm. Occasionally we also make use of other norms these will always be indicated by some subscript (e.g. t). We use R 0 to indicate the set of non-negative reals. For a convex function F over Rd, the Fenchel conjugate of F is F (θ) = supx Rd θ,x F(x). We occasionally make use of a compressed sum notation: ga b = b t=a gt. We use O to hide constant factors and O to hide both constant and logarithmic factors. All proofs not present in the main paper may be found in the appendix.

We will refer to the values gt provided to an online learning algorithm interchangeably as gradients , feedback and loss values. We will refer to online learning algorithms occasionally as either learners or just algorithms .

3 Overview of Approach

Our overall approach to achieve (2) is a sequence of reductions. As a first step, we observe that it suffices to achieve our goal in the special case W = R. Specifically, [16] Theorems 2 and 3 reduce the general W case to W = R case. We provide an explicit description of how to apply these reductions in Section C. So, we focus our analysis on the case W = R. Next, we reduce the problem to a variant of the online learning protocol in which we also must contend with some potentially non-Lipschitz regularization function (Section 3.1). Finally, we show how to achieve low regret in this special regularized setting (Section 3.3).

3.1 Hints and Regularization

Our bound is achieved via a reduction to a variant of Protocol 1 with two changes. First, the learner is provided with prior access to magnitude hints ht R that satisfy gt ht. This notion of magnitude hints is also a key ingredient in the previous bound (3). Our second change is that the loss is not only the linear loss gt,w , but a regularized non-linear loss gt,w + atψ(w) for some fixed function ψ W R 0 that we call a regularizer . Formally, this protocol variant is specified in Protocol 2.

Protocol 2. Regularized Online Learning with Magnitude Hints. Input: Convex function ψ W R 0. For t = 1,...,T:

1. Nature reveals magnitude hint ht ht 1 0 to the learner. 2. Learner outputs wt W. 3. Nature reveals loss gt with gt ht and at [0,γ] to the learner. 4. Learner suffers loss gt,wt + atψ(wt).

The learner is evaluated with the regularized regret T t=1 gt,wt w + atψ(wt) atψ(w ). The goal is to obtain:

T t=1 gt,wt w + atψ(wt) atψ(w ) goal

Á Á Àh2 T + T t=1 gt 2 + ψ(w )

Á Á Àγ2 + T t=1 a2 t .

In the special case that ψ(w) = 0 (i.e. the at are irrelevant, or all 0), then various algorithms achieving the desired bound (4) are available in the literature [18, 21, 23, 26]. We provide in Algorithm 3 a new algorithm for this situation that achieves the optimal logarithmic factors there is in fact a pareto-frontier of incomparable bounds that differ in the logarithmic factors. [26] provides the first algorithm to reach this frontier, while our method can achieve all points on the frontier1. We include this result because it is of some independent interest, but it not the major focus of our contributions. Any of the prior work in this area would roughly suffice for our broader purposes; the difference is only in the logarithmic terms.

Challenge of achieving (4). Achieving the bound (4) is challenging when w is not known ahead of time. To see why, let us briefly consider two potential solutions.

The most immediate approach might be to reduce Protocol 2 to the case in which at = 0 for all t by replacing gt with gt + at ψ(wt), and then possibly modifying the magnitude hint ht in some way to now be a bound on gt . However, this approach is problematic because the expected bound would now depend on T t=1 gt + at ψ(wt) 2 rather than T t=1 gt 2 and T t=1 a2 t. This means that the naive regret bound would be very hard to interpret as wt would appear on both the left and right hand sides of the inequality.

Another possibility is a follow-the-regularized leader/potential-based algorithm, making updates:

wt+1 = argmin w W Pt(w) + t i=1 gi,w + aiψ(w), (5)

for some sequence of potential functions Pt W R. In fact, this approach can be very effective; this is roughly the method employed by [27] for a similar problem. However, deriving the correct potential Pt and proving the desired regret bound can be very difficult, and could easily require separate analysis for each different possible ψ function. For example, [27] s analysis specifically applies to ψ(w) = w 2. There is other work on similar protocols using approximately this method, such as [28, 29], that also requires particular analysis for each setting. Finally, even if the bound can be achieved in general using this scheme, solving the optimization problem (5) may incur some undesirable computational overhead, even for intuitively simple regularizers such as ψ(w) = w 2. In fact, the method of [27] suffers from exactly this issue, which is why we provide an alternative approach in Section 3.3, for the special case of interest that W = R.

Re-parametrizing to achieve (4). In order to achieve the bound (4) in the special case W = R, we will employ a standard trick in convex optimization: re-parametrizing the objective as a convex constraint using the fact that the epigraph of a convex function is convex. Instead of having our learner output wt W, we will output (xt,yt) W R, but subject to the constraint that yt ψ(xt). We provide details of this approach in Section 3.3.

1It is likely that the approach of [26] in concert with the varying potentials of [20] would achieve all points on the frontier as well, although our analysis takes a different direction using the centered mirror descent method of [21].

With all of these technicalities introduced, we are ready to provide an outline of our method. The key idea is to show that for a very peculiar choice of coefficients a1,...,a T and some simple clipping of the gradients gt, we are able to achieve the following result.

Theorem 1. There exists an online learning algorithm that requires O(d) space and takes O(d) time per update, takes as input scalar values γ, h1, and ϵ and ensures that for any sequence g1,g2, Rd, the outputs w1,w1, Rd satisfy for all w and T:

T t=1 gt,wt w O ϵG + ϵ2γ + G2

γ log (e + G

Á Á ÀV log (e + w

+ w Glog (e + w

V log2(T) h1ϵ ) + γ w 2 log (e + w 2

ϵ2 log (e + G

where G = max(h1,maxt [T ] gt ) and V = G2 + T t=1 gt 2.

Before proving this result, let us briefly unpack the algebra in the statement to see how it relates to our originally stated bound (2). Notice that if we drop all the logarithmic terms, the bound becomes:

T t=1 gt,wt w O ϵG + ϵ2γ + G2

T t=1 gt 2 + w G + γ w 2

Here if we should think of h1 and ϵ as conservative under-estimates of maxt gt and w . Notice that decreasing h1 and ϵ only increases the terms inside the logarithms, so that in some sense the algorithm is very robust to even extremely conservative under-estimation. When it holds that h1 maxt gt and ϵ w , then the above bound is exactly the previously stated equation (2).

3.2 Proof Sketch of Theorem 1

Let us suppose for now that we have access to an algorithm that achieves the bound (4) under Protocol 2. Let us call it REG. In this section, we will detail how to use REG to achieve our desired goal (2) under Protocol 1 with W = R: in this sketch, we treat all values as scalars, and never vectors. Recall that it suffices to consider W = R to achieve the result in general. Given an output wt from REG, we play wt and observe the gradient gt. We will then produce a modified gradient gt, a scalar at, and a magnitude hint ht+1 to provide to REG such that gt and at satisfy the constraints of Protocol 2. We will set ψ(w) = w2, and then by careful choice of gt, at, and ht+1, we will be able to establish Theorem 1.

There are two key steps in our reduction. The first step is now a standard trick originally used by [18, 23, 30] to reduce the original Protocol 1 to Protocol 2. The idea is as follows: let us set ht = max(h1, g1 ,..., gt 1 ) for some given initial value h1 0. Notice that ht may be computed before gt is revealed and that the value G specified in the theorem satisfies G = h T +1. Then, upon receiving a gradient gt, we replace gt with the clipped gradient gt = (1 ht

gt ) gt. The clipped gradient gt satisfies gt ht by definition. We then pass gt in place of gt to an algorithm that interacts with Protocol 2. It is then relatively straightforward to see that for all w W:

T t=1 gt(wt w ) T t=1 gt(wt w ) + T t=1 gt gt w + T t=1 gt gt wt ,

T t=1 gt(wt w ) + h T +1 w + T t=1 (ht+1 ht) wt .

At this point, prior work [18, 23] observed that if we could constrain wt to have some chosen maximum value D, then the final summation above is at most h T +1D. By carefully choosing D in tandem with an algorithm that achieve (4) in the case ψ(w) = 0, one can achieve the previous compromise bound (3).

This is where our second key step (which is our main technical innovation) comes in. Instead of explicitly enforcing wt D, we will apply a soft constraint by adding a regularizer. Surprisingly, we will add a very tiny amount of regularization and yet still achieve meaningful regret bounds.

Recall that we are assuming access to an algorithm that achieves the bound (4) when interacting with Protocol 2. Let us set ψ(w) = w2. Then, observe that for any choices of a1,...,a T :

T t=1 gt(wt w ) T t=1 ( gt(wt w ) + atψ(wt) atψ(w )) + T t=1 gt gt w + w2

+ T t=1 ( gt gt wt atw2 t ),

T t=1 ( gt(wt w ) + atψ(wt) atψ(w ))

¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ controlled by (4)

+h T +1 w + w2

¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹ needs small at

+ T t=1 ((ht+1 ht) wt atw2 t )

¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ needs big at

From the above decomposition, we see that to make the overall regret small, we would like to choose at such that T t=1 at is small, but also at is large enough that T t=1 ((ht+1 ht) wt atw2 t ) is also small. It turns out that this is accomplished by the following choice for at:

at = γ (ht+1 ht)/ht+1 1 + t i=1(hi+1 hi)/hi+1 .

Here, γ is an arbitrary user-specified constant. Notice that the value of ht+1 is available immediately after gt is revealed, so that it is possible to set this value of at. Moreover, it is clear that at [0,γ] for all t.

Let us see how this value for at satisfies our desired properties. First, recall the bound log(p + q) log(q) p p+q for any p,q > 0, which implies T t=1 pt t i=0 pi log ( T t=1 pt/p0) for any sequence of positive numbers p0,...,p T . From this, we have:

T t=1 at = γ T t=1

(ht+1 ht)/ht+1 1 + t i=1(hi+1 hi)/hi+1 ,

γ log (1 + T t=1 (ht+1 ht)/ht+1),

γ log (1 + log (G/h1)).

Thus, T t=1 at is in fact doubly logarithmic in the ratio between h1 and h T +1 = max(h1,maxt gt ) = G.

Next, let us check that at is large enough to make T t=1(ht+1 ht) wt atw2 t small. To this end, observe that:

(ht+1 ht) wt atw2 t sup X (ht+1 ht)X at X2,

= (ht+1 ht)2

= ht+1(ht+1 ht)

4γ (1 + t i=1 (hi+1 hi)/hi+1),

h T +1(ht+1 ht)

4γ (1 + T i=1 (hi+1 hi)/hi+1),

= G(ht+1 ht)

4γ (1 + log(G/h1)),

where we used that G = h T +1. Thus, we have:

T t=1 ((ht+1 ht) wt atw2 t ) G2

4γ (1 + log(G/h1)).

This shows that at is large enough that it is able to counteract the effect of T t=1(ht+1 ht) wt (which makes the regret large if wt is large). It is tempting to conclude that the regularizer is somehow implicitly constraining wt to be small enough that the regret is bounded. However, it is difficult

to envision exactly what constraint is being enforced; notice that to make T t=1(ht+1 ht) wt = O(G2/γ) by applying some constraint wt D, we would need to set D = O(G/γ). However, such an aggresive constraint would surely prevent us from achieving low regret for even relatively moderate w G/γ. So, our regularization seems to be doing something more subtle than simply applying a global constraint to the wt s. Indeed, notice that in the case gt h1 for all t, we actually have at = 0 and so no constraint effect at all is enforced!

The final step we need to check is bounding T t=1 gt(wt w ) + atψ(wt) atψ(w ). To this end, we provide in Section 3.3 an algorithm that achieves the following bound, which is slightly weaker than (4):

T t=1 ( gt(wt w ) + atψ(wt) atψ(w ))

Á Á ÀV log (e + w

h1ϵ ) + w h T log (e + w

Á Á ÀS log (e + w 2

S log2(T) γϵ2 ) + w 2γ log (e + w 2

S log2(T) γϵ2 ) ,

where S = γ2 + γ T t=1 at. This bound is weaker than (4) due to the presence of S rather than γ2 + T t=1 a2 t. Nevertheless, by our bound on T t=1 at, we have:

S γ2 + γ2 log (1 + log (G/h1))

so that combining all of the above calculations we establish Theorem 1.

Thus, it remains to establish how we can achieve (4), or the slightly weaker (but sufficient) statement above. We accomplish this next in Section 3.3.

3.3 Regularized Online Learning via Epigraph Constraints

Recall that our approach to obtaining (4) is to replace the regularization terms in the loss with constraints. Formally, consider the following protocol:

Protocol 3. Epigraph-based Regularized Online Learning for W = R. Input: Convex function ψ R R. For t = 1,...,T:

1. Nature reveals magnitude hint ht ht 1 to the learner. 2. Learner outputs (xt,yt) R R with yt ψ(xt). 3. Nature reveals gt [ ht,ht] and at [0,γ] to the learner. 4. Learner suffers loss gtxt + atyt.

The learner is evaluated with the linear regret T t=1 gt(xt w ) + at(yt ψ(w )). The goal is to obtain:

T t=1 ( gt(xt w ) + at(yt ψ(w ))) goal

Á Á Àh2 T + T t=1 g2 t + ψ(w )

Á Á Àγ2 + T t=1 a2 t . (6)

The key fact about this protocol is the observation that by setting wt = xt, the bound (6) immediately implies (4). To see this, recall that ψ(w) 0, at 0, and yt ψ(xt) = ψ(wt) so that:

T t=1 ( gt,wt w + atψ(wt) atψ(w )) T t=1 ( gt,xt w + atyt atψ(w )).

So, to achieve (4) under Protocol 2, it suffices to achieve the bound (6) under Protocol 3.

There is one tempting approach that almost, but not quite, achieves this goal. One could employ the constraint-set reduction developed in [16] that converts an algorithm that operates on the

unconstrained domain Rd R to one respecting the constraint y ψ(x). In particular, it is relatively straightforward to build an algorithm that achieves (6) without requiring yt ψ(xt). This unconstrained setting can be handled by the classic coordinate-wise updates trick in which we run two instances of an algorithm achieving (4) in the special case that ψ(x) = 0, one of which will output xt and receive feedback gt, and the other will output yt and receive feedback at. Then, by the individual regret bounds on both coordinates, we would have:

T t=1 (gt(xt w ) + at(yt ψ(w ))) = T t=1 gt(xt w ) + T t=1 at(yt ψ(w )),

h2 T + t=1 g2 t + ψ(w )

Á Á Àγ2 + T T =1 a2 t .

Then, one might hope that applying the constraint-set reduction of [16] would allow us to apply the constraint W without damaging the regret bound. Unfortunately, this reduction will modify the feedback gt and at in such a way that T t=1 a2 t could become much larger, which makes this approach untenable in general.

Fortunately, it turns out that our particular usage will enforce some favorable conditions on at that make the above strategy viable. Specifically, the choices of gt, ht and at described in Section 3.2 satisfy the condition that at = 0 unless gt = ht. By careful inspection of the constraint-set reduction, it is possible to show that the above strategy achieves a slightly weaker version of (6):

T t=1 gt(xt w ) + at(yt ψ(w )) O w

h2 T + t=1 g2 t + ψ(w )

Á Á Àγ2 + γ T T =1 at . (7)

As detailed in Section 3.2, this weaker bound suffices for our eventual purposes. Nevertheless, for the reader interested in a fully general solution, in Appendix H, we provide a method for achieving (6) without restrictions. We do not employ it in our main development because it involves solving a convex subproblem at each iteration and so may be less efficient in some settings. This technique does however involve a small improvement to so-called full-matrix regret bounds [31], and so may be of some independent interest.

4 Lower Bounds

In this section, we show that the result of Theorem 1 is tight. In fact, we show a stronger result that generalizes our extra penalty term from G2/γ + γ w 2 to γψ( w ) + γψ (G/γ) for any symmetric convex function ψ, where ψ (x) = supz xz ψ(z) is the Fenchel conjugate of ψ. That is, we provide a Pareto frontier of different lower bounds and Theorem 1 is but one point on this frontier. In Appendix A we extend our upper-bound results to match any desired point on this frontier (up to a logarithmic factor). . We also provide matching upper bounds (up to a logarithmic factor) in Theorem 16, as well as more simplified

Theorem 2. Suppose ψ R R is convex, symmetric, differentiable, non-negative, achieves its minimum at ψ(0) = 0, and ψ(x) is strictly increasing for non-negative x. Further suppose that for any X,Y,Z > 0, there is some τ such that for all T τ,

Xψ (Y TZ) TZ

where ψ (z) = supzx ψ(x) is the Fenchel conjugate of ψ. Let h1 > 0, γ > 0 and ϵ > 0 be given.

For any online learning algorithm A interacting with Protocol 1 with W = R, there is a T0 such that for any T T0, there is a sequence of gradients g1,...,g T and a w such that the outputs w1,...,w T of A satisfy:

T t=1 gt(wt w ) ϵG + γ

8 ψ (G/γ) + γ

4 ψ(w ) + G w

Á Á ÀT log (1 + G w

where G = max(h1,g1,...,g T ). In particular, with ψ(x) = x1+q for any q > 0, we can ensure:

T t=1 gt(wt w ) Ω ϵG + G1+1/q

γ1/q + γ w 1+q + G w

Á Á ÀT log (1 + G w

The conditions on ψ in this bound are relatively mild. The first condition says that the gradient ψ should not grow exponentially fast. The second condition says that ψ should grow faster than some linear function. So, any polynomial of degree greater than 1 satisfies these conditions.

We note that this lower bound leaves something to be desired in terms of the quantification of the terms. Here, the value of G and w depends on the algorithm A. This is a critical factor in the proof; roughly speaking, the proof operates by providing the algorithm with a constant gradient gt = h1 at every round. Then, if the iterates wt grow in some sense quickly , we punish the algorithm with a very large negative gradient, which causes high regret if w = 0. Alternatively, if the iterates do not grow quickly, then we show that the regret is large for some w 1. This approach is a common idiom for lower bounds in the fully unconstrained setting [18, 22].

However, a much better bound might be possible; ideally, it would hold that for any G and w and algorithm A, we can find a sequence of gradients gt that enforces our desired regret. Indeed, when either w or maxt gt is provided to the algorithm, the lower bounds available do take this form [3, 5]. We leave as an open question whether it is possible to do so in our setting.

5 Discussion

We have provided a new online learning algorithm that achieves a near-optimal regret bound (2). Our algorithm is fully unconstrained , or fully parameter-free , in the sense that we achieve a near-optimal regret bound without requiring bounds on the gradients gt or the comparison point w . Prior work in this setting [18, 21, 22, 23, 26] achieve bounds that are technically incomparable, but may be aesthetically less desirable, as detailed in the discussion following (3). Nevertheless, ideally we would have a unified algorithm framework capturing both our old and new bounds. It is an open question whether more careful choice of regularization in our approach could achieve this goal.

Our algorithm takes as input parameters ϵ, h1 and γ. All of these have a pleasingly small impact on the regret bound. ϵ and h1 can be interpreted as very rough estimates of w and G. As these quantities go to zero, the regret bound increases only logarithmically. Moreover, these estimates can be too high by a factor of

T while still maintaining O( w G

T) regret. The quantity γ represents an estimate of G/ w . As discussed in Section 1, this value does not appear in any term that has a T-dependence in the regret bound and so also has a very mild impact on the regret.

While our bound has several intuitively desirable characteristics, it is missing one important property: our bound suffers from an issue highlighted by [18] called the range-ratio problem. That is, the bound depends on the ratio G/h1, which could be very large if the losses are rescaled by some arbitrary large number without rescaling h1. This issue is at the heart of how we are able to sidestep the lower-bound of [18], which appears to apply to all algorithms that do not suffer from the range-ratio problem.

5.1 Other forms of Unconstrained Online Learning

Our results focus on the case that we have no prior bounds on the value of w or gt , and our bounds eventually depend on maxt gt . One might worry that this is too conservative in some settings. For example, it might be that gt is known to be a random variable with bounded mean E[gt] G and variance Var(gt) σ2 for some known G and σ. In this case, maxt gt might become large even though intuitively our regret should still depend only on G + σ. This is the setting considered by several prior work on online learning with unconstrained domains [9, 17, 32]. Under various assumptions, these results all achieve an in-expectation regret bound of E[Regret T (w )] O( w (G + σ)

In fact, our results come close to this ideal even without knowledge of G. For example, [9, 17] study the case of sub-exponential gt that satisfy sup a 1 E[exp(β gt E[gt],a )] exp(β2σ2/2) for all β 1/b for some b > 0. In this case, for 1-dimensional gt, we have E[maxt g2 t ] O(G2 + σ2), and

so in expectation we achieve O( w (G + σ)

T + G2 + σ2 + w 2) (the extension from 1-d to arbitrary dimensions can then be achieved via the black-box reduction of [16]). However, in the case that gt has some heavy-tailed distribution such as studied by [32], it is less clear that our bounds achieve the desired result out-of-the box. Discovering how to achieve this is an interesting direction for future study.

5.2 Parameter-free Algorithms and Stochastic Convex Optimization

As discussed in the introduction, a common motivation for the study of online learning is its immediate application to stochastic convex optimization through various online-to-batch conversions. The classic conversion of [7], as well as a few more recent results [33, 34, 35, 36] all show that if gt is the output of a stochastic gradient oracle for a convex function F, then for any w argmin F:2

E[F ( T t=1 wt

T ) F(w )] E[Regret T (w )]

If gt G with probability 1 (for an unknown G), our Theorem 1 immediately implies

E[F ( T t=1 wt

T ) F(w )] O ( w G

T + G2/γ+γ w 2

T ). The first term is the optimal rate for stochas-

tic convex optimization that can be achieved via SGD with learning rate η = w

T if G and w are known ahead of time, and the second term is a lower-order penalty for not having up-front knowledge of these quantities.

Convergence results that match that of optimally tuned SGD are often called parameter-free (the parameter in question is the learning rate). As mentioned in the introduction, there has been a long line of works that attempt to achieve this goal by matching the regret bound (1), which can then be applied to the stochastic setting via an online-to-batch conversion. More recent work on parameter-free optimization has considered the stochastic case [37, 38], or deterministic case [39] directly without passing through a general regret bound. Many of these algorithms have shown significant empirical promise, even for non-convex deep learning tasks [38, 39, 40, 41, 42]. Almost all of these results require apriori knowledge of the value G3

To place our results in this context, let us focus on the case of a known G value. In this case, [37] show that by eschewing regret analysis and focusing specifically on the stochastic setting, it is possible to achieve a high-probability guarantee that improves upon the logarithmic factors achieved by our result, and so there seems to be something lost by focusing on regret bounds. However, in a surprising counterpoint, [44] shows that if one is interested in an in-expectation result, then there is actually no way to improve upon the logarithmic factors achieved via online-to-batch conversion when applied to parameter-free regret bounds. Thus, our in-expectation stochastic convergence rate is optimal even up to logarithmic factors, while we also do not require prior knowledge of G.

Finally, let us evaluate the optimality of our bound in the stochastic setting while accounting for the fact that our methods do not get to know either G or w . Here, we can again make use of the lower bounds developed by [44]. Consider the class of stochastic convex optimization objectives with Lipschitz constant G between 1 and L and w [1,R]. The price of adaptivity as defined by [44] is the maximum over this class of the ratio between the convergence guarantee of an algorithm that does not know w and G with respect to the minimax optimal convergence guarantee for an algorithm that does know these values (which is RG/

T). We achieve a price of adaptivity of O(1 + max(L,R)/

T). The best-known lower bound for this class is Ω(1 + min(L,R)/

T) [44]. Thus, there is a gap here although we provide matching lower bounds for the online setting, it is possible that in the stochastic setting, one can improve our bounds. That said, the stochastic lower bound is derived for algorithms that are given the ranges [1,L] and [1,R]. Our algorithm does not use this information and it is also plausible that without such knowledge the lower bound itself would improve.

Acknowledgements

AC is supported by NSF grant number CCF-2211718.

2The difference between these conversions lies in where the stochastic gradients gt are computed. 3A few exceptions achieve the prior bound (3) [18, 21, 23, 43].

[1] Francesco Orabona. A modern introduction to online learning . In: ar Xiv preprint ar Xiv:1912.13213 (2019). [2] Elad Hazan. Introduction to online convex optimization . In: ar Xiv preprint ar Xiv:1909.05207 (2019). [3] Jacob Abernethy, Peter L Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games . In: Proceedings of the nineteenth annual conference on computational learning theory. 2008, pp. 415 424. [4] Brendan Mcmahan and Matthew Streeter. No-regret algorithms for unconstrained online convex optimization . In: Advances in neural information processing systems. 2012, pp. 2402 2410. [5] Francesco Orabona. Dimension-free exponentiated gradient . In: Advances in Neural Information Processing Systems. 2013, pp. 1806 1814. [6] Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent . In: Proceedings of the 20th International Conference on Machine Learning (ICML03). 2003, pp. 928 936. [7] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms . In: Information Theory, IEEE Transactions on 50.9 (2004), pp. 2050 2057. [8] Ashok Cutkosky, Harsh Mehta, and Francesco Orabona. Optimal Stochastic Non-smooth Nonconvex Optimization through Online-to-Non-convex Conversion . In: International Conference on Machine Learning (ICML). 2023. [9] Kwang-Sung Jun and Francesco Orabona. Parameter-free online convex optimization with sub-exponential noise . In: Conference on Learning Theory. PMLR. 2019, pp. 1802 1823. [10] Zakaria Mhammedi. Risk monotonicity in statistical learning . In: Advances in Neural Information Processing Systems 34 (2021), pp. 10732 10744. [11] Francesco Orabona and Kwang-Sung Jun. Tight concentrations and confidence sequences from the regret of universal portfolio . In: IEEE Transactions on Information Theory (2023). [12] Gábor Lugosi and Gergely Neu. Online-to-PAC conversions: Generalization bounds via regret analysis . In: ar Xiv preprint ar Xiv:2305.19674 (2023). [13] Elad Hazan, Alexander Rakhlin, and Peter L Bartlett. Adaptive online gradient descent . In: Advances in Neural Information Processing Systems. 2008, pp. 65 72. [14] J. Duchi, E. Hazan, and Y. Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization . In: Conference on Learning Theory (COLT). 2010, pp. 257 269. [15] H. Brendan Mc Mahan and Matthew Streeter. Adaptive Bound Optimization for Online Convex Optimization . In: Proceedings of the 23rd Annual Conference on Learning Theory (COLT). 2010, pp. 244 256. [16] Ashok Cutkosky and Francesco Orabona. Black-Box Reductions for Parameter-free Online Learning in Banach Spaces . In: Conference On Learning Theory. 2018, pp. 1493 1529. [17] Dirk van der Hoeven. User-specified local differential privacy in unconstrained adaptive online learning . In: Advances in Neural Information Processing Systems. 2019, pp. 14103 14112. [18] Zakaria Mhammedi and Wouter M Koolen. Lipschitz and Comparator-Norm Adaptivity in Online Learning . In: Conference on Learning Theory (2020), pp. 2858 2887. [19] Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Impossible tuning made possible: A new expert algorithm and its applications . In: Conference on Learning Theory. PMLR. 2021, pp. 1216 1259. [20] Zhiyu Zhang, Ashok Cutkosky, and Ioannis Paschalidis. Pde-based optimal strategy for unconstrained online learning . In: International Conference on Machine Learning. PMLR. 2022, pp. 26085 26115. [21] Andrew Jacobsen and Ashok Cutkosky. Parameter-free Mirror Descent . In: Proceedings of Thirty Fifth Conference on Learning Theory. Ed. by Po-Ling Loh and Maxim Raginsky. Vol. 178. Proceedings of Machine Learning Research. PMLR, 2022, pp. 4160 4211. [22] Ashok Cutkosky and Kwabena Boahen. Online Learning Without Prior Information . In: Conference on Learning Theory. 2017, pp. 643 677.

[23] Ashok Cutkosky. Artificial Constraints and Hints for Unbounded Online Learning . In: Proceedings of the Thirty-Second Conference on Learning Theory. 2019, pp. 874 894. [24] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates . In: Advances in neural information processing systems. 2010, pp. 2199 2207. [25] Ashok Cutkosky. Combining Online Learning Guarantees . In: Proceedings of the Thirty Second Conference on Learning Theory. 2019, pp. 895 913. [26] Zhiyu Zhang, Heng Yang, Ashok Cutkosky, and Ioannis Ch Paschalidis. Improving Adaptive Online Learning Using Refined Discretization . In: ar Xiv preprint ar Xiv:2309.16044 (2023). [27] Andrew Jacobsen and Ashok Cutkosky. Unconstrained online learning with unbounded losses . In: International Conference on Machine Learning. PMLR. 2023, pp. 14590 14630. [28] Jack J Mayo, Hédi Hadiji, and Tim van Erven. Scale-free unconstrained online learning for curved losses . In: Conference on Learning Theory. PMLR. 2022, pp. 4464 4497. [29] Gergely Neu and Nneka Okolo. Dealing With Unbounded Gradients in Stochastic Saddlepoint Optimization . In: Proceedings of the 41st International Conference on Machine Learning. PMLR. 2024, pp. 37508 37530. [30] Zakaria Mhammedi, Wouter M Koolen, and Tim Van Erven. Lipschitz adaptivity with multiple learning rates in online learning . In: Conference on Learning Theory. PMLR. 2019, pp. 2490 2511. [31] Ashok Cutkosky. Better full-matrix regret via parameter-free online learning . In: Advances in Neural Information Processing Systems 33 (2020), pp. 8836 8846. [32] Jiujia Zhang and Ashok Cutkosky. Parameter-free regret in high probability with heavy tails . In: Advances in Neural Information Processing Systems 35 (2022), pp. 8000 8012. [33] Ashok Cutkosky. Anytime Online-to-Batch, Optimism and Acceleration . In: International Conference on Machine Learning. 2019, pp. 1446 1454. [34] Ali Kavis, Kfir Y Levy, Francis Bach, and Volkan Cevher. Uni XGrad: A Universal, Adaptive Algorithm with Optimal Guarantees for Constrained Optimization. In: Neur IPS. 2019, pp. 6257 6266. [35] Aaron Defazio, Ashok Cutkosky, Harsh Mehta, and Konstantin Mishchenko. When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement . In: ar Xiv preprint ar Xiv:2310.07831 (2023). [36] Aaron Defazio, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky, et al. The Road Less Scheduled . In: ar Xiv preprint ar Xiv:2405.15682 (2024). [37] Yair Carmon and Oliver Hinder. Making SGD Parameter-Free . In: Conference on Learning Theory (2022). [38] Maor Ivgi, Oliver Hinder, and Yair Carmon. Dog is sgd s best friend: A parameter-free dynamic step size schedule . In: International Conference on Machine Learning. PMLR. 2023, pp. 14465 14499. [39] Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by d-adaptation . In: International Conference on Machine Learning. PMLR. 2023, pp. 7449 7479. [40] Francesco Orabona and Tatiana Tommasi. Training Deep Networks without Learning Rates Through Coin Betting . In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA. 2017, pp. 2157 2167. [41] Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameterfree learner . In: ar Xiv preprint ar Xiv:2306.06101 (2023). [42] Ashok Cutkosky, Aaron Defazio, and Harsh Mehta. Mechanic: A Learning Rate Tuner . In: Advances in neural information processing systems 36 (2023). [43] Zhiyu Zhang, Heng Yang, Ashok Cutkosky, and Ioannis C Paschalidis. Improving adaptive online learning using refined discretization . In: International Conference on Algorithmic Learning Theory. PMLR. 2024, pp. 1208 1233. [44] Yair Carmon and Oliver Hinder. The Price of Adaptivity in Stochastic Convex Optimization . In: ar Xiv preprint ar Xiv:2402.10898 (2024). [45] Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of convex analysis. Springer Science & Business Media, 2004. [46] Ashok Cutkosky and Tamas Sarlos. Matrix-Free Preconditioning in Online Learning . In: International Conference on Machine Learning. 2019, pp. 1455 1464.

[47] Zakaria Mhammedi and Alexander Rakhlin. Damped online Newton step for portfolio selection . In: Conference on Learning Theory. PMLR. 2022, pp. 5561 5595. [48] Zakaria Mhammedi and Khashayar Gatmiry. Quasi-newton steps for efficient online expconcave optimization . In: The Thirty Sixth Annual Conference on Learning Theory. PMLR. 2023, pp. 4473 4503. [49] Khashayar Gatmiry and Zak Mhammedi. Projection-Free Online Convex Optimization via Efficient Newton Iterations . In: Advances in Neural Information Processing Systems 36 (2024). [50] Francesco Orabona and Dávid Pál. Coin Betting and Parameter-Free Online Learning . In: Advances in Neural Information Processing Systems 29. Ed. by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett. Curran Associates, Inc., 2016, pp. 577 585.

A Generalized Upper Bounds

In Theorem 1, we provide a bound that achieves the ideal bound of (1) with an extra penalty term of roughly G2/γ +γ w 2. It turns out that this penalty is but one point on a frontier of potential choices that are all immediately accessible by simply changing ψ(w) from w 2 to any other symmetric convex function, as outlined in the lower bounds of Section 4. Here we state formally that each of these bounds is achievable. In particular, by setting ψ(w) = w 1+q for any q > 0, we have:

Theorem 3. There is an online learning algorithm that takes as input positive scalar values q, γ, h1, and ϵ and ensures that for any sequence g1,g2, Rd, the outputs w1,w1, Rd satisfy for all w and T:

T t=1 gt,wt w O ϵG + w

Á Á ÀV log (e + w

V log2(T) h1ϵ )

+ w Glog (e + w

V log2(T) h1ϵ )

+ϵ1+qγ + γ w 1+q log (e + w 1+q

ϵ1+q log (e + G

h1 )) + G1+1/q

γ1/q log (1 + log ( G

where G = max(h1,maxt [T ] gt ) and V = G2 + T t=1 gt 2.

Finally, it is also the case that the logarithmic terms in our bounds can be adjusted to remove the T dependencies, at the cost of increasing the regret in the case w = 0. This is achieved simply by adjusting the logarithmic factors achieved by the algorithm for regularized online learning (Protocol 2) in a manner similar to other recent works in unconstrained online optimization [20, 21, 26]. Formally, we can achieve:

Theorem 4. There is an online learning algorithm that takes as input positive scalar values q, γ, h1, and ϵ and ensures that for any sequence g1,g2, Rd, the outputs w1,w1, Rd satisfy for all w and T:

T t=1 gt,wt w O ϵ

Á Á ÀV log (e + w

ϵ ) + w Glog (e + w

log (1 + log ( G

h1 )) + γ w 1+q log (e + w 1+q

ϵ1+q log (e + G

γ1/q log (1 + log ( G

where G = max(h1,maxt gt ) and V = G2 + T t=1 gt 2.

Both of these results are immediate corollaries of Theorem 16.

B Proof of Lower Bound

We restate and prove Theorem 2. Theorem 2. Suppose ψ R R is convex, symmetric, differentiable, non-negative, achieves its minimum at ψ(0) = 0, and ψ(x) is strictly increasing for non-negative x. Further suppose that for any X,Y,Z > 0, there is some τ such that for all T τ,

Xψ (Y TZ) TZ

where ψ (z) = supzx ψ(x) is the Fenchel conjugate of ψ. Let h1 > 0, γ > 0 and ϵ > 0 be given.

For any online learning algorithm A interacting with Protocol 1 with W = R, there is a T0 such that for any T T0, there is a sequence of gradients g1,...,g T and a w such that the outputs w1,...,w T of A satisfy:

T t=1 gt(wt w ) ϵG + γ

8 ψ (G/γ) + γ

4 ψ(w ) + G w

Á Á ÀT log (1 + G w

where G = max(h1,g1,...,g T ). In particular, with ψ(x) = x1+q for any q > 0, we can ensure:

T t=1 gt(wt w ) Ω ϵG + G1+1/q

γ1/q + γ w 1+q + G w

Á Á ÀT log (1 + G w

Proof. First, let us define ψγ(x) = γψ(x). Let ψγ(x) and ψ γ(θ) indicate the derivatives of ψγ and ψ γ. The following properties are standard facts about the Fenchel conjugate (see e.g. [45]):

ψ γ(θ) = γψ (θ/γ),

ψ γ(0) = 0,

ψ γ(x) 0 for all x,

ψγ(x) = ψγ(x) x ψ γ( ψγ(x)),

ψ γ(θ) = ψ γ(θ) θ ψγ( ψ γ(θ)).

Moreover, ψγ and ψ γ are inverses of each other and are odd functions, and ψ γ(x) is strictly increasing for non-negative x.

Next, observe that for any X,Y,X ,Y ,Z , there is a τ such that for all T τ:

T ψ (Y T), (8)

X ψ (Y TZ ) TZ . (9)

To see this, observe that by assumption, there is a τ1 such that (8) holds for all T τ1, and also a τ2 such that (9) holds for all T τ2, so we may take τ = max(τ1,τ2) to achieve both simultaneously.

From this, we see that there is some T0 such that for all T T0:

T ϵ ψ (2Th1/γ) = 2

T ϵ ψ γ(2Th1) (10)

γ ) = 1 32ϵψ γ(2Th1) (11)

We now construct the algorithm-dependent sequence g1,g2,... that satisfies the claim of the theorem.

1. Define g0 0 and set t 1.

2. Algorithm A outputs wt.

3. If wt < 2ϵ ψ γ(2h1(t 1)), set gt 2(t 1)h1 and for all k 1 set gt+k 0.

4. Else gt h1.

5. Set t t + 1 and go to Item 2.

Suppose that the condition in Item 3 has not been triggered for the first τ iterations. Then, we have:

τ t=1 wt gt 2ϵτh1 1

τ t=1 ψ γ(2(t 1)h1) 2h1,

τ t=1 (ψ γ(2th1) ψ γ(2(t 1)h1)), (by convexity of ψ γ)

2ψ γ(2τh1). (12)

Now, suppose that the condition in Item 3 is triggered at some iteration τ + 1 1. Then G = 2τh1 and with w = 0, we have:

T t=1 gt (wt w ) = τ+1 t=1 gt wt,

= gτ+1 wτ+1 + τ t=1 wt gt,

4ϵτh1 + ψ γ(2τh1) 2τh1 2ϵτh1 1

2ψ γ(2τh1), (by (12))

= 2ϵτh1 + ψ γ(2τh1) 2τh1 ψγ( ψ γ(2τh1)) + ψγ( ψ γ(2τh1)) 1

2ψ γ(2τh1),

= 2ϵτh1 + ψ γ(2τh1) + ψγ( ψ γ(2τh1)) 1

2ψ γ(2τh1),

2ψ γ(2τh1), (13)

Therefore, overall we have for w = 0:

T t=1 gt (wt w ) ϵG + γ

2 ψ (G/γ) + γψ( w ) + w G

Á Á ÀT log (1 + w G

Alternatively, suppose the condition in Item 3 is never triggered. In this case, let us set w = 2 ψ γ(2Th1) . Then, G = h1 and by (12) we have:

T t=1 gt (wt w ) ψ γ(2Th1) 2Th1 1

2ψ γ(2Th1) 2ϵTh1.

Using that ψ γ(x) x ψ γ(x) by convexity of ψ γ and ϕ γ(0) = 0, the right-hand side of the previous display can be bounded below by ():

ψ γ(2Th1) 2Th1 1

2ψ γ(2Th1) 2ϵTh1 1

2ψ γ(2Th1),

4ψ γ(G) + 1

4ψ γ(2Th1) 2ϵTh1.

Applying Eq. (11):

4ψ γ(G) + 8ϵTh1 2ϵTh1,

4 ψ (G/γ) + 6ϵTh1. (14)

Further, since ψ γ = ψγ, we have:

ψ γ(2Th1) 2Th1 1

2ψ γ(2Th1) = ψ γ(2Th1) Th1 + 1

2 ( ψ γ(2Th1) 2Th1 ψ γ(2Th1)),

= ψ γ(2Th1) Th1 + 1

2ψγ( ψ γ(2Th1)),

= ψ γ(2Th1) Th1 + 1

2 w Th1 + γ

Finally, let us bound 1

2 w g1 T using our choice of w . By definition, we have:

T log(1 + (exp(T) 1)),

applying Eq. (10):

Á Á Á ÀT log 1 + 2

T ψ γ(2Th1)

Á Á ÀT log (1 + w

Á Á ÀT log (1 + G w

T h1ϵ ). (15)

Therefore, combining Eq. (14) with Eq. (15):

ψ γ(2Th1) 2Th1 1

2ψ γ(2Th1) 2ϵTh1

4 ψ (G/γ) + 6ϵTh1) + 1

γ 2 ψ(w ) + G w

Á Á ÀT log (1 + G w

T h1ϵ ) 4ϵTh1 ,

8 ψ (G/γ) + γ

4 ψ(w ) + G w

Á Á ÀT log (1 + G w

8 ψ (G/γ) + γ

4 ψ(w ) + G w

Á Á ÀT log (1 + G w

C Reduction to W = R

As a first step in our algorithm design, we observe that the application of some known reductions from [16] can significantly simplify our task. [16] show that to build an algorithm whose regret bound depends on gt only through the norms gt , it suffices to consider exclusively the case W = R. We provide the formal reduction in Algorithm 1, which ensures the following regret bound.

Theorem 5 ([16]). Algorithm 1 ensures that gmagnitude t 2 gt , and also for all w W:

T t=1 gt,wt w 4 w

Á Á À2 T t=1 gt 2 + T t=1 gmagnitude t (wmagnitude t w ).

Algorithm 1 Reduction From General W to R

Input: Convex domain W Rd, online learning algorithm A1D with domain R. Initialize wdirection 1 = 0 Rd for t = 1...T do

Receive wmagnitude t R from A1D. Set ˆwt = wmagnitude t wdirection t Rd. Set wt = ΠW ˆw = argminw W w ˆw . Output wt, receive feedback gt. Set gunconstrained t = gt + gt wt ˆ wt wt ˆ wt .

Set wdirection t+1 = Π w 1wdirection t gunconstrained t

2 t i=1(gunconstrained i )2 .

Set gmagnitude t = gunconstrained t ,wdirection t R. Send gmagnitude t to A1D as the tth feedback. end for

From Theorem 5, it is clear that to achieve low regret on W, we need only bound T t=1 gmagnitude t (wmagnitude t w ), which is exactly the regret of a 1-dimensional learner. So, our final results will be established by considering the case of W = R, although we will define many intermediate problems for general W as they may have other applications for which the general setting is of interest.

D An Efficient Algorithm for Protocol 2 With Restricted (But Sufficient) Assumptions

In this section, we describe our algorithm for Protocol 2 in the special case that W = R and at = 0 whenever gt ht, where gt = (1 ht

gt ) gt. Our algorithm is in fact a reduction to the special case that at = 0 for all t. This is an important special case that has actually also been previously considered in the literature (see e.g. the discussion in Section 3.1), so we provide it as a separate Protocol below:

Protocol 4. Online Learning with Magnitude Hints. Input: Convex domain W (recall that we focus on W = R). For t = 1,...,T:

1. Nature reveals magnitude hint ht ht 1 to the learner. 2. Learner outputs wt W. 3. Nature reveals loss scalar gt with gt ht to the learner. 4. Learner suffers loss gt,wt .

The learner is evaluated with the regret T t=1 gt(wt w ). The goal is to obtain:

T t=1 gt,wt w goal

Á Á Àh2 T + T t=1 gt 2 . (16)

In Section E, we provide an explicit algorithm (Algorithm 3) for Protocol 4 that suffices for our purposes and achieves the bound (16). In the rest of this section, we take the existence of such an algorithm as given, and use it to build our method for Protocol 2.

Our algorithm for Protocol 2 is given in Algorithm 2. The full regret bound is provided by Theorem 10. However, before providing the general bound, which is somewhat technical, we provide two more interpretable corollaries in order to provide a preview of what the method is capable of.

Corollary 6. For any ϵ with ψ(ϵ) > 0, there exists an algorithm for Protocol 2 such that for all t, the outputs x1,...,x T satisfy:

T t=1 gt(xt x ) + at(ψ(xt) ψ(x ))

ϵh T + ψ(ϵ)γ + x

Á Á Á ÀVg log e + x

h1ϵ + x h T log e + x

Á Á ÀSa log (e + ψ(x ) Sa log2(T)

γψ(ϵ) ) + ψ(x )γ log (e + ψ(x ) Sa log2(T)

Where Vg = h2 T + T t=1 g2 t and Sa = γ2 + γ T t=1 at.

Proof. Apply Algorithm 2 with the BASE set to Algorithm 3 using p = 1/2. Then, in the notation of Theorem 10, the regret bound of Theorem 11 shows that A,B,C are all O(1) while D is O(log2(T)) and p = 1/2. Set ϵx = ϵ and ϵψ = ψ(ϵ). The result immediately follows.

Corollary 7. For any ϵ with ψ(ϵ) > 0, there exists an algorithm for Protocol 2 such that for all t, the outputs x1,...,x T satisfy:

T t=1 gt(xt x ) + at(ψ(xt) ψ(x ))

Á Á ÀVg log (e + x

ϵ ) + x h T log (e + x

Á Á ÀSa log (e + ψ(x )

ψ(ϵ) ) + ψ(x )γ log (e + ψ(x )

where Vg = h2 T + T t=1 g2 t and Sa = γ2 + γ T t=1 at.

Proof. Apply Algorithm 2 with the BASE set to Algorithm 3 using p = 0. Then, in the notation of Theorem 10, the regret bound of Theorem 11 shows that A,B,C and D are all O(1) while p = 0. Set ϵx = ϵ and ϵψ = ψ(ϵ). The result immediately follows.

Lemma 8. Suppose ψ R R is a convex function that achieves its minimum at 0. Let h > 0 and γ > 0 be given and define the norm (x,y) = h2x2 + γ2y2 and the distance function S(ˆx, ˆy) = infy ψ(x) (x,y) (ˆx, ˆy) . For any (ˆx, ˆy), let (δx,δy) be an arbitrary subgradient of S at (ˆx, ˆy). Then, δy 0.

Proof. Throughout this proof, we will assume ˆx > 0. The proof is completely symmetric in the sign of ˆx.

First, we dispense with the case in which there is no projection: suppose ˆy ψ(ˆx). Then we must have ˆy = y and ˆx = x and S(ˆx, ˆy) = 0. Further, for any y > ˆy, S(ˆx, y) = 0. However, if δy > 0, then by definition of subgradient, we must have 0 = S(ˆx, y) S(ˆx, ˆy) + δy( y ˆy) > 0, which cannot be. Therefore δy 0. So, it remains to consider the case ˆy < ψ(ˆx).

Define (x,y) = argminy ψ(x) (x,y) (ˆx, ˆy) . Further, by [16, Theorem 4], we have:

(δx,δy) = h2(ˆx x)

h2(x ˆx)2 + γ2(ˆy y)2 , γ2(ˆy y)

h2(x ˆx)2 + γ2(ˆy y)2 .

Therefore, it suffices to show that ˆy y.

To start, consider the case ψ(0) > ˆy. Then, we have ˆy < ψ(0) ψ(x) y as desired. So, in the following we consider the remaining case ψ(0) ˆy < ψ(ˆx).

Algorithm 2 Algorithm for Protocol 2 (REG)

Input: Initial online learning algorithm BASE for Procotol 4 with domain R taking initialization parameter ϵBASE. Non-negative convex function ψ. Parameters γ > 0, ϵx > 0 and ϵψ > 0 Initialize two copies of BASE: BASEx with ϵBASE = ϵx and BASEy with ϵBASE = ϵψ for t = 1...T do

Receive ht ht 1 R Send 3ht to BASEx as the tth magnitude hint. Send 3γ to BASEy as the tth magnitude hint. Get ˆxt R from BASEx Get ˆyt R from BASEy. Define the norm (x,y) 2 t = h2 tx2 + γ2y2, with dual norm (g,a) 2 ,t = g2

γ2 . Define St(ˆx, ˆy) = inf ˆy ψ(x) (x,y) (ˆx, ˆy) t. Compute xt,yt = argminy ψ(x) xt,yt) (ˆx, ˆy) t. Receive feedback gt [ ht,ht], at [0,γ], such that at = 0 unless gt = ht. Compute (δx t ,δy t ) = gt ,t St(ˆxt, ˆyt) Send gt + δx t to BASEx as tth feedback. Send at + δy t to BASEy as tth feedback. end for

Observe that since ψ is convex, it must be continuous. Therefore, by intermediate value theorem there must be some x 0 with ψ( x) = ˆy. Further, we have ψ( x) = ˆy < ψ(ˆx), so that x < ˆx.

Now, by convexity, if x x, we must ψ(x) ψ( x) because ψ must be non-decreasing for positive x since it achieves its minimum at 0. Therefore, y ψ(x) ψ( x) = ˆy and so we are done. So, let us suppose x < x.

Further, suppose that ˆy > y. Then, observe that:

h2( x ˆx)2 + γ2(max(y,ψ( x)) ˆy)2 < h2(x ˆx)2 + γ2(y ˆy)2,

so that the point ( x,max(y,ψ( x))) would contradict the optimality of (x,y). Thus, it also cannot be that ˆy > y and so we are done.

Lemma 9. Let h > 0 and γ > 0 be given and define the norm (x,y) = h2x2 + γ2y2 with corresponding dual norm . Let (g,a) be any point satisfying g h, a [0,γ], and a = 0 unless g = h. Let (δx,δy) be any points satisfying (δx,δy) = (g,a) . Then,

Proof. The dual norm is (g,a) = g2

γ2 . So, we have:

This immediately implies δy γ

2. We also have

(δx)2 g2 + h2a2

Now, since a = 0 unless g2 = h2, this yields either δx g if g < h or δx

g2 + h2a2/γ2 = h

2 if g = h, so either way δx g

Theorem 10. Let A,B,CD,ϵ > 0, and p 1, be given. Suppose that for any sequence z1,...,z T and magnitude hints m1 m T satisfying zt mt, BASE outputs w1,...,w T and guarantees regret:

T t=1 zt(wt u) ϵBASECm2p T Z1/2 p + A u

Á Á ÀZ log (e + D u Zp

m2p 1 ϵBASE ) + B u h T log (e + D u Zp

m2p 1 ϵBASE )

for any u R, where Z = m2 T + T t=1 z2 t .

Let ϵx, ϵψ and γ be given non-negative inputs to Algorithm 2. Then, for any T, with Vg = h2 T + T t=1 g2 t and Sa = γ2 + γ T t=1 at, Algorithm 2 s output sequence ˆx1,..., ˆx T guarantees:

T t=1 gt(ˆxt x ) + at(ψ(ˆxt) ψ(x ))

Cxϵxh2p T V 1/2 p g + Ax x

Á Á ÀVg log (e + Dx x V p g ϵxh2p 1 ) + Bx x log (e + Dx x V p g ϵxh2p 1 )

+ Cψϵψγ2p S1/2 p a + Aψψ(x )

Á Á ÀSa log (e + Dψψ(x )Sp a ϵψγ2p ) + Bψψ(x )log (e + Dψψ(x )Sp a ϵψγ2p ),

for any x R, where the constants in the above expression are given by:

Ax = 3A, Bx = 3B, Cx = 3C, Dx = D,

Bψ = 144A2 + 24B,

Cψ = 3C (144A2 + 24B)log (e + 12CD(1152p A2 + 48pb)p) + 1

2 + (2p + 1)(2 4p)

Dψ = 4D [1152A2p + 48p B] p .

Proof. First, observe that since (xt,yt) is the result of a projection to the domain y ψ(x), it must hold that yt ψ(xt) for all t. Thus, since at > 0 and ψ is non-negative, we have for any x :

gt(xt x ) + at(ψ(x) ψ(x )) gt(xt x ) + at(yt x ).

Therefore, it suffices to bound T t=1 gt(xt x ) + at(yt x ), which we will now accomplish.

By [16, Theorem 3], we have for any x R

T t=1 gt(xt x ) + at(yt ψ(x )) T t=1 (gt + δx t )(ˆxt x ) + T t=1 (at + δy t )(ˆyt ψ(x )),

and also (δx t ,δy t ) t, = (gt,at) t, by [16, Proposition 1]. Therefore, by Lemma 9, we have gt + δx t 3 gt 3ht. Defining Vg = h2 T + T t=1(gt + δx t )2, and by the guarantee of BASE, we have for any x :

T t=1 (gt + δx t )(ˆxt x )

C(3h T )2p [9h2 T + T t=1 (gt + δx t )2]

Á Á Á À[9h2 T + T t=1 (gt + δx t )2]log e + D x [9h2 T + T t=1(gt + δx t )2] p

+ 3Bh T x log e + D x [9h2 T + T t=1(gt + δx t )2] p

3Ch2p T V 1/2 p g ϵx + 3A x

Á Á ÀVg log (e + D x V p g h2p 1 ϵx ) + 3Bh T x log (e + D x V p g h2p 1 ϵx )

Next, observe that by Lemma 9, at + δy t 3γ. We also have for any x R:

T t=1 (at + δy t )(ˆyt ψ(x )) = 1

T t=1 (at + δy t )(yt 2ψ(x )) + 1

T t=1 (at + δy t )yt

T t=1 (at + δy t )(yt 2ψ(x )) + 1

T t=1 (at + δy t )(yt 2y ) + y

T t=1 (at + δy t )

Now, define Va = γ2 + T t=1(at + δy t )2. By the guarantee of BASE applied twice, we have that for any x R (ψ(x ) 0 below represents the comparator for the regret of BASE):

T t=1 (at + δy t )(ˆyt ψ(x ))

C(3γ)2p [9γ2 + T t=1 (at + δy t )2]

Á Á Á À(9γ2 + T t=1 (at + δy t )2)log e + 2Dψ(x )[9γ2 + T t=1(at + δy t )2] p

+ 3γBψ(x )log e + 2Dψ(x )[9γ2 + T t=1(at + δy t )2] p

Á Á Á À(9γ2 + T t=1 (at + δy t )2)log e + 2Dy [9γ2 + T t=1(at + δy t )2] p

+ 3γB y log e + 2Dy [9γ2 + T t=1(at + δy t )2] p

T t=1 (at + δy t )

3Cγ2p V 1/2 p a ϵψ + 3Aψ(x )

Á Á ÀVa log (e + 2Dψ(x )V p a γ2pϵψ ) + 3Bψ(x )log (e + 2Dψ(x )V p a γ2pϵψ )

Á Á ÀVa log (e + 2Dy V p a ϵψ ) + 3γBy log (e + 2Dy V p a γ2pϵψ )

T t=1 (at + δy t )

3Cγ2p V 1/2 p a ϵψ + 3A(ψ(x ) + y )

Á Á ÀVa log (e + 2D(ψ(x ) + y )V p a γ2pϵψ )

+ 3B(ψ(x ) + y )log (e + 2Dψ(x )V p a γ2pϵψ )

T t=1 (at + δy t )

Now, we observe: T t=1 (at + δy t ) = 1

T t=1 [(at + δy t + γ)2 γ2] 1

T t=1 (at + δy t )2,

T t=1 [(at + δy t + γ)2 γ2] + γ

Next, we bound (at + δy t + γ)2 γ2: (at + δy t + γ)2 γ = a2 t + (δy t )2 + 2atδy t + 2γat + 2γδy t ,

using δy t 0 (from Lemma 8) amd δy t γ

2 2γ (from Lemma 9):

a2 t + 2atδy t + 2γat

using 0 at γ:

so that we have (recalling that Sa = γ2 + γ T t=1 at:

T t=1 (at + δy t ) y

So, overall we have:

T t=1 (at + δy t )(ˆyt ψ(x ))

3Cγ2p V 1/2 p a ϵψ + 3A(ψ(x ) + y )

Á Á ÀVa log (e + 2D(ψ(x ) + y )V p a γ2pϵψ )

+ 3B(ψ(x ) + y )log (e + 2Dψ(x )V p a γ2pϵψ ) y

Since the above holds for any y 0, we may write:

T t=1 (at + δy t )(ˆyt ψ(x ))

inf y sup Va

3Cγ2p V 1/2 p a ϵψ + 3A(ψ(x ) + y )

Á Á ÀVa log (e + 2D(ψ(x ) + y )V p a γ2pϵψ )

+3B(ψ(x ) + y )log (e + 2Dψ(x )V p a γ2pϵψ ) y

Now, applying Lemma 20 to bound the minimax expression above, we have:

T t=1 (at + δy t )(ˆyt ψ(x )) (1

2 + 144A2)ψ(x )

Á Á ÀSa log (e + 4Dψ(x )Sp a ϵψγ2p [1152A2p + 48p B]p)

+ γψ(x )(144A2 + 24B)log (e + 4Dψ(x )Sp a ϵψγ2p [1152A2p + 48p B] p)

+ 3Cγ2p S1/2 pϵψ (144A2 + 24B)log (e + 12CD(1152p A2 + 48pb)p) + 1

2 + (2p + 1)(2 4p)

So, overall we achieve:

T t=1 gt(xt x ) + at(yt ψ(x ))

3Ch2p T V 1/2 p g ϵx + 3A x

Á Á ÀVg log (e + D x V p g h2p 1 ϵx ) + 3Bh T x log (e + D x V p g h2p 1 ϵx )

2 + 144A2)ψ(x )

Á Á ÀSa log (e + 4Dψ(x )Sp a ϵψγ2p [1152A2p + 48p B]p)

+ γψ(x )(144A2 + 24B)log (e + 4Dψ(x )Sp a ϵψγ2p [1152A2p + 48p B] p)

+ 3Cγ2p S1/2 pϵψ (144A2 + 24B)log (e + 12CD(1152p A2 + 48pb)p) + 1

2 + (2p + 1)(2 4p)

Ax = 3A, Bx = 3B, Cx = 3C, Dx = D,

Bψ = 144A2 + 24B,

Cψ = 3C (144A2 + 24B)log (e + 12CD(1152p A2 + 48pb)p) + 1

2 + (2p + 1)(2 4p)

Dψ = 4D [1152A2p + 48p B] p ,

T t=1 gt(xt x ) + at(yt ψ(x ))

Cxϵxh2p T V 1/2 p g + Ax x

Á Á ÀVg log (e + Dx x V p g ϵxh2p 1 ) + Bx x log (e + Dx x V p g ϵxh2p 1 ),

+ Cψϵψγ2p S1/2 p a + Aψψ(x )

Á Á ÀSa log (e + Dψψ(x )Sp a ϵψγ2p )

+ Bψψ(x )log (e + Dψψ(x )Sp a ϵψγ2p ).

from which the conclusion follows.

E A Parameter-Free Algorithm With Optimal Log Factors for Protocol 4

In this section we quote an algorithm that obtains a performance guarantee suitable for use as BASE in Theorem 10. We emphasize that the development in this section is only a very mild improvement (affecting only logarithmic factors) on previous work: our key contribution is how to use this algorithm to obtain better adaptivity to unknown Lipschitz constants.

In fact, algorithms satisfying the requirements of Theorem 10 up to logarithmic factors have been described by several previous authors: see [18, 21, 23, 26]. Here, we provide a slightly improved analysis of the algorithm of [21] which achieves tighter (and in fact optimal) logarithmic terms.

Theorem 11. Suppose g1,...,g T is any sequence of real numbers and 0 < h1 h T is another sequence of real numbers satisfying gt ht. Then, if p = 1/2, Algorithm 3 guarantees for all u

T t=1 gt(wt u) 8h T ϵ + 6 u

Á Á Á Á À(h2 T + T t=1 g2 t )log

3 + T t=1 g2 t /h2 t log2 (3 + T t=1 g2 t /h2 t)

+ 6 u h T log

3 + T t=1 g2 t /h2 t log2 (3 + T t=1 g2 t /h2 t)

Algorithm 3 1-Dimensional Learner for Protocol 4 (BASE)

Input: ϵ > 0, p [0,1/2] Initialize h0 = 0, k = 3 if p = 1/2 then

Define constant c = 3 else

Define constant c = 1 end if for t = 1...T do

Receive ht ht 1 R Define Vt = h2 t + t 1 i=1 g2 i if p = 1/2 then

c+ t 1 i=1 g2 i /h2 i log2(c+ t 1 i=1 g2 i /h2 i ) else

Define αt = ϵ (c+ t 1 i=1 g2 i /h2 i )p end if

Define Θt =

( t 1 i=1 gi)2

4k2Vt if t 1 i=1 gi 2k Vt

ht t 1 i=1 gi kht Vt

h2 t otherwise

Output wt = sign( t 1 i=1 gi)αt (exp(Θt) 1) Receive gt with gt ht. end for

while if p < 1/2 Algorithm 3 guarantees instead:

T t=1 gt(wt u) 4h2p T ϵ( T t=1 g2 t ) 1/2 p

Á Á Á À(h2 T + T t=1 g2 t )log

u (1 + T t=1 g2 t /h2 t) p

+ 6 u h T log

u (1 + T t=1 g2 t /h2 t) p

Notice that the term log2 (3 + T t=1 g2 t /h2 t) log2(3 + T), and so we upper bound this term with a constant for the purposes of use in Theorem 10. Further, the term T t=1 g2 t /h2 t T t=1 g2 t /h2 1, and so the logarithmic terms always fit into the framework of Theorem 10.

Proof. Observe that Algorithm 3 is an instance of FTRL with regularizer:

0 min η 1/ht [log(x/αt + 1)

η + ηVt] dx.

wt = argmin w ψt+1(w) + t 1 i=1 giw.

In the centered mirror descent framework of [21] (their Algorithm 1), this corresponds to setting φ(w) = 0. Further, [21] provides an analysis of this update for the particular family of regularizer functions ψt we consider above in their Theorem 6. Although formally speaking, their Theorem 6 specifies a particular equation for αt, inspection of the proof shows that most of their argument applies so long as αt is non-increasing. We reproduce this verification in Lemma 12, which yields:

T t=1 gt(wt u) ψT (u) + T t=1

Next, define h T +1 = 0 and g T +1 = 0 in order to define αT +1 and ψT +1 ψT . So, we can replace ψT (u) with ψT +1(u) in the above expression. Next, to bound ψT +1(u), we observe that:

ψT +1(u) = k

0 min η 1/h T +1 [log(x/αT +1 + 1)

η + ηVT +1] dx,

k u min η 1/h T +1 [log(u/αT +1 + 1)

η + ηVT +1].

Now, notice that if the minimizing η of minη 1/h T +1 [ log(u/αT +1+1)

η + ηVT +1] occurs on the bound-

ary η = 1/h T +1, then it must be that log(u/αT +1+1)

η > ηVT +1, since log(u/αT +1+1)

η is decreasing

in η and ηVT +1 is increasing in η. Thus in this case minη 1/h T +1 [ log(u/αT +1+1)

η + ηVT +1] 2h T log(u/αT +1 + 1). Alternatively, when the minimizing η is not on the boundary we have minη 1/h T +1 [ log(u/αT +1+1)

η + ηVT +1] = 2

VT +1 log(u/αT +1 + 1). So, in general we have:

ψT +1(u) 2k u

VT +1 log( u /αT +1 + 1) + 2k u h T log( u /αT +1 + 1).

So far this analysis is identical to that of [21], and has been agnostic to the value of αt, so long as αt is non-increasing. Now, however, we come to the place at which we diverge in analysis: our choice of αt is slightly larger and so results in better logarithmic factors in ψ. The trade-off is that we need to provide a fresh analysis of T t=1 2αtg2 t Vt to show that this term is still controlled. We accomplish this in Lemma 21 (for p = 1/2) and Lemma 22 (for p < 1/2). For p = 1/2, we then obtain:

T t=1 gt(wt u) 8h T ϵ + 2k u

Á Á Á Á À(h2 T + T t=1 g2 t )log

3 + T t=1 g2 t /h2 t log2 (3 + T t=1 g2 t /h2 t)

+ 2k u h T log

3 + T t=1 g2 t /h2 t log2 (3 + T t=1 g2 t /h2 t)

while for p < 1/2 we obtain:

T t=1 gt(wt u) 4h2p T ϵ( T t=1 g2 t ) 1/2 p

Á Á Á À(h2 T + T t=1 g2 t )log

u (1 + T t=1 g2 t /h2 t) p

+ 2k u h T log

u (1 + T t=1 g2 t /h2 t) p

The conclusion now follows by substituting in k = 3.

The following technical Lemma is lifted almost entirely from [21]. Unfortunately, this result was not explicitly declared as a separate Lemma in the prior literature and is instead merely a subset of the proof of a larger Theorem (specifically, Theorem 6 of [21]). So, we include the argument here for completeness. The steps are nearly identical to the prior literature, with only very mild improvement to some constants. Lemma 12. Let g1,...,g T be an arbitrary sequence of scalars. Suppose 0 < h1 h T is non-decreasing sequence with gt ht for all t, and let α1 αT , be a non-increasing sequence. Let k 3. Set Vt = h2 t + t 1 i=1 g2 i and define

0 min η 1/ht [log(x/αt + 1)

η + ηVt] dx

wt = argmin w ψt(w) + t 1 i=1 giw.

Then for all u R:

T t=1 gt(wt u) ψT (u) + T t=1

2αtg2 t Vt .

Proof. Define ψT +1 = ψT and let Df(a b) indicate the Bregman divergence Df(a b) = f(a) f(b) f (b)(a b). Define t(w) = Dψt+1(w w1). Then, by [21] Lemma 1, we have:

T t=1 gt(wt u) ψT (u) + T t=1 gt(wt wt+1) Dψt(wt+1 wt) t(wt+1)

So, it suffices to establish that:

gt(wt wt+1) Dψt(wt+1 wt) t(wt+1) 2αtgt Vt (17)

Following the notation and argument of [21], define Ft(w) = log(w/αt + 1) and

0 min η 1/ht [Ft(x)

η + ηVt] dx

Then we have ψ(w) = Ψt( w ) and elementary calculation yields:

Ψ t(x) = { 2k

Vt Ft(x) if ht

Ft(x) Vt kht Ft(x) + k Vt

ht otherwise

k Vt (x+αt)

Ft(x) if ht

kht x+αt otherwise

k Vt(1+2Ft(x)) 2(x+αt)2Ft(x)3/2 if ht

Ft(x) Vt kht (x+αt)2 otherwise

Therefore, Ψt(x) 0, Ψ t(x) 0, Ψ t (x) 0 and Ψ t (x) 0. Further, if we define x0 = αt(e 1), then for any > x0 we have

Ψ t (x) Ψ t (x)2

1 2k Vt ( 1

Ft(x)) if ht

1 kht otherwise

Ft(x) 2k Vt if ht

Ft(x) Vt 1 kht otherwise

Now, if we define Zt(x) = x 0 min(

ht ) dx, then we have

Ψ t (x) Ψ t (x)2 1

Clearly Zt is convex, 1/ht Lipschitz, and achieves its minimum value of 0 at 0. Therefore, by [21] Lemma 2, we have:

gt(wt wt+1) Dψt(wt+1 wt) Zt( wt+1 )g2 t 2g2 t Ψ (x0),

2g2 t (x0 + αt)

= 2g2 t αte k Vt ,

2g2 t αt Vt .

So, now if we could show that t(w) Zt( w )g2 t , this would establish (17). In turn, since t(w) = Ψt+1( w ) Ψt( w ), it suffices to establish:

Ψ t+1(x) Ψ t(x) Z (x)g2 t = g2 t min

To this end, we compute:

Ψ t+1(x) Ψ t(x) = k min η 1/ht+1 [Ft+1(x)

η + ηVt+1] k min η 1/ht [Ft(x)

k min η 1/ht [Ft+1(x)

η + ηVt+1] k min η 1/ht [Ft(x)

Next, let us define δm = h2 t+1 h2 t so that Vt+1 = Vt + δm + g2 t . Then we have Ft+1(x)

minη [ Ft+1(x)

η + η Vt] + η(δm + g2 t ). Armed with this calculation, we proceed:

Ψ t+1(x) Ψ t(x) k(δm + g2 t )min

+ k min η 1/ht [Ft+1(x)

η + ηVt] k min η 1/ht [Ft(x)

now, since αt αt+1, we have Ft+1 Ft so that:

k(δm + g2 t )min

Next, observe that

Vt + δm + g2 t = δ2 m + 2Vt + g2 t 2(Vt + δm + g2 t )3/2 0.

Therefore δm + g2 t Vt+1 = δm + g2 t

Vt + δm + g2 t ,

Vt + g2 t ,

Vt Vt + g2 t ,

Á Á À h2 t h2 t + g2 t ,

This implies that

(δm + g2 t )

So, altogether we have:

Ψ t+1(x) Ψ t(x) kg2 t

= Z t(x)g2 t ,

as desired.

F Fully Unconstrained Learning via Regularization

In this section, we provide a formal description of how to achieve a fully unconstrained bound via application of some peculiar regularization terms, as sketched in Section 3.2.

The goal is to ensure regret given by (4), restated below:

T t=1 gt(wt u) + at(ψ(wt) ψ(u)) O u

Á Á Àh2 T + T t=1 g2 t + ψ(u)

Á Á Àγ2 + T t=1 a2 t . (4)

In Section H, we will see how to obtain the bound (4) via a general technique for obtaining constrained full-matrix regret bounds (which is of independent interest). However, this approach comes with a mild computational overhead. To counteract this, in Section E, we provide an alternative approach that has the same computational complexity as gradient descent, but achieves the slightly weaker bound:

T t=1 gt(wt u) + at(ψ(wt) ψ(u)) O u

Á Á Àh2 T + T t=1 g2 t + ψ(u)

Á Á Àγ2 + γ T t=1 at . (18)

Fortunately, (18) will also be sufficient for our purposes.

Armed with an algorithm that achieves (18), we are ready to describe our approach for fully unconstrained learning.

Corollary 13. There exists an online learning algorithm that requires O(d) space and takes O(d) time per update, takes as input scalar values γ, h1, and ϵ and ensures that for any sequence g1,g2, Rd, the outputs w1,w1, Rd satisfy for all w and T:

T t=1 gt,wt w O ϵ

Á Á ÀV log (e + w

ϵ ) + w Glog (e + w

log (e + log (e + G

γ log (e + G

h1 ) + γw2 log (e + w 2

ϵ2 log (e + G

where G = h1 + maxt gt and V = G2 + T t=1 gt 2.

Proof. Apply Algorithm 5 with q = 1, and REG set to Algorithm 2 using Algorithm 3 with p = 0 as BASE. The result in 1 dimension then follows from Theorem 16 and Corollary 6. Then by the reduction from d-dimensional online learning to 1-dimensional online learning ([16] Theorem 2), the result in high dimensions also follows.

Theorem 1. There exists an online learning algorithm that requires O(d) space and takes O(d) time per update, takes as input scalar values γ, h1, and ϵ and ensures that for any sequence g1,g2, Rd, the outputs w1,w1, Rd satisfy for all w and T:

T t=1 gt,wt w O ϵG + ϵ2γ + G2

γ log (e + G

Á Á ÀV log (e + w

+ w Glog (e + w

V log2(T) h1ϵ ) + γ w 2 log (e + w 2

ϵ2 log (e + G

where G = max(h1,maxt [T ] gt ) and V = G2 + T t=1 gt 2.

Proof. Apply Algorithm 5 with q = 1, and REG set to Algorithm 2 using Algorithm 3 with p = 1/2 as BASE. The result in 1 dimension then follows from Theorem 16 and Corollary 6. Then by the reduction from d-dimensional online learning to 1-dimensional online learning ([16] Theorem 2), the result in high dimensions also follows.

Algorithm 4 Fully Unconstrained Learning in One Dimension

Input: Regularized learning algorithm REG with domain R. Parameter γ > 0, h1 > 0. Initialize REG with parameters ϵ and γ. for t = 1...T do

Send ht to REG as the tth magnitude hint. Get wt from REG. Play wt, see feedback gt. Set ht+1 = max(ht, gt ). Set gt = clip[ ht,ht]gt Set at = γ (ht+1 ht)/ht+1 1+ t i=1(hi+1 hi)/hi+1 . Send gt,at, to REG as tth loss and regularization coefficient. end for

Theorem 14. There exists an online learning algorithm that requires O(d) space and takes O(d) time per update, takes as input scalar values γ, h1, and ϵ and a symmetric convex function ψ and ensures that for any sequence g1,g2, Rd, the outputs w1,w1, Rd satisfy for all w and T:

T t=1 gt,wt w O ϵG + w

Á Á ÀV log (e + w

V log2(T) h1ϵ ) + w Glog (e + w

V log2(T) h1ϵ )

+ψ(ϵ)γ + γψ( w )log (e + ψ( w )

ψ(ϵ) log (e + G

h1 )) + γ log (1 + log ( G

γ [1 + log ( G

where ψ (θ) = supw θw ψ(w) is the Fenchel conjugate of ψ, G = max(h1,maxt gt ) and V = G2 + T t=1 gt 2.

Proof. Apply Algorithm 5 with REG set to Algorithm 2 using Algorithm 3 with p = 1/2 as BASE. The result in 1 dimension then follows from Theorem 16 and Corollary 6. Then by the reduction from d-dimensional online learning to 1-dimensional online learning ([16] Theorem 2), the result in high dimensions also follows.

Theorem 15. Suppose ψ is a symmetric convex function. Suppose that so long as ht gt , REG ensures for some A,B,C,D,p,ϵ:

T t=1 gt(wt w ) + at(ψ(wt) ψ(w ))

Cϵh2p T V 1/2 p g + Cψ(ϵ)γ2p S1/2 p a + A w

Á Á ÀVg log (e + D x V p g h2p 1 ϵ )

+ Bh T w log (e + D w V p g h2p 1 ϵ )

Á Á ÀSa log [e + Dψ(w )

γ2pψ(ϵ) Sp a]

+ γBψ(w )log [e + Dψ(w )

γ2pψ(ϵ) Sp a],

where Vg = h2 T + T t=1 g2 t and Sa = γ2 + γ T t=1 at. Then Algorithm 4 ensures:

Sa γ2 + γ2 log (1 + min[log (h T

Vg h2 T + T t=1 g2 t ,

T t=1 gt(wt w ) Cϵh2p T V 1/2 p g + Cψ(ϵ)γ2p S1/2 p a + A w

Á Á ÀVg log (e + D w V p g h2p 1 ϵ )

+ Bh T w log (e + D w V p g h2p 1 ϵ )

Á Á ÀSa log [e + Dψ(w )

γ2pψ(ϵ) Sp a]

+ γBψ(w )log [e + Dψ(w )

γ2pψ(ϵ) Sp a]

+ h T u + ψ(w )Sa

+ γ log (1 + min[log (h T

h1 ),T])ψ (h T

γ [1 + log (h T

In the special case that ψ(x) = x 1+q

1+q , we can replace the final term

γ log (1 + min[log ( h T

h1 ),T])ψ (h T [1 + log ( h T

h1 )]) in the above expression by:

h1+1/q T [1 + log ( h T

(1 + 1/q)γ1/q .

Proof. We have:

T t=1 gt(wt u)

= T t=1 gt(wt u) + at(ψ(wt) ψ(u)) + atψ(u) + (gt gt)(wt u) atψ(wt),

ψ(u) T t=1 at + u T t=1 gt gt + T t=1 gt(wt u) + at(ψ(wt) ψ(u)),

+ T t=1 gt gt wt atψ(wt)

Observing that gt gt = ht+1 ht:

= ψ(u) T t=1 at + u T t=1 [ht+1 ht] + T t=1 (ht+1 ht) wt atψ(wt) + T t=1 gt(wt u) + at(ψ(wt) ψ(u)).

Next, we will bound the terms T t=1 atψ(u) and u T t=1[ gt ht]+.. Moreover, ht = ht 1+[ gt ht]+, so that u T t=1 gt gt u h T .

Further, notice that for any s0,s1,...,s T , T t=1 log ( st t i=0 si ) log(s T /s0), so that:

T t=1 at γ log (1 + T t=1

Notice that [ gt ht]+/ht+1 1, so we also have:

ht+1 min[log (h T

so that overall:

Sa = γ2 + γ T t=1 at γ2 + γ2 log (1 + min[log (h T

Next, we bound the terms (ht+1 ht) wt atψ(wt). Let ψ (w) be the Fenchel conjugate of ψ. Recall that ψ is symmetric so that ψ(wt) = ψ( wt ). This also implies that ψ is symmetric and is minimized at zero. Thus:

(ht+1 ht) wt atψ(wt) = (ht+1 ht) wt atψ( wt ),

= atψ (ht+1 ht

= atψ (ht+1

γ [1 + t i=1

So, in general we have:

T t=1 (ht+1 ht) wt atψ(wt) T t=1 atψ (ht+1

γ [1 + t i=1

T t=1 atψ (h T

γ [1 + log (h T

γ log (1 + min[log (h T

h1 ),T])ψ (h T

γ [1 + log (h T

In the special case that ψ(w) = w 1+q

1+q , we have ψ (h) = h1+1/q

1+1/q so that we can improve the logarithmic factors and simplify the calculation:

atψ (ht+1 [1 + t i=1

hi+1 ]) = ath1+1/q t+1 [1 + t i=1 hi+1 hi

hi+1 ] 1+1/q

(1 + 1/q)γ1+1/q ,

= (ht+1 ht)h1/q t+1 [1 + t i=1 hi+1 hi

(1 + 1/q)γ1/q ,

= (ht+1 ht)h1/q t+1 [1 + t i=1 hi+1 hi

(1 + 1/q)γ1/q ,

(ht+1 ht)h1/q T [1 + log ( h T

(1 + 1/q)γ1/q

T t=1 (ht+1 ht) wt atψ(wt) h1+1/q T [1 + log ( h T

(1 + 1/q)γ1/q .

Finally, it is clear that gt ht so the summation T t=1 gt(wt u) + at(ψ(wt) ψ(u)) is controlled by the regret bound of REG:

T t=1 gt(wt u) + at(ψ(wt) ψ(u)) Cϵh2p T V 1/2 p g + Cψ(ϵ)γ2p S1/2 p a + A x

Á Á ÀVg log (e + D x V p g h2p 1 ϵ )

+ Bh T x log (e + D x V p g h2p 1 ϵ )

Á Á ÀSa log [e + Dψ(x )

γ2pψ(ϵ) Sp a]

+ γBψ(x )log [e + Dψ(x )

γ2pψ(ϵ) Sp a].

Algorithm 5 Fully Unconstrained Learning

Input: Symmetric convex function ψ R R with 0 = ψ(0). Scalars ϵ > 0, h1, γ > 0, p [0,1/2]. Let REG be an instance of Algorithm 6 with input ψ, γ, p, ϵx = ϵ and ϵψ = ψ(ϵ). Set vector wdirection 1 = 0 Send h1 to REG as the first magnitude hint. for t = 1...T do

// Apply reduction to 1-dimensional learning from [16] using adaptive gradient descent as direction learner .

Let wmagnitude t R be the tth output of REG. Set wt = wmagnitude t wdirection t Play wt, see feedback gt.

Set wdirection t+1 = Π w 1 [ wdirection t gt

2 t i=1 gi 2 ].

// Compute feedback for magnitude learner Set g1d t = gt,dt // Apply our new fully unconstrained magnitude learner. Set ht+1 = max(ht, g1d t ). Set gt = clip[ ht,ht]g1d t Set at = γ (ht+1 ht)/ht+1 1+ t i=1(hi+1 hi)/hi+1 . Send gt,at to REG as tth loss and regularization coefficient. Send ht+1 to REG as the t + 1st magnitude hint. end for

Finally, we also have:

Vg = h2 T + T t=1 g2 t ,

h2 T + T t=1 g2 t .

F.1 Full Statement of Main Result in High Dimensions

Throughout this paper, we have considered the special case that W = R. This suffices due to the reductions of [16] as discussed in Section C. However, here we provide a more complete theorem and algorithm for the case W = Rd. The pseudocode is provided in Algorithm 5, and the regret bound is stated in Theorem 16. Note that the regret bound follows essentially immediately from Theorem 15. Theorem 16. There exists universal constants A, B, C, such that Algorithm 5 guarantees for all T:

T t=1 gt,wt w Cϵh2p T V 1/2 p g + Cψ(ϵ)γ2p S1/2 p a + A w

Á Á ÀVg log (e + w V p g h2p 1 ϵ )

+ Bh T w log (e + w V p g h2p 1 ϵ )

Á Á ÀSa log [e + ψ( w )

γ2pψ(ϵ) Sp a]

+ γBψ( w )log [e + ψ( w )

γ2pψ(ϵ) Sp a]

+ h T u + ψ( w )Sa

+ γ log (1 + min[log (h T

h1 ),T])ψ (h T

γ [1 + log (h T

Sa γ2 + γ2 log (1 + min[log (h T

Vg h2 T + T t=1 g2 t

In the special case that ψ(x) = x 1+q

1+q , we can replace the final term

γ log (1 + min[log ( h T

h1 ),T])ψ (h T [1 + log ( h T

h1 )]) in the above expression by:

h1+1/q T [1 + log ( h T

(1 + 1/q)γ1/q .

Proof. Algorithm 5 is applying the dimension-free-to-one-dimension reduction provided by Theorem 2 of [16]. So overall the reduction tells us that the regret is bounded by

T t=1 gt,wt w T t=1 g1d t ,wmagnitude t w + w T t=1 gt,wdirection t w / w

In this case, the direction learner s iterates wdirection t are generated by standard adaptive gradient

descent [13], which guarantees the regret bound: T t=1 gt,wdirection t w / w 2

2 T t=1 gt 2.

For the first sum T t=1 g1d t ,wmagnitude t w , notice that wmagnitude is simply an application of Algorithm 4 using an instance of Algorithm 6 The first sum is bounded by application of Theorem 15, noticing that g1d t gt . So, putting the two bounds together we have the stated result.

G Technical Lemmas

Lemma 17. Let A, B, C, D, E be positive numbers and let e be the base of the natural logarithm. Then:

M log(e + DM C) + B log(e + DM C) EM (A2

E + B)log e + D (2CA2

Proof. First, by Young inequality xy infλ x2/2λ + λy2/2, we have for all M:

M log(e + DM C) M 2E2

4A2 + A2 log2(e + DM 2)

E2 Then using the identity x + y x + y:

M log(e + DM C) + B log(e + DM C) EM sup M (A2

E + B)log(e + DM C) EM

Now, from first order optimality conditions we are looking for a solution to:

E + B)DCM C 1

e + DM C = E

E + B)DCM C 1 = E

Notice that for any M 2CA2

E + B)DCM C 1 E

E + B)DCM C 1 < E

Therefore, the optimal value for M can be at most 2CA2

E . Now, notice that ( A2

E + B)log(e +

DM C) is strictly increasing in M. Thus, our quantity of interest is upper-bounded by substituting in M = 2CA2

E into this increasing term:

E + B)log(e + DM C) EM (A2

E + B)log e + D (2CA2

Lemma 18. Let A, B < 1, C be positive numbers. Then:

sup M AM B CM = (BA

1/(1 B) (1 B

Proof. We differentiate with respect to M:

ABM B 1 = C

So, plugging in this optimal M value we have: sup M AM B CM = ( CB

BA) 1/(B 1) ( 1

Lemma 19. Let A, B, C, D, E, F, G < 1 be positive numbers and let e be the base of the natural logarithm. Then:

M log (e + DM C) + B log (e + DM C) + FM G EM

E + B)log e + D (32CA2

1/(1 G) (1 G

When G = 0, the last term ( 4GGF

EG ) 1/(1 G) ( 1

G 1) should be replaced with the limiting value F.

Proof. Notice that

M log (e + DM C) + Blog (e + DM C) + FM G EM

M log (e + DM C) + B log (e + DM C) EM

4 + sup M FM G EM

The result now follows from Lemmas 17 and 18. Alternatively, if G = 0, clearly sup M FM G EM

Lemma 20. Let ψ, A, B, C, D, F, S, γ, and p 1/2 be positive numbers with S γ2, and let e be the base of the natural logarithm. Then:

inf E sup V A(ψ + E)

Á Á ÀV log (e + D(ψ + E)

γ2p V p) + Bγ(ψ + E)log (e + D(ψ + E)

γ2p V p) + Fγ2p V 1/2 p EV

(1/2 + 16A2)ψ

S log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

+ γψ (16A2 + 2B)log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

+ (16A2 + 2B)log (e + 2DF (128p A2 + 16p B) p) + 1

2 + (2p + 1)(2 4p)

2 γ2p FS1/2 p

Proof. By Lemma 19, we have:

sup V A(ψ + E)

Á Á ÀV log (e + D(ψ + E)

γ2p M p) + Bγ(ψ + E)log (e + D(ψ + E)

γ2p M p) + Fγ2p V 1/2 p EV

(4A2(ψ + E)2γ

E + Bγ(ψ + E))log (e + D(ψ + E)

γ2p (32pγ2A2(ψ + E)2

E2 + 8pγ2B(ψ + E)

+ (41/2 p(1/2 p)Fγ2p

(E/γ)1/2 p )

2 1+2p 2p + 1

= (4A2(ψ + E)2γ

E + Bγ(ψ + E))log (e + D(ψ + E)(32p A2(ψ + E)2

E2 + 8p B(ψ + E)

¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ( )

+ γ (2p + 1)(2 4p)

2 ( F E1/2 p )

¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ( )

E = max min ψ, γψ

log (e + 2DψSp

γ2p (128p A2 + 16p B)p) , γ1+2p F

We will bound the above expression by first considering ( ) and then ( ). Now, if E = ψ, we have:

( ) = γψ (16A2 + 2B)log (e + 2Dψ (128p A2 + 16p B) p)

recalling that S γ2:

γψ (16A2 + 2B)log (e + 2DψSp

γ2p (128p A2 + 16p B)p)

Alternatively, if E = γψ

log (e + 2DψSp

γ2p (128p A2 + 16p B)p), then we have E + ψ 2ψ and so:

( ) (16A2ψ2γ

E + 2Bγψ)log (e + 2Dψ (128p A2ψ2

E2 + 16p Bψ

Before we bound this expression, let us consider just the value inside the logarithm:

E2 + 16p Bψ

E 128p A2ψ S

now, since S γ2:

(128p A2ψ + 16p B) S

So, putting this back in the previous expression:

log (e + 2Dψ (128p A2ψ2

E2 + 16p Bψ

p ) log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

from which we conclude:

S log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

+ 2Bγψ log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

Finally, let us consider the case E = γ1+2p F

2 . To handle this situation, we will work with two more

subcases: either E ψ or not. If E ψ, then E + ψ 2ψ. Therefore:

( ) (16A2ψ2γ

E + 2Bγψ)log (e + 2Dψ (128p A2ψ2

E2 + 16p Bψ

However, if E ψ, then it must be that E γψ

log (e + 2DψSp

γ2p (128p A2 + 16p B)p). Thus by the

exact same analysis following equation (19), we again have

S log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

+ 2Bγψ log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

So, for our final subcase we consider E = γ1+2p F

2 and also E ψ. Then E + ψ 2E, which yields:

( ) γE(16A2 + 2B)log (e + 2DE (128p A2 + 16p B) p)

Since S γ2, E F and so:

γF(16A2 + 2B)log (e + 2DF (128p A2 + 16p B) p)

So, in all cases we have:

S log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

+ γψ (16A2 + 2B)log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

+ γF(16A2 + 2B)log (e + 2DF (128p A2 + 16p B) p)

Where the last term γF(16A2 + 2B)log (e + 2DF (128p A2 + 16p B) p) is only present if p 1/2.

Notice that we must have E γ1+2p F

2 . Therefore:

1 2p 1+2p γ2p FS1/2 p

( ) (2p + 1)(2 4p)

2 γ2p FS1/2 p

So, overall it holds that:

( ) + ( ) 16A2ψ

S log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

+ γψ (16A2 + 2B)log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

+ γF(16A2 + 2B)log (e + 2DF (128p A2 + 16p B) p)

+ (2p + 1)(2 4p)

2 γ2p FS1/2 p

To conclude, let us bound ES

2γ . If E γ1+2p F

2 , then it must be that E

log (e + 2DψSp

γ2p (128p A2 + 16p B)p). Therefore:

S log (e + 2DψSp

γ2p (128p A2 + 16p B)p)

Alternatively, if E = γ1+2p F

2 . In this case:

2γ = γ2p FS1/2 p

So, combining all these facts, we have when p < 1/2:

inf E sup V A(ψ + E)

Á Á ÀV log (e + D(ψ + E)

γ2p M p) + Bγ(ψ + E)log (e + D(ψ + E)

γ2p M p) + Fγ2p V 1/2 p EV

inf E ( ) + ( ) + ES

S log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

+ γψ (16A2 + 2B)log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

+ γF(16A2 + 2B)log (e + 2DF (128p A2 + 16p B) p)

+ (2p + 1)(2 4p)

2 γ2p FS1/2 p

+ γ2p FS1/2 p

S log (e + 2DψSp

γ2p (128p A2 + 16p B)p)

grouping terms, and using γ γ2p S1/2 p:

(1/2 + 16A2)ψ

S log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

+ γψ (16A2 + 2B)log (e + 2DψSp

γ2p (128p A2ψ + 16p B)p)

+ (16A2 + 2B)log (e + 2DF (128p A2 + 16p B) p) + 1

2 + (2p + 1)(2 4p)

2 γ2p FS1/2 p

Lemma 21. Suppose g1,...,gt and 0 < h1 h2 h T are such that gt ht for all t. Define Vt = ch2 t + g2 1 t 1. Define αt = ϵ

c+ t 1 i=1 g2 i /h2 i log2(c+ t 1 i=1 g2 i /h2 i ) for some c 3. Then:

αtg2 t Vt = 4ϵh T

Proof. Let 1 = τ1,...,τk T be the set of indices such that hτi+1 > 2hτi and hτi+1 1 2hτi, with τk+1 defined equal to T + 1 for convenience. Note that this implies hτk i < hτk/2i. Further, hτk h T , so overall we have for all i hτk i h T /2i. We will show that

τi+1 1 t=τi

αtg2 t Vt 2ϵhτi (20)

Once established, this implies:

αtg2 t Vt = k i=1

τi+1 1 t=τi

2ϵ k i=1 hτi

= 2ϵ k 1 i=0 hτk i

k 1 i=0 2 i

So, to establish (20), we observe that for any t [τi,τi+1 1] we have

Vt (c 1)h2 t + t j=τi g2 j

2 h2 τi+1 1 + t j=τi g2 j

c 1 + t j=τi g2 j /h2 j log2 (c 1 + t j=τi g2 j /h2 j)

2 + t j=τi g2 j /h2 τi+1 1 log2 ( c 1

2 + t j=τi g2 j /h2 τi+1 1)

2 h2 τi+1 1 + t j=τi g2 j log2 ( c 1

2 + t j=τi g2 j /hτi+1 1)

Combining these yields: αtg2 t Vt ϵhτi+1 1g2 t ( c 1

2 h2 τi+1 1 + t j=τi g2 j )log2 ( c 1

2 + t j=τi g2 j /hτi+1 1)

= ϵhτi+1 1 g2 t /h2 τi+1 1 ( c 1

2 + t j=τi g2 j /h2 τi+1 1)log2 ( c 1

2 + t j=τi g2 j /hτi+1 1)

ϵhτi+1 1 g2 t /h2 τi+1 1 (1 + t j=τi g2 j /h2 τi+1 1)log2 (1 + t j=τi g2 j /hτi+1 1)

2ϵhτi g2 t /h2 τi+1 1 (1 + t j=τi g2 j /h2 τi+1 1)log2 (1 + t j=τi g2 j /hτi+1 1)

So, now if we define xs = g2 s+τi 1/h2 τi+1 1, then we have:

τi+1 1 t=τi 2ϵhτi

τi+1 τi s=1

xs (1 + s s =1 xs )log2 (1 + s s =1 xs ) And, by [1] Lemma 4.13:

τi+1 τi s=1

xs (1 + s s =1 xs )log2 (1 + s s =1 xs )

τi+1 τi s=1 xs

0 dx (1 + x)log2(1 + x)

= 1 log(1 + x)

τi+1 τi s=0 xs

So, in the end we have τi+1 1 t=τi 2ϵhτi as desired.

Lemma 22. Suppose g1,...,gt and 0 < h1 h2 h T are such that gt ht for all t. Define Vt = ch2 t + g2 1 t 1. Define αt = ϵ (c+ t 1 i=1 g2 i /h2 i )p for some c 1 and p [0,1/2). Then:

αtg2 t Vt 2ϵh2p T ( T t=1 g2 t ) 1/2 p

Proof. Similar to the proof of Lemma 21, we have:

Vt (c 1)h2 t + t j=1 g2 j

t j=τi g2 j

( t j=1 g2 j /h2 j) p

ϵh2p t ( t j=1 g2 j ) p

Combining these yields:

αtg2 t Vt ϵh2p T g2 t

( t j=1 g2 j ) 1/2+p

Further, by [1] Lemma 4.13 we have:

( t j=1 g2 j ) 1/2+p

0 dx x1/2+p

( T t=1 g2 t ) 1/2 p

1/2 p from which the conclusion immediately follows.

H Regularized Regret via Full-Matrix Bound With Constraints

In this section, we provide an alternative approach to solving the epigraph-based regularized regret game specified by Protocol 3. Our approach actually involves a generic improvement to the class of so-called full-matrix regret bounds, and so may be of independent interest.

Specifically, we will provide an algorithm for online learning with magnitude hints (Protocol 4) that ensures the regret bound:

T t=1 gt,wt w O ϵh T +1 +

Á Á Àd T t=1 gt,w 2 log( w T/ϵ) . (21)

This type of bound is sometimes called a full-matrix bound as the term inside the square root can be expressed as w T Σw where Σ is the matrix of gradient outer products Σ = T t=1 gtg t . Bounds of this form have appeared before in the literature. For the case that W is an entire vector space, [16, 18] both provide full-matrix bounds. For the case in which W is not an entire vector space, [31] provides to our knowledge the only full-matrix bound. However, their algorithm suffers a suboptimal logarithmic factor: the log(T) term appears outside rather than inside the square root. We provide a method that fixes this issue.

However, before delving into the technical details of our approach, let us explain how achieving a full-matrix bound allows us to solve Protocol 3. The argument is nearly immediate: observe that in the 2-d game, we would have w (w ,ψ(w )) and gt ( gt,at). Then the bound (6) is immediate from (21). So, without further ado, let us provide our bound and analysis.

H.1 Full Matrix Algorithm and Analysis

Assume that W Rd is a closed convex set that contains the origin within its interior. Further, let Φbar be a self-concordant barrier for W with parameter µ > 0. In this section, we present an algorithm that achieves (21). The algorithm is Follow-The-Regularized-Leader (FTRL) with a specific regularizer we define next.

Regularizers. For Σ Rd d and Z,σ,ε > 0, define the regularizer:

Φ(w;Σ,Z,σ,ε) = sup λ 0

w (Σ + λI)w X(w (Σ + λI)we λZ det(σ 2Σ)

where X(θ) = W (θ)1/2 W (θ) 1/2 and W is the Lambert function; W(x) is defined as the principal solution to W(x)e W (x) = x.

Lemma 23. For any Σ Rd d and Z,ε,σ > 0, the Fenchel dual of the function Φ( ;Σ,Z,σ,ε) in (22) satisfies:

G Rd, Φ (G;Σ,Z,σ,ε) = inf λ 0

2G (Σ + λI) 1G + λZ

det(σ 2Σ) .

Proof. See [18].

FTRL. We will consider the FTRL algorithm with regularizer Φ( ;Σ,Z,σ,ε) + Φbar( ), for some choices of Σ, Z, σ, and ε. To specify these choices, let

2 (1 e 1 2γ 1

for γ > 1. With this, and given the history of gradients g1,...,gt 1 up to round t 1 and parameters γ,σ,ε > 0 and hint ht > 0, the algorithm outputs:

wt argmin w Rd Gt 1,w + Ψ( w;Vt 1,ht,σ,ε), (24)

where Ψ(w;V,h,σ,ε) = Φ(w;σ2I + γV,ρ(γ)2/h2,σ,ε) + Φbar( w), (25) and

Gτ = τ s=1 gs, and Vτ = τ s=1 gsg s. (26)

Remark 1 (Connection to Matrix-Free Grad). We note that without the barrier term Φbar in (25), the iterates in (24) can be computed in closed-form; in this case, the iterates exactly matches those of the Matrix-Free Grad algorithm by [18] for unconstrained Online Convex Optimization (the connection to FTRL was not made explicit in [18]). The advantage of adding a barrier Φbar is that it ensure that the iterates ( wt) are always in the feasible set without requiring any sophisticated constrained-to-unconstrained reductions that may lead to sub-optimal logarithmic terms in the regret [46] (see Remark 2 in the sequel). Lemma 24 (Monotocity of potential). Let σ,ε > 0 and γ > 1 be given. For all gt Rd and ht > 0 such that gt ht, we have

gt, wt Ψ (Gt 1;Vt 1,ht,σ,ε) Ψ (Gt;Vt,ht,σ,ε). (27)

where G Ψ (G;V,h,σ,ε) denotes the Fenchel dual of w Ψ(w;V,h,σ,ε).

The proof of the lemma is in Appendix H.3. By summing (27) over t and using Fenchel duality, we obtain the following regret bound for the FTRL iterates in (24). Theorem 25 (Regret with valid hints). Let σ,ε > 0 and γ > 1 be given. The FTRL iterates ( wt) in (24) in response to any sequence (gt) such that gt ht, for all t 1, satisfy: for all T N and w int W:

T t=1 gt,wt w ε + Φ bar(0) + Φbar(w) +

Qw T ln+ (det(σ 2ΣT ) Qw T ), (28)

where ln+( ) = 0 ln( ), ΣT = σ2I + γVT , and

Qw T = max{w ΣT w, 1

ρ(γ)2 ln(det(σ 2ΣT )h2 T w 2

ε2ρ(γ)2 ) + w ΣT w)}. (29)

Remark 2 (Comparison to previous "full-matrix" bounds in the constrained setting). We note that by having the O(log T) factor in (28) inside the square root, the bound in (28) improves on previous "full-matrix" bounds in the constraint setting [46], which have the log factor outside.

H.2 Implementation Considerations

As stated in Remark 2, if we remove Φbar from the regularizer, then iterates in (24) match those of Matrix-Free Grad, which are available in closed-form. Unfortunately, in the presence of Φbar (which ensures that the iterates are always in the feasible set W), the iterate wt in (24) no longer admits a closed-form expression, and computing wt, for t [T], now requires solving a convex optimization problem. This is not ideal from a computational perspective; most first-order OCO algorithms require only O(d) operation per round. It might be possible (at least in the case where W is bounded) to efficiently approximate ( wt) without solving an optimization problem at each step and without sacrificing the regret by much using Newton steps such as in the recent works of [47, 48, 49]. We leave this investigation for future work.

H.3 Proof of Lemma 24

Proof. By Lemma 23, we have that for all V and h, Ψ ( ;V,h,σ,ε) satisfies

Ψ (G;V,h,σ,ε) = inf u Rd Φ (G u;γV,ρ(γ)2/h2,σ,ε) + Φ bar( u),

= inf λ 0,u Rd ε exp( 1

2(G u) (σ2I + γV + λI) 1(G u) + λρ(γ)2

det(I + σ 2γV ) + Φ bar( u).

We will use this to prove (27).

Let (λ ,u ) R 0 Rd be the minimizer in the problem Ψ (Gt 1;Vt 1,ht,σ,ε). With this notation, we have

wt = argmin w Rd Gt 1,w + Ψ( w;Vt 1,ht),

= argmax w Rd Gt 1, w Ψ( w;Vt 1,ht),

= argmax v Rd { Gt 1,v Ψ(v;Vt 1,ht)},

= Ψ (Gt 1;Vt 1,ht,σ,ε),

and so by Lemma 26,

= (σ2I + γVt 1 + λ I) 1(Gt 1 u ) Φ (Gt 1 u ;σ2I + γVt 1,ρ(γ)2/h2 t,σ,ε). (31)

Moving forward, we define

Gt 1, = Gt 1 u and Gt, = Gt u .

To prove the lemmsa, it suffices to prove the stronger statement obtained by picking the sub-optimal choice (λ,u) = (λ ,u ) for the problem Ψ (Gt,Vt,ht,σ,ε); that is,

2G t 1, (σ2I + γVt 1 + λ I) 1Gt 1, + λ ρ(γ)2

det(I + σ 2γVt 1) + Φ bar( u )

2G t, (σ2I + γVt + λ I) 1Gt, + λ ρ(γ)2

det(I + σ 2γVt) Φ bar( u ),

= Φ (Gt 1, ;σ2I + γVt 1,ρ(γ)2/h2 t,σ,ε) Φ (Gt, ;σ2I + γVt,ρ(γ)2/h2 t,σ,ε),

and so dividing by Φ (Gt 1, ;σ2I + γVt 1,ρ(γ)2/h2 t,σ,ε) and using (31), this becomes

gt (σ2I + γVt 1 + λ I) 1Gt 1,

2G t, (σ2I + γVt + λ I) 1Gt, + λ ρ(γ)2

2 lndet(I + σ 2γVt))

2G t 1, (σ2I + γVt 1 + λ I) 1Gt 1, + λ ρ(γ)2

2 lndet(I + σ 2γVt 1)) .

Let us abbreviate Σ = σ2I + γVt 1 + λ I. The matrix determinant lemma and monotonicity of matrix inverse give

ln det(I + σ 2γVt)

det(I + σ 2γVt 1) = ln(1 + γg t (σ2I + γVt 1) 1gt) ln(1 + γg t Σ 1gt).

Then Sherman-Morrison gives

G t, (σ2I + γVt + λ I) 1Gt, = G t, Σ 1Gt, γ (g t Σ 1Gt, )2

1 + γg t Σ 1gt

and splitting off the last round Gt, = Gt 1, + gt gives

G t, (σ2I + γVt + λ I) 1Gt, = G t 1, Σ 1Gt 1, + 2G t 1, Σ 1gt + g t Σ 1gt γ(g t Σ 1Gt 1, )2

1 + γg t Σ 1gt .

All in all, it suffices to show

g t Σ 1Gt 1, 1 exp

2G t 1, Σ 1gt + g t Σ 1gt γ(g t Σ 1Gt 1, )2

2(1 + γg t Σ 1gt) 1

2 ln(1 + γg t Σ 1gt) .

Introducing scalars r = g t Σ 1Gt 1, and z = g t Σ 1gt, this simplifies to

r 1 exp(2r + z γr2

2(1 + γz) 1

2 ln(1 + γz))

Being a square, z 0 is positive. In addition, optimality of λ ensures that Σ 1Gt 1, = ρ(γ)

follows from the fact that d dλ G t 1, (σ2I + γV + λI) 1Gt 1, λ=λ = Σ 1Gt 1, 2. In combination with gt ht, we find

2 = 1 e 1 2γ 1

2 < 1. (32)

The above requirement may hence be further reorganized to

2r γr2 z + (1 + γz)(ln(1 + γz) + 2ln(1 + r)).

The convex right hand side is minimized subject to z 0 at

z = max 0, e 1 γ 1 2 ln(1+r) 1

so it remains to show

1 γ (1 + r) 2e 1 γ 1, if 1

γ 1 2ln(1 + r); 2ln(1 + r), otherwise. (33)

Note that by (32), we have 2log(1 + r) 1

γ 1, and so the condition in the previous display reduces to the second case; that is,

2r γr2 2log(1 + r), r 1 e 1 2γ 1

which is satisfied for the hardest case, where r = e 1 2γ 1

H.4 Proof of Theorem 25

Proof. Fix w Rd. Using that Ψ (G;V,h,σ,ε) is decreasing in h, we can telescope (27) in Lemma 24 to obtain

T t=1 g t wt Ψ (0;0,h1,σ,ε) Ψ (GT ;VT ,h T ,σ,ε)

By (30), we have Ψ (0;0,h1,σ,ε) ε + Φ bar(0), yielding:

T t=1 g t wt ε + Φ bar(0) Ψ (GT ;VT ,h T ,σ,ε),

ε + Φ bar(0) + inf u Rd GT ,u + Ψ( u;VT ,h T ,σ,ε),

= ε + Φ bar(0) + inf u Rd GT ,u + Φ( u;σ2I + γVT ,ρ(γ)2/h2 T ,σ,ε) + Φbar(u),

ε + Φ bar(0) + GT ,w + Φ( w;σ2I + γVT ,ρ(γ)2/h2 T ,σ,ε) + Φbar(w), (setting u = w)

= ε + Φ bar(0) + GT ,w + sup λ 0

w (ΣT + λI)w X(w (ΣT + λI)we λZT det(σ 2ΣT )

+ Φbar(w), (35)

where ΣT = σ2I + γVT and ZT = ρ(γ)2/h2 T . Zero derivative of the above objective for λ occurs at

ZT 2ZT w ΣT w

and hence the optimum for λ is either at that point or at zero, whichever is higher, with the crossover point at w 2

ZT = w ΣT w. Plugging that in, we find that for C = w 2

ZT , we have

w (ΣT + λI)w X(w (ΣT + λI)we λZT det(σ 2ΣT )

1 2(C + w ΣT w) X

1 2(C + w ΣT w)e ln w 2

ZT 2 + ZT w ΣT w

2 w 2 det(σ 2ΣT )

ε2 , if C w ΣT w;

w ΣT w X(w ΣT w det(σ 2ΣT )

ε2 ), otherwise.

Qw T X(det(σ 2ΣT )

ε2 Qw T ), (36)

where Qw T = max{w ΣT w, 1

ZT + w ΣT w)}; in the last inequality, we used that X(θ) is increasing to drop the exponential in its argument. Combining (36) with (35) and using that X(θ)

ln+(θ) (see Lemma 27), we obtain the desired bound.

H.5 Helper Lemmas for Full-Matrix Analysis

Lemma 26. Let W Rd and Y R. Further, let f X Y R be a differentiable function such that for all x X, the problem infy Y f(x,y) has a unique minimizer y(x). Then,

xf(x,y(x)) = xf(x,y(x)). (37)

Lemma 27. For θ 0, define X(θ) = supα α e α2

2 ln θ. Then X(θ) = (W(θ))1/2 (W(θ)) 1/2 =

lnθ + o(1).

Proof. The fact that X(θ) = (W(θ))1/2 (W(θ)) 1/2 follows from [50, Lemma 18]. Recall that

sup x yx ex = y lny y

X(θ) = sup α α e α2

= sup α inf η α η(α2

2 lnθ) + η lnη η

= inf η 1 2η + η

2 lnθ + η lnη η

where we plugged in the sub-optimal choices η = 1

ln θ (this requires θ 1) and η = 1

stick in η = 1

ln(ee 2+θ) we find

X(θ) ln(ee 2 + θ) + lnθ ln(ln(ee 2 + θ)) 2

ln(ee 2 + θ)

ln(ee 2 + θ)

Note that ee 2 = 1.14492. This is less than 2, the value of θ where

θ becomes positive.

I Complete Psuedocode for Regularized 1-Dimensional Learning

In Algorithm 6, we provide a self-contained implementation of an algorithm for regularized online learning (Protocol 2). The algorithm is obtained by combing Algorithm 3 with Algorithm 2.

I.1 Efficient Projections for ψ(z) = z2

Our algorithms for regularized online learning via epigraphs (Protocol 3) require projections to the set {y ψ(x)}. While in general this projection may be expensive, for simple function ψ of interest, such as ψ(z) = z2, this projection is relatively straightforward. In the following we provide a formula for this projection that is easy to compute (if a little ungainly to look at).

Proposition 28. Let ψ R R be given by ψ(x) = x2. Define the norm (x,y) 2 = hx2 + γ2y2, the function S(ˆx, ˆy) = infy ψ(x) (x,y) (ˆx, ˆy) , and the projection P(ˆx, ˆy) = argminy ψ(x) (x,y) (ˆx, ˆy) . Then for any ˆy < ψ(ˆx), we have P(ˆx, ˆy) = (x,y) with y = x2 and:

x = 21/3(G2 2γ2ˆy)

Z = 108G2γ4ˆx + 2

2916G2γ8ˆx2 + (6G2γ2 12γ4ˆy)3

Moreover, S(ˆx, ˆy) = ( G2(ˆx x)

G2(x ˆx)2+γ2(ˆ y)2 , γ2(ˆy y)

G2(x ˆx)2+γ2(ˆy y)2 )

Proof. Since the (x,y) is on the boundary of the constraint, we clearly have y = x2. Note that (x,y) = argminy ψ(x) (x,y) (ˆx, ˆy) 2. Thus, by La Grange multipliers, we have for some λ:

2G2(x ˆx) = λψ (x) = 2λx

2γ2(y ˆy) = λ

Algorithm 6 Regularized 1-dimensional learner (REG) for Protocol 2

Input: Non-negative convex function ψ R R. Parameters γ > 0, p [0,1/2], ϵx > 0 and ϵψ > 0 Initialize k = 3. if p = 1/2 then

Define constant c = 3 else

Define constant c = 1 end if for t = 1...T do

Receive ht ht 1 R. Set hx t = 3ht. Set hy t = 3γ Define V x t = 9h2 t + t 1 i=1(gx i )2. Define V y t = 9γ2 + t 1 t=1(gy i )2 if p = 1/2 then

Set αx t = ϵ

c+ t 1 i=1(gx i )2/(hx i )2 log2(c+ t 1 i=1(gx i )2/(hx i )2) Set αy t = ψ(ϵ)

c+ t 1 i=1(gy i )2/(hy i )2 log2(c+ t 1 i=1(gy i )2/(hy i )2) else

Define αx t = ϵ (c+ t 1 i=1(gx i )2/(hx i )2)p

Define αy t = ψ(ϵ) (c+ t 1 i=1(gy i )2/(hy i )2)p end if

Define Θx t =

( t 1 i=1 gx i )2

4k2V x t if t 1 i=1 gx i 2k V x t hx t t 1 i=1 gx i khx t V x t (hx t )2 otherwise

Define Θy t =

( t 1 i=1 gy i )2

4k2V ψ t if t 1 i=1 gy i 2k V y t hy t t 1 i=1 gy i khy t V y t (hy t )2 otherwise

Set ˆxt = sign( t 1 i=1 gx i )αx t (exp(Θx t ) 1) Set ˆyt = sign( t 1 i=1 gy i )αx t (exp(Θy t ) 1)

Define the norm (x,y) 2 t = h2 tx2 + γ2y2, with dual norm (g,a) 2 ,t = g2

γ2 . Define St(ˆx, ˆy) = inf ˆy ψ(ˆx) (x,y) (ˆx, ˆy) t Compute xt,yt = argminy ψ(x) (xt,yt) (ˆx, ˆy) t. Output wt = xt, receive feedback gt [ ht,ht], at [0,γ], such that at = 0 unless gt = ht. Compute (δx t ,δy t ) = gt ,t St(ˆxt, ˆyt) Set gx t = gt + δx t . Set gy t = at + δy t . end for

This implies:

x = G2ˆx G2 λ

Moreover, we also must have y = x2, so that:

(G2 λ)2 = ˆy G2

2γ2 + (ˆy G2

2γ2 )(G2 λ)2 G4ˆx2 = 0

This is clearly a cubic equation in λ, and so we can apply the cubic formula (via Mathematica) to obtain the following result:

3 25/3G2γ2 + 211/3G2γ4ˆy + 211/3γ6ˆy2

Z = 108G2γ4ˆx + 2

2916G2γ8ˆx2 + (6G2γ2 12γ4ˆy)3

which yields:

x = 21/3(G2 2γ2ˆy)

and y = x2.

The expression for S(ˆx, ˆy) follows directly from [16] Theorem 4.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: The main claims do accurately reflect the paper s contribution.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: Yes, see the discussion section.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: We provide a proof-sketch in the main paper and detailed proofs in the appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA] Justification: This paper has only mathematical congtent. There are no experiments in this paper. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: This paper has only mathematical congtent. There are no experiments in this paper. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors).

It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: This paper has only mathematical congtent. There are no experiments in this paper. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: This paper does conform to the code of ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [No]

Justification: This paper provides a purely mathematical contribution. As such, it is subject to the standard ethical concerns present for all mathematical papers, but no further ones. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: This paper has only mathematical congtent. There are no experiments in this paper.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: This paper has only mathematical congtent. There are no experiments in this paper.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: This paper has only mathematical congtent. There are no experiments in this paper.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: This paper has only mathematical congtent. There are no experiments in this paper.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: This paper has only mathematical congtent. There are no experiments in this paper.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.