# optimal_multiclass_ucalibration_error_and_beyond__fe26b952.pdf

Optimal Multiclass U-Calibration Error and Beyond

Haipeng Luo University of Southern California haipengl@usc.edu

Spandan Senapati University of Southern California ssenapat@usc.edu

Vatsal Sharan University of Southern California vsharan@usc.edu

We consider the problem of online multiclass U-calibration, where a forecaster aims to make sequential distributional predictions over K classes with low Ucalibration error, that is, low regret with respect to all bounded proper losses simultaneously. Kleinberg et al. (2023) developed an algorithm with U-calibration error O(K

T) after T rounds and raised the open question of what the optimal bound is. We resolve this question by showing that the optimal U-calibration error is Θ(

KT) we start with a simple observation that the Follow-the-Perturbed Leader algorithm of Daskalakis and Syrgkanis (2016) achieves this upper bound, followed by a matching lower bound constructed with a specific proper loss (which, as a side result, also proves the optimality of the algorithm of Daskalakis and Syrgkanis (2016) in the context of online learning against an adversary with finite choices). We also strengthen our results under natural assumptions on the loss functions, including Θ(log T) U-calibration error for Lipschitz proper losses, O(log T) U-calibration error for a certain class of decomposable proper losses, U-calibration error bounds for proper losses with a low covering number, and others.

1 Introduction

We consider the fundamental problem of making sequential probabilistic predictions over an outcome (e.g., predicting the probability of tomorrow s weather being sunny, cloudy, or rainy). Specifically, at each time t = 1, . . . , T, a forecaster/learner predicts pt K, where K denotes the probability simplex over K outcomes. At the same time, an adversary decides the true outcome, encoded by a one-hot vector yt E := {e1, . . . , e K}, where ei denotes the i-th standard basis vector of RK. The forecaster observes yt at the end of time t.

A popular approach to measure the performance of a forecaster is to measure her regret against the best fixed prediction in hindsight. Fixing some loss function ℓ: K E R, the regret of the forecaster s predictions with respect to ℓis defined as REGℓ:= PT t=1 ℓ(pt, yt) infp K PT t=1 ℓ(p, yt). Perhaps the most common class of loss functions to evaluate a forecaster are proper loss functions. A loss function is proper if Ey p[ℓ(p, y)] Ey p[ℓ(p , y)] for all p, p K. Hence proper loss functions incentivize the forecaster to predict the true probability of the outcome (to the best of their knowledge). We will focus on proper loss functions in this work.

Note, however, that regret is measured with respect to a specific loss function ℓ. It is unclear which proper loss one should minimize over for the specific application at hand and there could even

*Author ordering is alphabetical.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

be multiple applications with different loss functions which use the forecasters s prediction. Could it be possible for a forecaster to simultaneously enjoy low regret with respect to all proper loss functions? This questions was raised in the interesting recent work of Kleinberg et al. (2023). They propose the notion of U-calibration error UCal L := E [supℓ L REGℓ] (and a weaker version pseudo U-calibration error PUCal L := supℓ L E[REGℓ]) for a family of proper loss functions L. A forecaster with low U-calibration error thus enjoys good performance with respect to all loss functions in L simultaneously. Unless explicitly mentioned, we shall let L denote the set of all bounded (in [ 1, 1]) proper losses (as in Kleinberg et al. (2023)), and drop the subscript in UCal L and PUCal L for convenience.

The simplest way to get low U-calibration error is via the classical notion of low calibration error (Dawid, 1982), defined as Cal := P

t;pt=p(p yt) 1. Intuitively, a forecaster with low calibration error guarantees that whenever she makes a prediction p, the empirical distribution of the true outcome is indeed close to p. Kleinberg et al. (2023) prove that PUCal UCal = O(Cal) and thus a well-calibrated forecaster must have small U-calibration error. However, getting low calibration error is difficult and faces known barriers: the best existing upper bound on Cal is O(T K K+1 ) (Blum et al., 2008), and there is a Ω(T 0.528) lower bound (for K = 2) (Qiao and Valiant, 2021). Therefore, a natural question to ask is if it is possible to side-step calibration and directly get low U-calibration error. Kleinberg et al. (2023) answer this in the affirmative, and show that there exist simple and efficient algorithms with UCal = O(

T) for K = 2 and PUCal = O(K

T) for general K. This provides a strong decision-theoretic motivation for considering U-calibration error as opposed to calibration error; we refer the reader to Kleinberg et al. (2023) for further discussion.

Following up Kleinberg et al. (2023), this paper addresses the following question that was left open in their work: What is the minimax optimal multiclass U-calibration error? We give a complete answer to this question (regarding PUCal) by showing matching upper and lower bounds. Moreover, we identify several broad sub-classes of proper losses for which much smaller U-calibration error is possible. Concretely, our contributions are as follows.

1.1 Contributions and Technical Overview

First, we show that the minimax optimal value of PUCal is Θ(

In Section 3.1, we start by showing that a simple modification to the noise distribution of the Follow-the-Perturbed-Leader (FTPL) algorithm of Kleinberg et al. (2023) improves their PUCal = O(K

T) bound to O(

KT). In fact, our algorithm coincides with that of Daskalakis and Syrgkanis (2016) designed for an online learning setting with a fixed loss function and an adversary with only finite choices. The reason that it works for any proper losses simultaneously in our problem is because for any set of outcomes, the empirical risk minimizer with respect to any proper loss is always the average of the outcomes (c.f. property (1)).

We then show in Section 3.2 that there exists one particular proper loss ℓsuch that any algorithm has to suffer REGℓ= Ω(

KT) in the worst case, hence implying PUCal = Ω(

KT). While our proof follows a standard randomized argument, the novelty lies in the construction of the proper loss and the use of an anti-concentration inequality to bound the expected loss of the benchmark. We remark that, as a side result, our lower bound also implies the optimality of the FTPL algorithm of Daskalakis and Syrgkanis (2016) in their setting, which is unknown before to our knowledge.

While Kleinberg et al. (2023) only consider PUCal for general K, we take a step forward and further study the stronger measure UCal (recall PUCal UCal). We start by showing an upper bound on UCal L for the same FTPL algorithm and for any loss class L with a finite covering number. Then, we consider an even simpler algorithm, Follow-the-Leader (FTL), which is deterministic and makes UCal and PUCal trivially equal, and identify two broad classes of loss functions where FTL achieves logarithmic U-calibration error, an exponential improvement over the worst case:

In Section 4.1, we show that for the class LG of G-Lipschitz bounded proper losses (which includes standard losses such as the squared loss and spherical loss), FTL ensures PUCal LG = UCal LG = O(G log T). We further show that all algorithms must suffer PUCal LG = Ω(log T). While we prove this lower bound using the standard squared loss that is known to admit Θ(log T) regret in many online

learning settings (e.g., Abernethy et al. (2008)), to our knowledge it has not been studied in our setting where the learner s decision set is a simplex and the adversary has finite choices. Indeed, our proof is also substantially different from Abernethy et al. (2008) and is one of the most technical contributions of our work.

Next, in Section 4.2, we identify a class Ldec of losses that are decomposable over the K outcomes and additionally satisfy some mild regularity conditions, and show that FTL again achieves PUCal = UCal = O(log T) (ignoring other dependence). This class includes losses induced by a certain family of Tsallis entropy that are not Lipschitz. The key idea of our proof is to show that even though the loss might not be Lipschitz, its gradient grows at a controlled rate.

Given these positive results on FTL, one might wonder whether FTL is generally a good algorithm for any proper losses. We answer this question in the negative in Section 4.3 by showing that there exists a bounded proper loss such that the regret of FTL is Ω(T). This highlights the need of using FTPL if one cares about all proper losses (or at least losses not in LG or Ldec).

1.2 Related Work

For calibration, Foster and Vohra (1998) proposed the first algorithm for the binary setting with an (expected) O(T 2 3 ) calibration error (see also Blum and Mansour (2007) and Hart (2022) for a different proof of the result). In the multiclass setting, Blum et al. (2008) have shown an O(T K K+1 ) calibration error. Several works have studied other variants of calibration error, such as the most recently proposed Distance to Calibration (Błasiok et al., 2023; Qiao and Zheng, 2024; Arunachaleswaran et al., 2024); see the references therein for other earlier variants.

A recent research trend, initiated by Gopalan et al. (2022), has centered around the concept of simultaneous loss minimization, also known as omniprediction. Garg et al. (2024) study an online adversarial version of it, and U-calibration can be seen as a special non-contextual case of their setting with only proper losses considered. Their results, however, are not applicable here due to multiple reasons: for example, they consider only the binary case (K = 2), and their algorithm is either only designed for Lipschitz convex loss functions or computationally inefficient. We also note that omniprediction has been shown to have a surprising connection with multicalibration (Hébert-Johnson et al., 2018), a multi-group fairness notion, making it an increasingly important topic (Gopalan et al., 2022; Błasiok et al., 2024; Gopalan et al., 2023a,b).

2 Preliminaries

Notation: We use lowercase bold alphabets to denote vectors. N, N 0 denote the set of positive, non-negative integers respectively. For any m N, [m] denotes the index set {1, . . . , m}. We use K to denote the (K 1)-dimensional simplex, i.e., K := {p RK | pi 0, PK i=1 pi = 1}. The i-th standard basis vector (dimension inferred from the context) is denoted by ei, and we use E to represent the set {e1, . . . , e K} of all basis vectors of RK. By default, denotes the ℓ2 norm.

Proper Losses: Throughout the paper, we consider the class of bounded proper losses L := {ℓ: K E [ 1, 1] | ℓis proper} or a subset of it. We emphasize that convexity (in the first argument) is never needed in our results. As mentioned, a loss ℓis proper if predicting the true distribution from which the outcome is sampled from gives the smallest loss in expectation, that is, Ey p[ℓ(p, y)] Ey p[ℓ(p , y)] for all p, p K.

For a proper loss ℓ, we refer to ℓ(p, y) as its bivariate form. The univariate form of ℓis defined as ℓ(p) := Ey p[ℓ(p, y)]. It turns out that a loss is proper only if its univariate form is concave. Moreover, one can construct a proper loss using a concave univariate form based on the following characterization lemma.

Lemma 1 (Theorem 2 of Gneiting and Raftery (2007)). A loss ℓ: K E R is proper if and only if there exists a concave function f such that ℓ(p, y) = f(p) + gp, y p for all p K, y E, where gp denotes a subgradient of f at p. Also, f is the univariate form of ℓ.

We provide several examples of proper losses below:

The spherical loss is ℓ(p, y) = p,y

p , which is

K-Lipschitz (Proposition B.1) but non-convex in p. Its univariate form is ℓ(p) = p .

The squared loss (also known as the Brier score) is ℓ(p, y) = 1 2 p y 2, which is clearly 2-Lipschitz and convex in p. Its univariate form is ℓ(p) = 1 p 2.

Generalizing the squared loss, we consider the univariate form ℓ(p) = c K PK i=1 pα i for α > 1 and some constant c K > 0, which is the Tsallis entropy and is concave. The induced proper loss is ℓ(p, y) = c K(α 1) PK i=1 pα i c Kα PK i=1 pα 1 i yi, which is not Lipschitz for α (1, 2).*

The following fact is critical for U-calibration: for any n N and a sequence of outcomes y1, . . . , yn E, the mean forecaster is always the empirical risk minimizer for any proper loss ℓ: (the proof is by definition and included in Appendix B for completeness):

j=1 yj argmin p K

j=1 ℓ(p, yj). (1)

Problem Setting: As mentioned, the problem we study follows the following protocol: at each time t = 1, . . . , T, a forecaster predicts a distribution pt K over K possible outcomes, and at the same time, an adversary decides the true outcome encoded by a one-hot vector yt E, which is revealed to the forecaster at the end of time t.

For a fixed proper loss function ℓ, the regret of the forecaster is defined as: REGℓ:= PT t=1 ℓ(pt, yt) infp K PT t=1 ℓ(p, yt), which, according to property (1), can be written as PT t=1 ℓ(pt, yt) PT t=1 ℓ(β, yt) where β := 1

T PT t=1 yt is simply the empirical average of all outcomes.

Our goal is to ensure low regret against a class of proper losses simultaneously. We define Ucalibration error as UCal L = E [supℓ L REGℓ] and pseudo U-calibration error as PUCal L = supℓ L E[REGℓ] for a family of loss functions L. Unless explicitly mentioned, L denotes the set of all bounded (in [ 1, 1]) proper losses and is dropped from the subscripts for convenience.

Oblivious Adversary versus Adaptive Adversary: As is standard in online learning, an oblivious adversary decides all outcomes y1, . . . , y T ahead of the time with the knowledge of the forecaster s algorithm (but not her random seeds), while an adaptive adversary decides each yt with the knowledge of past forecasts p1, . . . , pt 1. Except for one result (upper bound on UCal for a class with lowcovering number), all our upper bounds hold for the stronger adaptive adversary, and all our lower bounds hold for the weaker oblivious adversary (which makes the lower bounds stronger).

3 Optimal U-calibration Error

In this section, we prove that the minimax optimal pseudo U-calibration error is Θ(

3.1 Algorithm

As mentioned, our algorithm makes a simple change to the noise distribution of the FTPL algorithm of Kleinberg et al. (2023) and in fact coincides with the algorithm of Daskalakis and Syrgkanis (2016) designed for a different setting. To this end, we start by reviewing their setting and algorithm. Specifically, Daskalakis and Syrgkanis (2016) consider the following online learning problem: at each time t [T], a learner chooses an action at A for some action set A; at the same time, an adversary selects an outcome θt from a finite set Θ := {ˆθ1, . . . , ˆθK} of size K; finally, the learner observes θt and incurs loss h(at, θt) for some arbitrary loss function h : A Θ [ 1, 1] that is fixed and known to the learner. Daskalakis and Syrgkanis (2016) propose the following FTPL algorithm: at each time t, randomly generate a set of hallucinated outcomes, where the number of each possible outcome ˆθi for i [K] follows independently a geometric distribution with parameter p

K/T, and then output the empirical risk minimizer using both the true outcomes and the hallucinated outcomes as the final action at. Formally, at argmina A PK i=1 Yt,ih(a, ˆθi) where Yt,i = |{s < t | θs = ˆθi}| + mt,i

*Here, the scaling constant c K is such that ℓ(p, y) [ 1, 1].

Algorithm 1 FTPL with geometric noise for U-calibration

1: for t = 1, . . . , T 2: For each i [K], sample mt,i independently from a geometric distribution with parameter p

K/T, and compute Yt,i = |{s < t | ys = ei}| + mt,i.

3: Predict pt K such that pt,i = Yt,i/ PK k=1 Yt,k, and observe the true outcome yt E. 4: end for

and mt,i is an i.i.d. sample of a geometric distribution with parameter p

K/T. This simple algorithm enjoys the following regret guarantee.

Theorem 1 (Appendix F.3 of Daskalakis and Syrgkanis (2016)). The FTPL algorithm described above satisfies the following regret bound: E h PT t=1 h(at, θt) infa A PT t=1 h(a, θt) i 4

KT, where the expectation is taken over the randomness of both the algorithm and the adversary.

Now we are ready to discuss how to apply their algorithm to our multiclass U-calibration problem. Naturally, we take A = K and Θ = E. What h should we use when we care about all proper losses? It turns out that this does not matter (an observation made by Kleinberg et al. (2023) already): according to property (1), the mean forecaster taking into account both the true outcomes and the hallucinated ones (that is, pt,i = Yt,i/ PK k=1 Yt,k) is a solution of argminp K PK i=1 Yt,ih(p, ei) for any proper loss h L! This immediately leads to the following result.

Corollary 1. Algorithm 1 ensures PUCal 4

KT against any adaptive adversary.

We remark that the only difference of Algorithm 1 compared to that of Kleinberg et al. (2023) is that mt,i is sampled from a geometric distribution instead of a uniform distribution in {0, 1, . . . ,

T }. Using such noises that are skewed towards smaller values leads to better trade-off between the stability of the algorithm and the expected noise range, which is the key to improve the regret bound from O(K

3.2 Lower Bound

We now complement the upper bound of the previous section with a matching lower bound. Similar to the Multi-Armed Bandit problem, the regime of interest is T = Ω(K).

Theorem 2. There exists a proper loss ℓwith range [ 1, 1] such that the following holds: for any online algorithm ALG, there exists a choice of y1, . . . , y T by an oblivious adversary such that the expected regret E[REGℓ] of ALG is Ω(

KT) when T 12K.

We defer to the proof to Appendix C and highlight the key ideas and novelty here. First, the proper loss we use to prove the lower bound takes the following univariate form ℓ(p) = 1

2 PK i=1 pi 1

K , which is in fact a direct generalization of the so-called V-shaped loss studied in Kleinberg et al. (2023) for the binary case. More specifically, they show that in the binary case, V-shaped losses are the hardest in the sense that low regret with respect to all V-shaped losses directly implies low regret with respect to all proper losses (that is, low U-calibration error). On the other hand, they also prove that this is not true for the general multiclass case. Despite this fact, here, we show that V-shaped loss is still the hardest in the multiclass case in a different sense: it is the hardest loss for any algorithm with PUCal = O(

With this loss function, we then follow a standard probabilistic argument and consider a randomized oblivious adversary that samples y1, . . . , y T i.i.d. from the uniform distribution over E. For such an adversary, we argue the following: (a) the expected loss incurred by ALG is non-negative, i.e., E h PT t=1 ℓ(pt, yt) i 0, where the expectation is taken over y1, . . . , y T and any internal randomness

in ALG; (b) the expected loss incurred by the benchmark is bounded as E h infp K PT t=1 ℓ(p, yt) i

KT for some universal positive constant c, where the expectation is over y1, . . . , y T . Together, this implies that the expected regret of ALG is at least c

KT in this randomized environment, which further implies that there must exist one particular sequence of y1, . . . , y T such that the expected

regret of ALG is at least c

KT, finishing the proof. We remark that our proof for (b) is novel and based on an anti-concentration inequality for Bernoulli random variables (Lemma A.4).

We discuss some immediate implications of Theorem 2 below. First, it implies that in the online learning setting of Daskalakis and Syrgkanis (2016) where the adversary has only K choices (formally defined in Section 3.1), without further assumptions on the loss function, their FTPL algorithm is minimax optimal. To our knowledge this is unknown before.

Second, since PUCal = supℓE[REGℓ] E[REGℓ ] for any ℓ L, Theorem 2 immediately implies a Ω(

KT) lower bound on the pseudo multiclass U-calibration error. In fact, since UCal PUCal, the same lower bound holds for the actual U-calibration error.

Corollary 2. For any online forecasting algorithm, there exists an oblivious adversary such that UCal PUCal = Ω(

3.3 From PUCal to UCal

We now make an attempt to bound the U-calibration error UCal of Algorithm 1 for an oblivious adversary. Specifically, since the perturbations are sampled every round and the adversary is oblivious, using Hoeffding s inequality it is straightforward to show that for a fixed ℓand a fixed δ (0, 1), the regret of Algorithm 1 with respect to ℓsatisfies REGℓ 4

2T log (1/δ) with probability at least 1 δ (see Hutter et al. (2005, Section 9) or Lemma D.1). Therefore, for a finite subset L of L, taking a union bound over all ℓ L gives supℓ L REGℓ 4

2T log (|L |/δ) with probability at least 1 δ. Picking δ = 1/T and using the boundedness of losses, we obtain UCal L 2 + 4

2T log (T |L |). In Appendix D, we generalize this simple argument to any infinite subset L of L with a finite ϵ-covering number M(L , ϵ; . ) and prove for any ϵ > 0,

UCal L 2 + 4ϵT + 4

2T log (T M(L , ϵ; . )). (2)

Using this bound, we now give a concrete example of a simple parameterized family L L for which UCal L = O(

KT + T log T). Consider the parameterized class

L = {αℓ1(p, y) + (1 α)ℓ2(p, y)|α [0, 1]},

where ℓ1(p, y), ℓ2(p, y) L are two fixed bounded and proper losses. It is straightforward to verify that ℓα(p, y) := αℓ1(p, y) + (1 α)ℓ2(p, y) L, therefore L L.

To obtain an ϵ (0, 1) cover for L , we consider the set C := {0, ϵ, . . . , 1 ϵ, 1} which partitions the interval [0, 1] to 1

ϵ smaller intervals each of length ϵ. For each α [0, 1], let cα C denote the closest point to α (break ties arbitrarily). Clearly, |α cα| ϵ. Next, consider the function gα(p, y) := cαℓ1(p, y) + (1 cα)ℓ2(p, y). The class {gα(p, y)|α [0, 1]} is clearly a 2ϵ cover of L with size 1

ϵ . Thus, M(L , ϵ; . ) = O( 1

ϵ ). It then follows from (2) that

on choosing ϵ = 1

T . On the other hand, in subsection 4.3 we shall argue that for this class with a specific example of ℓ2, FTL suffers linear U-calibration error (that is, UCal L = Ω(T)).

4 Improved Bounds for Important Sub-Classes

In this section, we show that it is possible to go beyond the Θ(

KT) U-calibration error for several broad sub-classes of L that include important and common proper losses. These results are achieved by an extremely simple algorithm called Follow-the-Leader (FTL), which at time t > 1 forecasts*

s=1 yt argmin p K

s=1 ℓ(p, ys), (3)

*The forecast at time t = 1 can be arbitrary.

that is, the average of the past outcomes. For notational convenience, we define nt = Pt s=1 ys so that FTL predicts pt = nt 1

t 1 , with nt 1,i being the count of outcome i before time t.

Importantly, since FTL is a deterministic algorithm, it s PUCal and UCal are always trivially the same. Moreover, there is also no distinction between an oblivious adversary and an adaptive adversary because of this deterministic nature.

4.1 Proper Lipschitz Losses

In this section, we show that Θ(log T) is the minimax optimal bound for PUCal and UCal for Lipschitz proper losses. Specifically, we consider the following class of G-Lipschitz proper losses

LG := {ℓ L | |ℓ(p, y) ℓ(p , y)| G p p , p, p K, y E} .

As discussed in Section 2, the two common proper losses, squared loss and spherical loss, are both in LG for some G. Note that the class of LG is rich since according to Lemma 1 it corresponds to the class of concave univariate forms that are Lipschitz and smooth (see Lemma B.2). We now show that FTL enjoys logarithmic U-calibration error with respect to LG.

Theorem 3. The regret of FTL for learning any ℓ LG is at most 2 + 2G log T. Consequently, FTL ensures PUCal LG = UCal LG = O(G log T).

Proof. Using the standard Be-the-Leader lemma (see e.g., (Orabona, 2019, Lemma 1.2)) that says PT t=1 ℓ(pt+1, yt) infp K PT t=1 ℓ(p, yt), the regret of FTL can be bounded as

t=2 ℓ(pt, yt) ℓ(pt+1, yt) 2 + G

t=2 pt pt+1 ,

where the second inequality is because ℓ LG. Next, since pt = nt 1

t 1 and pt+1 = nt

t , we obtain

nt 1 t 1 nt

nt 1 t(t 1) yt

where the equality follows since nt = nt 1 + yt and the last inequality follows from the triangle inequality and nt 1 nt 1 1 = t 1. Finally, since PT t=2 1 t R T 1 1 zdz = log T, we obtain REGℓ 2 + 2G log T, which completes the proof.

A closer look at the proof reveals that global Lipschitzness over the entire simplex K is in fact not necessary. This is because, for example, in the term ℓ(pt, ei) ℓ(pt+1, ei) for some i [K], by the definition of FTL the corresponding coordinates pt,i and pt+1,i are almost always at least 1/T, with only one exception which is when t is the first time we have yt = ei and which we can ignore since the regret incurred is at most a constant. This means that having local Lipschitzness in a certain region is enough; see Lemma E.1 for details. Note that the loss induced by the Tsallis entropy (mentioned in Section 2) is exactly one such example where global Lipschitzness does not hold but local Lipschitzness does. We defer the concrete discussion of the regret bounds of FTL on this example to Section 4.2 (where yet another different analysis is introduced).

In the rest of this subsection, we argue that no algorithm can guarantee regret better than Ω(log T) for one particular Lipschitz proper loss, making FTL minimax optimal for this class.

Theorem 4. There exists a proper Lipschitz loss ℓsuch that: for any algorithm ALG, there exists a choice of y1, . . . , y T by an oblivious adversary such that the expected regret of ALG is Ω(log T).

The loss we use in this lower bound is simply the squared loss ℓ(p, y) = p y 2 with K = 2. While squared loss is known to admit Θ(log T) regret in other online learning problems such as that from Abernethy et al. (2008), as far as we know there is no study on our setting where the decision set is the simplex and the adversary has only finite choices. It turns out that this variation brings significant technical challenges, and our proof is substantially different from that of Abernethy et al. (2008). We defer the details to Appendix F and discuss the key steps below.

Step 1: Since squared loss is convex in p, by standard arguments it suffices to consider deterministic algorithms only (see Lemma F.1). Moreover, for deterministic algorithms, there is no difference between an oblivious adversary and an adaptive adversary so that the minimax regret can be written as

VAL = inf p1 K sup y1 E inf p T K sup y T E

t=1 ℓ(pt, yt) inf p K

t=1 ℓ(p, yt)

To solve this, further define Vn,r recursively as Vn,r = infp K supy E Vn+y,r 1 + ℓ(p, y) with Vn,0 = infp K PK i=1 niℓ(p, ei), so that VAL is simply V0,T .

Step 2: Using the minimax theorem, we further show that Vn,r = supq K PK i=1 qi Vn+ei,r 1 + ℓ(q) where the univariate form ℓ(q) is 1 q 2 (as mentioned in Section 2). Recall that we consider only the binary case K = 2, so it is straightforward to give an analytical form of the solution to the maximization over q K. Specifically, writing Vn,r = V(n1,n2),r as Vn1,n2,r to make notation concise, we show

V2 if V1 V2 < 2, (V1 V2)2

2 if 2 V1 V2 2, V1 if V1 V2 > 2,

where V1 and V2 are shorthands for Vn1+1,n2,r 1 and Vn1,n2+1,r 1 respectively. Next, by an induction on r we show that for all valid n1, n2, r it holds that 2 V1 V2 2 (Lemma F.2), therefore Vn1,n2,r is always equal to (V1 V2)2

Step 3: By an induction on r again, we show that Vn1,n2,r exhibits a special structure of the form

Vn1,n2,r = (n1 n2)2

where {ur}T r=0 and {vr}T r=0 are recursively defined via ur+1 = ur + ur + 1

T 2 and vr+1 = ur

2 + vr + r+1

2 with u0 = v0 = 0 (Lemma F.3). Since VAL = V0,0,T = v T , it remains to show v T = Ω(log T), which is done via two technical lemmas F.4 and F.5.

4.2 Decomposable Losses

Next, we consider another sub-class of proper losses that are not necessarily Lipschitz. Instead, their univariate form is decomposable over the K outcomes and additionally satisfies a mild regularity condition. Specifically, we define the following class

i=1 ℓi(pi) where each ℓi is twice continuously differentiable in (0, 1)

Both the squared loss and its generalization via Tsallis entropy discussed in Section 2 are clearly in this class Ldec, with the latter being non-Lipschitz when α (1, 2). The spherical loss, however, is not decomposable and thus not in Ldec. We now show that FTL achieves logarithmic regret against any ℓ Ldec (see Appendix G for the full proof). Theorem 5. The regret of FTL for learning any ℓ Ldec is at most 2K + (K + 1)βℓ(1 + log T) for some universal constant βℓwhich only depends on ℓand K. Consequently, FTL ensures PUCal Ldec = UCal Ldec = O((supℓ Ldec βℓ)K log T).

Proof Sketch. We start by showing a certain controlled growth rate of the second derivative of the univariate form (see Appendix H for the proof).

Lemma 2. For a function f that is concave, Lipschitz, and bounded over [0, 1] and twice continuously

differentiable over (0, 1), there exists a constant c > 0 such that |f (p)| c max 1 p, 1 1 p for all

Note that according to Lemma 1, each ℓi must be concave, Lipschitz, and bounded, for the induced loss to be proper and bounded. Therefore, using Lemma 2, there exists a constant ci > 0 such that |ℓ i (p)| ci max 1 p, 1 1 p for each i. The rest of the proof in fact only relies on this property; in other words, the regret bound holds even if one replaces the twice continuous differentiability condition with this (weaker) property.

More specifically, for each i [K], let Ti := {ti,1, . . . , ti,ki} [T] be the subset of rounds where the true outcome is i (which could be empty). Then, using the Be-the-Leader lemma again and trivially bounding the regret by its maximum value for the (at most K) rounds when an outcome appears for the first time, we obtain

t Ti\{ti,1} ℓ(pt, ei) ℓ(pt+1, ei) | {z } δt,i

By using the characterization result in Lemma 1, we then express ℓ(pt, ei) and ℓ(pt+1, ei) in terms of the univariate forms ℓ(pt), ℓ(pt+1), and their respective gradients ℓ(pt), ℓ(pt+1). Next, using the concavity of ℓi, the Mean Value Theorem, and Lemma 2, we argue that

j=1 βℓ,j |pt+1,j pt,j| max 1

ξt,j , 1 1 ξt,j

for some ξt that is a convex combination of pt and pt+1, and constant βℓ,i = c K ci ( c K is the scaling constant such that ℓ(p) = c K PK i=1 ℓi(pi)). To bound (4), we consider the terms |pt+1,j pt,j|

ξt,j and |pt+1,j pt,j|

1 ξt,j individually and find that they are always bounded by either 1 nt 1,i or 1 t 1 according to the update rule of FTL. Thus, we obtain

REGℓ 2K + βℓ(S1 + S2), where S1 =

t Ti\{ti,1}

1 nt 1,i , S2 =

t Ti\{ti,1}

and βℓ= PK i=1 βℓ,i. Finally, direct calculation shows S1 K(1 + log T) and S2 1 + log T, which finishes the proof.

To showcase the usefulness of this result, we go back to the Tsallis entropy example.

Corollary 3. For any loss ℓwith univariate form ℓ(p) = c K PK i=1 pα i for α (1, 2) (the constant c K is such that the loss has range [ 1, 1]), FTL ensures REGℓ= O( c Kα(α 1)K2 log T).

Proof. As mentioned, our proof of Theorem 5 only relies on Lemma 2, and it is straightforward to verify that for the loss considered here, one can take the constant c in Lemma 2 to be α(α 1), and thus the regret of FTL is O(Kβℓlog T) with βℓ= K c Kα(α 1).

On the other hand, if one were to use the proof based on local Lipschitzness (mentioned in Section 4.1 and discussed in Appendix E), one would only obtain a regret bound of order O(K + c Kα(α 1)T 2 α log T), which is much worse (especially for small α). Finally, we remark that for α 2, the bivariate form is Lipschitz, and thus FTL also ensures logarithmic regret according to Theorem 3.

4.3 FTL Cannot Handle General Proper Losses

Despite yielding improved regret for Lipschitz and other special classes of proper losses, unfortunately, FTL is not a good algorithm in general when dealing with proper losses, as shown below.

Theorem 6. There exists a proper loss ℓand a choice of y1, . . . , y T by an oblivious adversary such that the regret REGℓof FTL is Ω(T).

Proof. The loss we consider is in fact the same V-shaped loss used in the proof of Theorem 2 that shows all algorithms must suffer Ω(

KT) regret. Here, we show that FTL even suffers linear regret

for this loss. Specifically, it suffices to consider the binary case K = 2 and the univariate form ℓ(p) = 1

2 . Using Lemma 1, we obtain the following bivariate form:

ℓ(p, e1) = 1

, ℓ(p, e2) = 1

where the sign function is defined as sign(x) = 1 if x > 0; 1 if x < 0; 0 if x = 0. Therefore, ℓ(p, e1) is equal to 1

2 if p1 < 1

2; 0 if p1 = 1

2. Similarly, ℓ(p, e2) is equal to 1

2; 0 if p1 = 1

2. Let T be even and yt = e1 if t is odd, and e2 otherwise. For such a sequence y1, . . . , y T , the benchmark selects β = 1

T PT t=1 yt = [ 1

2] and incurs 0 cost. On the

other hand, FTL chooses pt = [ 1

2] when t is odd, and pt = h t 2(t 1), t 2 2(t 1) i otherwise. Thus, the

regret of FTL is REGℓ= PT t=2 ℓ(pt, e2) = T

4 . This completes the proof.

Consider the parametrized class in subsection 3.3, let K = 2, ℓ2(p, y) correspond to the V-shaped loss in Theorem 6, and consider any ℓ1(p, y) L. It follows from Theorem 6 that REGℓα = Ω(T) when α = 0, therefore UCal L = supα [0,1] REGℓα = Ω(T), whereas Algorithm 1 ensures UCal L = O( T log T).

5 Conclusion and Future Directions

In this paper, we give complete answers to various questions regarding the minimax optimal bounds on multiclass U-calibration error, a notion of simultaneous loss minimization proposed by (Kleinberg et al., 2023) for the fundamental problem of making online forecasts on unknown outcomes. We not only improve their PUCal = O(K

T) upper bound and show that the minimax pseudo U-calibration error is Θ(

KT), but also further show that logarithmic U-calibration error can be achieved by an extremely simple algorithm for several important classes of proper losses.

There are many interesting future directions, including 1) understanding the optimal bound on the actual U-calibration error UCal, 2) generalizing the results to losses that are not necessarily proper, and 3) studying the contextual case and developing more efficient algorithms with better bounds compared to those in the recent work of Garg et al. (2024).

Acknowledgements

We thank mathoverflow user mathworker21 for their help in formulating and proving Lemma F.5. Haipeng Luo was supported by NSF award IIS-1943607 and a Google Research Scholar Award, and Vatsal Sharan was supported by NSF CAREER Award CCF-2239265 and an Amazon Research Award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of Amazon.

Abernethy, J., Bartlett, P. L., Rakhlin, A., and Tewari, A. (2008). Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the 21st annual conference on learning theory, pages 414 424.

Arunachaleswaran, E. R., Collina, N., Roth, A., and Shi, M. (2024). An elementary predictor obtaining 2

T distance to calibration. ar Xiv preprint ar Xiv:2402.11410.

Błasiok, J., Gopalan, P., Hu, L., Kalai, A. T., and Nakkiran, P. (2024). Loss Minimization Yields Multicalibration for Large Neural Networks. In Guruswami, V., editor, 15th Innovations in Theoretical Computer Science Conference (ITCS 2024), volume 287 of Leibniz International Proceedings in Informatics (LIPIcs), pages 17:1 17:21, Dagstuhl, Germany. Schloss Dagstuhl Leibniz-Zentrum für Informatik.

Błasiok, J., Gopalan, P., Hu, L., and Nakkiran, P. (2023). A unifying theory of distance from calibration. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1727 1740.

Blum, A., Hajiaghayi, M., Ligett, K., and Roth, A. (2008). Regret minimization and the price of total anarchy. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 373 382.

Blum, A. and Mansour, Y. (2007). From external to internal regret. Journal of Machine Learning Research, 8(6).

Daskalakis, C. and Syrgkanis, V. (2016). Learning in auctions: Regret is hard, envy is easy. In 2016 ieee 57th annual symposium on foundations of computer science (focs), pages 219 228. IEEE.

Dawid, A. P. (1982). The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605 610.

Foster, D. P. and Vohra, R. V. (1998). Asymptotic calibration. Biometrika, 85(2):379 390.

Garg, S., Jung, C., Reingold, O., and Roth, A. (2024). Oracle efficient online multicalibration and omniprediction. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2725 2792. SIAM.

Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359 378.

Gopalan, P., Hu, L., Kim, M. P., Reingold, O., and Wieder, U. (2023a). Loss Minimization Through the Lens Of Outcome Indistinguishability. In Tauman Kalai, Y., editor, 14th Innovations in Theoretical Computer Science Conference (ITCS 2023), volume 251 of Leibniz International Proceedings in Informatics (LIPIcs), pages 60:1 60:20, Dagstuhl, Germany. Schloss Dagstuhl Leibniz-Zentrum für Informatik.

Gopalan, P., Kalai, A. T., Reingold, O., Sharan, V., and Wieder, U. (2022). Omnipredictors. In Braverman, M., editor, 13th Innovations in Theoretical Computer Science Conference (ITCS 2022), volume 215 of Leibniz International Proceedings in Informatics (LIPIcs), pages 79:1 79:21, Dagstuhl, Germany. Schloss Dagstuhl Leibniz-Zentrum für Informatik.

Gopalan, P., Kim, M. P., and Reingold, O. (2023b). Swap agnostic learning, or characterizing omniprediction via multicalibration. In Thirty-seventh Conference on Neural Information Processing Systems.

Hart, S. (2022). Calibrated forecasts: The minimax proof. ar Xiv preprint ar Xiv:2209.05863.

Hébert-Johnson, U., Kim, M., Reingold, O., and Rothblum, G. (2018). Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pages 1939 1948. PMLR.

Hutter, M., Poland, J., et al. (2005). Adaptive online prediction by following the perturbed leader.

Klein, P. and Young, N. (1999). On the number of iterations for dantzig-wolfe optimization and packing-covering approximation algorithms. In International Conference on Integer Programming and Combinatorial Optimization, pages 320 327. Springer.

Kleinberg, B., Leme, R. P., Schneider, J., and Teng, Y. (2023). U-calibration: Forecasting for an unknown agent. In The Thirty Sixth Annual Conference on Learning Theory, pages 5143 5145. PMLR.

Orabona, F. (2019). A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213.

Qiao, M. and Valiant, G. (2021). Stronger calibration lower bounds via sidestepping. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 456 466.

Qiao, M. and Zheng, L. (2024). On the distance from calibration in sequential prediction. ar Xiv preprint ar Xiv:2402.07458.

Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press.

A Concentration and Anti-Concentration Inequalities

Lemma A.1 (Markov s inequality). For a non-negative random variable X, and any a > 0, we have P(X a) E[X]

a . Lemma A.2 (Khintchine s inequality). Let ϵ1, . . . , ϵT be i.i.d. Rademacher random variables, i.e., P(ϵi = +1) = P(ϵi = 1) = 1

2 for all i [T]. Then,

for any x1, . . . , x T R. Lemma A.3 (Hoeffding s inequality). Let X1, . . . , XT be independent random variables satisfying ai Xi bi for all i [T]. Then, for any ϵ > 0, we have

t=1 XT E[Xt] ϵ

2ϵ2 PT i=1(bi ai)2

t=1 XT E[Xt] ϵ

2ϵ2 PT i=1(bi ai)2

Lemma A.4 (Reverse Chernoff bounds). (Klein and Young, 1999, Lemma 5.2) Let X := 1

T PT i=1 Xi be the average of T i.i.d. Bernoulli random variables X1, . . . , XT with E[Xi] = p for all i [T]. If ϵ (0, 1

2], p (0, 1

2] are such that ϵ2p T 3, then

P( X (1 ϵ)p) exp( 9ϵ2p T), P( X (1 + ϵ)p) exp( 9ϵ2p T).

B Deferred Proofs for Proper Losses

Lemma B.1. For a proper loss ℓ(p, y), the following holds true for any n N and y1, . . . , yn E:

i=1 yi argmin p K

i=1 ℓ(p, y).

Proof. Let β := 1

n Pn i=1 yi. Since ℓis proper, Ey β[ℓ(β, y)] Ey β[ℓ(p , y)] for all p K. Notably, for any p K, we have

i=1 βiℓ(p, ei) = 1

j=1 1[yj = ei]ℓ(p, ei) = 1

i=1 1[yj = ei]ℓ(p, ei) = 1

j=1 ℓ(p, yj).

n Pn j=1 ℓ(β, yj) 1

n Pn j=1 ℓ(p , yj) for any p K. This completes the proof.

Lemma B.2. Let ℓ(p) be a differentiable and concave, α-Lipschitz function over K such that ℓ(p) is β-Lipschitz over K. Then, ℓ(p, y) is proper and (2α + 2β)-Lipschitz in p K for each y E.

Proof. Since ℓ(p) is concave, it follows from Lemma 1 that ℓ(p, y) is proper. To prove the second part, without any loss of generality, assume that y = e1. For any p, p K, we have

ℓ(p, e1) ℓ(p , e1) = ℓ(p) + ℓ(p), e1 p (ℓ(p ) + ℓ(p ), e1 p )

α p p + 1ℓ(p) 1ℓ(p ) ( p, ℓ(p) p , ℓ(p ) )

(α + β) p p + p p, ℓ(p ) + p, ℓ(p ) ℓ(p)

(α + β) p p + p p ℓ(p ) + p ℓ(p ) ℓ(p)

2(α + β) p p ,

where the first equality follows from Lemma 1; the first inequality follows from the Lipschitzness of ℓ; the second inequality follows from the Lipschitzness of ℓ(p); the third inequality follows from the Cauchy-Schwartz inequality; the final inequality follows since p 1 and ℓ(p) is Lipschitz. This completes the proof.

Proposition B.1. The spherical loss is

K-Lipschitz.

Proof. We shall show that ℓ(p, y)

K for all y E, p K, where the gradient is taken with respect to the first argument. Without any loss of generality, assume y = e1, thus ℓ(p, y) = p1

p . It is easy to obtain the following:

p1 = p 2 p2 1 p 3 , ℓ(p, y)

p 3 for all i > 2.

Thus, ℓ(p, e1) = 1 p 3 q

( p 2 p2 1)2 + p2 1 PK i=2 p2 i = 1 p 2 q

( p 2 p2 1) 1 p

K, where the last inequality follows the Cauchy-Schwartz inequality.

C Proof of Theorem 2

Theorem 2. There exists a proper loss ℓwith range [ 1, 1] such that the following holds: for any online algorithm ALG, there exists a choice of y1, . . . , y T by an oblivious adversary such that the expected regret E[REGℓ] of ALG is Ω(

KT) when T 12K.

Proof. Consider the function defined as

It follows from Lemma 1 that the bivariate form of ℓis ℓ(p, y) = ℓ(p) + gp, p y , where gp denotes a subgradient of ℓ(p) at p. For the choice of ℓ, gi = 1

K ), where the sign function is defined as sign(x) = 1 if x > 0; 1 if x < 0; 0 if x = 0. We first show that ℓ(p, y) [ 1, 1]. Without any loss of generality, assume y = e1. Therefore,

ℓ(p, e1) = 1

(1 p1)sign p1 1

i=2 pisign pi 1

(1 p1)sign p1 1

i=2 pisign pi 1

i=2 sign pi 1

It is then trivial to note that p = e1 corresponds to a minimum, with value ℓ(e1, e1) = K 1

K ; p = [0, 1 K 1, . . . , 1 K 1] corresponds to a maximum, with value K 1

Next, we consider a randomized oblivious adversary which samples y1, . . . , y T from the uniform distribution over E. For such an adversary, we shall show that E[REG] = Ω(

KT). We overload the notation and use ALG1:t to denote the internal randomness of the algorithm until time t (inclusive). In particular, this notation succintly represents both deterministic and randomized algorithms (pt could be sampled from a distribution Dt). Similarly, ALGt shall denote the randomness at time t.

With this notation, in the constructed randomized environment, the expected cost of ALG is

t=1 ℓ(pt, yt)

= EALG1:T ,y1,...,y T

t=1 ℓ(pt, yt)

t=1 EALG1:T ,y1,...,y T [ℓ(pt, yt)]

t=1 EALG1:t 1,y1,...,yt 1EALGt,yt[ℓ(pt, yt)|ALG1:t 1, y1, . . . , yt 1]

t=1 EALG1:t 1,y1,...,yt 1EALGt Eyt[ℓ(pt, yt)|ALG1:t 1, y1, . . . , yt 1]

t=1 EALG1:t 1,y1,...,yt 1EALGt

i=1 ℓ(pt, ei)

where in the first equality we have made the randomness explicit; the second equality follows from the linearity of expectations; the third equality follows from the law of iterated expectations; the fourth equality follows because ALGt is independent of yt (pt is chosen without knowing yt) and vice-versa (yt is sampled uniformly randomly from E); the fifth equality follows by expanding out the expectation; the first inequality follows since PK i=1 ℓ(p, ei) 0 for any p K. This is because,

i=1 ℓ(p, ei) = Kℓ(p) +

i=1 gp, ei p = K ℓ(p) + gp, 1

where 1K denotes the K-dimensional vector of all ones. Next, since ℓ(p) is concave over K, the term above can be lower bounded by Kℓ 1

Next, the expected regret of ALG can be lower bounded in the following manner:

E[REGℓ] = EALG1:T ,y1,...,y T

t=1 ℓ(pt, yt) inf p K

t=1 ℓ(p, yt)

Ey1,...,y T

t=1 ℓ(p, yt)

= Ey1,...,y T

where the first inequality follows since the expected cost of ALG is non-negative, and the benchmark is independent of ALG; the second equality follows from property 1. In the next steps, we deal with the expectation in (5). Sample y1, . . . , y T from E and let n1, . . . , n K denote the counts of the K basis vectors, i.e., ni = |{j [T]; yj = ei}|. Clearly, PK i=1 ni = T. Let n = [n1, . . . , n K] collect these counts. Then, 1

T PT t=1 yt = [ n1

T , . . . , n K

Thus, the term in (5) equals 1

2 En1,...,n K h PK i=1 ni T

K i where n1, . . . , n K are sampled from a

multinomial distribution with event probability equal to 1

K . Further, using the linearity of expectations we arrive at

t=1 Xt E[Xt]

where Xt is a Bernoulli random variable with mean 1

K . Next, we bound E h PT t=1 Xt E[Xt] i

using an anti-concentration bound on Bernoulli random variables. In particular, applying Lemma A.4, we obtain

t=1 Xt E[Xt]

for any a hq

2K i . When T 12K, this interval is non-empty. From the Markov s inequality (Lemma A.1), we have

t=1 Xt E[Xt]

t=1 Xt E[Xt]

for any a > 0. Setting a = q

K we arrive at E h PT t=1 Xt E[Xt] i 2

3 exp( 27) q

T K . Thus,

KT, which completes the proof.

Remark C.1. For K = 2, the use of Lemma A.4 can be sidestepped via the use of Khintchine s inequality. Indeed, in this case we have

2 Eϵ1,...,ϵT

where ϵ1, . . . , ϵT are i.i.d. Rademacher random variables, and the last inequality follows from Khintchine s inequality (Lemma A.2).

D Bounding the (Actual) Multiclass U-Calibration Error

In this section, we bound the U-calibration error UCal L for subclasses L of L with a finite covering number. We begin with deriving a high probability bound on the regret of Algorithm 1. Lemma D.1. Fix some ℓ L. Then, for any δ (0, 1), the regret of Algorithm 1 satisfies

with probability at least 1 δ.

Proof. Define the random variable Xt := ℓ(pt, yt), thus Xt [ 1, 1]. Since the adversary is oblivious and mt,1, . . . , mt,K are sampled every round independently, X1, . . . , XT are independent. Applying Hoeffdings inequality (Lemma A.3), for any ϵ > 0, we have

t=1 ℓ(pt, yt) E [ℓ(pt, yt)] ϵ

which implies that P (REGℓ E[REGℓ] ϵ) 1 exp ϵ2

2T . Let δ := exp ϵ2

REGℓ E[REGℓ]

Finally, applying the result of Theorem 1 to bound E[REGℓ] completes the proof.

Using this high probability bound, we first bound UCal L when L is a finite subset of L. Lemma D.2. Fix a subset L L with |L | < . Algorithm 1 ensures

UCal L 2 + 4

2T log (T |L |).

Proof. For any ℓ L, let Eℓdenote the event that REGℓ 4

δ . From Lemma D.1, P (Eℓ) 1 δ. Let S := supℓ L REGℓ. Thus, the probability that S is bounded by the same quantity is P ( ℓ L Eℓ), which can be bounded as

P ( ℓ L Eℓ) = 1 P ( ℓ L E ℓ) 1 X

ℓ L P (E ℓ) = X

ℓ L P (Eℓ) + 1 |L | 1 |L | δ,

where the first equality follows from De-Morgan s law; the first inequality follows from the union bound; the last inequality is because P (Eℓ) 1 δ. Setting δ = 1 T |L |, we obtain

sup ℓ L REGℓ 4

2T log (T |L |) | {z } =:

Note that, E [S] = P(A)E [S|A] + P(A )E [S|A ], where A denotes the event that S . Using the facts E [S|A] , P(A ) 1

T , and E [S|A ] 2T since ℓ [ 1, 1], we have

E [S] 2 + ,

which completes the proof.

Before proceeding further, we first define the notion of cover and covering numbers. Definition D.1 (Cover and Covering Number). The ϵ-cover of a function class F defined over a domain X is a function class Cϵ such that, for any f F there exists g Cϵ such that supx X |f(x) g(x)| ϵ. The covering number M(F, ϵ; . ) is then defined as M(F, ϵ; . ) := min {|Cϵ| ; Cϵ is an ϵ-cover of F}, i.e., the size of the minimal cover.

The . in the notation M(F, ϵ; . ) is used to represent the fact that the distance between two functions f, g is measured with respect to the . norm, i.e., supx X |f(x) g(x)|. Such a definition can be generalized to more general distance metrics/pseudo-metrics, but is not required for our purposes. Note that the cover Cϵ in Definiton D.1 is not necessarily a subset of F. We refer to Wainwright (2019) for an exhaustive treatment of cover and covering numbers of different classes F s.

Let Cϵ be a minimal ϵ-cover of F. For each g Cϵ, let Sg,ϵ be the collection of functions f F such that g is a representative of f, i.e.,

Sg,ϵ := f F | sup x X |f(x) g(x)| ϵ . (6)

Clearly, g CϵSg,ϵ = F. For each Sg,ϵ, fix a flead Sg,ϵ (chosen arbitrarily) as a leader . Let Llead,F denote the collection of these leaders. It is clear that |Llead,F| |Cϵ| = M(F, ϵ; . ).

In the following lemma, we generalize the result of Lemma D.2 to the case when L is a possibly infinite subset of L. Our proof is based on applying the result of Lemma D.2 to the leader set of L . Lemma D.3. Fix a subset L L with a covering number M(L , ϵ; . ). Then, the sequence of forecasts made by Algorithm 1 satisfies for any ϵ > 0,

UCal L 2 + 4ϵT + 4

2T log (T M(L , ϵ; . )).

Proof. Let Cϵ be an ϵ-cover of L of size M(L , ϵ; . ) and Llead L be the corresponding leader set. Applying Lemma D.2, we have

E sup ℓ Llead REGℓ

2T log(T|Llead|) 2+4

2T log(T M(L , ϵ; . )).

Now, fix any ℓ L and let g Cϵ be the representative of ℓ, and ℓ correspond to the leader of the partition Sg,ϵ that ℓbelongs to. By definition we have the following:

|g(p, y) ℓ(p, y)| ϵ, |g(p, y) ℓ (p, y)| ϵ

for all p K, y E. Therefore, it follows from the triangle inequality that |ℓ(p, y) ℓ (p, y)| 2ϵ. As usual, let β = 1

T PT t=1 yt denote the empirical average of the outcomes. Then,

ℓ(pt, yt) ℓ(β, yt) ℓ (pt, yt) ℓ (β, yt) + 4ϵ = REGℓ REGℓ + 4ϵT.

Taking supremum with respect to ℓ L on both sides, followed by expectation, we obtain

E sup ℓ L REGℓ

E sup ℓ Llead REGℓ

+ 4ϵT 2 + 4ϵT + 4

2T log (T M(L , ϵ; . )),

which finishes the proof.

E Regret of FTL for Locally Lipschitz Functions

Lemma E.1. Suppose that for a loss function ℓ, there exists a constant G[ 1

T ,1] such that for each i [K], ℓ(p, ei) is locally Lipschitz in the sense that |ℓ(p, ei) ℓ(p , ei)| G[ 1

T ,1] p p for all p, p K such that pi, p i [ 1

T , 1]. Then, the regret of FTL with respect to this loss is at most 2K + G[ 1

T ,1](1 + log T).

Proof. Using the Be-the-Leader lemma, we know that the regret of FTL can be bounded as REG 2 + PK i=1 P

t 2;yt=ei ℓ(pt, ei) ℓ(pt+1, ei). Assume that Em E of size m K contains all the outcomes chosen by the adversary over T rounds, i.e., yt Em for all t [T]. For each ei Em, let ki denote the total number of time instants t such that yt = ei and let Ti := {ti,1, . . . , ti,ki} denote those time instants. Then, we have

t 2;yt=ei ℓ(pt, ei) ℓ(pt+1, ei)

t Ti\{ti,1} ℓ(pt, ei) ℓ(pt+1, ei),

where the last inequality follows by bounding ℓ(pt, ei) ℓ(pt+1, ei) with 2 for all the t s where an outcome appears for the first time. For each ei Em and for all t > ti,1, we have pt,i = nt 1,i

T . Therefore,

REGℓ 2m + G[ 1

t Ti\{ti,1} pt pt+1

= 2m + G[ 1

t Ti\{ti,1}

nt 1 t 1 nt

Proceeding similar to the proof of Theorem 3, we can show that

REGℓ 2m + G[ 1

t Ti\{ti,1}

t = 2m + G[ 1

T ,1](1 + log T),

where the last inequality follows by dropping the negative term, m K, and PT j=2 1 j R T 1 1 zdz = log T. This completes the proof.

Example E.1. Consider for instance the loss whose univariate form is ℓ(p) = c K PK i=1 pα i , where α (1, 2), and c K > 0 is a normalizing constant (which only depends on K) to ensure that ℓ(p, y)

is bounded in [ 1, 1]. Clearly, ℓ(p) is concave and thus the induced loss ℓ(p, y) is proper as per Lemma 1. For the chosen ℓ(p), ℓ(p, e1) is given by

ℓ(p, e1) = ℓ(p) + (1 p1) 1ℓ(p)

i=2 pi iℓ(p) = c K

i=1 pα i + αpα 1 1

It is easy to verify that 1ℓ(p, e1) = c K α(α 1)pα 2 i (1 p1) and iℓ(p, e1) = c K α(1 α)pα 1 i for all i > 1. Thus, for a large T, G[ 1

T ,1] is of order O(α(α 1) c KT 2 α). This yields a regret bound O(K + α(α 1) c KT 2 α log T), which for α ( 3

2, 2) is better (with respect to T) than the O(

KT) bound obtained in Theorem 1.

F Proof of Theorem 4

Theorem 4. There exists a proper Lipschitz loss ℓsuch that: for any algorithm ALG, there exists a choice of y1, . . . , y T by an oblivious adversary such that the expected regret of ALG is Ω(log T).

As mentioned, the loss we use in this lower bound construction is the squared loss ℓ(p, y) := p y 2 (we ignore the constant 1/2 here for simplicity, which clearly does not affect the proof). It is clearly convex in p for any y. The univariate form of the loss is

ℓ(p) = Ey p[ℓ(p, y)] =

i=1 pi p ei 2 = p 2 + 1 2

i=1 pi p, ei = 1 p 2 .

We now follow the three steps outlined in Section 4.1.

Step 1: First, it is well-known that for convex losses, deterministic algorithms are as powerful as randomized algorithms. Formally, let Arand and Adet be the class of randomized algorithms and deterministic algorithms respectively for the forecasting problem. Then the following holds: Lemma F.1. For any loss ℓ(p, y) that is convex in p K for any y E, we have

inf ALG Arand sup y1,...,y T E E [REGℓ] = inf ALG Adet sup y1,...,y T E REGℓ.

Proof. The direction is trivial since Adet Arand. For the other direction, it suffices to show that for any randomized algorithm ALG Arand, one can construct a deterministic algorithm ALG Adet such that, for any fixed sequence y1, . . . , y T E, the expected regret of ALG is lower bounded by the regret of ALG . To do so, it suffices to let ALG output the expectation of the randomized output of ALG at each time t. Since the loss if convex, by Jensen s inequality, the loss of ALG is at most the expect loss of ALG. This finishes the proof.

Since there is no difference between an oblivious adversary and an adaptive adversary for deterministic algorithms, inf ALG Adet supy1,...,y T E REGℓcan be written as

VAL = inf pt K sup yt E

t=1 ℓ(pt, yt) inf p K

t=1 ℓ(p, yt)

where infpt K supyt E T t=1 is a shorthand for the iterated expression

inf p1 K sup y1 E inf p2 K sup y2 E . . . . . . inf p T 1 K sup y T 1 E inf p T K sup y T E .

Let n NK be a vector such that ni represents the cumulative number of the outcome i, and r {0, . . . , T} represent the number of remaining rounds. For any n and r such that n 1 + r = T, define Vn,r recursively as

Vn,r = inf p K sup y E Vn+y,r 1 + ℓ(p, y)

with Vn,0 = infp K PK i=1 niℓ(p, ei). It is then clear that VAL is simply V0,T .

Step 2: We proceed to rewrite and simplify Vn,r as follows:

Vn,r = inf p K sup y E Vn+y,r 1 + ℓ(p, y) (8)

= inf p K sup q K Ey q [Vn+y,r 1 + ℓ(p, y)]

= sup q K inf p K Ey q [Vn+y,r 1 + ℓ(p, y)]

= sup q K inf p K

i=1 qi Vn+ei,r 1 + qiℓ(p, ei)

= sup q K inf p K

i=1 qi Vn+ei,r 1 + qi (ℓ(p) + ℓ(p), ei p )

= sup q K inf p K

i=1 qi Vn+ei,r 1 + ℓ(p) + ℓ(p), q p

i=1 qi Vn+ei,r 1 + ℓ(q),

where the second equality follows since Ey q [Vn+y,r 1 + ℓ(p, y)] is a linear function in q, and the infimum/supremum of a linear function is attained at the boundary; the third equality follows from the minimax theorem as Ey q [Vn+y,r 1 + ℓ(p, y)] is convex in p and concave in q; the final equality is because ℓ(p)+ ℓ(p), q p ℓ(q) which follows from the concavity of the univariate form of ℓ, and equality is attained at p = q.

Throughout the subsequent discussion, we consider K = 2 and use the concrete form of the squared loss. Writing V(n1,n2),r as Vn1,n2,r for simplicity, we simplify the recurrence to

Vn1,n2,r = sup q 2 q1Vn1+1,n2,r 1 + q2Vn1,n2+1,r 1 + 1 q2 1 q2 2

= sup q [0,1] V2 + (V1 V2)q 2(q2 q),

where V1, V2 is a shorthand for Vn1+1,n2,r 1, Vn1,n2+1,r 1 respectively. It is straightforward to show via the KKT conditions that

sup q [0,1] V2 + (V1 V2)q 2(q2 q) =

V2 if V1 V2 < 2, (V1 V2)2

2 if 2 V1 V2 2, V1 if V1 V2 > 2. (R)

Thus, (R) (with V1, V2 replaced by Vn1+1,n2,r 1, Vn1,n2+1,r 1) represents the recurrence we wish to solve for to obtain V0,0,T . The base case of (R) is the following:

Vn,T n,0 = Tℓ h n

which holds for all n such that 0 n T. In the next lemma we show that 2 V1 V2 2, therefore, it is sufficient to solve the recursion (R) corresponding to this case only.

Lemma F.2. The recurrence (R) is also equal to the following:

Vn1,n2,r = (Vn1+1,n2,r 1 Vn1,n2+1,r 1)2

8 + Vn1+1,n2,r 1 + Vn1,n2+1,r 1

Proof. It suffices to show that the condition |Vn1+1,n2,r 1 Vn1,n2+1,r 1| 2 always holds. Rewriting n2 + 1 as n2 and r 1 as r, this is equivalent to showing

|Vn1,n2,r Vn1+1,n2 1,r| 2 (10)

for all n1, n2, r such that n1 + n2 + r = T, and the arguments in (10) are well defined, i.e., 0 n1 T r 1, and 1 n2 T r. We prove this by an induction on r.

Base Case: This corresponds to r = 0. For 0 n T 1, |Vn,T n,0 Vn+1,T n 1,0| is equal to

T 1 = 2 1 2n + 1

which verifies the base case.

Induction Hypothesis: Fix a k {1, . . . , T}; assume that (10) holds for r = k 1.

Induction Step: We show that (10) holds for r = k. It follows from (R) and the induction hypothesis that

Vn1,n2,k = 1

8 (Vn1+1,n2,k 1 Vn1,n2+1,k 1)2 + Vn1+1,n2,k 1 + Vn1,n2+1,k 1

Vn1+1,n2 1,k = 1

8 (Vn1+2,n2 1,k 1 Vn1+1,n2,k 1)2 + Vn1+2,n2 1,k 1 + Vn1+1,n2,k 1

Let α1 := Vn1+2,n2 1,k 1, α2 := Vn1+1,n2,k 1, α3 := Vn1,n2+1,k 1, := Vn1+1,n2 1,k Vn1,n2,k. Subtracting the equations above and expressing in terms of the defined quantities, we obtain

2 + (α1 α2)2

2 + (α1 α3) (α1 + α3 2α2)

2 + (x + y) (x y)

= (x + 2)2 (y 2)2

where we have defined x := α1 α2, y := α2 α3. It follows from the induction hypothesis that |x| 2, |y| 2, therefore | | 2. Summarizing, we have shown that |Vn1+1,n2 1,k Vn1,n2,k| 2 which completes the proof via induction.

Step 3: Since n1 + n2 + r = T, we may express Vn1,n2,r in terms of n1, n2, and T. In the next lemma, we show that Vn1,n2,r exhibits a very special structure when expressed in this manner; this allows us to reduce V0,0,T to solving a one dimensional recurrence. Lemma F.3. For all n1, n2 such that n1, n2 0, and n1 + n2 = T r, it holds that

Vn1,n2,r = (n1 n2)2

T + vr, (11)

where {ur}T r=0, {vr}T r=0 are sequences that depend only on T, and are defined by the following recurrences:

ur+1 = ur + ur + 1

2 , vr+1 = ur

2 + vr + r + 1

for all 0 r T 1, with u0 = v0 = 0.

Proof. Similar to Lemma F.2, the proof shall follow by an induction on r.

Base Case: For r = 0, it follows from (9) that Vn1,n2,0 = 2n1n2

T , which is consistent with (11) and u0 = v0 = 0.

Induction Hypothesis: Fix a k {0, . . . , T 1}. Assume that (11) holds for r = k.

Induction Step: We show that (11) holds for r = k + 1. From Lemma F.2, we have

Vn1,n2,k+1 = (Vn1+1,n2,k Vn1,n2+1,k)2

8 + Vn1+1,n2,k + Vn1,n2+1,k

It follows from the induction hypothesis that

Vn1+1,n2,k = (n1 + 1 n2)2

2 uk 2(n1 + 1) n2

Vn1,n2+1,k = (n1 n2 1)2

2 uk 2n1 (n2 + 1)

Define δ := Vn1+1,n2,k Vn1,n2+1,k and σ := Vn1+1,n2,k + Vn1,n2+1,k. Subtracting the equations above, we obtain

δ = (n1 + 1 n2)2 (n1 n2 1)2

2 uk + 2n1 (n2 + 1) 2(n1 + 1) n2

= 2(n1 n2) uk + 1

Adding the equations above, we obtain

σ = (n1 + 1 n2)2 + (n1 n2 1)2

2 uk 2n1 (n2 + 1) + 2(n1 + 1) n2

= (n1 n2)2 + 1 uk 4n1n2

T 2(n1 + n2)

= (n1 n2)2 uk 4n1n2

T + uk + 2vk + 2(r + 1)

where the last equality follows since n1 + n2 = T k 1. Expressing Vn1,n2,k+1 in terms of δ, σ, we have Vn1,n2,k+1 = δ2

2. Substituting δ, σ, we obtain

Vn1,n2,k+1 = (n1 n2)2

uk + uk + 1

2 + vk + (r + 1)

2 uk+1 2n1n2

which completes the induction step. The proof is hence complete by induction.

Since VAL = V0,0,T = v T , it only remains to bound v T . Note that the recursion describing v is coupled with u. However, since we only want to bound v T , we can sum the recursion describing v to obtain

r=0 (vr+1 vr) = 1

r=0 (r + 1) T

Moreover, since the summation with respect to v telescopes and v0 = 0, we have

Therefore, it remains to bound PT 1 r=0 ur. Define ar := ur + 1

T so that v T = 1

2 PT 1 r=0 ar. The recurrence (12) describing u reduces to ar+1 = ar + a2 r for all 0 r T 1, with a0 = 1

T . In the next result, we obtain bounds on ar.

Lemma F.4. For all 0 r T 1, it holds that ar 1 T r.

Proof. As usual, the proof shall follow by an induction on r. Since a0 = 1

T , the base case is trivially satisfied. Fix a k {0, . . . , T 2}, and assume that ak 1 T k. Since ak+1 = ak + a2 k, we have

ak+1 1 (T k)2 + 1 T k = T k + 1

(T k)2 = (T k)2 1

(T k)2 1 T k 1 1 T k 1.

This completes the induction step.

Lemma F.5. For all 0 r T 1, it holds that ar 1 T r+log T . Furthermore,

r=0 ar log T log T + 1 + 1 = Ω(log T).

Proof. According to Lemma F.4, we can write ar as 1 T r+br for some non-negative sequence {br} with b0 = 0. We next obtain the recurrence describing {br}. In particular, since ar+1 = ar(ar + 1), we have

1 T (r + 1) + br+1 = 1 T r + br 1 T r + br + 1 ,

which on simplifying (by multiplying both sides with (T (r + 1) + br+1)(T r + br)) yields

br+1 = br + 1 + br br+1

T r + br . (14)

Next, we shall show by an induction on r that br Pr 1 i=0 1 T i for all 0 r T 1. Since b0 = 0,

the base case is trivially satisfied. Fix a k {0, . . . , T 2} and assume that bk Pk 1 i=0 1 T i. We consider two cases depending on whether or not bk+1 bk. If bk+1 < bk, the induction step holds trivially. If bk+1 bk, it follows from the recurrence (14) that

bk+1 bk + 1 T k + bk bk + 1 T k

which completes the induction step. Therefore, we have established that br Pr 1 i=0 1 T i for all 0 r T 1. It then follows that

for all 0 r T 1, which translates to ar 1 T r+log T . This completes the proof of the first part

of the lemma. With this lower bound on ar, we can lower bound PT 1 r=0 ar as

1 T r + log T =

1 i + log T Z T +1

dz z + log T = log T log T + 1 + 1 ,

which is Ω(log T) for a large T. This completes the proof.

To conclude, we have shown

VAL = V0,0,T = v T = 1

r=0 ar = Ω(log T),

proving Theorem 4.

G Proof of Theorem 5

Theorem 5. The regret of FTL for learning any ℓ Ldec is at most 2K + (K + 1)βℓ(1 + log T) for some universal constant βℓwhich only depends on ℓand K. Consequently, FTL ensures PUCal Ldec = UCal Ldec = O((supℓ Ldec βℓ)K log T).

Proof. We work with the notation established in the proof of Lemma E.1. The regret of FTL can be bounded as

t Ti\{ti,1} ℓ(pt, ei) ℓ(pt+1, ei) | {z } δt,i

In the subsequent steps, we shall bound δt,i. We begin in the following manner:

δt,i = ℓ(pt) + ei pt, ℓ(pt) ℓ(pt+1) ei pt+1, ℓ(pt+1) = ℓ(pt) ℓ(pt+1) + [ ℓ(pt)]i [ ℓ(pt+1)]i + pt+1, ℓ(pt+1) pt, ℓ(pt) [ ℓ(pt)]i [ ℓ(pt+1)]i + pt, ℓ(pt+1) ℓ(pt) ,

where the first equality follows from Lemma 1; the first inequality follows since ℓ(pt) ℓ(pt+1) + ℓ(pt+1), pt pt+1 which follows from the concavity of ℓ. Next, by the Mean Value Theorem,

ℓ(pt+1) ℓ(pt) = 2ℓ(pt + v(pt+1 pt)) (pt+1 pt)

for some v [0, 1]. Note that pt,i = nt 1,i

t 1 , pt+1,i = nt 1,i+1

t (since yt = ei), therefore pt+1,i pt,i. Let ξt := pt + v(pt+1 pt). Then,

[ ℓ(pt)]i [ ℓ(pt+1)]i = [ 2ℓ(ξt)]i, pt pt+1 ,

where [ 2ℓ(ξt)]i denotes the i-th row of [ 2ℓ(ξt)]; we arrive at

δt,i [ 2ℓ(ξt)]i, pt pt+1 + pt+1 pt, 2ℓ(ξt)pt

= 2 i,iℓ(ξt) (pt+1,i pt,i) +

j=1 pt,j 2 j,jℓ(ξt) (pt+1,j pt,j)

2 i,iℓ(ξt) (pt+1,i pt,i) + X

j =i pt,j 2 j,jℓ(ξt) (pt+1,j pt,j)

2 j,jℓ(ξt) |(pt,j pt+1,j)| , (15)

where the first equality follows since pt+1,i pt,i and 2 i,iℓ(ξt) 0; the second inequality follows by dropping the term pt,i 2 i,iℓ(ξt) (pt+1,i pt,i) which is non-positive; the final inequality is because, for j = i, we have pt+1,j = nt 1,j

t and pt,j = nt 1,j

t 1 , therefore pt+1,j pt,j, and bounding pt,j 1.

To proceed with the further bounding, we apply Lemma 2 and utilize the growth condition on the Hessian 2 j,jℓ(ξt) c K cj max 1 ξt,j , 1 1 ξt,j

for some constant cj > 0 and all j [K] (here

c K is the scaling constant such that ℓ(p) = c K PK i=1 ℓi(pi)). Let βℓ,j := c K cj. Using the bound on 2 j,jℓ(ξt) to bound the term in (15), we obtain

j=1 βℓ,j max 1

ξt,j , 1 1 ξt,j

|pt,j pt+1,j| . (16)

Next, we shall bound the terms T1 := pt+1,i pt,i

ξt,i , T2 := pt+1,i pt,i

1 ξt,i , T3 := pt,j pt+1,j

ξt,j , and T4 :=

pt,j pt+1,j

1 ξt,j , where the index j in T3 and T4 runs over j = i. Note that,

T1 = pt+1,i pt,i pt,i + v(pt+1,i pt,i) pt+1,i pt,i

= nt 1,i + 1

t 1 1 nt 1,i ,

where the first equality follows from the definition of ξt; the first inequality follows since pt+1,i pt,i and v [0, 1]. Similarly,

T2 = pt+1,i pt,i 1 pt,i v(pt+1,i pt,i) pt+1,i pt,i

= nt 1,i + 1

= t 1 nt 1,i

t (t 1) t t 1 nt 1,i = 1 t 1,

where the first inequality follows since pt+1,i pt,i. For T3, we have

T3 = 1 pt+1,j

1 + v pt+1,j

where the inequality follows since v [0, 1]. Finally, we bound T4 as

T4 = 1 pt+1,j

1 pt,j 1 + v pt+1,j

1 t t 1 nt 1,j 1 + v

t 1 t 1 nt 1,j 1 nt 1,i ,

where the final inequality is because PK j=1 nt 1,j = t 1. Collecting the bounds on T1, T2, T3, and T4, and substituting them back in (16), we get

j=1 βℓ,j max 1 nt 1,i , 1 t 1

j=1 βℓ,j 1 nt 1,i + 1 t 1

Summing over all t, we obtain REGℓ 2m+βℓ(S1+S2), where S1 := P

t Ti\{ti,1} 1 nt 1,i ,

t Ti\{ti,1} 1 t 1, and βℓ:= PK i=1 βℓ,i. Note that the subscript in βℓdenotes that the constant is dependent on the loss ℓand only depends on ℓand K. We bound S1 as

1 j m(1 + log T) K(1 + log T).

Next, note that S2 PT t=1 1 t 1 + log T. Thus, S1 + S2 (K + 1)(1 + log T), which yields REGℓ 2K + (K + 1)βℓ(1 + log T). This completes the proof.

H Proof of Lemma 2

Lemma 2. For a function f that is concave, Lipschitz, and bounded over [0, 1] and twice continuously

differentiable over (0, 1), there exists a constant c > 0 such that |f (p)| c max 1 p, 1 1 p for all

Proof. Since f is twice differentiable, if |f (p)| does not approach infinity at the boundary of [0, 1], then there is nothing to prove. In the rest of the proof, we assume that |f (p)| approaches infinity both when p approaches 0 and when p approaches 1 (the case when it only approaches infinty at one side is exactly the same).

Using a technical result from Lemma H.1, there exists an ϵ0 (0, 1) such that for any p (0, ϵ0],

|f (p)| = |f (q) f (0)|

q for some q [p, 1], which can be further bounded by

|f (q) f (0)|

p |f (q)| + |f (0)|

p 2 supq [0,1] |f (q)|

with c1 := 2 supq [0,1] |f (q)| (finite due to f being Lipschitz). Similarly, there exists an ϵ1 (0, 1) such that for any p [1 ϵ1, 1),

|f (p)| = |f (1) f (q)|

1 q for some q [0, p], which is further bounded by

|f (1) f (q)|

1 p |f (1)| + |f (q)|

1 p c1 max 1

Finally, for any p (ϵ0, 1 ϵ1), we trivially bound |f (p)| as

|f (p)| max 1

sup q (ϵ0,1 ϵ1)

max 1 q, 1 1 q

where c2 is finite since f is twice continuously differentiable in (0, 1). Setting c = max(c1, c2) finishes the proof.

Lemma H.1. Let f satisfy the conditions of Lemma 2 and additionally limp 0+ |f (p)| = and limp 1 |f (p)| = . Then there exists ϵ0 (0, 1) such that for any p (0, ϵ0], we have f (q) f (0) = f (p)q for some q [p, 1]. Similarly, there exists ϵ1 (0, 1) such that for any p [1 ϵ1, 1), we have f (1) f (q) = f (p)(1 q) for some q [0, p].

Proof. For simplicity, we only prove the first part of the lemma since the proof of the second part follows the same argument. Since f is twice continuously differentiable in (0, 1) and limp 0+ |f (p)| = , there exists an ϵ (0, 1) such that |f | is decreasing in (0, ϵ]. Since f is concave, this is equivalent to f being increasing in (0, ϵ].

Now, applying Mean Value Theorem, we know that there exists ϵ0 (0, ϵ] such that f (ϵ) f (0) = f (ϵ0)ϵ. It remains to prove that for any p (0, ϵ0], the function g(q) := f (q) f (0) f (p)q has a root in [p, 1]. This is true because

g(p) = f (p) f (0) f (p)p = (f (ξ) f (p))p 0,

where the equality is by Mean Value Theorem again for some point ξ [0, p] and the inequality holds since f is increasing in (0, ϵ]. On the other hand, we have

g(ϵ) = f (ϵ) f (0) f (p)ϵ = (f (ϵ0) f (p))ϵ 0,

where the equality holds by the definition of ϵ0 and the inequality holds since again f is increasing in (0, ϵ]. Applying the Intermediate Value Theorem then shows that g(q) has a root in [p, ϵ], which finishes the proof.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: All key contributions of the paper are summarized in the abstract and more details can be found in Section 1. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Refer to Sections 1 and 5. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: In every section, the assumptions are clearly specified, and each lemma/theorem has a proof which can be found in the appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: The paper does not have any experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often

one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA]

Justification: The paper does not have any experiments.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justification: The paper does not have any experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The paper does not have any experiments. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: The paper does not have any experiments. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The authors have carefully reviewed the Neur IPS Code of Ethics and thereby confirm that this work agrees with it.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: The paper does not pose any societal impacts.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use any existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.