# truthfulness_of_calibration_measures__35a251a1.pdf

Truthfulness of Calibration Measures

Nika Haghtalab, Mingda Qiao, Kunhe Yang, and Eric Zhao University of California, Berkeley {nika,mingda.qiao,kunheyang,eric.zh}@berkeley.edu

We study calibration measures in a sequential prediction setup. In addition to rewarding accurate predictions (completeness) and penalizing incorrect ones (soundness), an important desideratum of calibration measures is truthfulness, a minimal condition for the forecaster not to be incentivized to exploit the system. Formally, a calibration measure is truthful if the forecaster (approximately) minimizes the expected penalty by predicting the conditional expectation of the next outcome, given the prior distribution of outcomes. We conduct a taxonomy of existing calibration measures. Perhaps surprisingly, all of them are far from being truthful. We introduce a new calibration measure termed the Subsampled Smooth Calibration Error (SSCE), which is complete and sound, and under which truthful prediction is optimal up to a constant multiplicative factor. In contrast, under existing calibration measures, there are simple distributions on which a polylogarithmic (or even zero) penalty is achievable, while truthful prediction leads to a polynomial penalty.

1 Introduction

Probability forecasting is a central prediction task to a wide range of domains and applications, such as finance, meteorology, and medicine [MW84, DF83, WM68, JOKOM12, KSB21, VCV15, BF+02, CAT16]. For forecasts to be useful, a common minimum requirement is that they are calibrated, i.e., the predictions are unbiased conditioned on the predicted value. Formally, for a sequence of T binary events, a forecaster who predicts probabilities in [0, 1] is perfectly calibrated if for every α [0, 1], among the time steps on which α is predicted, an α fraction of the outcomes is indeed 1. Since perfectly calibrated forecasts are often unachievable, calibration measures have been introduced to quantify some form of deviation from perfectly calibrated forecasts. Common examples of these measures include the expected calibration error (ECE) [FV98], the smooth calibration error [KF08], and the distance from calibration [BGHN23].

As these calibration measures are commonly used to evaluate the performance of forecasters, it is important that their use encourages forecasters to incorporate the highest quality information available to them (e.g., via their expert knowledge or side information) about the next outcome. This desideratum, formally referred to as truthfulness, requires that a calibration measure incentivizes the forecasters to predict truthfully when the true distribution of the next outcome is known to them. Lack of truthfulness can have severe consequences: it serves as a poor measure of quality of forecasts, tempts forecasters to make deliberately biased predictions in order to game the system, and erodes trust in predictions provided by third-party forecasters. Given the importance of truthfulness, we set out to identify calibration measures that demonstrate truthfulness.

While truthfulness of calibration measures has not been systematically investigated to date, evidence of the lack of truthfulness of some calibration measures has emerged in recent literature. For example, [FH21, QV21] noted that a forecaster can lower their ECE by predicting according to the past. This observation was applied in the algorithm of [FH21] and motivated the sidestepping technique in the lower bound proof of [QV21]. More recently, [QZ24] highlighted a large gap in the truthfulness of a recently proposed calibration measure (called the distance from calibration [BGHN23]) by

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

showing that in a simple setup of predicting i.i.d. outcomes, the truthful forecaster incurs a distance of Ω(

T) from calibration but there is a forecasting algorithm that achieves polylog(T) distance from calibration. We call this a polylog(T)-Ω(

T) truthfulness gap. On the other hand, we say that a calibration measure is (α, β)-truthful if predicting the next outcome according to its conditional distribution incurs a measure that is no more than αOPT + β, where OPT is the minimum value of the calibration measure achievable by any forecaster. Faced with evidence that some calibration measures suffer from large truthfulness gaps, we will systematically examine the truthfulness (or a gap thereof) of a wide range of calibration measures.

For a truthful calibration measure to also be useful it must distinguish accurate predictions from inaccurate ones. After all, a measure that is uniformly 0 regardless of the quality of predictions is perfectly truthful (formally (1, 0)-truthful) but provides no insights into the quality of the predictions. We formalize the minimum requirement for a measure to be useful by its completeness and soundness when predicting i.i.d. Bernoulli outcomes. The former requires that predicting the outcomes according to the correct parameter of the generating Bernoulli distribution incurs no or o(T) penalty, whereas the latter requires the penalty to be Ω(T) when predictions systematically deviate from the correct parameter. An equally important feature of a calibration measure is that it defines an ideal that could be asymptotically achieved for all prediction tasks. This is formalized by the existence of forecasting algorithms with an o(T) penalty in the adversarial sequential prediction setting [FV98], where the sequence of outcomes is produced by an adaptive adversary.

With these desiderata in place (namely truthfulness, soundness, completeness, and asymptotic calibration), we ask whether there are calibration measures that simultaneously satisfy all these criteria? We answer this question in three parts:

Part I: We show that existing calibration measures do not simultaneously meet these criteria. We conduct a taxonomy of several existing calibration measures in terms of their completeness, soundness and truthfulness (formally defined in Section 2). We show that almost all of them have large truthfulness gaps: There are simple distributions on which an O(1) (or even zero) penalty is achievable, while truthful predictions lead to a poly(T) penalty; see Table 1 for details.

Indeed, this lack of truthfulness is not limited to specific or contrived distributions. In the next theorem which we will prove in Appendix B, we strengthen these findings by showing that a commonly used notion of calibration systematically suffers large truthfulness gaps in most forecasting instances.

Theorem 1.1 (Informal). For every product distribution with marginals bounded away from 0 and 1, the truthful forecaster incurs Ω(

T) smooth calibration error but there exists a forecasting algorithm that incurs only polylog(T) smooth calibration error.

A notable exception in Table 1 is the class of calibration measures induced by proper scoring rules, i.e., loss functions for probabilistic predictions that are optimized by truthful forecasts. By definition, these calibration measures are (1, 0)-truthful. However, none of them is complete: as we show in Appendix A, even on i.i.d. Bernoulli trials, the optimal and truthful predictions incur an Ω(T) penalty.

Part II: We introduce a new calibration measure, called SSCE, that is sound, complete, and approximately truthful. We do this using a simple adjustment to an existing notion of calibration measure: we subsample a subset of the time steps and evaluate the smooth calibration error [KF08] on this sampled set only. We call this the Subsampled Smooth Calibration Error (SSCE) and formally define it in Section 2. Our main result is that SSCE is (O(1), 0)-truthful. Theorem 1.2 (Main Theorem). There exists a universal constant c > 0 such that the SSCE is (c, 0)-truthful. Furthermore, the SSCE is complete and sound.

As shown in Table 1, to the best of our knowledge, SSCE is the first calibration measure that simultaneously achieves completeness, soundness, and non-trivial truthfulness.

While our methodology for constructing this calibration measure is simple, the analytical steps required to establish the (O(1), 0)-truthfulness guarantee are far from simple. We dedicate most of the main body of this paper to illustrating the proof ideas in a series of warmups to Theorem 1.2.

Part III: There is a forecasting algorithm that achieves O(

T) SSCE even in the adversarial setting. While our study of truthfulness of calibration measures is necessarily focused on when the

Calibration Measure Complete? Sound? Truthful?

Expected Calibration Error, Maximum Swap Regret 0-Ω(T) gap

Smooth Calibration, Distance from Calibration, Interval Calibration, Laplace-Kernel Calibration 0-Ω(

U-Calibration Error O(1)-Ω(

Proper Scoring Rules (1, 0)-truthful

Subsampled Smooth Calibration Error (O(1), 0)-truthful

Table 1: Evaluation of existing calibration measures along with SSCE, in terms of completeness, soundness and truthfulness (Definitions 2.2 and 2.5). An α-β truthfulness gap means that there is a prediction instance on which forecasting according to the true conditional distribution of the next outcome incurs more than β penalty, but there is a forecasting strategy that incurs at most α penalty. See Appendix A for more details.

forecaster knows the conditional distribution of the next outcome, it is important to ensure that, even in the adversarial setting, a sublinear penalty can be achieved for this calibration measure. For this, we study the sequential calibration setting (e.g., [FV98]) where the outcome at time t is chosen by an adaptive adversary who has observed the sequence of earlier outcomes and predictions. We show that an O(

T) SSCE is achievable.

Theorem 1.3. In the adversarial sequential calibration setting, there is a deterministic strategy for the forecaster that achieves an O(

An interesting and important feature of this result is that it achieves an O(

T) rate whereas an O(

T) rate for the expected calibration error is known to be impossible to achieve [QV21]. Together our Theorems 1.2 and 1.3 establish that SSCE is a truthful and useful calibration measure.

1.1 Related Work

There is a large body of work on calibration, a notion that dates back to the 1950s [Bri50, Daw82, Daw85] and has been applied to game theory [FV97, HPY23], machine learning [GPSW17], and algorithmic fairness [KMR17, PRW+17, HJKRR18, HJZ23]. We will restrict our discussion to sequential calibration and the systematic study of calibration measures, which are the closest to this work.

Sequential calibration. Foster and Vohra [FV98] first proved that one can achieve asymptotic calibration on arbitrary and adaptive outcomes. Formally, they gave a forecasting algorithm with an O(T 2/3) ECE in expectation, when predicting T binary outcomes chosen by an adaptive adversary. Subsequent work gave alternative and simpler proofs of the result [FL99, Fos99, Har22], extended the result to other calibration measures [KF08, FH18, FH21, QZ24], and proved lower bounds on the optimal ECE [QV21]. Most closely related to our approach is the work of [FRST11], who studied a stronger notion that requires calibration on a family of checking rules, where each checking rule specifies a subset of the time horizon. Despite the apparent similarity, their notion is qualitatively different from the SSCE, since we take an expectation over the subsampled horizon, whereas they take the maximum. In particular, no forecaster can be calibrated in their definition if the checking rule family contains all subsets of [T], since there always exists a checking rule that strongly correlates with the outcomes.

Calibration measures. The recent work of Błasiok, Gopalan, Hu and Nakkiran [BGHN23] initiated the rigorous study of calibration measures. Their work focused on the offline setup, where there is a known marginal distribution over the feature space, and each predictor maps the feature space to [0, 1]. They proposed to use the distance from calibration the ℓ1 distance from the predictor to the closest predictor that is perfectly calibrated as the ground truth, and studied whether existing

calibration measures are consistent with it. Note that completeness and soundness are defined differently in [BGHN23]: a calibration measure is called complete (resp., sound) if it is upper (resp., lower) bounded by a polynomial of the distance from calibration. Since the distance from calibration is far from being truthful in the online setup (as shown by [QZ24]), our definition of completeness and soundness set up minimal conditions for an error metric to be regarded as measuring calibration, rather than enforcing closeness to the distance from calibration.

Subsampling. Our new calibration measure is derived from subsampling the time horizon. This simple idea has been shown to be effective in various different contexts, including privacy amplification in differential privacy (e.g.,[Ste22, Section 6]), handling adversarial corruptions [BLMT22], as well as adaptive data analysis [Bla23].

Proper scoring rules. Proper scoring rules [WM68] are error metrics for probabilistic forecasts that are optimized when the forecaster predicts according to the true distribution. While the error metrics induced by proper scoring rules are (perfectly) truthful by definition, as we show in Appendix A, they are qualitatively different from the usual calibration measures and, in particular, do not meet the completeness criterion. We note that a recent line of work [CY21, NNW21, LHSW22, PW22, HSLW23] studied the optimization of scoring rules, namely, finding the proper scoring rule that maximally incentivizes the forecaster to exert effort to obtain additional information.

2 Preliminaries

Sequential prediction. We consider the following prediction setup: First, a sequence x {0, 1}T is sampled from distribution D. At each step t [T], the forecaster makes a prediction pt [0, 1], after which xt is revealed. Formally, a deterministic forecaster is a function A : ST t=1{0, 1}t 1 [0, 1], where A(b1, b2, . . . , bt 1) specifies the forecaster s prediction at step t if the first t 1 observations match b1:(t 1). Distribution D and forecaster A naturally induce a joint distribution of (x, p) {0, 1}T [0, 1]T via sampling x D and predicting pt = A(x1, x2, . . . , xt 1).

Note that we could have defined the forecaster as a function of both the outcomes x1:(t 1) and the predictions p1:(t 1) in the past. This alternative definition is equivalent to ours, since p1:(t 1) would be uniquely determined by x1:(t 1). We could also have considered randomized forecasters, which are specified by distributions over deterministic forecasters. However, as we will see later, restricting our attention to deterministic forecasters does not affect the subsequent definitions.

Calibration measures. The quality of the forecaster s predictions in the setting above is quantified by calibration measures. Formally, a calibration measure CM is a family of functions {CMT : T N}, where each CMT maps {0, 1}T [0, 1]T to [0, T]. We will frequently omit the subscript T, since it is usually clear from the context. With respect to calibration measure CM, the expected penalty incurred by forecaster A on distribution D is defined as err CM(D, A) := E(x,p) (D,A) [CM(x, p)], where (x, p) (D, A) denotes sampling a sequence x and predictions p from the joint distribution induced by D and A.

One example of calibration measures is the smooth calibration error introduced by [KF08] that is defined as sm CE(x, p) := supf F PT t=1 f(pt)(xt pt), where F is the family of 1-Lipschitz functions from [0, 1] to [ 1, 1]. In this work, we introduce a new calibration measure called Subsampled Smooth Calibration Error (SSCE) that is defined by subsampling a subset of the time horizon, and evaluating the smooth calibration error on it. We will formally define this measure next. In the following, Unif(S) denotes the uniform distribution over a finite set S. For a T-dimensional vector x and S [T], x|S denotes the |S|-dimensional vector formed by the entries of x indexed by S. Definition 2.1 (Subsampled Smooth Calibration Error). For a sequence of outcomes x {0, 1}T and predictions p [0, 1]T , the Subsampled Smooth Calibration Error (SSCE) is defined as

SSCE(x, p) := E S Unif(2[T ]) [sm CE(x|S, p|S)] = E y Unif({0,1}T )

t=1 yt f(pt) (xt pt)

Completeness and soundness. We give minimal conditions for a calibration measure to be regarded as complete (intuitively accurate predictions have a small penalty) and sound (intuitively inaccurate predictions have a large penalty).

Definition 2.2 (Completeness and soundness). A calibration measure CM is complete if: (1) For any x {0, 1}T , CMT (x, x) = 0; (2) For any α [0, 1], Ex1,...,x T Bernoulli(α) h CMT (x, α 1T ) i =

oα(T). The calibration measure is sound if: (1) For any x {0, 1}T , CMT (x, 1T x) = Ω(T);

(2) For any α, β [0, 1] such that α = β, Ex1,...,x T Bernoulli(α) h CMT (x, β 1T ) i = Ωα,β(T). Here,

oα( ) and Ωα,β( ) may hide constant factors that depend on the parameters in the subscript.

Truthfulness. To define the truthfulness of a calibration measure, we introduce the truthful forecaster and the optimal error for a distribution D.

Definition 2.3 (Truthful forecaster). With respect to distribution D ({0, 1}T ), the truthful

forecaster is defined as Atruthful(D)(b1, b2, . . . , bt 1) := Prx D h xt = 1 x1:(t 1) = b1:(t 1) i .

Arguably, Atruthful(D) is the only forecaster that makes the right predictions on distribution D.

Definition 2.4 (Optimal error). The optimal error on distribution D ({0, 1}T ) with respect to calibration measure CM is defined as OPTCM(D) := inf A err CM(D, A), where A ranges over all deterministic forecasters.

Note that by an averaging argument, the definition of OPTCM(D) is unchanged if we take an infimum over randomized forecasters.

A calibration measure is truthful if, on every distribution, the truthful forecaster is near-optimal.

Definition 2.5 (Truthfulness of calibration measures). A calibration measure CM is (α, β)-truthful if, for every D ({0, 1}T ), err CM(D, Atruthful(D)) α OPTCM(D) + β. Conversely, CM is said to have an α-β truthfulness gap if, for some distribution D, OPTCM(D) α and err CM(D, Atruthful(D)) β.

3 Technical Overview

In this section, we briefly discuss the main technical ideas and challenges behind the proofs of Theorems 1.1, 1.2, and 1.3. We provide more details on our main result, i.e., that SSCE is (O(1), 0)-truthful, in Sections 4 through 6. Theorem 1.3 follows from a recent result of [ACRS24] on minimizing the distance from calibration in the adversarial setup, along with a new result connecting SSCE to distance from calibration, and is proved in Section 7. We defer the proof of Theorem 1.1 to Appendix B.

A simple distribution that witnesses truthfulness gaps. Inspired by [QV21, Example 2], we consider the distribution D specified as follows: The time horizon is divided into T/3 blocks of length 3, each with a uniformly random bit, followed by a zero and a one. Within each block, the truthful forecaster predicts 1/2, 0 and 1 in order. Then, among the steps on which 1/2 is predicted, the frequency of ones is typically 1/2 Θ(1/

T). This deviation results in a Θ(

T) penalty under most calibration measures (concretely, all calibration measures in the first two rows of Table 1).

However, there is a different strategy that ensures perfect calibration, and thus a zero penalty under most calibration measures. Within each block, the forecaster predicts 1/2 on the first step. If the bit turns out to be 1, the forecaster maintains perfect calibration by predicting 1/2 on the second step, on which the outcome is known to be 0; otherwise, the forecaster accomplishes the same by predicting 1/2 on the third step. Therefore, the distribution D witnesses a 0-Ω(

T) truthfulness gap for every calibration measure in the first two rows of Table 1.

The importance of subsampling in the SSCE becomes apparent in light of the example above. On distribution D, the truthful forecaster has to pay a Θ(

T) cost for the mild deviation from the expectation, while a strategic forecaster avoids this deviation by correlating the predictions with the biases in the past. With the subsampling, however, the forecaster is no longer sure about the biases that factor into the penalty. This ensures that, compared to truth-telling, the benefit from predicting strategically is marginal, and thus makes the truthfulness guarantee in Theorem 1.2 possible.

Establishing truthfulness via martingale inequalities. We prove that the SSCE is (O(1), 0)- truthful in three steps: (1) Define a complexity measure σ(D) of distribution D; (2) Show that err SSCE(D, Atruthful(D)) = O(σ(D)); (3) Show that OPTSSCE(D) = Ω(σ(D)).

As we elaborate in Section 5, the crux of Step (2) is to control the expected deviation of a martingale (Mt)0 t T with respect to filtration (Ft) by the its realized variance Vart := Pt s=1 Var [Ms|Fs 1], which is highly non-trivial as the two processes (Mt) and (Vart) are correlated. In more detail, the filtration (Ft) corresponds to the randomness in x D, while (Mt) tracks the biases in the predictions (on a subset of the time horizon) tested by a Lipschitz function. We note that such a bound would easily follow from off-the-shelf concentration inequalities for martingales (e.g., Freedman s inequality [Fre75]), if the total realized variance Var T were uniformly bounded. However, in general, Var T may vary drastically, and directly applying these concentration inequalities would introduce an extra super-constant factor. Our workaround is a doubling trick that divides the time horizon into epochs, the realized variances in which grow exponentially. We then apply Freedman s inequality to each epoch separately. In Section 5, we formulate a toy random walk problem that highlights this challenge and demonstrates our solution to it, which is of independent interest.

Similarly, as we show in Section 6, the crux of Step (3) is to establish another martingale inequality. We first show that for fixed x and p, we have SSCE(x, p) = Ω( NT ), where Nt := Pt s=1 1 [|xs ps| 1/2]. Furthermore, over the randomness in x D, the realized variance process (Vart) defined above is shown to lower bound (Nt), i.e., (Nt Vart) is a sub-martingale. However, the desired result requires the lower bound E NT Ω(1) E Var T , which does not follow from E [NT Var T ] 0 in general. This challenge necessitates a more careful analysis tailored to the specific properties of the processes (Nt) and (Vart).

Deterministic forecasting strategy via reduction to sm CE. We build on the result of [ACRS24] showing the existence of a deterministic forecasting strategy guaranteeing an O(

T) bound on sm CE. In particular, we show via a standard chaining argument that SSCE is upper bounded by sm CE plus a variance term that can be upper bounded by O(

T). The result of [ACRS24] then implies a deterministic forecasting algorithm achieving an O(

4 Warmup: The Product Distribution Case

As a warmup, in this section, we start by showing that SSCE is (O(1), O(log T))-truthful for product distributions. This is a weaker version of Theorem 1.2 in terms of both the truthfulness parameters of SSCE and the restriction to product distributions. In Sections 5 and 6, we outline how we will remove these restrictions and improve the analysis of truthfulness.

For distribution D = QT t=1 Bernoulli(p t ), take σ2 := Varx D h PT t=1 xt i = PT t=1 p t (1 p t ) as a

complexity measure of the distribution of outcomes. We will show that err SSCE(D, Atruthful(D)) = O(σ + log T) and OPTSSCE(D) = Ω(σ) O(1).

4.1 Upper Bound the SSCE of the Truthful Forecaster

We first show that the truthful forecaster for D, which predicts pt = p t at every step t, gives Ex D [SSCE(x, p )] = O(σ + log T). For this purpose, it suffices to prove

E x D [sm CE(x, p )] = O(σ + log T), (1)

since for each fixed S [T], applying (1) to x|S and p |S gives Ex D [sm CE(x|S, p |S)] O(σ + log T), and taking an expectation over S Unif(2[T ]) gives the desired bound on SSCE.

Recall that E [sm CE(x, p )] = E h supf F PT t=1 f(p t ) (xt p t ) i . If we replace F with the family

of constant functions from [0, 1] to [ 1, 1], the right-hand side would reduce to

t=1 (xt p t )

v u u u t E x D

t=1 (xt p t )

v u u t Var x D

Therefore, to prove the upper bound in (1), we need to show that the family of one-dimensional Lipschitz functions is not significantly richer than constant functions.

At a high level, this is done by taking finite coverings of Lipschitz functions and using Dudley s chaining technique [Dud87] to upper bound the value of this stochastic process. In more detail, let Fδ be the smallest δ-covering of F in the uniform norm, i.e., for each f F, there exists fδ Fδ such that f fδ δ. It is well-known that |Fδ| = e O(1/δ), and a chaining argument gives

t=1 f(p t ) (xt p t )

O(log T ) X

t=1 g(p t ) (xt p t )

where Gδ := {fδ fδ/2 : fδ Fδ, fδ/2 Fδ/2, fδ fδ/2 3δ/2}.

It remains to bound the second term of (2). Note that for a fixed g, because of the independence of xts, g(p t ) (xt p t ) is independent across t [T]. Therefore, we can control the tail probability of PT t=1 g(p t ) (xt p t ) by Bernstein inequalities. For each fixed δ, using a Bernstein tail bound, taking a union bound over g Gδ, and noting that |Gδ| |Fδ| |Fδ/2| = e O(1/δ), we have

t=1 g(p t ) (xt p t )

σ2 log |Gδ| + log |Gδ| = O(σ

Plugging this into (2) proves (1) and thus the desired bound Ex D [SSCE(x, p )] = O(σ + log T).

4.2 Lower Bound the Optimal SSCE

Next, we lower bound OPTSSCE(D) by showing that every forecasting strategy must incur an Ω(σ) SSCE on D. Recall that SSCE(x, p) is given by

E y Unif({0,1}T )

t=1 yt f(pt) (xt pt)

E y Unif({0,1}T )

t=1 yt (xt pt)

where we use the fact that F contains the constant functions 1 and 1.

Fix x {0, 1}T , p [0, 1]T and let N := PT t=1 1 [|xt pt| 1/2]. Over the randomness in y Unif({0, 1}T ), the quantity PT t=1 yt (xt pt), by the central limit theorem, is approximately distributed as a normal distribution with variance PT t=1 1 4(xt pt)2 PT t=1 1 161 [|xt pt| 1/2] = Ω(N), so its expected absolute value is Ω(

Now it remains to lower bound the expectation of

N induced by an arbitrary forecaster. Conditioning on x1:(t 1), xt always follows Bernoulli(p t ). Thus, regardless of the choice of pt [0, 1], the condition |xt pt| 1/2 holds with probability at least min{p t , 1 p t } p t (1 p t ). Then, over the T steps, we expect that N Ω(PT t=1 p t (1 p t )) = Ω(σ2) holds with probability Ω(1), as long

as σ = Ω(1). This gives the desired lower bound E [SSCE(x, p)] E h

N i = Ω(σ) O(1).

5 Upper Bound the SSCE of the Truthful Forecaster

To extend the proof strategy sketched in Section 4 to non-product distributions, the first challenge is to define an appropriate complexity measure of a general distribution D. Consider the stochastic process (Vart)0 t T defined as Vart := Pt s=1 p s(1 p s), where x D and

p t := Ex D h x t x 1:(t 1) = x1:(t 1) i is now a random variable that denotes the conditional expecta-

tion of xt after observing x1:(t 1). The right definition turns out to be roughly σ(D) := E Var T . In this section, we prove the following weaker upper bound on the SSCE incurred by the truthful forecaster. We provide a stronger bound (Theorem C.1) in Appendix C.

Theorem 5.1. For any D ({0, 1}T ), err SSCE(D, Atruthful(D)) = O(E Var T + log2 T).

Proof sketch. We begin by repeating the chaining argument in Section 4. Recall that, for any δ > 0, there is a δ-covering Fδ of F in the -norm that has size e O(1/δ). Letting πδ(f) denote the mapping

of a function f onto the covering Fδ such that f πδ(f) δ, we can write for any M Z+:

SSCE(x, p) 2 M T + E y Unif({0,1}T )

k=0 sup f F

t=1 yt (π2 k(f)(pt) π21 k(f)(pt)) (xt pt)

| {z } =:Wk

To control the expectation of each Wk, we note that the set Gk := {π2 k(f) π21 k(f) : f F} is of size at most |F2 k| |F21 k|. Furthermore, every function g Gk satisfies

g = π2 k(f) π21 k(f) π2 k(f) f + f π21 k(f) = O(2 k)

for some f F. We apply the following technical lemma, which we prove in Appendix C.

Lemma 5.2. Given a function f : [0, 1] [ 1, 1] and y {0, 1}T , consider the martingale Mt(f, y) := Pt s=1 ys f(p s) (xs p s) where x D. Then, for any finite family G of functions from [0, 1] to [ 1, 1] and any y {0, 1}T , we have

max f G MT (f, y) O log |G| log T + p

log |G| E x D

Applying Lemma 5.2 to each Gk scaled up by a Θ(2k) factor and noting that log |Gk| log |F2 k| + log |F21 k| = O(2k) gives

err SSCE(D, Atruthful(D)) 2 M T +

k=0 O(2 k) O 2k log T + 2k/2 E x D

k=0 O log T + 2 k/2 E x D

2 M T + O M log T + E x D

Choosing M = Θ(log T) proves the theorem.

We remark that the proof of Lemma 5.2 is highly non-trivial. As mentioned in Section 3, such an upper bound would follow from Freedman s inequality, if Var T were always bounded by O E Var T 2 . However, in general, applying Freedman s inequality to each MT (f, y) necessarily requires an additional union bound over possible values of Var T , and introduces a superconstant multiplicative factor.

The challenge in dealing with the randomness in Var T is captured by the following toy problem:

Random walk with early stopping: Let (Xt)0 t T be the random walk such that X0 = 0 and each Xt Xt 1 independently follows Unif({ 1}). Let τ be an arbitrary stopping time with respect to (Xt). Prove that E [|Xτ|] O(1) E [ τ].

Indeed, the above corresponds to a special case of Lemma 5.2 in which: (1) the sequence p starts with entry 1/2, and may switch to entry 0 at any point, depending on the realization of xts; (2) the family G consists of two constant functions 1 and 1.

One might be tempted to prove E [|Xτ|] O(1) E [ τ] by first proving E [|Xτ||τ = t] = O(

t) for all t [T], and then applying the law of total expectation. Such an approach is doomed to fail, because the stopping time τ might significantly bias the conditional expectation of |Xτ| on some event τ = t0, e.g., by stopping at time t0 only if |Xt0| t0.

Our workaround is inspired by the standard doubling trick in online learning. We break the time horizon into epochs of geometrically increasing lengths: the k-th epoch contains 2k steps. We break |Xτ| into the displacements accumulated in different epochs; their sum clearly upper bounds |Xτ|.

Furthermore, we can show that, conditioning on reaching epoch k, the displacement within the epoch is O(

2k). This allows us to establish the desired inequality via

E [|Xτ|] O(1)

O(log T ) X

k=1 Pr [τ reaches epoch k]

2k O(1) E τ .

To prove Lemma 5.2, we extend this technique to a general martingale MT (f, y) by dividing the time horizon into epochs according to the doubling of Vart, and then applying Freedman s inequality to each epoch.

Towards a stronger upper bound. In our actual proof, we use a slightly different complexity measure σγ(D) := E [γ(Var T )], where γ(x) = x if x < 1 and γ(x) = x otherwise. Roughly speaking, this definition accounts for the fact that a sum of independent Bernoulli random variables behaves quite differently when its mean is close to 0. To remove the extra log2 T term in Theorem 5.1, our actual proof also uses a variant of Lemma 5.2, Lemma C.9, which involves a more careful application of Freedman s inequality tailored to specific coverings of Lipschitz functions.

6 Lower Bound the Optimal SSCE

In this section, we outline a weaker lower bound on the optimal SSCE achievable on a distribution. Theorem 6.1. For any D ({0, 1}T ), OPTSSCE(D) = Ω(E Var T ) O(1).

Similar to the product distribution case (Section 4), the key quantity in the proof is the stochastic process (Nt)0 t T defined as Nt := Pt s=1 ns and nt := 1 [|xt pt| 1/2]. This is formalized by the following lemma, which applies to any realization of x, p, and NT = PT t=1 1 [|xt pt| 1/2]:

Lemma 6.2. For any x {0, 1}T and p [0, 1]T , we have SSCE(x, p) Ω NT .

It remains to lower bound the quantity E NT induced by an arbitrary forecaster. As argued earlier, conditioning on x1:(t 1), we always have Pr [nt = 1] p t (1 p t ) = Vart Vart 1, where p t and Vart are defined as in Section 5. Thus, (Nt Vart) is a sub-martingale, which implies E [NT ] E [Var T ]. However, this does not imply that E NT Ω(E Var T ). In fact, such an inequality does not hold in general: When p 1 = ε 1 and p t = 0 for all t 2, E NT could be O(ε), yet E Var T = Ω( ε) O(ε).

The following technical lemma circumvents this counterexample by subtracting a constant term from the right-hand side: Lemma 6.3. The stochastic process (Nt)t [T ] satisfies E NT Ω(E Var T ) O(1).

Note that Theorem 6.1 directly follows from Lemmas 6.2 and 6.3, which we prove in Appendix D.1. To avoid the extra O(1) term in the lower bound, our actual proof (deferred to Appendix D.3) works with the slightly different complexity measure σγ(D) := E [γ(Var T )] defined in Section 5.

7 Forecasting with O(

In this section, we prove Theorem 1.3, which states the existence of a deterministic forecaster that incurs an O(

T) SSCE against all adaptive adversaries. Recall the definition of the smooth calibration error (sm CE) from Section 2. Using standard chaining arguments, we can show the following relation between SSCE and sm CE, whose proof we defer to Appendix F. Lemma 7.1. For any x {0, 1}T and p [0, 1]T ,

SSCE(x, p) 1

2sm CE(x, p) + O(

where the O( ) notation hides a universal constant that does not depend on T, x or p.

Theorem 1.3 follows from the lemma above and a recent result of [ACRS24].

Proof of Theorem 1.3. It was shown by [ACRS24] that there exists a deterministic forecaster with an O(

T) distance from calibration (Cal Dist(x, p)) against every adaptive adversary in the adversarial sequential calibration setup. Lemma 7.1 together with the inequality 1

2sm CE(x, p) Cal Dist(x, p) from [BGHN23, Lemma 5.4 and Theorem 7.3] implies that

SSCE(x, p) Cal Dist(x, p) + O(

so the same forecaster incurs an SSCE of O(

T) as well.

8 Discussion

We formulate three natural desiderata of calibration measures that evaluate the quality of probabilistic forecasts: truthfulness, completeness, and soundness. They serve as minimal requirements for an error metric to be considered as measuring calibration and not to create a significant incentive for forecasters to predict untruthfully. While existing calibration measures fail to simultaneously meet all these criteria, we propose the new calibration measure (SSCE) that is shown to be approximately truthful via a non-trivial analysis. In the following, we discuss two natural directions of future work.

Inherent trade-offs among different desiderata? As shown in Table 1, the SSCE and the error metrics induced by proper scoring rules give a trade-off between truthfulness and completeness: The former is complete and approximately truthful, while the latter is perfectly truthful but not complete. Is there a calibration measure that achieves the best of both worlds? Taking a step back, while our definition of truthfulness seems natural, the completeness and soundness criteria, as defined, only serve as minimal requirements. It still remains to explore ways to formally quantify the latter two, and investigate the inherent quantitative trade-offs among truthfulness, completeness and soundness.

Truthfulness against adaptive adversaries? One may wonder whether the truthfulness guarantee of SSCE can be extended to handle adaptive adversaries as well. Assuming that the forecaster is given an adversary s (randomized) strategy for choosing xt based on x1:(t 1) and p1:(t 1), is it still approximately optimal to always predict the conditional probability? Here, adaptive emphasizes that xt may depend on both x1:(t 1) and p1:(t 1); the formulation in Section 2 is equivalent to that xt only depends on x1:(t 1).

Unfortunately, as we show in Appendix G, such a guarantee does not hold for SSCE, and is unlikely to hold for any natural calibration measure: An adversary can force the forecaster to predict untruthfully by threatening to increase the variance of the subsequent bits. However, this adversary is highly contrived and unrealistic for practical scenarios. We may thus identify reasonable restrictions on the adaptive adversary to sidestep this counterexample.

9 Acknowledgements

This work is supported in part by the National Science Foundation under grants CCF-2145898 and the Graduate Research Fellowship Program under grant DGE 2146752, the Office of Naval Research under grant N00014-24-1-2159, an Alfred P. Sloan fellowship, a Schmidt Sciences AI2050 fellowship, and a Google Research Scholars award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.

[ACRS24] Eshwar Ram Arunachaleswaran, Natalie Collina, Aaron Roth, and Mirah Shi. An elementary predictor obtaining 2

T distance to calibration. ar Xiv preprint ar Xiv:2402.11410, 2024.

[BF+02] Henri Berestycki, Igor Florent, et al. Asymptotics and calibration of local volatility models. Quantitative finance, 2(1):61, 2002.

[BGHN23] Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. A unifying theory of distance from calibration. In Symposium on Theory of Computing (STOC), pages 1727 1740, 2023.

[Bla23] Guy Blanc. Subsampling suffices for adaptive data analysis. In Symposium on Theory of Computing (STOC), pages 999 1012, 2023.

[BLMT22] Guy Blanc, Jane Lange, Ali Malik, and Li-Yang Tan. On the power of adaptivity in statistical adversaries. In Conference on Learning Theory (COLT), pages 5030 5061, 2022.

[Bri50] Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1 3, 1950.

[CAT16] Cynthia S Crowson, Elizabeth J Atkinson, and Terry M Therneau. Assessing calibration of prognostic risk scores. Statistical methods in medical research, 25(4):1692 1706, 2016.

[CY21] Yiling Chen and Fang-Yi Yu. Optimal scoring rule design. ar Xiv preprint ar Xiv:2107.07420, 2021.

[Daw82] A. P. Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605 610, 1982.

[Daw85] A. P. Dawid. Calibration-based empirical probability. The Annals of Statistics, 13(4):1251 1274, 1985.

[DF83] Morris H De Groot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12 22, 1983.

[Dud87] R. M. Dudley. Universal donsker classes and metric entropy. The Annals of Probability, 15(4):1306 1326, 1987.

[FH18] Dean P. Foster and Sergiu Hart. Smooth calibration, leaky forecasts, finite recall, and nash dynamics. Games and Economic Behavior, 109:271 293, 2018.

[FH21] Dean P. Foster and Sergiu Hart. Forecast hedging and calibration. Journal of Political Economy, 129(12):3447 3490, 2021.

[FL99] Drew Fudenberg and David K. Levine. An easier way to calibrate. Games and Economic Behavior, 29(1-2):131 137, 1999.

[Fos99] Dean P. Foster. A proof of calibration via blackwell s approachability theorem. Games and Economic Behavior, 29(1-2):73 78, 1999.

[Fre75] David A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):100 118, 1975.

[FRST11] Dean P. Foster, Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Complexitybased approach to calibration with checking rules. In Conference on Learning Theory (COLT), pages 293 314, 2011.

[FV97] Dean P. Foster and Rakesh V. Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21(1-2):40 55, 1997.

[FV98] Dean P. Foster and Rakesh V. Vohra. Asymptotic calibration. Biometrika, 85(2):379 390, 1998.

[GPSW17] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), pages 1321 1330, 2017.

[Har22] Sergiu Hart. Calibrated forecasts: The minimax proof. ar Xiv preprint ar Xiv:2209.05863, 2022.

[HJKRR18] Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning (ICML), pages 1939 1948, 2018.

[HJZ23] Nika Haghtalab, Michael Jordan, and Eric Zhao. A unifying perspective on multicalibration: Game dynamics for multi-objective learning. In Advances in Neural Information Processing Systems (Neur IPS), pages 72464 72506, 2023.

[HPY23] Nika Haghtalab, Chara Podimata, and Kunhe Yang. Calibrated Stackelberg games: Learning optimal commitments against calibrated agents. In Advances in Neural Information Processing Systems (Neur IPS), pages 61645 61677, 2023.

[HSLW23] Jason D. Hartline, Liren Shan, Yingkai Li, and Yifan Wu. Optimal scoring rules for multi-dimensional effort. In Conference on Learning Theory (COLT), pages 2624 2650, 2023.

[HW24] Lunjia Hu and Yifan Wu. Predict to minimize swap regret for all payoff-bounded tasks. ar Xiv preprint ar Xiv:2404.13503, 2024.

[JOKOM12] Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila Ohno-Machado. Calibrating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association, 19(2):263 274, 2012.

[KF08] Sham M. Kakade and Dean P. Foster. Deterministic calibration and Nash equilibrium. Journal of Computer and System Sciences, 74(1):115 130, 2008.

[KLST23] Robert Kleinberg, Renato Paes Leme, Jon Schneider, and Yifeng Teng. U-calibration: Forecasting for an unknown agent. In Conference on Learning Theory (COLT), pages 5143 5145, 2023.

[KMR17] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In Innovations in Theoretical Computer Science (ITCS), pages 43:1 43:23, 2017.

[KSB21] Benjamin Kompa, Jasper Snoek, and Andrew L Beam. Second opinion needed: communicating uncertainty in medical machine learning. NPJ Digital Medicine, 4(1):4, 2021.

[KSJ18] Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning (ICML), pages 2805 2814, 2018.

[LHSW22] Yingkai Li, Jason D. Hartline, Liren Shan, and Yifan Wu. Optimization of scoring rules. In Economics and Computation (EC), pages 988 989, 2022.

[MW84] Allan H Murphy and Robert L Winkler. Probability forecasting in meteorology. Journal of the American Statistical Association, 79(387):489 500, 1984.

[NNW21] Eric Neyman, Georgy Noarov, and S. Matthew Weinberg. Binary scoring rules that incentivize precision. In Economics and Computation (EC), pages 718 733, 2021.

[NRRX23] Georgy Noarov, Ramya Ramalingam, Aaron Roth, and Stephan Xie. High-dimensional unbiased prediction for sequential decision making. In OPT 2023: Optimization for Machine Learning, 2023.

[PRW+17] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q. Weinberger. On fairness and calibration. Advances in Neural Information Processing Systems (NIPS), pages 5680 5689, 2017.

[PW22] Maneesha Papireddygari and Bo Waggoner. Contracts with information acquisition, via scoring rules. In Economics and Computation (EC), pages 703 704, 2022.

[QV21] Mingda Qiao and Gregory Valiant. Stronger calibration lower bounds via sidestepping. In Symposium on Theory of Computing (STOC), pages 456 466, 2021.

[QZ24] Mingda Qiao and Letian Zheng. On the distance from calibration in sequential prediction. In Conference on Learning Theory (COLT), pages 4307 4357, 2024.

[RS24] Aaron Roth and Mirah Shi. Forecasting for swap regret for all downstream agents. ar Xiv preprint ar Xiv:2402.08753, 2024.

[She10] I. G. Shevtsova. An improvement of convergence rate estimates in the Lyapunov theorem. Doklady Mathematics, 82(3):862 864, 2010.

[Ste22] Thomas Steinke. Composition of differential privacy & privacy amplification by subsampling. ar Xiv preprint ar Xiv:2210.00597, 2022.

[VCV15] Ben Van Calster and Andrew J Vickers. Calibration of risk prediction models: impact on decision-analytic performance. Medical decision making, 35(2):162 169, 2015.

[WM68] Robert L. Winkler and Allan H. Murphy. Good probability assessors. Journal of Applied Meteorology and Climatology, 7(5):751 758, 1968.

Calibration Measure Complete? Sound? Truthful?

Expected Calibration Error 0-Ω(T) gap

Maximum Swap Regret 0-Ω(T) gap

Smooth Calibration Error 0-Ω(

Distance from Calibration 0-Ω(

Interval Calibration Error 0-Ω(

Laplace-Kernel Calibration Error 0-Ω(

U-Calibration Error O(1)-Ω(

Proper Scoring Rules (1, 0)-truthful

T (O(1), 0)-truthful

Subsampled Smooth Calibration Error (O(1), 0)-truthful

Table 2: Evaluation of previous calibration measures along with SSCE, in terms of completeness, soundness and truthfulness (Definitions 2.2 and 2.5). Every calibration measure, except SSCE, either lacks completeness or has a significant truthfulness gap.

A Taxonomy of Existing Calibration Measures

In this section, we prove that the existing calibration measures in Table 2 either have a large truthfulness gap or lack completeness.

In these proofs, the biases induced by specific outcomes and predictions will be frequently used: With respect to outcomes x {0, 1}T and predictions p [0, 1]T , the bias associated with value α [0, 1] is defined as

t=1 (xt pt) 1 [pt = α] .

A.1 Existing Calibration Measures

The expected calibration error. A common calibration measure is the sum of L1 errors of each level set, known as the L1 calibration error or the Expected Calibration Error (ECE): On x {0, 1}T and p [0, 1]T , the expected calibration error is defined as

ECE(x, p) := X

t=1 (xt pt) 1[pt = α]

α [0,1] | α|.

Note that the summand | α| is non-zero only if α {p1, p2, . . . , p T }, so the summations above are essentially finite and well-defined.

The smooth calibration error. The smooth calibration error [KF08] is defined as

sm CE(x, p) := sup f F

t=1 f(pt) (xt pt) = sup f F

α [0,1] f(α) α,

where F is the family of 1-Lipschitz functions from [0, 1] to [ 1, 1]. Again, since α = 0 holds only if α {p1, p2, . . . , p T }, the summation above is finite and well-defined.

The distance from calibration. The distance from calibration, introduced by [BGHN23] and extended to the sequential setup by [QZ24], is defined as:

Cal Dist(x, p) := min q C(x) p q 1,

p [0, 1]T : a [0, 1],

t=1 (xt pt) 1[pt = α] = 0

is the set of predictions that are perfectly calibrated for x.

Interval calibration. The interval calibration error of [BGHN23] relaxes the ECE to a binned version while penalizing the use of long intervals. Formally, an interval partition I is a finite collection of intervals {I1, I2, . . . , I|I|} that form a partition of [0, 1]. The interval calibration error is defined as:

int CE(x, p) := inf I

t=1 (xt pt) 1 [pt Ii]

i=1 len(Ii) 1 [pt Ii]

where the infimum is over all interval partitions I, and len(I) denotes the length of interval I. Note that the first summation inside the infimum is analogous to the ECE, except that the biases associated with all values within the same interval are added together. The second summation gives the total lengths of the intervals into which the T predictions fall.

Laplace-kernel calibration. The Laplace-kernel calibration error [BGHN23] is a special case of the maximum mean calibration error introduced by [KSJ18]. It can be viewed as a variant of the smooth calibration error, in which the family F of Lipschitz functions is replaced by

e F := f : R R : f 2 2 + f 2 2 1 ,

where 2 denotes the ℓ2 norm of functions, and f is the derivative of f. Namely,

k CELap(x, p) := sup f e F

t=1 f(pt) (xt pt).

U-calibration. The definition of the U-calibration error [KLST23] is based on proper scoring rules. A (bounded) scoring rule is a function S : {0, 1} [0, 1] [ 1, 1]. A scoring rule is proper if it holds for every α [0, 1] that

α arg min β [0,1] E x Bernoulli(α) [S(x, β)] .

In other words, when the outcome x is drawn from follow Bernoulli(α), predicting the true parameter α minimizes the expected loss. The U-calibration error is then defined as

UCal(x, p) := sup S

t=1 S(xt, pt) inf α [0,1]

t=1 S(xt, α)

where the supremum is over all proper scoring rules. Note that for each fixed S, the expression inside the supremum is exactly the external regret of the forecaster, i.e., the excess loss compared to the best fixed prediction in hindsight.

Maximum swap regret. A recent line of work [NRRX23, RS24, HW24] considers a strengthening of U-calibration, in which the external regret is replaced with the swap regret. In particular, [HW24] showed that the resulting calibration measure, termed the Maximum Swap Regret (MSR), is polynomially related to the ECE after scaling by a factor of 1/T: ECE(x, p)

2 MSR(x, p)

T 2ECE(x, p)

A.2 0-Ω(T) Truthfulness Gaps

We first prove the 0-Ω(T) truthfulness gaps of the ECE and the MSR.

Proposition A.1. Both the expected calibration error and the maximum swap regret have a 0-Ω(T) truthfulness gap.

To establish Proposition A.1, we follow a similar argument to the one in Section 3: We divide the time horizon into T/3 triples, each containing a random bit followed by a zero and a one. The truthful forecaster would predict the true probabilities for the T/3 random bits, which are designed to be close to 1/2 but distinct. This leads to a linear ECE. On the other hand, a strategic forecaster may always predict 1/2 on the random bit. Then, based on the realization of the random bit, they use the subsequent deterministic bits to offset the bias. The resulting predictions are perfectly calibrated, and thus have a zero ECE. Finally, the relation between the ECE and the MSR gives the same truthfulness gap for the MSR.

Proof of Proposition A.1. Consider the distribution D defined as follows:

Let ε1, ε2, . . . , εT/3 be distinct values in [ 1/4, 1/4] chosen arbitrarily.

For each k [T/3], set (p 3k 2, p 3k 1, p 3k) = (1/2 + εk, 0, 1).

D is the product distribution QT t=1 Bernoulli(p t ).

By definition, the predictions made by the truthful forecaster are exactly given by p . Then, for each k [T/3] and α = 1/2 + εk [1/4, 3/4], we have | α| = |x3k 2 α| 1/4. This shows

err ECE(D, Atruthful(D)) (T/3) (1/4) = Ω(T). By the inequality MSR(x,p)

T h ECE(x,p)

also have err MSR(D, Atruthful(D)) = Ω(T).

On the other hand, consider the following alternative forecaster for D:

For each k [T/3], predict p3k 2 = 1/2.

If x3k 2 = 0, predict p3k 1 = 0 and p3k = 1/2; otherwise, predict p3k 1 = 1/2 and p3k = 1.

Clearly, for each k [T/3], the steps t {3k 2, 3k 1, 3k} have zero contribution to 0, 1 and 1/2. Therefore, this forecaster achieves a zero ECE on D. This proves OPTECE(D) = 0 and establishes the 0-Ω(T) truthfulness gap for the ECE. Finally, the inequality MSR(x,p)

T 2ECE(x,p)

T implies that the same forecaster achieves a zero MSR, which establishes OPTMSR(D) = 0 and the 0-Ω(T) truthfulness gap for the MSR.

T) Truthfulness Gaps

Next, we prove the 0-Ω(

T) truthfulness gap for several calibration measures. The proof follows the argument outlined in Section 3.

Proposition A.2. The smooth calibration error, the distance from calibration, the interval calibration error, and the Laplace-kernel calibration error all have a 0-Ω(

T) truthfulness gap.

Proof. The truthfulness gaps of the four calibration measures are witnessed by the same product distribution D = QT t=1 Bernoulli(p t ), where (p 3k 2, p 3k 1, p 3k) = (1/2, 0, 1) for every k [T/3].

Truthful forecaster has an Ω(

T) penalty. The truthful forecaster makes predictions that are identical to p . As a result, we have 0 = 1 = 0, while 1/2 is distributed as the difference between a sample from Binomial(T/3, 1/2) and its mean T/6. It then follows that | 1/2| Ω(

T) holds with probability Ω(1). We will show that all four calibration measures evaluate to Ω(

T) in expectation.

For the smooth calibration error, we have

errsm CE(D, Atruthful(D)) = E x D | 1/2| = E X Binomial(T/3,1/2) [|X T/6|] = Ω(

For the distance from calibration, by [BGHN23, Lemma 5.4 and Theorem 7.3], we have the inequality 1 2sm CE(x, p) Cal Dist(x, p) for any x {0, 1}T and p [0, 1]T , so the truthful forecaster also gives err Cal Dist(D, Atruthful(D)) = Ω(

For interval calibration, let I be an arbitrary interval partition, and I I be the interval that contains 1/2. If I contains either 0 or 1, we must have len(I) 1/2, and the term PT t=1 P|I| i=1 len(Ii) 1 [pt Ii] will be at least 2T/3 1/2 = Ω(T). If I does not contain 0 or 1, the summation PT t=1(xt pt) 1 [pt I] will be exactly 1/2, and the first term in the definition will be at least | 1/2|. It follows that int CE(x, p) Ω(

T) with probability Ω(1), so we have the lower bound errint CE(D, Atruthful(D)) = Ω(

For Laplace-kernel calibration, let f0 be an arbitrary function in e F such that f0(1/2) > 0, e.g., we can take f0(x) = ce x2 for a sufficiently small constant c > 0. Then, we have

k CELap(x, p) sup f {f0, f0}

α [0,1] f(α) α Ω(1) | 1/2|.

It follows that errk CELap(D, Atruthful(D)) Ω(1) E | 1/2| = Ω(

Strategic forecaster has a zero penalty. Consider the same strategic forecaster as in the proof of Proposition A.1: For each k [T/3],

Predict p3k 2 = 1/2.

If x3k 2 = 0, predict (p3k 1, p3k) = (0, 1/2); otherwise, predict (p3k 1, p3k) = (1/2, 1).

Clearly, this guarantees that α = 0 holds for all α [0, 1]. By definition, we have OPTsm CE(D) = OPTCal Dist(D) = 0. It also easily follows that both int CE and k CELap evaluate to 0. For int CE, we consider the interval partition I = {{0}, (0, 1/2), {1/2}, (1/2, 1), {1}}, which witnesses int CE(x, p) = 0. For k CELap, the summation PT t=1 f(pt) (xt pt) = P

α [0,1] f(α) α evaluates to 0 for all f e F. This proves OPTint CE(D) = OPTk CELap(D) = 0.

A.4 O(1)-Ω(

T) Truthfulness Gap of U-Calibration

For the U-calibration error, we prove a slightly smaller truthfulness gap of O(1)-Ω(

T), via a more involved analysis.

Proposition A.3. The U-calibration error has an O(1)-Ω(

T) truthfulness gap.

Proof. We use a slightly different construction: the product distribution D = QT t=1 Bernoulli(p t ) where p t = 1/2 for t T/2 and p t = 1 for t > T/2.

Truthful forecaster has an Ω(

T) penalty. We first show that the truthful forecaster has an Ω(

T) U-calibration error. Let random variable X := PT/2 t=1 xt denote the number of ones among the first T/2 random bits. Note that X follows Binomial(T/2, 1/2). Consider the scoring rule defined as: S(0, α) = sgn(α 1/2) and S(1, α) = sgn(1/2 α).

Note that S is proper, since for any α [0, 1], we have

E x Bernoulli(α) [S(x, β)] = (1 α) sgn(β 1/2) + α sgn(1/2 β) = (1 2α) sgn(β 1/2),

which is always minimized at β = α.

The total loss (w.r.t. S) incurred by the forecaster is then

t=1 S(xt, pt) = X S(1, 1/2) + (T/2 X) S(0, 1/2) + T/2 S(1, 1)

= X 0 + (T/2 X) 0 + T/2 ( 1) = T/2.

On the other hand, the total loss incurred by a fixed prediction β [0, 1] is given by:

t=1 S(xt, β) = (T/2 + X) S(1, β) + (T/2 X) S(0, β)

= (T/2 + X) sgn(1/2 β) + (T/2 X) sgn(β 1/2) = 2X sgn(1/2 β).

By choosing β = 1, we can obtain a total loss of 2X. Therefore, whenever X T/4, we have

UCal(x, p) T/2 ( 2X) = 2(X T/4).

When X < T/4, we always have UCal(x, p) 0, since the trivial scoring rule S 0 is proper. This shows that the truthful forecaster gives

err UCal(D, Atruthful(D)) E X Binomial(T/2,1/2) [max{2(X T/4), 0}] = Ω(

Strategic forecaster with an O(1) penalty. We consider an alternative forecaster A, which is slightly more involved:

At every step t T/2, predict pt = 5/8.

For t = T/2 + 1, T/2 + 2, . . . , T, predict pt = 5/8 until | 5/8| 1 at some time t. After that step, predict pt = 1.

We first argue that the condition | 5/8| 1 must hold at some point. Recall that X = PT/2 t=1 xt. By a Chernoff bound, X falls into [T/8, 5T/16] except with probability e Ω(T ). Assuming this, we have 5/8 = X (T/2) (5/8) 0 at time t = T/2. Furthermore, if we hypothetically predict 5/8 for each of the last T/2 steps, we would have

5/8 = (X + T/2) T (5/8) T/8 + T/2 5T/8 = 0

after all the T steps. Since 5/8 changes by at most 1 at each step, we must hit the condition | 5/8| 1 at some point.

Therefore, except with an e Ω(T ) probability, we end up with 5/8 [ 1, 1]. Furthermore, we predict at most two different values: 5/8 and 1. For every fixed proper scoring rule S : {0, 1} [0, 1] [ 1, 1], we have

t=1 S(xt, pt) inf β [0,1]

t=1 S(xt, β)

t=1 S(xt, pt) 1 [pt = α] inf β [0,1]

t=1 S(xt, β) 1 [pt = α]

In the above, we divide the time horizon [T] into two parts, based on whether 5/8 or 1 is predicted. The inequality holds since the right-hand side allows different values of β for different parts. Clearly, the term corresponding to α = 1 has zero contribution, since it reduces to S(1, 1) infβ [0,1] S(1, β) times the number of times 1 is predicted, which evaluates to 0 by definition of proper scoring rules.

The term corresponding to α = 5/8, on the other hand, is given by

N0 S(0, 5/8) + N1 S(1, 5/8) inf β [0,1][N0 S(0, β) + N1 S(1, β)],

where each Nb denotes the number of steps on which 5/8 is predicted and the outcome is b {0, 1}. By definition of proper scoring rules, the infimum is achieved by β = N1 N0+N1 , and the above can be further simplified into (N0 + N1) [S (β , 5/8) S (β , β )], where S(α, β) := α S(1, β) + (1 α) S(0, β) is the linear extension of S to [0, 1]2.

Let ℓ(α) := S(α, α) denote the uni-variate form of S. The following is a standard fact about proper scoring rules (see e.g., [KLST23, Lemma 1 and Corollary 2]).

Lemma A.4. For any proper scoring rule S : [0, 1]2 [ 1, 1] and its uni-variate form ℓ: [0, 1] [ 1, 1], it holds for all α, β [0, 1] that

S(α, β) = ℓ(β) + (α β) ℓ (β)

|ℓ (α)| 2 for all α [0, 1].

In particular, we have

|S(β , 5/8) S(5/8, 5/8)| = |β 5/8| ℓ (5/8) 2|β 5/8|

and |S(5/8, 5/8) S(β , β )| = |ℓ(5/8) ℓ(β )| 2|β 5/8|. It follows that, assuming X [T/8, 5T/16],

UCal(x, p) 4(N0 + N1)|β 5/8| = 4 N1 5

8(N0 + N1) = 4| 5/8| 4.

When X [T/8, 5T/16] does not hold (which happens with probability e Ω(T )), the U-calibration error is trivially upper bounded by O(T). It follows that

OPTUCal(D) err UCal(D, A) 4 + O(T) e Ω(T ) = O(1).

A.5 Lack of Completeness

Every scoring rule S : {0, 1} [0, 1] [0, 1] induces a calibration measure CM(S)(x, p) := PT t=1 S(xt, pt).1 When S is proper, it is easy to show that the resulting CM(S) is perfectly truthful, i.e., (1, 0)-truthful.

A drawback of such calibration measures is that they all lack completeness. Concretely, consider the squared loss S(x, p) := (x p)2. When the outcomes x1, x2, . . . , x T are independent and uniformly random bits, the right prediction pt 1/2 gives a total penalty of T/4, which is only a constant factor away from the maximum possible penalty of T. This violates the completeness property in Definition 2.2. In contrast, as shown in Table 2, almost all the other calibration measures would evaluate to T in this case. Such an asymptotic gap better justifies the intuition that pt 1/2 is a much better prediction than, say, pt 0.

More generally, unless the proper scoring rule S is trivial, we may find (x0, p0) {0, 1} (0, 1) such that S(x0, p0) > 0. Then, on a sequence of independent samples from Bernoulli(p0), we have

E x1,...,x T Bernoulli(p0)

h CM(S) T (x, p0 1) i T S(x0, p0) Pr X Bernoulli(p0) [X = x0]

T S(x0, p0) min{p0, 1 p0} = Ω(T),

which violates the completeness condition in Definition 2.2.

We also note that sm CE(x, p) +

T gives a calibration measure that is trivially truthful: Implicit in the proof of [QZ24, Theorem 3], the truthful forecaster gives an O(

T) smooth calibration error on every distribution D ({0, 1}T ), so it immediately gives a constant approximation of the optimal

1Here, we consider scoring rules with co-domain [0, 1], since our definition of calibration measures (in Section 2) requires them to be bounded between 0 and T on length-T sequences.

error, which is at least

T. However, this metric is not complete in the sense of Definition 2.2, since it evaluates to

T instead of 0 when p = x (i.e., the predictions are binary and perfect). While SSCE also discourages the forecaster from over-optimizing the metric by introducing some additional noise, the subsampling procedure is arguably more organic and better-justified than adding a

B Proof of Theorem 1.1

We prove Theorem 1.1, which we formally restate below.

Theorem B.1 (Formal version of Theorem 1.1). For every p [0, 1]T , on the product distribution D = QT t=1 Bernoulli(p t ), there is a forecaster that achieves an O(log3/2 T) smooth calibration error and distance from calibration. Moreover, assuming that p [δ, 1 δ]T for a fixed constant δ (0, 1/2], both OPTsm CE(D) and OPTCal Dist(D) are Ω(

B.1 The Upper Bound Part

We start by proving the upper bound part of Theorem B.1 by designing a forecasting algorithm.

The forecasting algorithm. Our proof is based on an algorithm of [QZ24] that works for the special case that p t 1/2. Their algorithm starts by predicting 1/2 on the first T/2 steps. Depending on the realization of these T/2 random bits, it predicts a slightly biased value for the next T/2 steps, until the total bias (i.e., the partial sum of xt pt) becomes close to 0 at some point. If there is still time left, the algorithm repeats the above strategy for the remainder of the time horizon.

Roughly speaking, [QZ24] shows that a polylog(T) distance from calibration can be achieved by designing a sub-routine with the following three properties:

Small bias: With high probability, the total bias is O(1) in magnitude at some time t [T/2, T].

Proximity of predictions: During the sub-routine, the values being predicted lie in a short interval of length polylog(T)/

Sparsity of predictions: During the sub-routine, only O(1) different values are predicted.

To handle the general case that p [0, 1]T is arbitrary, we design an alternative sub-routine, the behavior of which depends on whether the sequence p is sufficiently stationary in some sense. Let µfirst := 1 T/2 PT/2 t=1 p t and µsecond := 1 T/2 PT t=T/2+1 p t be the averages of the first and the second halves of the sequence, respectively. Let µ = (µfirst + µsecond)/2 be the overall average.

Case 1: |µfirst µ| > polylog(T)/

T. When µfirst and µ are far away, we predict α := µfirst+µ

2 at every step. Without loss of generality, suppose that µfirst < µ, in which case we have µfirst < α < µ,

where both inequalities hold with a margin > polylog(T)/

T. Then, with high probability the following two events happen: (1) The total bias is negative at time T/2, i.e., it holds that PT/2 t=1 xt < α (T/2); (2) If we (hypothetically) predict the same value α for the second half, the bias will be positive in the end with high probability, i.e., PT t=1 xt > α T. Therefore, with high probability, the bias must be close to 0 at some point in [T/2, T]. In this case, this sub-routine has all the desired properties.

Case 2: |µfirst µ| polylog(T)/

T. When µfirst and µ are close, we use a strategy that is more similar to the algorithm of [QZ24]. For the first half of the sequence, we predict α := µfirst. Let first := PT/2 t=1(xt α) denote the total bias at time T/2. Say that first 0. Then, we will predict β := µsecond + first

T/2 + polylog(T )

T in the second half of the sequence. The value of β is chosen such that we can offset the bias incurred in the first half (i.e., the first/(T/2) term). We also introduce some additional bias (i.e.,

the polylog(T)/

T term), so that we can return to a zero bias with high probability. In this case, our sub-routine predicts two different values (α and β), and they only differ by polylog(T)/

T with high probability.

Formally, our algorithm is given in Algorithm 1. The actual algorithm is significantly more involved than the outline above. The complication is due to the constraint that all predictions must lie in [0, 1], while our choice of β in Case 2 above might be invalid. We circumvent this issue by noting that β can be invalid only if µfirst is too close to either 0 or 1. In that case, we will choose a different value of α (i.e., the prediction for the first half), so that the sign of the bias at time T/2 is more predictable, and the resulting choice of β will likely be valid.

Algorithm 1: Forecaster for Product Distributions Input: Parameters p 1, p 2, . . . , p T . Outcomes x1, x2, . . . , x T observed sequentially. Output: Predictions p1, p2, . . . , p T .

1 t 0; r 0;

2 while t < T do

3 r r + 1; T (r) T t; H(r) T (r)/2 ;

4 if T (r) = 1 then predict p T = 0 and break;

5 µ(r) first 1 H(r) Pt+H(r)

s=t+1 p s; µ(r) second 1 H(r) Pt+2H(r)

s=t+H(r)+1 p s;

6 µ(r) [µ(r) first + µ(r) second]/2; (r) 0;

7 if |µ(r) first µ(r)| q

8 α(r) [µ(r) first + µ(r)]/2;

9 for i = 1, 2, . . . , 2H(r) do

10 t t + 1; Predict pt α(r);

11 Observe xt; (r) (r) + (xt pt);

12 if i > H(r) and | (r)| 1 then break;

15 if µ(r) first 1/2 then

16 if µ(r) first 10 q

H(r) then α(r) µ(r) first ;

17 else α(r) max µ(r) first q

2µ(r) first ln T (r)

19 if 1 µ(r) first 10 q

H(r) then α(r) µ(r) first ;

20 else α(r) min µ(r) first + q

2[1 µ(r) first] ln T (r)

21 for i = 1, 2, . . . , H(r) do

22 t t + 1; Predict pt α(r);

23 Observe xt; (r) (r) + (xt pt);

25 if (r) 0 then β(r) min µ(r) second + (r)/H(r) + q

2H(r) , 1 ;

26 else β(r) max µ(r) second + (r)/H(r) q

2H(r) , 0 ;

27 for i = 1, 2, . . . , H(r) do

28 t t + 1; Predict pt β(r);

29 Observe xt; (r) (r) + (xt pt);

30 if | (r)| 1 then break;

The analysis. We analyze Algorithm 1 and prove the upper bound in Theorem B.1 in the following three steps:

First, we break the execution of Algorithm 1 into different rounds of the while-loop, and show that each round brings a polylog(T) smooth calibration error in expectation.

Then, using the simple observation that the smooth calibration error is sub-additive, we obtain an upper bound on the overall smooth calibration error.

Finally, we use a relation between sm CE(x, p) and Cal Dist(x, p) when p only contains a few different values (shown by [QZ24]) to translate the upper bound to one on the distance from calibration.

The first step is the most technical. We fix r and condition on the value of t (equivalently, the value of T (r)) at the beginning of the r-th round. Note that the event t = t0 is solely determined by the realization of x1, x2, . . . , xt0, so conditioning on the value of t, the subsequent bits xt+1 through x T are still distributed according to D. Let sequences x(r) and p(r) denote the outcomes and predictions made in the r-th round. Note that the two sequences are of the same length, though the length might vary.

We classify the rounds into three different types as follows:

Type 1: The condition |µ(r) first µ(r)| q

H(r) holds in the if-statement on Line 7.

Type 2: |µ(r) first µ(r)| < q

H(r) , and α(r) is set to µ(r) first on either Line 16 or Line 19.

Type 3: |µ(r) first µ(r)| < q

H(r) , and α(r) is not set to µ(r) first.

Note that for fixed p , the type of a round is deterministic given r and T (r).

The three lemmas below give high-probability bounds on the smooth calibration error incurred during each round.

Lemma B.2. Conditioning on the value of T (r), if the r-th round is Type 1, it holds with probability 1 O(1/T (r)) that sm CE(x(r), p(r)) 1.

Lemma B.3. Conditioning on the value of T (r), if the r-th round is Type 2, it holds with probability 1 O(1/T (r)) that

sm CE(x(r), p(r)) 1 + O 1

h (r) first i2 + O

(r) first ,

where (r) first denotes the value of (r) at the end of the first for-loop (on Line 25).

Lemma B.4. Conditioning on the value of T (r), if the r-th round is Type 3, it holds with probability 1 O(1/T (r)) that

sm CE(x(r), p(r)) 1 + O 1

h (r) first i2 + O

(r) first ,

where (r) first denotes the value of (r) at the end of the first for-loop (on Line 25).

We first prove the upper bound part of Theorem B.1 using the lemmas above.

Proof of Theorem B.1, the upper bound part. By Lemmas B.2 through B.4, regardless of the type of the r-th round, it holds with probability 1 O 1/T (r) that

sm CE(x(r), p(r)) 1 + O 1

h (r) first i2 + O

(r) first ,

where (r) first is regarded as 0 if the r-th round is Type 1. We say that the round fails if this upper bound on sm CE does not hold. Conditioning on that T (r) = L, we always have sm CE(x(r), p(r)) L, since there are at most L steps in the r-th round. Therefore, we have the inequality

sm CE(x(r), p(r)) 1 + O 1

h (r) first i2 + O

(r) first + L 1 [round r fails] .

We will upper bound the value of E sm CE(x(r), p(r)) by taking an expectation over both sides of

the above. Therefore, we examine the expectation of | (r) first| and [ (r) first]2 conditioning on T (r) = L.

When the round is Type 1, there is nothing to upper bound. For Type 2 rounds, (r) first is the difference between Xfirst = Pt+H s=t+1 xs and its mean µfirst H. Since the variance of Xfirst is O(L), we have

E h (r) first i = O(

L) and E h (r) first i2 = O(L).

Type 3 rounds are trickier. We assume that µfirst 1/2; this is without loss of generality since the µfirst > 1/2 case can be handled by a completely symmetric argument. Then, (r) first is the difference

between Xfirst = Pt+H s=t+1 xs and αH, and α may differ from µfirst by at most q

2µfirst ln L

H . This gives

E h (r) first i2 = E h (Xfirst µfirst H)2i + (µfirst H αH)2

O(L) + O(µfirst H ln L).

Now we use the fact that when µfirst 1/2, the round is Type 3 only if µfirst < 10 q

H . This implies O(µfirst H ln L) O(

L log3/2 L), which is dominated by the O(L) term. It then follows from Jensen s inequality that

E h (r) first i

E h (r) first i2 = O(

Put everything together. Therefore, we have the upper bound

E h sm CE(x(r), p(r)) T (r) = L i

[ (r) first]2 + O

| (r) first|

+ L Pr h round r fails T (r) = L i

log L) + L O(1/L) = O( p

The second step applies our earlier conclusion that E h | (r) first| i = O(

L) and E h [ (r) first]2i = O(L)

conditioning on T (r) = L. Taking another expectation over the randomness in T (r) shows that sm CE(x(r), p(r)) = O( log T) for every r. Note that we have

sm CE(x, p) = sup f F

t=1 f(pt) (xt pt)

t f(p(r) t ) (x(r) t p(r) t )

t f(p(r) t ) (x(r) t p(r) t )

r sm CE(x(r), p(r)).

Furthermore, there are at most O(log T) rounds. It follows that E [sm CE(x, p)] = O(log3/2 T).

Finally, we note that in each round of the while-loop, the forecaster predicts at most 2 different values (namely, α(r) and β(r)). Therefore, the predictions p1, p2, . . . , p T contain at most O(log T) different values. By [QZ24, Theorem 2], we conclude that

E [Cal Dist(x, p)] O(1) E [sm CE(x, p) + |{p1, p2, . . . , p T }|] = O(log3/2 T).

Now we prove Lemmas B.2 through B.4. In the proofs below, we frequently drop the superscript (r) since we only refer to the r-th round.

Proof of Lemma B.2. Recall that a Type 1 round is one in which the condition |µfirst µ| q

H holds in the if-statement. We say that the round succeeds, if we exit the for-loop us-

ing the break statement on Line 12, i.e., the condition i > H and | (r)| 1 holds at some point (including in the last iteration where i = 2H); otherwise, the round fails.

Note that only one value (namely, α(r)) is predicted within the round. Thus, if the round succeeds, we have sm CE(x(r), p(r)) = | α(r)| = (r) 1.

It remains to control the probability for a Type 1 round to fail. Consider random variables

s=t+1 xs and X :=

Note that both are sums of independent Bernoulli random variables, with E [Xfirst] = µfirst H and E [X] = 2µH. Also note that since α = (µfirst + µ)/2, we have

|µfirst α| = |µ α| = 1

2 |µfirst µ|

Without loss of generality, suppose that µfirst µ. By an additive Chernoff bound, we have

Pr [Xfirst/H α] exp 2H (α µfirst)2 1 T (r) .

Pr [X/(2H) α] exp 4H (α µ)2 1 T (r) .

Therefore, except with probability O(1/T (r)), we have both Xfirst < αH and X > 2αH. In other words, if the for-loop (hypothetically) runs all the 2H iterations, we would have (r) < 0 at the end of the H-th iteration, and (r) > 0 at the end of the 2H-th iteration. Since (r) changes by |xt pt| 1 within each iteration, there must be an iteration i {H + 1, H + 2, . . . , 2H} at the end of which (r) falls into [0, 1]. By definition of Algorithm 1, we exit the for-loop at that time, and the r-th round succeeds.

Proof of Lemma B.3. Recall that in a Type 2 round, we have |µfirst µ| < q

H and α = µfirst. Without loss of generality, suppose that µfirst 1/2; the case that µfirst > 1/2 follows from a completely symmetric argument. We say that a Type 2 round succeeds if both conditions below are satisfied:

When β is chosen, the clipping (i.e., taking the minimum with 1 or taking the maximum with 0) is not effective.

We exit the second for-loop through the break statement on Line 30.

Otherwise, the round fails.

Again, we first upper bound the smooth calibration error incurred within a successful round, and then control the probability for a round to fail. Since only α and β are predicted in this round, we have

sm CE(x(r), p(r)) = sup f F [f(α) α + f(β) β],

where α and β are defined with respect to x(r) and p(r). The above is further given by

sup f F [f(β) ( α + β) + [f(α) f(β)] α]

sup f F [f(β) ( α + β)] + sup f F [(f(α) f(β)) α]

= | α + β| + |α β| | α|.

Note that α + β is exactly the value of (r) at the end of the second for-loop, while α is its value after the first for-loop, i.e., (r) first. Then, assuming that the round succeeds, we have | α + β| 1 and

|α β| = |µfirst β| |µfirst µsecond| + |µsecond β|

| (r) first| H +

| (r) first| + O

Plugging the above back into the upper bound on sm CE(x(r), p(r)) shows that in a successful Type 2 round,

sm CE(x(r), p(r)) 1 + O 1

[ (r) first]2 + O

| (r) first|.

In the following, we show that a Type 2 round succeeds with probability 1 O(1/T (r)). Let Xfirst := Pt+H s=t+1 xs. Note that Xfirst is a sum of H independent Bernoulli random variables and

E [Xfirst] = µfirst H. Furthermore, we have (r) first = Xfirst µfirst H. By an additive Chernoff bound, we have

| (r) first|

|Xfirst/H µfirst|

1 2 T (r) . (3)

Recall that we need to argue that no clipping is applied when β is chosen. We analyze the following two cases:

Case 1. (r) first 0. In this case, we need to show that

µsecond + (r) first H +

Recall that we assumed µfirst 1/2 and |µfirst µ| < q

H . The latter further implies

|µfirst µsecond| = 2|µfirst µ| < q

H . Thus, it suffices to prove that r

H + | (r) first| H +

When | (r) first| q

2 (i.e., the event in Equation (3) holds), the left-hand side above

, which is below 1/2 as long as T (r) exceeds some universal constant T0.

Therefore, the probability that a clipping is applied is at most O(1/T (r)), where we absorb the constraint T (r) T0 into the hidden constant in O( ).

Case 2. (r) first < 0. In this case, we need to show that

µsecond + (r) first H

Recall that the definition of Type 2 rounds implies µfirst 10 q

H . Thus, it suffices to prove that

H | (r) first| H

The above holds whenever the event in Equation (3) happens, since 10

Finally, we argue that, with high probability, we exit the second for-loop via the break statement. Let Xsecond := Pt+2H s=t+H+1 xs denote the total outcome in the second half. By symmetry, we only

deal with the case that (r) first 0, where we have β = µsecond + (r) first/H + q

2H . If the second

for-loop runs all the H iterations in full, at the end of it, the value of (r) will be given by

(r) first + Xsecond βH = Xsecond µsecond H

Note that the above is non-negative only if Xsecond µsecond H + q

2 , which, by an additive

Chernoff bound, holds with probability at most 1/T (r). Therefore, with probability 1 1/T (r), the value of (r) must fall into [ 1, 0] during the second for-loop, and we will take the break statement accordingly.

Proof of Lemma B.4. Again, without loss of generality, suppose that µfirst 1/2; the other case follows from a completely symmetric argument. In contrast to Type 1 and Type 2 rounds, we say that a Type 3 round succeeds if all the following conditions hold simultaneously:

(r) first 0, i.e., (r) 0 holds at the end of the first for-loop (on Line 25).

When β is chosen, the clipping (i.e., taking the minimum with 1) is not effective.

We exit the second for-loop through the break statement on Line 30.

Otherwise, the round fails.

By the same argument as in the proof of Lemma B.3, in a successful Type 3 round, we have

sm CE(x(r), p(r)) 1 + O 1

[ (r) first]2 + O

| (r) first|.

The only change in the argument is the upper bound on |α β|, since α is no longer equal to µfirst. Nevertheless, we still have

|α β| |α µfirst| + |µfirst µsecond| + |µsecond β|

2µfirst ln T (r)

| (r) first| H +

| (r) first| + O

and the rest of the analysis goes through.

Thus, it remains to show that a Type 3 round succeeds with probability 1 O(1/T (r)). Let Xfirst := Pt+H s=t+1 xs. Note that Xfirst is a sum of independent Bernoulli random variables and E [Xfirst] = µfirst H. By a multiplicative Chernoff bound, for any δ 0, we have

Pr [Xfirst/H (1 δ)µfirst] exp δ2µfirst H/2 .

In particular, plugging δ = q

µfirst H into the above gives

Xfirst/H µfirst

2µfirst ln T (r)

Recall that α is chosen as the maximum between µfirst q

2µfirst ln T (r)

H and 0. Thus, with probability at least 1 1/T (r), we have Xfirst/H α, which is equivalent to (r) 0 at the end of the first for-loop.

Then, we need to argue that when β is chosen, we have µsecond + (r)/H + q

2H 1. We will show the equivalent inequality:

(µsecond 1/2) + (r)/H +

For the first term, we note that since µ = (µfirst+µsecond)/2, the assumption |µfirst µ| < q

implies |µfirst µsecond| = O q

. With the additional assumption that µfirst 1/2, we

µsecond 1/2 (µfirst 1/2) + |µfirst µsecond| O

For the second term, we note that, at the end of the first for-loop, (r)/H is given by

+ (µfirst α).

By an additive Chernoff bound, Xfirst

H µfirst O q

holds with probability 1 O(1/T (r)).

By our choice of α, µfirst α is always O q

. Finally, the last term is clearly

. Therefore, as long as T (r) is larger than a universal constant T0, the total

term is upper bounded by 1/2. Again, we can absorb the condition T (r) T0

into the big-O notation, so the second condition (that β is not clipped) is satisfied with probability 1 O(1/T (r)).

Finally, we argue that we exit the second for-loop via the break statement with high probability. Let Xsecond := Pt+2H s=t+H+1 xs denote the total outcome in the second half. Recall that we have (r) 0

at the end of the first for-loop, and that β = µsecond + (r)/H + q

2H . If the second for-loop

runs all the H iterations in full, at the end of it, the value of (r) will be given by

(r) first + Xsecond βH = Xsecond µsecond H

Note that the above is non-negative only if Xsecond µsecond H + q

2 , which, by an additive

Chernoff bound, holds with probability at most 1/T (r). Therefore, with probability 1 1/T (r), the value of (r) must fall into [ 1, 0] during the second for-loop, and we will take the break statement accordingly.

B.2 The Lower Bound Part

We prove the lower bound part of Theorem B.1 via a central limit theorem.

Proof of Theorem B.1, the lower bound part. On the product distribution D = QT t=1 Bernoulli(p t ), the truthful forecaster predicts pt = p t at every step t. Then, we have

E x D [sm CE(x, p )] = E x D

t=1 f(p t ) (xt p t )

t=1 (xt p t )

where we use the fact that F contains the constant functions f 1 and f 1.

Applying the Berry-Esseen theorem [She10] to the random variable X := PT t=1(xt p t ) gives:

x R, Pr [X x σ0] Φ(x) C0 σ 1 0 ρ0,

where Φ(x) is CDF of the standard normal distribution, C0 0.56 is a universal constant, and

t=1 E [(xt p t )2] =

t=1 p t (1 p t ) p

ρ0 = max t [T ] E |xt p t |3

E [|xt p t |2] = max t [T ] p t (1 p t ) [(p t )2 + (1 p t )2] p t (1 p t ) 1.

In particular, taking x = 1 gives:

Pr [X σ0] Φ( 1) C0 σ 1 0 ρ0 = Ω(1) O(1/

For all sufficiently large T, the O(1/

T) term is dominated by the Ω(1) term, in which case we have

E x D [sm CE(x, p )] E [|X|] σ0 Pr [X σ0] = Ω(

Finally, by the inequality 1

2sm CE(x, p) Cal Dist(x, p) [BGHN23, Lemma 5.4 and Theorem 7.3], the distance from calibration incurred by the truthful forecaster is also Ω(

C Supplemental Materials for Section 5

The following is a tighter version of Theorem 5.1.

Theorem C.1. For any D ({0, 1}T ), err SSCE(D, Atruthful(D)) = O(E [γ(Var T )]), where

γ(x) := x, x < 1, x, x 1.

Proof. Given a function f : [0, 1] [ 1, 1] and binary vector y {0, 1}T , we define the martingale Mt(f, y) := Pt s=1 ys f(p s) (xs p s) where x D and we use Ft to denote the filtration describing the randomness of MT (f, y) up to time t and p t := E [xt |Ft 1 ]. Note that, conditioned on Ft 1, xt is distributed as a Bernoulli with parameter p t .

We can write the SSCE of a truthful forecaster in terms of MT (f, y) as

SSCE(x, p ) := E y Unif({0,1}T )

sup f F MT (f, y)

We now proceed via chaining and define the dyadic scale εk = 21 k for k = 0, 1, 2, . . . . To cover the set of Lipschitz functions F, we will use the sets of piecewise constant functions {Fδ}δ>0 described in Lemma C.2. For each function f F, let πk(f) be a close function in Fεk such that d(f, πk(f)) 2εk. Observe that the covering Fε0 is a singleton and that πk(f) always exists as Fεk is a 2εk-covering of F. Telescoping then gives

f(x) = (f(x) πM(f)(x)) + π0(f)(x) +

i=1 [πi(f)(x) πi 1(f)(x)] ,

meaning that we have

SSCE(x, p ) E y Unif({0,1}T )

t=1 yt (f(p t ) πM(f)(p t )) (xt p t )

| {z } (Term A)

+ E y Unif({0,1}T )

t=1 yt π0(f)(p t ) (xt p t )

| {z } (Term B)

+ E y Unif({0,1}T )

t=1 yt (πi(f)(p t ) πi 1(f)(p t )) (xt p t )

| {z } (Term C)

First, we can use that d(f(p t ) πM(f)(p t )) 22 M to deterministically bound Term A by

E y Unif({0,1}T )

t=1 yt (f(p t ) πM(f)(p t )) (xt p t )

Second, we can observe that the image of π0(f) is a singleton: | {π0(f) | f F} | = 1; let this unique function be denoted by f . Then, Term B reduces to Ey Unif({0,1}T ) [MT (f , y)], which evaluates to 0 after taking an expectation over x D, since for every y {0, 1}T , (Mt(f , y))0 t T forms a martingale. Third, we can observe that πi(f) πi 1(f) is a function from [0, 1] 21 i, 0, 21 i that takes a constant value along the segments [(j 1)21 i, j21 i) for all j [2i 1]. Thus, we can bound the summands of Term C by

E y Unif({0,1}T )

t=1 yt (πi(f)(p t ) πi 1(f)(p t )) (xt p t )

j=0 E y Unif({0,1}T )

sup v {0, 21 i}

t=1 yt v (xt p t ) 1 j21 i p t < (j + 1)21 i #

j=0 21 i E y Unif({0,1}T )

t=1 yt v (xt p t ) 1 j21 i p t < (j + 1)21 i

| {z } =:MT (v,y,i,j)

Invoking Lemma C.9 with G = {x 7 1, x 7 1} and I = j21 i, (j + 1)21 i , we have that for all i [M], j {0, 1, . . . , 2i 1}, and y {0, 1}T :

sup v { 1} MT (v, y, i, j)

(48 + 8 ln 2) E x D

t=1 p t (1 p t )1 j21 i p t < (j + 1)21 i !#

Plugging this into Term C, we have

E x D [Term C] (48 + 8 ln 2)

i=1 21 i 2i 1 X

t=1 p t (1 p t )1 j21 i p t < (j + 1)21 i !#

= (48 + 8 ln 2)

i=1 21 i E x D

t=1 p t (1 p t )1 j21 i p t < (j + 1)21 i !

Using Lemma C.3, we can simplify

E x D [Term C] (48 + 8 ln 2)

2i 1 + 1 E x D

t=1 p t (1 p t )1 j21 i p t < (j + 1)21 i

(48 + 8 ln 2)

i=1 21 i/2 E x D

t=1 p t (1 p t )

= (48 + 8 ln 2) (2 + 2

2) E x D [γ (Var T )] .

Plugging this into (4) and observing that we can choose M to be arbitrarily large, we have as desired

E x D [SSCE(x, p )] inf M N

h 22 M T + E x D [Term C] i

(48 + 8 ln 2) (2 + 2

2) E x D [γ (Var T )] .

C.1 Auxillary Lemmas

Covering Lipschitz functions. Let us first recall a standard covering of the class of Lipschitz functions F [ 1, 1][0,1]. We will work with the metric d on the functions F induced by the -norm; that is, for any f, g F, d(f, g) := supx [0,1] |f(x) g(x)|. In this section, for δ > 0 and b > a where b a

δ Z, we will use the shorthand [a, b]δ := {a, a + δ, . . . , b} to denote endpoints of partitioning of [a, b] into segments of length δ. We will also use the shorthand x δ := max {iδ | iδ x, i Z} to denote rounding down to the nearest multiple of δ.

Lemma C.2. For δ > 0 where 1

δ Z, consider all functions f : [0, 1] [ 1, 1] that satisfy conditions

(1) x [0, 1]δ : f(x) [ 1, 1]δ (2) x [0, 1]δ \ {1} : |f(x + δ) f(x)| δ (3) x [0, 1] : f(x) = f( x δ).

This set of functions, which we will denote by Fδ, is a 2δ-covering of the set of 1-Lipschitz functions F : [0, 1] [ 1, 1] in the metric d.

Proof. Fix a 1-Lipschitz function f F. Let f Fδ be the function in our covering where, for all x [0, 1]δ, f (x) = f(x) δ. Note that f is unique because the elements of Fδ can be identified by their image on [0, 1]δ. For any x [0, 1], we have

|f(x) f (x)| |f(x) f( x δ)| + |f (x) f ( x δ)| + |f( x δ) f ( x δ)|

|x x δ| + 0 + |f( x δ) f ( x δ)| 2δ,

where the first inequality is the triangle inequality, the second inequality uses the 1-Lipschitzness of f and that f (x) = f ( x δ), and the third inequality uses the fact that |f(z) f (z)| δ and |z z δ| δ for all z [0, 1].

Bounding sums of γ. Consider the piecewise function γ(x) := x, x < 1, x, x 1.

Lemma C.3. For all values x1, . . . , xn 0, we can upper bound Pn i=1 γ(xi) n γ(Pn i=1 xi).

Proof. First, suppose that Pn i=1 xi 1. Then, γ(Pn i=1 xi) = Pn i=1 xi and xi 1 for all i [n]. The claim is therefore equivalent to the trivial statement Pn i=1 xi n Pn i=1 xi.

Now suppose that Pn i=1 xi > 1. The Cauchy-Schwarz inequality gives

By our assumption that Pn i=1 xi > 1, we have γ(Pn i=1 xi) = p Pn i=1 xi. We separately have that

because γ(x) = x x for x [0, 1] and γ(x) = x = x for x 1. Thus,

i=1 xi = n γ

C.2 Epochs of Doubling Realized Variance

Definition C.4. For I [0, 1], consider the stochastic process (Vart(I))0 t T defined as

s=1 p s(1 p s) 1 [p s I] ,

where x D and p t := Prx D h x t = 1|x 1:(t 1) = x1:(t 1) i . We define the epochs with respect to

I as the sequence τ0, τ1, N where τ0 = 0 and, for each k [ log2(T) + 2],

τk := min t [τk 1 + 1, T] | Vart(I) Varτk 1(I) 2k 1 { } . (5)

The epochs τ0, τ1, . . . defined in Definition C.4 partition the T time steps of a martingale into epochs such that the realized variance Vart(I) increases by approximately 2k 1 within the k-th epoch. In particular, we can understand τk as pointing to the last time step of the kth epoch. The definition of τ ensures that:

Epoch 1 starts from time step 1, and ends at the earliest time step t such that Vart(I) 1 = 20. For k 2, Epoch k starts from the time step after the last step of Epoch k 1, and ends at the earliest time step such that the total variance within the epoch reaches 2k 1.

We have the following technical facts about the epochs τ. Fact C.5. The ( log2(T) + 2)-th epoch is never complete, i.e., τ log2(T ) +2 = .

Proof. Our definition of Vart(I) clearly guarantees Var T (I) T, which implies

Var T (I) Varτ log2(T ) +1(I) T < 2 log2(T ) +1,

and therefore, τ log2(T ) +2 = .

Fact C.6. For every epoch k [ log2(T) + 2], the change in realized variance in epoch k is deterministically upper bounded by Varτk(I) Varτk 1(I) < 2k 1 + 1.

Proof. By definition, Varτk 1(I) Varτk 1(I) < 2k 1. Because p t [0, 1] for all t [T], the realized variance increases by at most p t (1 p t ) 1 in each timestep, i.e. Varτk(I) Varτk 1(I) 1. The fact follows by summing the two inequalities.

Fact C.7. For any epoch k [ log2(T) + 2], the probability that the kth epoch ends is at most

Pr [τk < ] min E h 1[Var T (I) 1]

Var T (I) i

Proof. The sequence of realized variances Var1(I), . . . , Var T (I) is deterministically non-decreasing. Thus, for every epoch k [ log2(T) + 2],

Pr [τk < ] Pr Var T (I) Varτk 1 2k 1 Var T (I) 1

Pr 1 [Var T (I) 1] Var T (I) 2k 1

= Pr h 1 [Var T (I) 1] p

with the second inequality following as τ1 < implies Var T (I) 1. We can next invoke Markov s inequality Pr [X a] E[X]

2k 1 and X = 1 [Var T (I) 1] p

Var T (I) to recover

Pr h 1 [Var T (I) 1] p

2k 1 i min E h 1[Var T (I) 1]

Var T (I) i

Fact C.8. The exponentially weighted sum of probabilities that each epoch ends is at most

log2(T ) +2 X

2k 1 Pr[τk 1 < ] (2

2 + 2) E h 1 [Var T (I) 1] p

Var T (I) i .

Proof. We will prove the deterministic inequality

log2(T ) +2 X

2k 1 1 [τk 1 < ] (2

2 + 2)1 [Var T (I) 1] p

the fact follows from taking an expectation on both sides.

Let K = max {k | τk < } be the number of completed epochs. When K = 0, we have Var T (I) < 1, and both sides of the above reduce to 0. Now, suppose that K 1, in which case we have Var T (I) 1. By telescoping, we can lower bound the realized variance by

k=1 2k 1 2K 1.

Separately, by definition of K, we have

log2(T ) +2 X

2k 1 1 [τk 1 < ] =

= 1 [Var T (I) 1]

1 [Var T (I) 1]

with the second equality following from Var T (I) 1. Combining the previous two inequalities gives the desired inequality

log2(T ) +2 X

2k 1 1 [τk 1 < ] (2

2 + 2)1 [Var T (I) 1] p

C.3 Random Walks with Early Stopping

We now prove a technical result that the magnitude of a random walk with random variance can be upper bounded by its (expected) standard deviation. Compared to Lemma 5.2, the lemma below gives a bound that depends on γ(Var T (I)) (rather than the square root), and avoids the extra log |G| log T term. While the leading factor ( log |G|) is larger than the one in Lemma 5.2 ( p

log |G|), we will only apply the bound to the case that |G| = O(1), where the difference between the two is only a constant factor.

Lemma C.9. Given a function f : [0, 1] [ 1, 1], y {0, 1}T , and set I [0, 1], consider the martingale Mt(f, y, I) := Pt s=1 yt f(p s) (xs p s) 1 [p s I], where x D, and p t =

Prx D h x t = 1|x 1:(t 1) = x1:(t 1) i . Then, for any finite family G of functions from [0, 1] to [ 1, 1],

any y {0, 1}T , and any I [0, 1], we have

max f G MT (f, y, I) 8 6 + log(|G|) E x D [γ(Var T (I))] .

where Vart(I) := Pt s=1 p s(1 p s) 1 [p s I] is the realized variance restricted to subset I, and

γ(x) := x, x < 1, x, x 1.

Proof. Let us decompose the horizon into epochs of doubling realized variance with respect to the subset I as per Definition C.4. Using τ as defined in (5), we will write Ik := [τk 1 +1 : min {T, τk}] to denote the time steps composing epoch k and write K := max {k | τk < } to denote the number of completed epochs.

We will separately handle the contributions of epoch 1 and those of later epochs.

First epoch. Since yt {0, 1} and f 1 holds for every f G, we can bound the expected contribution from the first epoch as follows:

t=1 yt f(p t ) (xt p t ) 1 [p t I]

t=1 |xt p t | 1 [p t I]

Note that for any p [0, 1] and Bernoulli random variable x Bernoulli(p),

E [|x p|] = Pr [x = 0] |0 p| + Pr [x = 1] |1 p| = 2p(1 p).

It thus follows that the process (Xt)0 t T where

s=1 [|xs p s| 2p s(1 p s)] 1 [p s I]

is a martingale, as conditioning on any realization of x1:(t 1), we have

E x D [|xt p t | 2p t (1 p t ) | x1:t 1] = E x Bernoulli(p t ) [|x p t |] 2p t (1 p t ) = 0.

By the optional stopping theorem, we have

t=1 [|xt p t | 2p t (1 p t )] 1 [p t I]

= E [Xτ1] = 0.

Plugging this identity into (6) gives

t=1 yt f(p t ) (xt p t ) 1 [p t I]

t=1 2p t (1 p t ) 1 [p t I]

= 2 E x D [Varτ1(I)]

2 E x D [min {2, Var T (I)}]

4 E x D [min {1, Var T (I)}] ,

where the third step applies Fact C.6 with k = 1.

Later epochs. Applying a triangle inequality and the law of total expectation gives

t=τ1+1 yt f(p t ) (xt p t ) 1 [p t I]

t Ik yt f(p t ) (xt p t ) 1 [p t I]

k=2 max f G

t Ik yt f(p t ) (xt p t ) 1 [p t I]

log2(T ) +2 X

t=1 yt f(p t ) (xt p t ) 1 [p t I t Ik]

log2(T ) +2 X

k=2 Pr [τk 1 < ] E x D

max f G M k,f T | τk 1 < ,

where we define the process

t=1 yt f(p t ) (xt p t ) 1 [p t I t Ik] . (9)

In the above, the third step uses Fact C.5, namely that τ log2(T ) +2 = . We can use Freedman s inequality to obtain a maximal inequality for each of these M k,f T processes.

Fact C.10. For every y {0, 1}T and k 2, we can uniformly bound the process M k,f T defined in (9) over a finite class G of functions from [0, 1] to [ 1, 1] by

max f G M k,f T | τk 1 <

2k 1(2 + 2 p

log |G|) + 2 + 2 log |G| .

Applying Fact C.10 to each of the martingales M k,f T in (8) gives us

t=τ1+1 yt f(p t ) (xt p t ) 1 [p t I]

log2(T ) +2 X

k=2 Pr [τk 1 < ] (

2k 1(2 + 2 p

log |G|) + 2 + 2 log |G|). (10)

To upper bound the right-hand side above, we use Fact C.8 to bound

log2(T ) +2 X

k=2 Pr [τk 1 < ]

2k 1 E h 1 [Var T (I) 1] p

Var T (I) i (2 + 2

and use Fact C.7 to bound

log2(T ) +2 X

k=2 Pr [τk 1 < ]

log2(T ) +2 X

1, E h 1 [Var T (I) 1] p

Var T (I) i

E h 1 [Var T (I) 1] p

Var T (I) i (2 +

Plugging these into (10) gives

E max f G [MT (f, y, I) Mτ1(f, y, I)]

E h 1 [Var T (I) 1] p

Var T (I) i (2 + 2

E h 1 [Var T (I) 1] p

Var T (I) i 8 5 + log |G| . (11)

Combine bounds. Combining (7) and (11) and recalling the definition of γ, we recover our main claim

max f G MT (f, y, I)

max f G Mτ1(f, y, I) + E x D

max f G MT (f, y, I) Mτ1(f, y, I)

4 E x D [min {1, Var T (I)}] + 8 5 + log |G| E h 1 [Var T (I) 1] p

Var T (I) i

4 E x D [γ(Var T (I))] + 8 5 + log |G| E x D [γ(Var T (I))]

8 (6 + log |G|) E x D [γ(Var T (I))] .

The second step above applies Inequalities (7) and (11). The third holds since min{1, x} γ(x) and 1 [x 1] x γ(x) hold for all x 0.

Let us recall Freedman s inequality [Fre75]. Lemma C.11. Consider a martingale Mn D with filtration (Ft) where |Mt Mt 1| 1 for all t [n]. For all x, y > 0, we have the following high-probability bound on Mn:

t=1 E (Mt Mt 1)2 |Ft 1 y

We now prove Fact C.10.

Fact C.10. For every y {0, 1}T and k 2, we can uniformly bound the process M k,f T defined in (9) over a finite class G of functions from [0, 1] to [ 1, 1] by

max f G M k,f T | τk 1 <

2k 1(2 + 2 p

log |G|) + 2 + 2 log |G| .

Proof. Fix any f G. For t / Ik, we have trivially that for any x1:t 1 {0, 1}t 1: E x D yt f(p t ) (x t p t ) 1 [t Ik p t I] | x 1:t 1 = x1:t 1 = 0.

For t Ik, since 1 [τk 1 < ] and p t is measurable by x1:t 1, we again have that E x D yt f(p t ) (xt p t ) 1 [t Ik p t I] | x 1:t 1 = x1:t 1

= yt f(p t ) 1 [t Ik p t I] E x D x t | x 1:t 1 = x1:t 1 p t

This means that M k,f T is a martingale even conditioned on the event that τk 1 < .

Our construction of epoch k in (5) further guarantees that the realized variance of M k,f T is deterministically upper bounded by Varτk(I) Varτk 1(I) 2k 1 + 1 (Fact C.6). Thus,

t=1 p t (1 p t ) 1 [t Ik p t I]

t=1 E x Bernoulli(p t )

(x p t )2 1 [t Ik p t I]

t=1 E x t Dt

(x t p t )2 | x 1:t 1 = x1:t 1 1 [t Ik p t I]2

t=1 y2 t f(p t )2 E x t Dt

(x t p t )2 | x 1:t 1 = x1:t 1 1 [t Ik p t I]2

t=1 E x t Dt

h (M k,f t M k,f t 1)2 | x 1:t 1 = x1:t 1 i . (12)

where the first equality uses the definition of a Bernoulli s variance; the second equality uses that, conditioned on Ft 1, xt Bernoulli(p t ); and the second inequality uses that y2 t 1 and |f(x)| 1 for all x [0, 1].

We can thus use Freedman s inequality to bound the deviation of each martingale M k,f T . First,

observe that the quadratic formula gives the inequality exp x2 2(x+y) p if x log(1/p) + q

log2(p) + 2y log(1/p). We can therefore invoke Lemma C.11 with y = 2k 1 + 1 and

x = 2 log(1/p) + p

2y log(1/p) log(1/p) + q

log2(p) + 2y log(1/p)

to show that

t=1 E x t Dt

h (M k,f t M k,f t 1)2 | x 1:t 1 = x1:t 1 i 2k 1 + 1 | τk 1 <

Applying (12), we can simplify this to

p Pr M k,f T q

2(2k 1 + 1) log(1/p) + 2 log(1/p) | τk 1 <

Pr M k,f T q

2k+1 log(1/p) + 2 log(1/p) | τk 1 < .

We can then take a union bound over G for

p Pr max f G M k,f T q

2k+1 log(|G| /p) + 2 log(|G| /p) | τk 1 < .

Using the layer cake representation of expectation, we can convert this high-probability bound into the expectation bound through a change of variables

E max f G M k,f T | τk 1 < = Z

0 Pr max f G M k,f T t | τk 1 < dt

2k+1 log(|G| /p) + 2 log(|G| /p) dp

2 π erfc( p

log |G|) + p

log |G|) + 2 + 2 log |G| ,

where the last equality follows by Fact C.12. When |G| > 1, we can compute the integral to be

E max f G M k,f T | τk 1 <

log|G| exp( log |G|) + p

log |G|) + 2 + 2 log |G|

2k 1(2 + 2 p

log |G|) + 2 + 2 log |G| ,

where the first inequality uses that erfc(z) < exp( z2)

z π . When |G| = 1, we again have

E max f G M k,f T | τk 1 < π

2k 1(2 + 2 p

log |G|) + 2 + 2 log |G| .

Fact C.12. For k, n Z+, the following integral equality holds Z 1

2k+1 log(n/p) + 2 log(n/p) dp =

2 π erfc( p

log n) + 2 + 2 log n

where erfc denotes the complementary error function.

Proof. Let us first separate the integral into two parts: Z 1

2k+1 log(n/p) + 2 log(n/p) dp = Z 1

2k+1 log(n/p) dp + Z 1

0 2 log(n/p) dp.

We can bound the second integral easily. Since log(n/p) = log n log p, Z 1

0 2 log(n/p) dp = Z 1

0 2(log n log p) dp

= 2 log n Z 1

= 2 log n + 2 (13)

Now we consider the first integral. Let u = log(n/p). Then p = ne u and dp = ne u du. When p = 1, u = log n. When p = 0, u goes to . Thus, the integral becomes: Z 1

2k+1 log(n/p) dp = Z log n

2k+1u ( ne u) du

The integral involving the error function erfc(x) can be recognized: Z

u e u du = u e u log n + Z

1 2 u e u du

= lim u u e u p

log n e log n + Z

1 2 u e u du

log n e log n + Z

1 2 u e u du

log n e log n + Z

log n e t2 dt

Thus, the integral R 1 0 p

2k+1 log(n/p) dp is given by

log n) + log n

log n . (14)

Summing (13) and (14) gives the claim.

C.4 Proof of Lemma 5.2

Lemma 5.2. Given a function f : [0, 1] [ 1, 1] and y {0, 1}T , consider the martingale Mt(f, y) := Pt s=1 ys f(p s) (xs p s) where x D. Then, for any finite family G of functions from [0, 1] to [ 1, 1] and any y {0, 1}T , we have

max f G MT (f, y) O log |G| log T + p

log |G| E x D

Proof. Let us decompose the martingale MT (f, y) into epochs of doubling realized variance with respect to I = [0, 1] as per Definition C.4. Using τ as defined in (5), we will write Ik := [τk 1 + 1, min {T, τk}] to denote the time steps composing epoch k and write K := max {k | τk < } to denote the number of completed epochs.

Applying a triangle inequality and the law of total expectation gives

t=1 yt f(p t ) (xt p t )

t Ik yt f(p t ) (xt p t )

k=1 max f G

t Ik yt f(p t ) (xt p t )

log2(T ) +2 X

t Ik yt f(p t ) (xt p t ) 1 [t Ik]

log2(T ) +2 X

k=1 Pr [τk 1 < ] E x D

max f G M k,f T | τk 1 < . (15)

where we define the process M k,f T := PT t=1 yt f(p t ) (xt p t ) 1 [t Ik]. In the above, the second equality uses Fact C.5, namely that τ log2(T ) +2 = . Applying Fact C.10 to each of the martingales M k,f T in (15) gives us

t=1 yt f(p t ) (xt p t )

log2(T ) +2 X

k=1 Pr [τk 1 < ] h

2k 1(2 + 2 p

log |G|) + 2 + 2 log |G| i .

We can upper bound some of the summands in the right-hand side by using Fact C.8 to bound

log2(T ) +2 X

k=2 Pr [τk 1 < ]

Var T i (2 + 2

This gives that

t=1 yt f(p t ) (xt p t )

(2 + 2 log |G|)( log2(T) + 2) + E h

Var T i (2 + 2

D Supplemental Materials for Section 6

Notation. For all stochastic processes (Xt), we use Xt1:t2 = Xmin{t2,T } Xt1 to denote the increment within the time interval (t1, t2] (with X0 = 0 by default).

D.1 Proof of the Weaker Lower Bound

We restate and prove Lemmas 6.2 and 6.3. Lemma 6.2. For any x {0, 1}T and p [0, 1]T , we have SSCE(x, p) Ω NT .

Proof. Recall that SSCE is defined using sm CE, which is in turn a supremum over the family F of Lipschitz functions. Since both f 1 and f 1 are included in F, for any realized sequences x and p, we can lower bound SSCE(x, p) as follows:

SSCE(x, p) E y Unif({0,1}T )

t=1 yt (xt pt)

where we have defined zt := (yt 0.5)(xt pt) to be zero-mean independent random variables, and µ := PT t=1 0.5(xt pt). Now we partition [T] into T1 and T2, where T1 includes the all time steps such that |xt pt| 1

2, and T2 = T \ T1 contains the remaining time steps. From the definition of NT , it immediately follows that NT = |T1|. Letting Z1 := P

t T1 zt and Z2 := P

t T2 zt, it remains to lower bound E [|Z1 + Z2 + µ|] by Ω( NT ).

We will first prove that E [|Z1|] C NT for a universal constant C > 0. From the Berry-Esseen theorem (e.g. from [She10]), the CDF of Z1 can be approximated by the CDF of the standard normal distribution as follows:

x R, Pr [Z1 x σ0] Φ(x) C0 σ 1 0 ρ0,

where Φ(x) is the standard Gaussian CDF, C0 is a universal constant no larger than 0.56, and

t T1 E [z2 t ] =

t T1 (xt pt)2 1

ρ0 = max t T1 E |zt|3

E [|zt|2] = max t T1 |xt pt|3/8 |xt pt|2/4 1

As a result, we can lower bound the probability of |Z1| 0.05 NT as follows:

Pr h |Z1| 0.05 p

NT i Pr [|Z1| > 0.2 σ0] (σ0 1

= 2 1 Pr [Z1 0.2 σ0] (Z1 is symmetric)

2 1 Φ(0.2) 2C0/ p

NT . (Berry-Esseen theorem)

Since C0 0.56 and Φ(0.2) 0.58, we can guarantee Pr |Z1| 0.05 NT Ω(1) for all NT 8. When NT 7, we have |Z1| = NT /2 0.05 NT when all {zt | t T1} are positive, which happens with probability 2 NT 2 7 = Ω(1). Therefore, we can always conclude that

E [|Z1|] 0.05 p

NT Pr h |Z1| 0.05 p

for some universal constant C > 0.

Finally, we consider the randomness of Z2 and show that E [|Z1 + Z2 + µ|] C

2 NT . Applying the tower property of expectations, we have

E y [|Z1 + Z2 + µ|] = E h E |Z1 + Z2 + µ| Z2 i .

Consider the following two cases for the conditional expectation inside:

When |Z2 + µ| C

2 NT , we use Jensen s inequality and E [Z1] = 0 to obtain E [|Z1 + Z2 + µ| | Z2] | E [Z1 + Z2 + µ | Z2] | = |Z2 + µ| C

When |Z2 + µ| < C

2 NT , we apply the triangle inequality and have

E [|Z1 + Z2 + µ| | Z2] E [|Z1|] |Z2 + µ| > C NT C

Therefore, regardless of the realization of Z2, we always have E [|Z1 + Z2 + µ0| | Z2] C

2 NT . Taking an expectation over the randomness of Z2 gives the desired bound SSCE(x, p) C

Lemma 6.3. The stochastic process (Nt)t [T ] satisfies E NT Ω(E Var T ) O(1).

Proof. Since NT Var T /16 implies NT Var T /4, we have

4 1 NT Var T

4 1 NT < Var T

Therefore, to establish the inequality E NT Ω(E Var T ) O(1), it suffices to prove that the expectation of the second term which we denote with M is upper bounded by O(1), i.e.,

Var T 1 [NT < Var T /16] i O(1). (16)

We proceed by partitioning the range of Var T into subintervals of geometrically increasing length and enumerating all possibilities for which subinterval Var T falls into. If Var T 1, its contribution to M is clearly O(1). Otherwise, we must have Var T [2l, 2l+1) for some l N, which implies that NT < Var T /16 < 2l 3. Therefore, we bound M by taking a union bound over all such l s:

Var T 1 NT < 2l 3 Var T [2l, 2l+1) i

2l+1 Pr NT < 2l 3 Var T 2l . (17)

Now we bound Pr [NT < k/8 Var T k] for any fixed value of k (that plays the role of 2l) by constructing a sub-martingale. We start by partitioning the time horizon [T] into blocks based on the realized variance Vart, such that each block Bj := (bj 1, bj] terminates upon the realized variance Var Bj first exceeds 1. Formally, using notation Xt1:t2 := Xmin{t2,T } Xt1 to denote the increment of any process (Xt) in (t1, t2] (with X0 = 0 by default), the endpoints bj are defined recursively as:

b0 := 0, bj := min { } t [bj 1 + 1, T] | Varbj 1:t 1 , j 1.

We show in the following lemma that for each block Bj, the expected increment NBj within Bj is lower bounded by a constant as long as Bj terminates before T.

Lemma D.1. For the constant c = 1 1/e, E h 1 NBj 1 c 1 [bj < ] Fbj 1 i 0.

We prove Lemma D.1 in Appendix D.2. This lemma justifies that if we define Aj as

A0 := 0, Aj Aj 1 := 1 NBj 1 c 1 [bj < ] (j 1),

then (Aj)j 0 forms a sub-martingale of bounded increment |Aj Aj 1| 1, making it unlikely for any Aj to deviate significantly below 0. However, if NT < k/8 and Var T k, then Ak/2 must witness a large deviation: on the one hand, block Bk/2 should terminate properly because the variance in each block cannot exceed 2; on the other hand, NT < k/8 implies that at most k/8 of these blocks can have a nonzero increment NBj. As a result,

j=1 1 NBj 1 c Xk

j=1 1 [bj < ] NT c (k/2) < k/8.

By applying the Azuma-Hoeffding inequality for submartingales, we can quantitatively bound the probability of such a large deviation by

Pr [NT < k/8 Var T k] Pr Ak/2 k/8 e k/64.

Finally, plugging the above bound back into equation (17) gives us

2l+1 e 2l 6 O(1).

We have thus established the inequality (16), which in turn proves the lemma.

D.2 Proof of Lemma D.1

Now we prove Lemma D.1, which we restate below.

Lemma D.1. For the constant c = 1 1/e, E h 1 NBj 1 c 1 [bj < ] Fbj 1 i 0.

Proof. We first show that for all t [T], we have Pr [nt = 1 | Ft 1] p t (1 p t ), where Ft 1 denotes the filtration generated by all the randomness up to time t 1. Note that conditioning on Ft 1, xt is distributed according to Bernoulli(p t ). If the forecaster chooses pt 1

2, the condition

2 holds when xt = 0, which happens with probability 1 p t ; otherwise it holds when xt = 1, which happens with probability p t . Therefore, regardless of the choice of pt, we have

Pr [nt = 1 | Ft 1] = Pr xt Bernoulli(p t ) [|xt pt| 1/2] min{p t , 1 p t } p t (1 p t ).

This allows us to invoke Lemma D.5 with qt := nt, rt := p t (1 p t ), and θ = 1, where we only consider the random process inside block Bj. In this context, the stopping time τ1 corresponds to the end of the block, i.e., bj. Therefore, by applying Lemma D.5 at time step bj 1, we obtain

Abj 1 = Pr h NBj 1 Fbj 1 i 1 e 1 Pr bj < | Fbj 1 0

E h 1 NBj 1 c 1 [bj < ] Fbj 1 i 0, where c = 1 1

D.3 A Stronger Lower Bound

In this section, we state and prove the stronger SSCE lower bound for all forecasters. Theorem D.2. For any D ({0, 1}T ), OPTSSCE(D) = Ω(E [γ(Var T )]), where the function γ is defined as γ(x) := x 1 [0 x < 1] + x 1 [x 1].

Proof of Theorem D.2. The theorem holds by combining Lemma 6.2, which lower bounds the SSCE by Ω( NT ), and the stronger lower bound on E NT shown in Lemma D.3.

Lemma D.3. There exists a universal constant C > 0 such that E NT C E [γ(Var T )], where the function γ is defined as γ(x) := x 1 [0 x < 1] + x 1 [x 1].

Proof of Lemma D.3. The proof is also based on partitioning the time horizon into blocks Bj = (bj 1, bj] each with approximately unit variance similar to the approach used in proving Lemma 6.3. However, this proof involves a more careful analysis of the growth of Nt by further grouping blocks into epochs and giving special treatment to the first epoch, where the cumulative variance is very small.

Specifically, consider the blocks Bj = (bj 1, bj] defined by

b0 := 0, bj := min { } t [bj 1 + 1, T] | Varbj 1:t 1 , j 1.

Recall that the increment of Vart satisfies Vart Vart 1 = p t (1 p t ) 1/4. Thus, every block j satisfies Var Bj = Varbj Varbj 1 = (Varbj 1 Varbj 1) + (Varbj Varbj 1) 1 + 1/4 = 5/4. We further group blocks into epochs such that the k-th epoch Tk := (τk 1, τk] contains 2k blocks:

T0 := B1, Tk := [

j (2k 1,2k] Bj, k 1 (or equivalently, τk := b2k).

In addition, we define e Nt as the sum of ns capped by 1 in each block:

j:bj t min{NBj, 1} = X

j:bj t 1 NBj 1 .

Clearly, for all the realized sequences we have NT e NT and e Nτk 2k, where the latter is because

each block contributes at most 1 to e Nt. In the following, we will first analyze the growth of q

e Nt in epochs k 1, then provide a different analysis for the zeroth epoch.

In each epoch Tk with k 1. We start by establishing the following lemma, which extends the characterization of Lemma D.1 into epochs.

Lemma D.4 (Lower bound on e NTk). For any k 1, we have

E h e NTk i 2k 2 Pr [τk < ] .

Proof of Lemma D.4. According to Lemma D.1, we have that in each block Bj = (bj 1, bj],

E 1 NBj 1 c 1 [bj < ] = E h E h 1 NBj 1 c 1 [bj < ] Fbj 1 ii 0,

where the first step uses the tower property of expectations, and c = 1 1

Summing over all the blocks in epoch Tk, we obtain

E h e NTk i =

j=2k 1+1 E h e NBj i =

j=2k 1+1 E 1 NBj 1 (Definition of Tk and e Nt)

j=2k 1+1 1 [bj < ]

(Lemma D.1)

j=2k 1+1 1 [τk < ]

(bj b2k = τk for all j 2k)

2k 2 Pr [τk < ] . (c 1/2)

We have thus established Lemma D.4.

With Lemma D.4, we obtain a lower bound by linearizing the increment of q

e Nt in each block.

2 e Nτk e Nτk 1 (Concavity of function x)

2 1 E h e Nτk e Nτk 1 i = 2 k

2 1 E h e NTk i ( e Nτk 2k)

2 k 2 3 Pr [τk < ] . (Lemma D.4)

The first step above can be alternatively justified by a

2 a, which holds for all a b 0.

In epoch T0. We now analyze q

e NT0 in epoch 0. Note that the T0 contains only the first block B1, so this value is either 0 or 1, depending on whether there exists a t B1 such that nt = 1 |xt pt| 1

Recall that in the proof of Lemma D.1, we have shown that regardless of the choice of pt,

Pr [nt = 1 | Ft 1] = Pr xt Bernoulli(p t ) [|xt pt| 1/2] p t (1 p t )

Therefore, in the special case of product distributions (i.e., the sequence (p t ) is deterministic and each outcome xt p t is independent of other time steps), we can directly bound the probability that q

e NT0 = 1 as follows:

e NT0 = 1 = 1

t=1 Pr [nt = 0] 1

t=1 [1 p t (1 p t )]

t=1 p t (1 p t )

= 1 exp( Var B1) 1

where the last step follows from the inequality 1 e x x/2 when 0 x 5/4, and the fact that Var B1 5/4.

However, in the general case where the sequence (p t ) is itself random and depends on the history of xt s, such a direct argument fails. Instead, we use Lemma D.6 that extends the above analysis to this more general setting. Lemma D.6 is itself a similar but more general statement than Lemma D.5, as it is applicable even when the cumulative variance is smaller than the hard threshold θ. Invoking Lemma D.6 with qt := nt, rt := p t (1 p t ), and the stopping time τ as the earlier time step between the end of block B1 and the first time where nt = 1, we have

Pr [Nτ 1] 1 E e Varτ 1

2 E [Varτ] ,

where the last step again uses 1 e x x/2 for x [0, 5/4]. Moreover, since 5

4 1 [Nτ 1] Varτ:b1, we also have

Pr [Nτ 1] 4

5 E [Varτ:b1] 1

2 E [Varτ:b1] .

Combining the two inequalities, we obtain

= Pr [NB1 1] Pr [Nτ 1]

4 E [Varτ + Varτ:b1] = 1

4 E [Var B1]

4 E [Var T 1 [τ1 = ]] . (τ1 = = Var T = Var B1)

Putting everything together. Combining the lower bounds for epoch 0 and epochs k 1, we obtain

4 E [Var T 1 [τ1 = ]] + X

k 1 2 k 2 3 Pr [τk < ]

4 E [Var T 1 [τ1 = ]] + X

k 1 Pr [τk 1 < , τk = ] X

Var T 1 [τ1 = ] + X

k 1 1 [τk 1 < , τk = ] 2 k 2

Var T 1 [τ1 = ] + X

k 1 1 [τk 1 < , τk = ] p

where the last step follows from the observation that the cumulative variance in each block cannot exceed 2, so τk = implies that Var T < 2k+1, i.e., 2k/2 Var T /

2. Finally, since τ1 = is equivalent to Var T < 1, we have established that

16 E h Var T 1 [Var T < 1] + 1 [Var T 1] p

Var T i = 1

16 E [γ(Var T )] .

The lemma follows from the fact that NT e NT always holds, which implies E NT

1 16 E [γ(Var T )].

D.4 Auxiliary Lemmas

Lemma D.5. Let Qt = P

s t qs, Rt = P

s t rs be two (coupled) stochastic processes such that qt {0, 1}, rt [0, 1] for all t [T]. Let Ft denote the filtration generated by all the randomness up to time t. Suppose rt is a deterministic function on Ft 1, and st := Pr [qt = 1 | Ft 1] rt.

For any constant θ > 0, define τθ to be a stopping time chosen as the first time that Rt reaches θ, i.e., τθ := min{ } {t [T] | Rt θ}. Let Q+ t := Qt:τθ be the sum of qs in the future until the stopping time τθ. If t > τθ, then we let Q+ t := 0. Consider random variables At s defined on the filtration Ft as follows:

At := Pr h Q+ t 1 Ft i 1 e (θ Rt) Pr [τθ < | Ft] .

Then we have At 0 for every t T and every event in Ft.

Proof of Lemma D.5. It suffices to prove the inequality conditioning on events in Ft that are atomic in the sense that they uniquely determine the values of q1:t and r1:t. The general case would follow from the law of total probability. In particular, in the following proof, we may view the value of Rt as fixed when we analyze the quantity At.

We perform a backwards induction from t = T to t = 0. Consider the base case of t = T. If RT θ, we have

AT = Pr h Q+ T 1 FT i

1 e (θ RT )

Pr [τθ < | FT ] 0.

Otherwise when RT < θ, we have Pr [τθ < | FT ] = 0, which also implies AT = 0 0.

We then assume At 0, and show that the same holds for At 1, where t T. If Rt 1 θ, we clearly have At 1 0, as the factor 1 e (θ Rt 1) would be non-negative. Therefore, it suffices to consider the case that Rt 1 < θ. In this case, the stopping time τθ should be t, so we have Q+ t 1 = qt + Q+ t . We bound At 1 by breaking the event Q+ t 1 1 into two cases: either qt = 1, or qt = 0 but Q+ t 1. We have

Pr h Q+ t 1 1 Ft 1 i = Pr [qt = 1 | Ft 1] + Pr [qt = 0 | Ft 1] Pr Q+ t 1 | Ft 1, qt = 0

= st + (1 st) E h Pr Q+ t 1 | Ft Ft 1, qt = 0 i

For the second term, we apply the induction hypothesis of At 0 and get

E h Pr Q+ t 1 | Ft Ft 1, qt = 0 i E h 1 e (θ Rt) Pr [τθ < | Ft] Ft 1, qt = 0 i

= 1 e (θ Rt 1 rt) Pr [τθ < | Ft 1, qt = 0] ,

where the second step uses the fact that conditioning on Ft 1, Rt = Rt 1 + rt. As a result, we obtain

Pr h Q+ t 1 1 Ft 1 i st + (1 st) 1 e (θ Rt 1 rt) Pr [τθ < | Ft 1, qt = 0] . (18)

We also expand the conditional probability Pr [τθ < | Ft 1] as follows: Pr [τθ < | Ft 1] = st Pr [τθ < | Ft 1, qt = 1] + (1 st) Pr [τθ < | Ft 1, qt = 0]

st + (1 st) Pr [τθ < | Ft 1, qt = 0] (19)

Combining the bounds in (18) and (19), we obtain

At 1 st + (1 st) 1 e (θ Rt 1 rt) Pr [τθ < | Ft 1, qt = 0]

1 e (θ Rt 1) st + (1 st) Pr [τθ < | Ft 1, qt = 0]

= st e (θ Rt 1) + (1 st) e (θ Rt 1) e (θ Rt 1 rt) Pr [τθ < | Ft 1, qt = 0]

e (θ Rt 1) (st ert + 1 ert) (bounding the probability by 1)

e (θ Rt 1) (rt ert + 1 ert) (st rt from assumption)

= e (θ Rt 1)+rt (rt + e rt 1) 0. ( x, e x 1 x) We have thus proved that the claim also holds for t 1. This completes the induction.

Lemma D.6. Let Qt = P s t qs, Rt = P s t rs be two (coupled) stochastic processes such that qt {0, 1}, rt [0, 1] for all t [T]. Let Ft denote the filtration generated by all the randomness up to time t. Suppose rt is a deterministic function on Ft 1, and st := Pr [qt = 1 | Ft 1] rt.

For any constant θ > 0, define τ to be a stopping time chosen as the first time that either Rt reaches 1 or qt = 1, i.e., τ := min{ } {t [T] | Rt 1} {t [T] | qt = 1}. Let Q+ t := Qt:τ and R+ t := Qt:τ be the sum of qs and rs in the future until the stopping time τ, respectively. We also let Q+ t = R+ t = 0 when t > τ. Consider random variables At s defined on the filtration Ft as follows:

At := Pr h Q+ t 1 Ft i E h 1 e R+ t Ft i .

Then we have At 0 for every t T and every event in Ft.

Proof of Lemma D.6. Using a similar approach to that for Lemma D.5, we prove this claim via a backwards induction from t = T to t = 0. Again, we only consider the atomic events in Ft that uniquely determines the values of q1:t and r1:t, and thus whether τ t; the general case follows from the law of total probability.

For the base case of t = T, we have

AT = Pr h Q+ T 1 FT i

| {z } =0 as Q+ T 0

+ E h e R+ T FT i

| {z } =1 as R+ T 0

Now for t T, we assume the claim holds for t and analyze At 1. If τ t 1, we immediately obtain At 1 0 since Q+ t 1 = R+ t 1 = 0. It remains to consider the case of τ t. For the first term of At 1 (the conditional probability), we have

Pr h Q+ t 1 1 Ft 1 i = Pr h qt = 1 Ft 1 i + Pr [qt = 0 | Ft 1] Pr Q+ t 1 | Ft 1, qt = 0

= st + (1 st) Pr h Q+ t 1 Ft 1, qt = 0 i

st + (1 st) E h 1 e R+ t Ft 1, qt = 0 i

= 1 (1 st) E h e R+ t Ft 1, qt = 0 i ,

where the inequality step follows from the induction hypothesis At 0.

On the other hand, the second term of At 1 (the conditional expectation) can be bounded as

E h 1 e R+ t 1 Ft 1 i

= Pr h qt = 1 Ft 1 i 1 e rt (qt = 1 implies τ = t and R+ t 1 = rt)

+ Pr h qt = 0 Ft 1 i E h 1 e rt R+ t Ft 1, qt = 0 i

= 1 st e rt (1 st) E h e rt R+ t Ft 1, qt = 0 i

1 E h e rt (1 st)R+ t Ft 1, qt = 0 i . (Jensen s inequality for the convex function e x)

Finally, combining the bounds for both terms of At 1, we obtain

At 1 = Pr h Q+ t 1 1 Ft 1 i E h 1 e R+ t 1 Ft 1 i

E h e rt (1 st)R+ t (1 st) e R+ t Ft 1, qt = 0 i

E h e R+ t e rt (1 st) Ft 1, qt = 0 i (est R+ t 1)

E h e R+ t e st (1 st) Ft 1, qt = 0 i (st rt by assumption)

0. (e x 1 x, x 0) We have proved that At 1 0. As a result, At 0 for all t T and all events in Ft.

E Proof of Theorem 1.2

In this section, we prove our main theorem (Theorem 1.2) by combining the theorems established in the previous sections, and then verifying the completeness and soundness of the SSCE.

Proof of Theorem 1.2. Let D ({0, 1}T ) be an arbitrary distribution and define the random variable

t=1 p t (1 p t ),

over x D, where p t := Prx D h x t = 1 | x 1:(t 1) = x1:(t 1) i . By Theorems C.1 and D.2, the truthful forecaster gives

err SSCE(D, Atruthful(D)) = O E x D [γ(Var T )] ,

OPTSSCE(D) = Ω E x D [γ(Var T )] .

In the above, the O( ) and Ω( ) notations hide universal constants that do not depend on D. Therefore, there exists a universal constant c > 0 such that the SSCE is (c, 0)-truthful.

Completeness. Now we verify that the SSCE is complete. For any x {0, 1}T , we have

SSCE(x, x) = E y Unif({0,1}T )

t=1 y T f(xt) (xt xt)

For any α [0, 1], the upper bound

E x1,...,x T Bernoulli(α)

h SSCE(x, α 1T ) i = O( p

T α (1 α)) = oα(T)

follows from applying Theorem C.1 to the product distribution D = QT t=1 Bernoulli(α) and the fact that γ(x) x for all x 0.

Soundness. To show that the SSCE is sound, we first consider the case that x {0, 1}T is arbitrary and the predictions are p = 1T x. Noting that the function x 7 1/2 x is in the family F of 1-Lipschitz functions from [0, 1] to [ 1, 1], we have

SSCE(x, 1T x) = E y Unif({0,1}T )

t=1 yt f(1 xt) (xt (1 xt))

E y Unif({0,1}T )

t=1 yt (xt 1/2) (2xt 1)

= E y Unif({0,1}T )

where the third step holds since (x 1/2) (2x 1) = 1/2 holds for every x {0, 1}.

Finally, we fix α, β [0, 1] such that α = β. For fixed x, y {0, 1}T , we have

t=1 yt f(β) (xt β) =

t=1 yt (xt β)

Taking an expectation over x1, . . . , x T Bernoulli(α) and y Unif({0, 1}T ) gives

E x1,...,x T Bernoulli(α)

h SSCE(x, β 1T ) i = E x,y

t=1 yt f(β) (xt β)

t=1 yt (xt β)

t=1 yt (xt β)

2 T = Ωα,β(T),

where the third step follows from Jensen s inequality E [|X|] | E [X] |.

F Proof of Lemma 7.1

Lemma 7.1. For any x {0, 1}T and p [0, 1]T ,

SSCE(x, p) 1

2sm CE(x, p) + O(

where the O( ) notation hides a universal constant that does not depend on T, x or p.

We prove Lemma 7.1 via a standard chaining argument.

Proof of Lemma 7.1. We decompose the SSCE as follows:

= E y Unif({0,1}T )

t=1 yt f(pt) (xt pt)

E y Unif({0,1}T )

f(pt) (xt pt)

t=1 f(pt) (xt pt).

Note that the second term is exactly 1

2sm CE(x, p), so it suffices to bound the first term by O(

For notational convenience, let M (f) T := PT t=1 yt 1

2 f(pt) (xt pt) for function f F. We will establish the following bound for any N 1 and functions f1, f2, . . . , f N from [0, 1] to [ 1, 1]:

E y Unif({0,1}T )

sup i [N] M (fi) T

T log N). (20)

Assuming Inequality (20), applying Dudley s chaining technique [Dud87] to the δ-covering Fδ defined in Lemma C.2 would give

E y Unif({0,1}T )

sup f F M (f) T

T log |Fδ| dδ (chaining)

2 dδ (log |Fδ| O(1/δ) from Lemma C.2)

which implies the lemma.

Therefore, it remains to establish Inequality (20). We prove this using Hoeffding s inequality and a union bound. For each i [N] and every ε > 0, we have

sup i [N] M (fi) T ε

i=1 Pr h M (fi) T ε i (union bound)

2ε2 PT t=1(xt pt)2fi(pt)2

(Hoeffding s inequality)

. ( fi 1, i [N])

Finally, the bound (20) holds by taking an integral over ε > 0: shorthanding X := supi [N] M (fi) T , we have

0 Pr [X τ] dτ Z +

0 min{N e 2τ 2/T , 1} dτ = O( p

This completes the proof.

G Supplemental Materials for Section 8

We justify the claim in Section 8 that it is impossible for the SSCE (and most natural calibration measures) to incentivize truthful prediction against all adaptive adversaries.

Suppose that the adversary draws x1 from Bernoulli(1/2). If the forecaster predicts p1 = 0, all the subsequent bits are zeros; otherwise, the adversary keeps producing independent samples from Bernoulli(1/2).

Clearly, the truthful forecaster predicts pt = 1/2 at every step t [T], and the resulting outcome sequence x is uniform over {0, 1}T . The resulting SSCE is then Θ(T 1/2) in expectation. If the forecaster keeps predicting pt = 0 instead, the expectation of SSCE(x, p) is only O(1). Note that this impossibility holds for any calibration measure CM that satisfies

E x1,...,x T Bernoulli(1/2)

h CMT (x, 1T /2) i = ω(1)

and E x1 Bernoulli(1/2)

h CMT (x1 0T 1, 0T ) i = O(1),

where denotes concatenation.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The claims match the theoretical results that we prove in the paper. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: This is a theoretical work, so the results hold for the specific problem setups and formulations, which we formally state in Section 2. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: All the theorems and claims are formally proved, either in the main paper or in the appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA] Justification: The paper does not include experiments requiring code. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: The research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: This is a theoretical work, and there is no societal impact of the work performed to the best of our knowledge.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: The paper does not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.