# online_platt_scaling_with_calibeating__c9d965a2.pdf

Online Platt Scaling with Calibeating

Chirag Gupta 1 Aaditya Ramdas 1

Abstract We present an online post-hoc calibration method, called Online Platt Scaling (OPS), which combines the Platt scaling technique with online logistic regression. We demonstrate that OPS smoothly adapts between i.i.d. and non-i.i.d. settings with distribution drift. Further, in scenarios where the best Platt scaling model is itself miscalibrated, we enhance OPS by incorporating a recently developed technique called calibeating to make it more robust. Theoretically, our resulting OPS+calibeating method is guaranteed to be calibrated for adversarial outcome sequences. Empirically, it is effective on a range of synthetic and real-world datasets, with and without distribution drifts, achieving superior performance without hyperparameter tuning. Finally, we extend all OPS ideas to the beta scaling method.

1 Introduction

In the past two decades, there has been signiﬁcant interest in the ML community on post-hoc calibration of ML classiﬁers (Zadrozny and Elkan, 2002; Niculescu-Mizil and Caruana, 2005; Guo et al., 2017). Consider a pretrained classiﬁer f : X Ñ r0, 1s that produces scores in r0, 1s for covariates in X. Suppose f is used to make probabilistic predictions for a sequence of points pxt, ytq T t 1 where yt P t0, 1u. Informally, f is said to be calibrated (Dawid, 1982) if the predictions made by f match the empirically observed frequencies when those predictions are made:

for all p P r0, 1s, Averagetyt : fpxtq pu p. (1)

In practice, for well-trained f, larger scores fpxq indicate higher likelihoods of y 1, so that f does well for accuracy or a ranking score like AUROC. Yet we often ﬁnd that f does not satisfy (some formalized version of) condition (1). The goal of post-hoc calibration, or recalibration,

1Carnegie Mellon University, Pittsburgh PA, USA. Correspondence to: Chirag Gupta <chiragg@cmu.edu, chiragpvg@gmail.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

is to use held-out data to learn a low-complexity mapping m : r0, 1s Ñ r0, 1s so that mpfp qq retains the good properties of f accuracy, AUROC, sharpness as much as possible, but is better calibrated than f.

The focus of this paper is on a recalibration method proposed by Platt (1999), commonly known as Platt scaling (PS). The PS mapping m is a sigmoid transform over f parameterized by two scalars pa, bq P R2:

ma,bpfpxqq : sigmoidpa logitpfpxqq bq. (2)

Here logitpzq logpz{1 zq and sigmoidpzq 1{1 e z are inverses. Thus m1,0 is the identity mapping. Figure 1a has additional illustrative ma,b plots; these are easily interpreted if f is overconﬁdent, that is if fpxq values are skewed towards 0 or 1, we can pick a P p0, 1q to improve calibration; if f is underconﬁdent, we can pick a ą 1; if f is systematically biased towards 0 (or 1), we can pick b ą 0 (or b ă 0). The counter-intuitive choice a ă 0 can also make sense if f s predictions oppose reality (perhaps due to a distribution shift). The parameters pa, bq are typically set as those that minimize log-loss over calibration data or equivalently maximize log-likelihood under the model yi iid Bernoullipma,bpfpxiqqq, on the held-out data.

Although a myriad of recalibration methods now exist, PS remains an empirically strong baseline. In particular, PS is effective when few samples are available for recalibration (Niculescu-Mizil and Caruana, 2005; Gupta and Ramdas, 2021). Scaling before subsequent binning has emerged as a useful methodology (Kumar et al., 2019; Zhang et al., 2020). Multiclass adaptations of PS, called temperature, vector, and matrix scaling have become popular (Guo et al., 2017). Being a parametric method, however, PS comprises a limited family of post-hoc corrections for instance, since ma,b is always a monotonic transform, PS must fail even for i.i.d. data for some data-generating distributions (see Gupta et al. (2020) for a formal proof). Furthermore, we are interested in going beyond i.i.d. data to data with drifting/shifting distribution. This brings us to our ﬁrst question,

(Q1) Can PS be extended to handle shifting

or drifting data distributions?

A separate view of calibration that pre-dates the ML posthoc calibration literature is the online adversarial calibration

Online Platt Scaling

0.0 0.2 0.4 0.6 0.8 1.0 Pre-hoc model f(x)

Post-hoc Platt scaling model ma, b(f(x))

a = 1.0, b = 0.0 a = 2.0, b = 0.0 a = 0.5, b = 0.0 a = 1.0, b = -1.0 a = 1.0, b = 0.5

(a) Platt scaling

Initialize weights w1 P Rd X At time t 1, 2, . . . , T

Observe features xt P Rd

Predict pt p1 e w t xtq 1

Observe yt P t0, 1u

Compute updated weight wt 1 P Rd

Goal: minimize regret řT t 1 lpyt, ptq, where

lpy, pq y log p p1 yq logp1 pq.

(b) Online logistic regression

Expert (OPS)

The probability of rain

Hedging (HOPS)

Tracking (TOPS)

When you said 75% in the past, it has rained 85% of the time. So

I will forecast 85%.

Calibeating

I will play a hedging game on the instances where you said

75% so an adversary

cannot fool me. {

(c) Calibeating applied to online Platt scaling

Figure 1: The combination of Platt scaling and online logistic regression yields Online Platt Scaling (OPS). Calibeating is applied on top of OPS to achieve further empirical improvements and theoretical validity.

framework (De Groot and Fienberg, 1981; Foster and Vohra, 1998). Through the latter work, we know that calibration can be achieved for arbitrary yt sequences without relying on a pretrained model f or doing any other modeling over available features. This is achieved by hedging or randomizing over multiple probabilities, so that the past track record can essentially only improve, no matter the future outcome (paraphrased from Foster and Hart (2021)). For interesting classiﬁcation problems, however, the yt sequence is far from adversarial and informative covariates xt are available. In such settings, covariate-agnostic algorithms achieve calibration by predicting something akin to an average řt s 1 ys{t at time t 1 (see Appendix D). Such a prediction, while calibrated, is arguably not useful. A natural question is:

(Q2) Can informative covariates (features) be used

to make online adversarial calibration practical?

We ﬁnd an answer to (Q1) and (Q2) by leveraging the recently developed framework of calibeating (Foster and Hart, 2023), which is illustrated in Figure 1c. Calibeating shows how to perform certain corrections on top of pre-existing expert forecasts to improve (calibeat) them. The key calibeating idea relevant to our study is that adversarial calibration techniques can be used to simultaneously beat an expert and be provably calibrated.

For the sequential problem of being calibrated for a stream pxt, ytqtě1, where is one to ﬁnd an expert forecaster that is both covariate-based and time-adaptive? One possibility is a time-series model a sequence of predictors pft : X Ñ r0, 1sqtě1 that are updated as more data becomes available, potentially using autoregression. Our forthcoming proposal is simpler and arguably requires less (domain) expertise.

1.1 Online adversarial post-hoc calibration

The proposal, summarized in Figure 2, is as follows. First, train any probabilistic classiﬁer f on some part of the data. Then, perform online post-hoc calibration on top of f to get

Learn on training data

f : 𝒳 [0,1] Online learn post-hoc mapping mt : [0,1] [0,1]

Pre-training

Post-hoc calibration + calibeating on

streaming test data (xt, yt)t 1

Figure 2: Online adversarial post-hoc calibration.

online adaptivity. In effect, this amounts to viewing fpxtq as a scalar summary of xt, and the post-hoc mapping pmt : r0, 1s Ñ r0, 1sqtě1 becomes the time-series model over the scalar feature fpxtq. Finally, apply calibeating on the post-hoc predictions mtpfpxtqq to obtain adversarial validity. Figure 2 highlights our choice to do both post-hoc calibration and calibeating simultaneously on the streaming test data pxt, ytqtě1.

Such an online version of post-hoc calibration has not been previously studied to the best of our knowledge. We show how one would make PS online, to obtain Online Platt Scaling (OPS). OPS relies on a simple but crucial observation: PS is a two-dimensional logistic regression problem over pseudo-features logitpfpxtqq. Thus the problem of learning OPS parameters is the problem of online logistic regression (OLR, see Figure 1b for a brief description). Several regret minimization algorithms have been developed for OLR (Hazan et al., 2007; Foster et al., 2018; J ez equel et al., 2020). We consider these and ﬁnd an algorithm with optimal regret guarantees that runs in linear time. These regret guarantees imply that OPS is guaranteed to perform as well as the best ﬁxed PS model in hindsight for an arbitrarily distributed online stream pxt, ytqtě1, which includes the entire range of distribution drifts i.i.d. data, data with covariate/label drift, and adversarial data. We next present illustrative experiments where this theory bears out impressively in practice.

Then, Section 2 presents OPS, Section 3 discusses calibeating, Section 4 presents baseline experiments on synthetic and real-world datasets. Section 5 discusses the extension of all OPS ideas to a post-hoc technique called beta scaling.

Online Platt Scaling

0 10 20 30 0.0

t = 1 to t = 1000 (training)

0 10 20 30 0.0

t = 1501 to t = 2000

0 10 20 30 0.0

t = 3501 to t = 4000

0 10 20 30 0.0

t = 5501 to t = 6000

0.0 0.2 0.4 0.6 0.8 1.0 Value of x

True/predicted Pr(Y = 1 X = x)

t Model Acc Ò CE Ó

1 1000 Base 88.00% 0.1

1501 2000 Base 81.20% 0.18 OPS 81.68% 0.19

3501 4000 Base 43.12% 0.55 OPS 62.16% 0.36

5501 6000 Base 23.60% 0.64 OPS 87.76% 0.13

(a) Accuracy and calibration error (CE) values of the base model and OPS for indicated values of t.

Figure 3: The adaptive behavior of Online Platt scaling (OPS) for the covariate drift dataset described in Section 1.2. The title of each panel indicates the time-window that panel corresponds to. The histogram of Xt values in the corresponding time window is plotted with maximum height normalized to 1. Also plotted is the true curve for Prp Y 1 | X xq and two predictive curves: a base model trained on t 1 to t 1000, and OPS-calibrated models with parameter values ﬁxed at the start of the time window. The base model is accurate for the training data which is mostly in r 5, 10s, but becomes inaccurate and miscalibrated with the covariate-shifted values for larger t (bottom two subplots). OPS adapts well, agreeing with the base model in the top-right subplot, but ﬂipping the base model predictions in the bottom-right subplot.

1.2 Illustrative experiments with distribution drift

Covariate drift. We generated data as follows. For t 1, 2, . . . , 6000,

Xt Nppt 1q{250, 4q;

Yt|Xt " Berp0.1q if modpt Xt{5u , 2q 0, Berp0.9q if modpt Xt{5u , 2q 1. (3)

Thus the distribution of Yt given Xt is a ﬁxed periodic function, but the distribution of Xt drifts over time. The solid yellow line in Figure 3 plots Prp Y 1 | X xq against x. We featurized x as a 48-dimensional vector with the components sin x freq translation , where freq P t1, 2, 3, 4, 5, 6u and translation P t0, π{4, π{2, . . . 7π{4u.

A logistic regression base model f is trained over this 48dimensional representation using the points p Xt, Ytq1000 t 1 , randomly permuted and treated as a single batch of exchangeable points, which we will call training points. The points p Xt, Ytq6000 t 1001 form a supervised non-exchangeable test stream: we use this stream to evaluate f, recalibrate f using OPS, and evaluate the OPS-calibrated model.

Figure 3 displays f and the recalibrated OPS models at four ranges of t (one per plot). The training data has most xt-values in the range r 5, 10s as shown by the (heightnormalized) histogram in the top-left plot. In this regime,

f is visually accurate and calibrated the dotted light blue line is close to the solid yellow truth. We now make some observations at three test-time regimes of t:

(a) t 1501 to t 2000 (the histogram shows the distribution of pxtq2000 t 1501). For these values of t, the test data is only slightly shifted from the training data, and f continues to perform well. The OPS model recognizes the good performance of f and does not modify it much.

(b) t 3500 to t 4000. Here, f is out-of-phase with the true distribution, and Platt scaling is insufﬁcient to improve f by a lot. OPS recognizes this, and it offers slightly better calibration and accuracy by making less conﬁdent predictions between 0.2 and 0.4.

(c) t 5500 to t 6000. In this regime, f makes predictions opposing reality. Here, the OPS model ﬂips the prediction, achieving high accuracy and calibration.

These observations are quantitatively supported by the accuracy and ℓ1-calibration error (CE) values reported by the table in Figure 3a. Accuracy and CE values are estimated using the known true distribution of Yt | Xt and the observed Xt values, making them unbiased and avoiding some well-known issues with CE estimation. More details are provided in Appendix A.2.

Label drift. For t 1, 2, . . . , 6000, data is generated as:

Online Platt Scaling

2 0 2 4 0.0

t = 1 to t = 1000 (training)

2 0 2 4 0.0

t = 1501 to t = 2000

4 2 0 2 4 0.0

t = 3501 to t = 4000

2 0 2 4 0.0

t = 5501 to t = 6000

0.0 0.2 0.4 0.6 0.8 1.0 Value of x

True/predicted Pr(Y = 1 X = x)

(a) OPS with label drift.

10 0 10 0.0

t = 1 to t = 1000 (training)

10 0 10 0.0

t = 1501 to t = 2000

10 0 10 0.0

t = 3501 to t = 4000

10 0 10 0.0

t = 5501 to t = 6000

0.0 0.2 0.4 0.6 0.8 1.0 Value of x

True/predicted Pr(Y = 1 X = x)

(b) OPS with regression-function drift.

Figure 4: The adaptive behavior of OPS for the simulated label shift and regression-function drift datasets described in Section 1.2. For more details on the contents of the ﬁgure, please refer to Figure 3. The improvement in calibration and accuracy of OPS over the base model is visually apparent, but for completeness, {Acc, CE} values are reported in the Appendix as part of Figures 10 and 11.

Yt Bernoullip0.95p1 αtq 0.05αtq,

where αt pt 1q{6000q;

Xt|Yt 1 t Yt 0u Np0, 1q 1 t Yt 1u Np2, 1q. (4)

Thus, Xt | Yt is ﬁxed while the label distribution drifts. We follow the same training and test splits described in the covariate drift experiment, but without sinusoidal featurization of Xt; the base logistic regression model is trained directly on the scalar Xt s. The gap between f and the true model increases over time but OPS adapts well (Figure 4a).

Regression-function drift. For t 1, 2, . . . , 6000, the data is generated as follows: αt pt 1q{5000,

Xt Np0, 10q and Yt|Xt Bernoullipptq, where (5)

pt " 0.1p1 αtq 0.5αt if modpt Xt{5u , 2q 0, 0.9p1 αtq 0.5αt if modpt Xt{5u , 2q 1.

Thus the distribution of Xt is ﬁxed, but the regression function Prp Yt 1 | Xtq drifts over time. We follow the same training and test splits described in the covariate drift experiment, as well as the 48-dimensional featurization and logistic regression modeling. The performance of the base model worsens over time, while OPS adapts (Figure 4b).

2 Online Platt scaling (OPS)

In a batch post-hoc setting, the Platt scaling parameters are set to those that minimize log-loss over the calibration data. If we view the ﬁrst t instances in our stream as the calibration data, the ﬁxed-batch Platt scaling parameters are,

ppat,pbtq arg min pa,bq PR2

s 1 lpma,bpfpxsqq, ysq, (6)

where lpp, yq y log p p1 yq logp1 pq and ma,b is deﬁned in (2). Observe that this is exactly logistic regression over the dataset plogitpfpxsqq, ysqt s 1.

The thesis of OPS is that as more data is observed over time, we should use it to update the Platt scaling parameters. Deﬁne p OPS t : mat,btpfpxtqq, where pat, btq depends on tpfpx1q, y1q, . . . , pfpxt 1q, yt 1qu.1 One way to compare methods in this online setting is to consider regret RT with respect to a reference ℓ2-ball of radius B, B : tpa, bq P R2 : a2 b2 ď B2u:

t 1 lpp OPS t , ytq min pa,bq PB

t 1 lpma,bpfpxtqq, ytq. (7)

RT is the difference between the total loss incurred when playing pat, btq at times t ď T and the total loss incurred when playing the single optimal pa, bq P B for all t ď T. Typically, we are interested in algorithms that have low RT irrespective of how pxt, ytq is generated.

2.1 Logarithmic worst-case regret bound for OPS

OPS regret minimization is exactly online logistic regression (OLR) regret minimization over pseudo-features logitpfpxtqq. Thus our OPS problem is immediately solved

1A variant of this setup allows pat, btq to depend on fpxtq (Foster et al., 2018).

Online Platt Scaling

Algorithm Regret Running time

Online Gradient Descent (OGD) (Zinkevich, 2003) B ?

T T Online Newton Step (ONS) (Hazan et al., 2007) e B log T T AIOLI (J ez equel et al., 2020) B logp BTq T log T Aggregating Algorithm (AA) (Vovk, 1990; Foster et al., 2018) logp BTq B18T 24

Table 1: Asymptotic regret and running times of online logistic regression (OLR) algorithms for OPS as functions of the radius of reference class B and time-horizon T. For general OLR, regret and running times also depend on the dimension of X. However, OPS effectively reduces the dimensionality of X to 2, so that a second-order method like ONS runs almost as fast as a ﬁrst-order method like OGD. Also note that B ?

a2 b2 is small if the base model f is not highly miscalibrated. ONS with ﬁxed hyperparameters was chosen for all OPS experiments; see Section 2.2 for implementation details.

using OLR methods. A number of OLR methods have been proposed, and we consider their regret guarantees and running times for the OPS problem. These bounds typically depend on T and two problem-dependent parameters: the dimension (say d) and B, the radius of B.

1. In our case, d 2 since there is one feature logitpfpxqq and a bias term. Thus d is a constant.

2. B could technically be large, but in practice, if f is not highly miscalibrated, we expect small values of a and b which would in turn lead to small B. This was true in all our experiments.

Regret bounds and running times for candidate OPS methods are presented in Table 1, which is an adaptation of Table 1 of J ez equel et al. (2020) with all polypdq terms removed. Based on this table, we identify AIOLI and Online Newton Step (ONS) as the best candidates for implementing OPS, since they both have Oplog Tq regret and r Op Tq running time. In the following theorem, we collect explicit regret guarantees for OPS based on ONS and AIOLI. Since the logloss can be unbounded if the predicted probability equals 0 or 1, we require some restriction on fpxtq.

Theorem 2.1. Suppose @t, fpxtq P r0.01, 0.99s, B ě 1, and T ě 10. Then, for any sequence pxt, ytq T t 1, OPS based on ONS satisﬁes

RT p ONSq ď 2pe B 10Bq log T 1, (8)

and OPS based on AIOLI satisﬁes

RT p AIOLIq ď 22B logp BTq. (9)

The proof is in Appendix E. Since log-loss is a proper scoring rule (Gneiting and Raftery, 2007), minimizing it has implications for calibration (Br ocker, 2009). However, no absolute calibration bounds can be shown for OPS, as discussed shortly in Section 2.4.

2.2 Hyperparameter-free ONS implementation

In our experiments, we found ONS to be signiﬁcantly faster than AIOLI while also giving better calibration. Further, ONS worked without any hyperparameter tuning after an

initial investigation was done to select a single set of hyperparameters. Thus we used ONS for experiments based on a verbatim implementation of Algorithm 12 in Hazan (2016), with γ 0.1, ρ 100, and K tpa, bq : }pa, bq}2 ď 100u. Algorithm 1 in the Appendix contains pseudocode for our ﬁnal OPS implementation.

2.3 Follow-The-Leader as a baseline for OPS

The Follow-The-Leader (FTL) algorithm sets pat, btq ppat 1,pbt 1q (deﬁned in (6)) for t ě 1. This corresponds to solving a logistic regression optimization problem at every time step, making the overall complexity of FTL Ωp T 2q. Further, FTL has Ωp Tq worst-case regret. Since full FTL is intractably slow to implement even for an experimental comparison, we propose to use a computationally cheaper variant, called Windowed Platt Scaling (WPS). In WPS the optimal parameters given all current data, ppat,pbtq, are computed and updated every Op100q steps instead of at every time step. We call this a window and the exact size of the window can be data-dependent. The optimal parameters computed at the start of the window are used to make predictions until the end of that window, then they are updated for the next window. This heuristic version of FTL performs well in practice (Section 4).

2.4 Limitations of regret analysis

Regret bounds are relative to the best in class, so Theorem 2.1 implies that OPS will do no worse than the best Platt scaling model in hindsight. However, even for i.i.d. data, the best Platt scaling model is itself miscalibrated on some distributions (Gupta et al., 2020, Theorem 3). This latter result shows that some form of binning must be deployed to be calibrated for arbitrarily distributed i.i.d. data. Further, if the data is adversarial, any deterministic predictor can be rendered highly miscalibrated (Oakes, 1985; Dawid, 1985); a simple strategy is to set yt 1 tpt ď 0.5u. In a surprising seminal result, Foster and Vohra (1998) showed that adversarial calibration is possible by randomizing/hedging between different bins. The following section shows how one can perform such binning and hedging on top of OPS,

Online Platt Scaling

based on a technique called calibeating.

3 Calibeating the OPS forecaster

Calibeating (Foster and Hart, 2023) is a technique to improve or beat an expert forecaster. The idea is to ﬁrst use the expert s forecasts to allocate data to representative bins. Then, the bins are treated nominally: they are just names or tags for groups of data-points that the expert suggests are similar . The ﬁnal forecasts in the bins are computed using only the outcome (yt) values of the points in the bin (seen so far), with no more dependence on the expert s original forecast. The intuition is that forecasting inside each bin can be done in a theoretically valid sense, irrespective of the theoretical properties of the expert.

We will use the following ϵ-bins to perform calibeating:

B1 r0, ϵq, B2 rϵ, 2ϵq, . . . , Bm r1 ϵ, 1s. (10)

Here ϵ ą 0 is the width of the bins, and for simplicity we assume that m 1{ϵ is an integer. For instance, one could set ϵ 0.1 or the number of bins m 10, as we do in the experiments in Section 4. Two types of calibeating tracking and hedging are described in the following subsections. We suggest recalling our illustration of calibeating in the introduction (Figure 1c).

3.1 Calibeating via tracking past outcomes in bins

Say at some t, the expert forecasts pt P r0.7, 0.8q. We look at the instances s ă t when ps P r0.7, 0.8q and compute

yb t 1 Averagetys : s ă t, ps P r0.7, 0.8qu.

Suppose we ﬁnd that yb t 1 0.85. That is, when the expert forecasted bin r0.7, 0.8q in the past, the average outcome was 0.85. A natural idea now is to forecast 0.85 instead of 0.75. We call this process Tracking , and it is the form of calibeating discussed in Section 4 of Foster and Hart (2023). In our case, we treat OPS as the expert and call the tracking version of OPS as TOPS. If p OPS t P Bb, then

p TOPS t : Averagetys : s ă t, p OPS s P Bbu. (11)

The average is deﬁned as the mid-point of Bb if the set above is empty.

Foster and Hart (2023) showed that the Brier-score of the TOPS forecasts p TOPS t , deﬁned as 1 T řT t 1pyt p TOPS t q2, is better than the corresponding Brier-score of the OPS forecasts p OPS t , by roughly the squared calibration error of p OPS t (minus a log T term). In the forthcoming Theorem 3.1, we derive a result for a different object that is often of interest in post-hoc calibration, called sharpness.

3.2 Segue: deﬁning sharpness of forecasters

Recall the ϵ-bins introduced earlier (10). Deﬁne Nb |tt ď T : pt P Bbu| and pyb 1 Nb ř

tďT,pt PBb yt if Nb ą 0, else pyb 0. Sharpness is deﬁned as,

SHPpp1:T q : 1

b 1 Nb py2 b.2 (12)

If the forecaster is perfectly knowledgeable and forecasts pt yt, its SHP equals řT t 1 yt{T : y T . On the other hand, if the forecaster puts all points into a single bin b, its SHP equals přT t 1 yt{Tq2 y2 T . The former forecaster is precise or sharp, while the latter is not, and SHP captures this it can be shown that y2 T ď SHPpp1:T q ď y T . We point the reader to Br ocker (2009) for further background. One of the goals of effective forecasting is to ensure high sharpness (Gneiting et al., 2007). OPS achieves this goal by relying on the log-loss, a proper scoring rule. The following theorem shows that TOPS suffers a small loss in SHP compared to OPS. Theorem 3.1. The sharpness of TOPS forecasts satisﬁes

SHPpp TOPS 1:T q ě SHPpp OPS 1:T q ϵ ϵ2

The proof (in Appendix E) uses Theorem 3 of Foster and Hart (2023) and relationships between sharpness, Brierscore, and a quantity called reﬁnement. If T is ﬁxed and known, setting ϵ a

log T{T (including constant factors), or equivalently, the number of bins B a

T{ log T gives a rate of r Op a

1{Tq for the SHP difference term. While we do not show a calibration guarantee, TOPS had the best calibration performance in most experiments (Section 4)

3.3 Calibeating via hedging or randomized prediction

All forecasters introduced so far the base model f, OPS, and TOPS make forecasts pt that are deterministic given the past data until time t 1. If the yt sequence is being generated by an adversary that acts after seeing pt, then the adversary can ensure that each of these forecasters is miscalibrated by setting yt 1 tpt ď 0.5u.

Suppose instead that the forecaster is allowed to hedge randomize and draw the forecast from a distribution instead of ﬁxing it to a single value and the adversary only has access to the distribution and not the actual pt. Then there exist hedging strategies that allow the forecaster to be arbitrarily well-calibrated (Foster and Vohra, 1998). In fact, Foster (1999, henceforth F99) showed that this can be done while hedging between two arbitrarily close points in r0, 1s.

In practice, outcomes are not adversarial, and covariates are available. A hedging algorithm that does not use covariates cannot be expected to give informative predictions. We verify this intuition through an experiment in Appendix on historical rain data D F99 s hedging algorithm simply predicts the average yt value in the long run.

2The original deﬁnition of sharpness (Murphy, 1973) was (essentially): T 1 řm b 1 Nbpybp1 pybq, which equals SHPpp1:T q y T . We add the forecast-independent term y T on both sides and deﬁne the (now non-negative) quantity as SHP.

Online Platt Scaling

A best-of-both-worlds result can be achieved by using the expert forecaster to bin data using xt values, just like we did in Section 3.1. Then, inside every bin, a separate hedging algorithm is instantiated. For the OPS predictor, this leads to HOPS (OPS + hedging). Speciﬁcally, in our experiments and the upcoming calibration error guarantee, we used F99:

p HOPS t : F99pys : s ă t, ps P Bbq. (14)

A standalone description of F99 is included in Appendix C. F99 hedges between consecutive mid-points of the ϵ-bins deﬁned earlier (10). The only hyperparameter for F99 is ϵ. In the experiments in the main paper, we set ϵ 0.1. To be clear, pt is binned on the ϵ-bins, and the hedging inside each bin is again over the ϵ-bins.

The upcoming theorem shows a SHP lower bound on HOPS. In addition, we show an assumption-free upper bound on the (ℓ1-)calibration error, deﬁned as

CEpp1:T q : 1

b 1 Nb |ppb pyb| , (15)

where Nb, pyb were deﬁned in Section 3.2, and ppb 1 Nb ř

tďT,pt PBb pt, if Nb ą 0, else ppb mid-pointp Bbq. Achieving small CE is one formalization of (1). The following result is conditional on the y1:T , p OPS 1:T sequences. The expectation is over the randomization in F99.

Theorem 3.2. For adversarially generated data, the expected sharpness of HOPS forecasts using the forecast hedging algorithm of Foster (1999) is lower bounded as

E SHPpp HOPS 1:T q ě SHPpp OPS 1:T q ˆ ϵ log T 1

(16) and the expected calibration error of HOPS satisﬁes,

E CEpp HOPS 1:T q ď ϵ{2 2 a

1{ϵ2T. (17)

The proof in Appendix E is based on Theorem 5 of Foster and Hart (2023) and a CE bound for F99 based on Blackwell approachability (Blackwell, 1956). With ϵ rΘp T 1{3q, the difference term in the SHP bound is r Op T 1{3q and with ϵ rΘp T 1{4q, the CE bound is r Op T 1{4q. Compare this to the usual CE rate without calibeating of Op T 1{3q (Foster and Vohra, 1998). High-probability versions of (17) can be derived using probabilistic Blackwell approachability lemmas, such as those in Perchet (2014).

4 Experiments

We perform experiments with synthetic and real-data, in i.i.d. and distribution drift setting. Code to reproduce the experiments can be found at https://github.com/aigen/ df-posthoc-calibration (see Appendix A.4 for more details). All baseline and proposed methods are described in Collection 1 on the following page. In each experiment, the base model f was a random forest (sklearn s

Collection 1. Proposed and baseline methods for online post-hoc calibration. Final forecasts are identiﬁed in blue.

Input: f : X Ñ r0, 1s, any pre-learnt model Input: px1, y1q, px2, y2q, . . . , px T , y T q P X ˆ t0, 1u Input: calibration-set-size Tcal ă T, window-size W Fixed Platt scaling: pa FPS, b FPSq Ð ppa Tcal,pb Tcalq (eq. 6) Windowed Platt scaling: pa WPS, b WPSq Ð pa FPS, b FPSq Online Platt scaling: pa OPS 1 , b OPS 1 q Ð p1, 0q for t 2 to T do

pa OPS t , b OPS t q Ð ONSppx1, y1q, . . . , pxt 1, yt 1qq (ONS is Algorithm 1 in the Appendix) end for for t Tcal 1 to T do

p BM t Ð fpxtq p FPS t Ð sigmoidpa FPS logitpfpxtqq b FPSq p WPS t Ð sigmoidpa WPS logitpfpxtqq b WPSq p OPS t Ð sigmoidpa OPS t logitpfpxtqq b OPS t q p TOPS t is set using past pys, p OPS s q values as in (11) p HOPS t is set using past pys, p OPS s q values as in (14) If mod pt Tcal, Wq 0, pa WPS, b WPSq Ð ppat,pbtq end for

implementation). All default parameters were used, except n estimators was set to 1000. No hyperparameter tuning on individual datasets was performed for any of the recalibration methods.

Metrics. We measured the SHP and CE metrics deﬁned in (12) and (15) respectively. Although estimating population versions of SHP and CE in statistical (i.i.d.) settings is fraught with several issues (Kumar et al. (2019); Roelofs et al. (2022) and several other works), our deﬁnitions target actual observed quantities which are directly interpretable without reference to population quantities.

Reading the plots. The plots we report show CE values at certain time-stamps starting from Tcal 2W and ending at T (see third line of Collection 1). Tcal and W are ﬁxed separately for each dataset (Table 2 in Appendix). We also generated SHP plots, but these are not reported since the drop in SHP was always very small.

4.1 Experiments on real datasets

We worked with four public datasets in two settings. Links to the datasets are in Appendix A.1.

Distribution drift. We introduced synthetic drifts in the data based on covariate values, so this is an instance of covariate drift. For example, in the bank marketing dataset (leftmost plot in Figure 5), the problem is to predict which clients are likely to subscribe to a term deposit if they are targeted for marketing, using covariates like age, education, and bank-balance. We ordered the available 12000 rows roughly by age by adding a random num-

Online Platt Scaling

Base model (BM) Fixed-batch Platt scaling (FPS)

Windowed Platt scaling (WPS) Online Platt scaling (OPS)

OPS + tracking (TOPS) OPS + hedging (HOPS)

2000 4000 6000 8000 10000 Time t

[Calibration-error (CE) at t]

Bank marketing response

5000 15000 25000 Time t

[Calibration-error (CE) at t]

Credit default

2000 4000 6000 8000 Time t

[Calibration-error (CE) at t]

Customer churn modeling

600 800 1000 1200 1400 Time t

[Calibration-error (CE) at t]

Fetal Health

Figure 5: Drifting data. CE (calibration error) values over time of considered models on four datasets with synthetically induced drifts. The plots have invisible error bars since variation across the 100 runs was small. OPS consistently performs better than BM, FPS, and WPS, while TOPS is the best-performing among all methods across datasets and time. All methods had roughly the same SHP values at a given time-step, so the SHP plots are delayed to Appendix A (Figure 8).

ber uniformly from t 1, 0, 1u to age and sorting all the data. Training is done on the ﬁrst 1000 points, Tcal 1000, and W 500. Similar drifts are induced for the other datasets, and Tcal, W values are set depending on the total number of points; further details are in Appendix A.1.

All simulations were performed 100 times and the average CE and SHP values with std-deviation errorbars were evaluated at certain time-steps. Thus, our lines correspond to estimates of the expected values of CE and SHP, as indicated by the Y-axis labels. We ﬁnd that across datasets, OPS has the least CE among non-calibeating methods, and both forms of calibeating typically improve OPS further (Figure 5). Speciﬁcally, TOPS performs the best by a margin compared to other methods. We also computed SHP values, which are reported in Appendix A (Figure 8). The drop in SHP is insigniﬁcant in each case (around 0.005).

IID data. This is the usual batch setting formed by shufﬂing all available data. Part of the data is used for training and the rest forms the test-stream. We used the same values of Tcal and W as those used in the data drift experiments (see Appendix A.1). In our experiments, we ﬁnd that the gap in CE between BM, FPS, OPS, and WPS is smaller (Figure 6). However, TOPS performs the best in all scenarios, typically by a margin. Here too, the change in SHP was small, so those plots were delayed to Appendix A (Figure 9).

4.2 Synthetic experiments

In all experiments with real data, WPS performs almost as good as OPS. In this subsection, we consider some synthetic data drift experiments where OPS and TOPS continue performing well, but WPS performs much worse.

Covariate drift. Once for the entire process, we draw random orthonormal vectors v1, v2 P R10 ( v1 2 v2 2

1, v 1v2 0), a random weight vector w P t 1, 1u10 p 10 2 q

with each component set to 1 or 1 independently with probability 0.5, and set a drift parameter δ ě 0. The data is generated as follows:

ut v1 cospδtq v2 sinpδtq, Xt Np0, I10 10utu t q,

Yt|Xt Bernoullipsigmoidpw r Xtqq, where

r Xt rx1, . . . , x10, x1x2, x1x3, . . . , x9x10s P R10 p 10 2 q.

Thus the distribution of Yt given Xt is ﬁxed as a logistic model over the expanded representation r Xt that includes all cross-terms (this is unknown to the forecaster who only sees Xt). The features Xt themselves are normally distributed with mean 0 and a time-varying covariance matrix. The principal component (PC) of the covariance matrix is a vector ut that is rotating on the two-dimensional plane containing the orthonormal vectors v1 and v2. The ﬁrst 1000 points are used as training data, the remaining T 5000 form a test-stream, and W 500. We report results in two settings: one is i.i.d., that is δ 0, and the other is where the u for the ﬁrst and last point are at a 180 angle (Figure 7a).

Label drift. Given some δ ą 0, p Xt, Ytq is generated as:

Yt Bernoullip0.5 δtq,

Xt|Yt 1 t Yt 0u Np0, R10q 1 t Yt 1u Npe1, R10q.

Thus Pp Y1 1q 0.5 δ and for the last test point, Pp Y6000 1q 0.5 6000δ. This ﬁnal value can be set to control the extent of label drift; we show results with no drift (i.e., δ 0, Figure 7b left) and δ set so that ﬁnal bias 0.5 6000δ 0.9 (Figure 7b right). The number of training points is 1000, T 5000, and W 500.

4.3 Changing ϵ, histogram binning, beta scaling

In Appendix A.3, we report versions of Figures 5, 6 with ϵ 0.05, 0.2 (instead of ϵ 0.1) with similar conclusions (Figures 12, 13). We also perform comparisons with a windowed version of the popular histogram binning method (Zadrozny and Elkan, 2001) and online versions of the beta scaling method, as discussed in the forthcoming Section 5.

Online Platt Scaling

Base model (BM) Fixed-batch Platt scaling (FPS)

Windowed Platt scaling (WPS) Online Platt scaling (OPS)

OPS + tracking (TOPS) OPS + hedging (HOPS)

2000 4000 6000 8000 10000 Time t

[Calibration-error (CE) at t]

Bank marketing response

5000 15000 25000 Time t

[Calibration-error (CE) at t]

Credit default

2000 4000 6000 8000 Time t

[Calibration-error (CE) at t]

Customer churn modeling

600 800 1000 1200 1400 Time t

[Calibration-error (CE) at t]

Fetal Health

Figure 6: IID data. CE values over time of considered models with four randomly shufﬂed (ie, nearly i.i.d.) datasets. The plots have invisible error bars since variation across runs was small. TOPS achieves the smallest values of CE throughout.

2000 3000 4000 5000 Time t

[Calibration-error (CE) at t]

Angle between intial and final PC = 0

2000 3000 4000 5000 Time t

[Calibration-error (CE) at t]

Angle between intial and final PC = 180

(a) Left plot: i.i.d. data, right plot: covariate drift.

2000 3000 4000 5000 Time t

[Calibration-error (CE) at t]

P(Y=1) does not drift

2000 3000 4000 5000 Time t

[Calibration-error (CE) at t]

P(Y=1) drifts from 0.5 to 0.9

(b) Left plot: i.i.d. data, right plot: label drift.

Figure 7: Experiments with synthetic data. In all cases, TOPS has the lowest CE across time.

5 Online beta scaling with calibeating

A recalibration method closely related to Platt scaling is beta scaling (Kull et al., 2017). The beta scaling mapping m has three parameters pa, b, cq P R3,

ma,b,c beta pfpxqq : sigmoidpa log fpxq b logp1 fpxqq cq.

Observe that enforcing b a recovers Platt scaling since logitpzq logpzq logp1 zq. The beta scaling parameters can be learnt following identical protocols as Platt scaling: (i) the traditional method of ﬁxed batch post-hoc calibration akin to FPS, (ii) a natural benchmark of windowed updates akin to WPS, and (iii) regret minimization based method akin to OPS. This leads to the methods FBS, WBS, and OBS, replacing the P of Platt with the B of beta. Tracking + OBS (TOBS) and Hedging + OBS (HOBS) can be similarly derived. Further details on all beta scaling methods are in Appendix B, where we also report plots similar to Figures 5, 6 for beta scaling (Figure 14). In a comparison between histogram binning, beta scaling, Platt scaling, and their tracking versions, TOPS and TOBS are the best-performing methods across experiments (Figure 15).

We provided a way to bridge the gap between the online (typically covariate-agnostic) calibration literature, where data is assumed to be adversarial, and the (typically i.i.d.) post-hoc calibration literature, where the joint covariateoutcome distribution takes centerstage. The TOPS method we proposed has the lowest calibration error in all our experiments. The HOPS method provably controls miscalibration at any pre-deﬁned level and could be a desirable choice in sensitive applications. The good performance of OPS+calibeating lends further empirical backing to the thesis that scaling+binning methods perform well in practice, as has also been noted in prior works (Zhang et al., 2020; Kumar et al., 2019).

We note a few directions for future work. First, online algorithms that control regret on the most recent data have been proposed (Hazan and Seshadhri, 2009; Zhang et al., 2018). These approaches could give further improvements on ONS, particularly for drifting data. Second, while this paper entirely discusses calibration for binary classiﬁcation, all binary routines can be lifted to achieve multiclass notions such as top-label or class-wise calibration (Gupta and Ramdas, 2022b). Alternatively, multiclass versions of Platt scaling (Guo et al., 2017) such as temperature and vector scaling can also be targeted directly using online multiclass logistic regression (J ez equel et al., 2021).

Acknowledgements

We thank Youngseog Chung and Dhruv Malik for fruitful discussions, and the ICML reviewers for valuable feedback. CG was supported by the Bloomberg Data Science Ph.D. Fellowship. For computation, we used allocation CIS220171 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, supported by NSF grants 2138259, 2138286, 2138307, 2137603, and 2138296. Speciﬁcally, we used the Bridges2 system (Towns et al., 2014), supported by NSF grant 1928147, at the Pittsburgh Supercomputing Center (PSC).

Online Platt Scaling

David Blackwell. An analog of the minimax theorem for vector payoffs. Paciﬁc Journal of Mathematics, 6(1):1 8, 1956.

Jochen Br ocker. Reliability, sufﬁciency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorology and physical oceanography, 135(643):1512 1519, 2009.

A Philip Dawid. The well-calibrated Bayesian. Journal of the American Statistical Association, 77(379):605 610, 1982.

A Philip Dawid. Comment: The impossibility of inductive inference. Journal of the American Statistical Association, 80(390):340 341, 1985.

Morris H De Groot and Stephen E Fienberg. Assessing probability assessors: Calibration and reﬁnement. Technical report, Carnegie Mellon University, 1981.

Dean P Foster. A proof of calibration via Blackwell s approachability theorem. Games and Economic Behavior, 29(1-2):73 78, 1999.

Dean P Foster and Sergiu Hart. Forecast hedging and calibration. Journal of Political Economy, 129(12):3447 3490, 2021.

Dean P Foster and Sergiu Hart. Calibeating : beating forecasters at their own game. Theoretical Economics (to appear), 2023.

Dean P Foster and Rakesh V Vohra. Asymptotic calibration. Biometrika, 85(2):379 390, 1998.

Dylan J Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, and Karthik Sridharan. Logistic regression: The importance of being improper. In Conference On Learning Theory, 2018.

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359 378, 2007.

Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243 268, 2007.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, 2017.

Chirag Gupta and Aaditya Ramdas. Distribution-free calibration guarantees for histogram binning without sample splitting. In International Conference on Machine Learning, 2021.

Chirag Gupta and Aaditya Ramdas. Faster online calibration without randomization: interval forecasts and the power of two choices. In Conference On Learning Theory, 2022a.

Chirag Gupta and Aaditya Ramdas. Top-label calibration and multiclass-to-binary reductions. In International Conference on Learning Representations, 2022b.

Chirag Gupta, Aleksandr Podkopaev, and Aaditya Ramdas. Distribution-free binary classiﬁcation: prediction sets, conﬁdence intervals and calibration. In Advances in Neural Information Processing Systems, 2020.

Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Elad Hazan and Comandur Seshadhri. Efﬁcient learning algorithms for changing environments. In International Conference on Machine Learning, 2009.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192, 2007.

R emi J ez equel, Pierre Gaillard, and Alessandro Rudi. Efﬁcient improper learning for online logistic regression. In Conference on Learning Theory, 2020.

R emi J ez equel, Pierre Gaillard, and Alessandro Rudi. Mixability made efﬁcient: Fast online multiclass logistic regression. In Advances in Neural Information Processing Systems, 2021.

Meelis Kull, Telmo M Silva Filho, and Peter Flach. Beyond sigmoids: How to obtain well-calibrated probabilities from binary classiﬁers with beta calibration. Electronic Journal of Statistics, 11(2):5052 5080, 2017.

Ananya Kumar, Percy S Liang, and Tengyu Ma. Veriﬁed uncertainty calibration. In Advances in Neural Information Processing Systems, 2019.

Allan H Murphy. A new vector partition of the probability score. Journal of applied Meteorology, 12(4):595 600, 1973.

Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In International Conference on Machine Learning, 2005.

Online Platt Scaling

David Oakes. Self-calibrating priors do not exist. Journal of the American Statistical Association, 80(390):339 339, 1985.

Vianney Perchet. Approachability, regret and calibration: Implications and equivalences. Journal of Dynamics & Games, 1(2):181, 2014.

John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classiﬁers, 10(3):61 74, 1999.

Rebecca Roelofs, Nicholas Cain, Jonathon Shlens, and Michael C Mozer. Mitigating bias in calibration error estimation. In International Conference on Artiﬁcial Intelligence and Statistics, 2022.

J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. Scott, and N. Wilkins-Diehr. XSEDE: Accelerating Scientiﬁc Discovery. Computing in Science & Engineering, 16(5):62 74, 2014.

Volodimir G Vovk. Aggregating strategies. In Conference on Computational Learning Theory, 1990.

David Widmann, Fredrik Lindsten, and Dave Zachariah. Calibration tests in multi-class classiﬁcation: A unifying framework. In Advances in Neural Information Processing Systems, 2019.

Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive Bayesian classiﬁers. In International Conference on Machine Learning, 2001.

Bianca Zadrozny and Charles Elkan. Transforming classiﬁer scores into accurate multiclass probability estimates. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.

Jize Zhang, Bhavya Kailkhura, and T Yong-Jin Han. Mixn-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine Learning, 2020.

Lijun Zhang, Tianbao Yang, and Zhi-Hua Zhou. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning, 2018.

Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In International Conference on Machine Learning, 2003.

Online Platt Scaling

Name Ttrain Tcal W Sort-by Link to dataset

Bank marketing 1000 1000 500 Age https://www.kaggle.com/datasets/kukuroo3/ bank-marketing-response-predict

Credit default 1000 1000 500 Sex https://www.kaggle.com/datasets/uciml/ default-of-credit-card-clients-dataset

Customer churn 1000 1000 500 Location https://www.kaggle.com/datasets/ shrutimechlearn/churn-modelling

Fetal health 626 300 100 Acceleration https://www.kaggle.com/datasets/andrewmvd/ fetal-health-classification

Table 2: Metadata for datasets used in Section 4.1. The sort-by column indicates which covariate was used to order data points. All datasets are under the Creative Commons CC0 license.

Base model (BM) Fixed-batch Platt scaling (FPS)

Windowed Platt scaling (WPS) Online Platt scaling (OPS)

OPS + tracking (TOPS) OPS + hedging (HOPS)

2000 4000 6000 8000 10000 Time t

[Sharpness (SHP) at t]

Bank marketing response

5000 15000 25000 Time t

[Sharpness (SHP) at t]

Credit default

2000 4000 6000 8000 Time t

[Sharpness (SHP) at t]

Customer churn modeling

600 800 1000 1200 1400 Time t

[Sharpness (SHP) at t]

Fetal Health

Figure 8: Sharpness results with drifting data. SHP values over time of considered models on four datasets with synthetically induced drifts (Section 4.1). The plots have invisible error bars since variation across the 100 runs was small. The drop in expected sharpness is below 0.005 at all times except on the Fetal Health Dataset.

Base model (BM) Fixed-batch Platt scaling (FPS)

Windowed Platt scaling (WPS) Online Platt scaling (OPS)

OPS + tracking (TOPS) OPS + hedging (HOPS)

2000 4000 6000 8000 10000 Time t

[Sharpness (SHP) at t]

Bank marketing response

5000 15000 25000 Time t

[Sharpness (SHP) at t]

Credit default

2000 4000 6000 8000 Time t

[Sharpness (SHP) at t]

Customer churn modeling

600 800 1000 1200 1400 Time t

[Sharpness (SHP) at t]

Fetal Health

Figure 9: Sharpness results with i.i.d. data. SHP values over time of considered models on four shufﬂed (ie, nearly i.i.d.) datasets (Section 4.1). The drop in expected sharpness is less than 0.005 in all cases except for the HOPS forecaster on the Fetal Health dataset, where it is 0.01.

A Experimental details and additional results

Some implementation details, metadata, information on metrics, and additional results and ﬁgures are collected here.

A.1 Metadata for datasets used in Section 4.1

Table 2 contains metadata for the datasets we used in Section 4.1. Ttrain refers to the number of training examples. The sort-by column indicates which covariate was used to order data points. In each case some

Online Platt Scaling

2 0 2 4 0.0

t = 1 to t = 1000 (training)

2 0 2 4 0.0

t = 1501 to t = 2000

4 2 0 2 4 0.0

t = 3501 to t = 4000

2 0 2 4 0.0

t = 5501 to t = 6000

0.0 0.2 0.4 0.6 0.8 1.0 Value of x

True/predicted Pr(Y = 1 X = x)

t Model Acc Ò CE Ó

0 1000 Base 90.72% 0.025

1500 2000 Base 84.29% 0.088 OPS 86.08% 0.029

3500 4000 Base 68.80% 0.26 OPS 83.07% 0.053

5500 6000 Base 59.38% 0.42 OPS 92.82% 0.049

(a) Table reporting accuracy and CE values of the base model and OPS for indicated values of t.

Figure 10: The adaptive behavior of OPS for the simulated label drift scenario described in Section 1.2.

10 0 10 0.0

t = 1 to t = 1000 (training)

10 0 10 0.0

t = 1501 to t = 2000

10 0 10 0.0

t = 3501 to t = 4000

10 0 10 0.0

t = 5501 to t = 6000

0.0 0.2 0.4 0.6 0.8 1.0 Value of x

True/predicted Pr(Y = 1 X = x)

t Model Acc Ò CE Ó

0 1000 Base 84.14% 0.1

1500 2000 Base 74.64% 0.12 OPS 75.40% 0.097

3500 4000 Base 59.51% 0.2 OPS 59.60% 0.067

5500 6000 Base 44.39% 0.34 OPS 51.43% 0.049

(a) Table reporting accuracy and CE values of the base model and OPS for indicated values of t.

Figure 11: The adaptive behavior of OPS for the simulated regression-function drift scenario described in Section 1.2.

Online Platt Scaling

noise was added to the covariate in order to create variation for the experiments. The exact form of drift can be found in the python ﬁle sec 4 experiments core.py in the repository https://github.com/AIgen/ df-posthoc-calibration/tree/main/Online%20Platt%20Scaling%20with%20Calibeating.

A.2 Additional plots and details for label drift and regression-function drift experiments from Section 1

Figures 3, 10, and 11 report accuracy (Acc) and calibration error (CE) values for the base model and the OPS model in the three dataset drift settings we considered. The Acc values are straightforward averages and can be computed without issues. However, estimation of CE on real datasets is tricky and requires sophisticated techniques such as adaptive binning, debiasing, heuristics for selecting numbers of bins, or kernel estimators (Kumar et al., 2019; Roelofs et al., 2022; Widmann et al., 2019). The issue typically boils down to the fact that Prp Y 1 | X xq cannot be estimated for every x P X without making smoothness assumptions or performing some kind of binning. However, in the synthetic experiments of Section 1, Prp Y 1 | Xq is known exactly, so such techniques are not required. For some subset of forecasts ps, p2, . . . , pt, we compute

i s |pi Prp Yi 1 | Xi xiq| ,

on the instantiated values of Xs, Xs 1, . . . , Xt. Thus, what we report is the true CE given covariate values.

A.3 Additional results with windowed histogram binning and changing bin width

Comparison to histogram binning (HB). HB is a recalibration method that has been shown to have excellent empirical performance as well as theoretical guarantees (Zadrozny and Elkan, 2001; Gupta and Ramdas, 2021). There are no online versions of HB that we are aware of, so we use the same windowed approach as windowed Platt and beta scaling for benchmarking (see Section 2.3 and the second bullet in Section B). This leads to windowed histogram binning (WHB), the ﬁxed-batch HB recalibrator that is updated every Op100q time-steps. We compare WHB to OPS and OBS (see Section 5). Since tracking improves both OPS and OBS, we also consider tracking WHB. Results are presented in Figure 15.

We ﬁnd that WHB often performs better than OPS and OBS in the i.i.d. case, and results are mixed in the drifting case. However, since WHB is a binning method, it inherently produces something akin to a running average, and so tracking does not improve it further. The best methods (TOPS, TOBS) are the ones that combine one of our proposed parametric online calibrators (OPS, OBS) with tracking.

Changing the bin width ϵ. In the main paper, we used ϵ 0.1 and deﬁned corresponding bins as in (10). This binning reﬂects in three ways on the experiments we performed. First, ϵ-binning is used to divide forecasts into representative bins before calibeating (equations (11), (14)). Second, ϵ-binning is used to deﬁne the sharpness and calibration error metrics. Third, the hedging procedure F99 requires specifying a binning scheme, and we used the same ϵ-bins.

Here, we show that the empirical results reported in the main paper are not sensitive to the chosen representative value of ϵ 0.1. We run the same experiment used to produce Figures 5 and 6 but with ϵ 0.05 (Figure 12) and ϵ 0.2 (Figure 13). The qualitative results remain identical, with TOPS still the best performer and hardly affected by the changing epsilon. In fact, the plots for all methods except HOPS are indistinguishable from their ϵ 0.1 counterparts at ﬁrst glance. HOPS is slightly sensitive to ϵ: the performance improves slightly with ϵ 0.05, and worsens slightly with ϵ 0.2.

A.4 Reproducibility

All results in this paper can be reproduced exactly, including the randomization, using the IPython notebooks that can be found at https://github.com/aigen/df-posthoc-calibration in the folder Online Platt scaling with Calibeating. The README page in the folder contains a table describing which notebook to run to reproduce individual ﬁgures from this paper.

Online Platt Scaling

Base model (BM) Fixed-batch Platt scaling (FPS)

Windowed Platt scaling (WPS) Online Platt scaling (OPS)

OPS + tracking (TOPS) OPS + hedging (HOPS)

2000 4000 6000 8000 10000 Time t

[Calibration-error (CE) at t]

Bank marketing response

5000 15000 25000 Time t

[Calibration-error (CE) at t]

Credit default

2000 4000 6000 8000 Time t

[Calibration-error (CE) at t]

Customer churn modeling

600 800 1000 1200 1400 Time t

[Calibration-error (CE) at t]

Fetal Health

(a) Calibration error for i.i.d. data streams.

2000 4000 6000 8000 10000 Time t

[Calibration-error (CE) at t]

Bank marketing response

5000 15000 25000 Time t

[Calibration-error (CE) at t]

Credit default

2000 4000 6000 8000 Time t

[Calibration-error (CE) at t]

Customer churn modeling

600 800 1000 1200 1400 Time t

[Calibration-error (CE) at t]

Fetal Health

(b) Calibration error for drifting data streams.

Figure 12: Results for the same experimental setup as Figures 5 and 6, but with ϵ 0.05.

Base model (BM) Fixed-batch Platt scaling (FPS)

Windowed Platt scaling (WPS) Online Platt scaling (OPS)

OPS + tracking (TOPS) OPS + hedging (HOPS)

2000 4000 6000 8000 10000 Time t

[Calibration-error (CE) at t]

Bank marketing response

5000 15000 25000 Time t

[Calibration-error (CE) at t]

Credit default

2000 4000 6000 8000 Time t

[Calibration-error (CE) at t]

Customer churn modeling

600 800 1000 1200 1400 Time t

[Calibration-error (CE) at t]

Fetal Health

(a) Calibration error for i.i.d. data streams.

2000 4000 6000 8000 10000 Time t

[Calibration-error (CE) at t]

Bank marketing response

5000 15000 25000 Time t

[Calibration-error (CE) at t]

Credit default

2000 4000 6000 8000 Time t

[Calibration-error (CE) at t]

Customer churn modeling

600 800 1000 1200 1400 Time t

[Calibration-error (CE) at t]

Fetal Health

(b) Calibration error for drifting data streams.

Figure 13: Results for the same experimental setup as Figures 5 and 6, but with ϵ 0.2.

Online Platt Scaling

Windowed beta scaling (WBS) Online beta scaling (OBS)

Windowed Platt scaling (WPS) Online Platt scaling (OPS)

OBS + tracking (TOBS) OBS + hedging (HOBS)

2000 4000 6000 8000 10000 Time t

[Calibration-error (CE) at t]

Bank marketing response

5000 15000 25000 Time t

[Calibration-error (CE) at t]

Credit default

2000 4000 6000 8000 Time t

[Calibration-error (CE) at t]

Customer churn modeling

600 800 1000 1200 1400 Time t

[Calibration-error (CE) at t]

Fetal Health

(a) Calibration error for i.i.d. data streams.

2000 4000 6000 8000 10000 Time t

[Calibration-error (CE) at t]

Bank marketing response

5000 15000 25000 Time t

[Calibration-error (CE) at t]

Credit default

2000 4000 6000 8000 Time t

[Calibration-error (CE) at t]

Customer churn modeling

600 800 1000 1200 1400 Time t

[Calibration-error (CE) at t]

Fetal Health

(b) Calibration error for drifting data streams.

Figure 14: Performance of online beta scaling (OBS) and its calibeating variants on real datasets with and without distribution drift. OBS further improves upon OPS in most cases. In each plot, TOBS is the best-performing method.

Windowed histogram binning (WHB) WHB + tracking (TWHB)

Online Platt scaling (OPS) OPS + tracking (TOPS)

Online beta scaling (OBS) OBS + tracking (TOBS)

2000 4000 6000 8000 10000 Time t

[Calibration-error (CE) at t]

Bank marketing response

5000 15000 25000 Time t

[Calibration-error (CE) at t]

Credit default

2000 4000 6000 8000 Time t

[Calibration-error (CE) at t]

Customer churn modeling

600 800 1000 1200 1400 Time t

[Calibration-error (CE) at t]

Fetal Health

(a) Calibration error for i.i.d. data streams.

2000 4000 6000 8000 10000 Time t

[Calibration-error (CE) at t]

Bank marketing response

5000 15000 25000 Time t

[Calibration-error (CE) at t]

Credit default

2000 4000 6000 8000 Time t

[Calibration-error (CE) at t]

Customer churn modeling

600 800 1000 1200 1400 Time t

[Calibration-error (CE) at t]

Fetal Health

(b) Calibration error for drifting data streams.

Figure 15: Comparing the performance of windowed histogram binning (WHB), online Platt scaling (OPS), online beta scaling (OBS), and their tracking variants on real datasets with and without distribution drifts. Among non-tracking methods (dotted lines), WHB performs well with i.i.d. data, while OBS performs well for drifting data. Among tracking methods (solid lines), TOBS and TOPS are the best-performing methods in every plot. Tracking typically does not improve WHB much since WHB is already a binning method (so tracking is implicit).

Online Platt Scaling

Algorithm 1 Online Newton Step for OPS (based on Hazan (2016, Algorithm 12))

Input: K tpx, yq : }px, yq}2 ď 100u, time horizon T, and initialization parameter pa OPS 1 , b OPS 1 q p1, 0q : θ1 P K Hyperparameters: γ 0.1, ρ 100 Set A0 ρI2 for t 1 to T do

Play θt, observe log-loss lpmθtpfpxtqq, ytq and its gradient t : θtlpmθtpfpxtqq, ytq At At 1 t t Newton step: rθt 1 θt 1

γ A 1 t t Projection: pa OPS t 1, b OPS t 1q θt 1 arg minθPKprθt 1 θq Atprθt 1 θq end for

B Online beta scaling

This is an extended version of Section 5, with some repetition but more details. A recalibration method closely related to Platt scaling is beta scaling (Kull et al., 2017). The beta scaling mapping m has three parameters pa, b, cq P R3, and corresponds to a sigmoid transform over two pseudo-features derived from fpxq: logpfpxqq and logp1 fpxqq:

ma,b,cpfpxqq : sigmoidpa logpfpxqq b logp1 fpxqq cq. (18)

Observe that enforcing b a recovers Platt scaling since logitpzq logpzq logp1 zq. The beta scaling parameters can be learnt following identical protocols as Platt scaling.

The traditional method is to optimize parameters by minimizing the log-likelihood (equivalently, log-loss) over a ﬁxed held-out batch of points.

A natural benchmark for online settings is to update the parameters at some frequency (such as every 50 or 100 steps). At each update, the beta scaling parameters are set to the optimal value based on all data seen so far, and these parameters are used for prediction until the next update occurs. We call this benchmark windowed beta scaling (WBS); it is analogous to the windowed Platt scaling (WPS) benchmark considered in the main paper.

Our proposed method for online settings, called online Beta scaling (OBS), is to use a log-loss regret minimization procedure, similar to OPS. Analogously to (7), RT for OBS predictions p OBS t mat,bt,ctpfpxtqq is deﬁned as

t 1 lpp OBS t , ytq min pa,b,cq PB

t 1 lpma,b,cpfpxtqq, ytq, (19)

where B : tpa, b, cq P R3 : a2 b2 c2 ď B2u for some B P R, and l is the log-loss. We use online Newton step (Algorithm 1) to learn pat, bt, ctq, with the following initialization and hyperparameter values:

K tpx, y, zq : }px, y, zq}2 ď 100u, pa OBS 1 , b OBS 1 , c OBS 1 q p1, 1, 0q; γ 0.1, ρ 25, A0 ρI3.

These minor changes have to be made simply because the dimensionality changes from two to three. The empirical results we present shortly are based on an implementation with exactly these ﬁxed hyperparameter values that do not change across the experiments (that is, we do not do any hyperparameter tuning).

Due to the additional degree of freedom, beta scaling is more expressive than Platt scaling. In the traditional batch setting, it was demonstrated by Kull et al. (2017) that this expressiveness typically leads to better (out-of-sample) calibration performance. We expect this relationship between Platt scaling and beta scaling to hold for their windowed and online versions as well. We conﬁrm this intuition through an extension of the real dataset experiments of Section 4.1 to include WBS and OBS (Figure 14). In the main paper we reported that the base model (BM) and ﬁxed-batch Platt scaling model (FPS) perform the worst by a margin, so these lines are not reported again. We ﬁnd that OBS performs better than both OPS and WBS, so we additionally report the performance of calibeating versions of OBS instead of OPS. That is, we replace OPS + tracking (TOPS) with OBS + tracking (TOBS), and OPS + hedging (HOPS) with OBS + hedging (HOBS).

A regret bound similar to Theorem 2.1 can be derived for OBS by instantiating ONS and AIOLI regret bounds with d 3 (instead of d 2 as done for OPS). The calibeating theorems (3.1 and 3.2) hold regardless of the underlying expert, and so also hold for OBS.

Online Platt Scaling

C F99 online calibration method

We describe the F99 method proposed by Foster (1999), and used in our implementation of HOPS (Section 3.3). The description is borrowed with some changes from Gupta and Ramdas (2022a). Recall that the F99 forecasts are the mid-points of the ϵ-bins (10): B1 r0, ϵq, B2 rϵ, 2ϵq, . . . , Bm r1 ϵ, 1s. For b P rms : t1, 2, . . . , mu and t ě 1, deﬁne:

(mid-point of Bb) mb pb 0.5q{m bϵ ϵ{2,

(left end-point of Bb) lb pb 1q{m pb 1qϵ,

(right end-point of Bb) rb b{m bϵ,

F99 maintains some quantities as more data set is observed and forecasts are made. These are,

(frequency of forecasting mb) N t b |t1 tps mbu : s ď tu| ,

(observed average when mb was forecasted) pt b

#řt s 1 ys1 tps mbu {N t b if N t b ą 0 mb if N t b 0,

(deﬁcit) dt b lb pt b,

(excess) et b pt b rb.

The terminology deﬁcit is used to indicate that pt b is smaller lb similarly. Excess is used to indicate that pt b is larger than rb similarly. The F99 algorithm is as follows. Implicit in the description is computation of the quantities deﬁned above.

F99: the online adversarial calibration method of Foster (1999)

At time t 1, forecast p1 m1.

At time t 1 (t ě 1q, if

condition A: there exists an b P rms such that dt b ď 0 and et b ď 0,

is satisﬁed, forecast pt 1 mb for any i that veriﬁes condition A. Otherwise,

condition B: there exists a b P rm 1s such that et b ą 0 and dt b 1 ą 0,

must be satisﬁed (see Lemma 5 (Gupta and Ramdas, 2022a)). For any index b that satisﬁes condition B, forecast

mb with probability dt b 1 dt b 1 et b mb 1 with probability et b dt b 1 et b .

These randomization probabilities are revealed before yt 1 is set by the agent that is generating outcomes, but the actual pt value is drawn after yt 1 is revealed.

D Forecasting climatology to achieve calibration

Although Foster and Vohra s result (1998) guarantees that calibrated forecasting is possible against adversarial sequences, this does not immediately imply that the forecasts are useful in practice. To see this, consider an alternating outcome sequence, yt 1 tt is odd u. The forecast pt 1 tt is odd u is calibrated and perfectly accurate. The forecast pt 0.5 (for every t) is also calibrated, but not very useful.

Thus we need to assess how a forecaster guaranteed to be calibrated for adversarial sequences performs on real-world sequences. In order to do so, we implemented the F99 forecaster (described in Appendix C), on Pittsburgh s hourly rain data from January 1, 2008, to December 31, 2012. The data was obtained from ncdc.noaa.gov/cdo-web/. All days on which the hourly precipitation in inches (HPCP) was at least 0.01 were considered as instances of yt 1. There are many

Online Platt Scaling

0 2000 4000 6000 8000 10000 Time t

Forecast pt

Figure 16: Foster (1999) s ϵ-calibrated forecaster on Pittsburgh s hourly rain data (2008-2012). The forecaster makes predictions on the grid p0.05, 0.15, . . . , 0.95q. In the long run, the forecaster starts predicting 0.35 for every instance, closely matching the average number of instances on which it rained ( 0.37).

missing rows in the data, but no complex data cleaning was performed since we are mainly interested in a simple illustrative simulation. F99 makes forecasts on an ϵ-grid with ϵ 0.1: that is, the grid corresponds to the points p0.05, 0.15, . . . , 0.95q. We observe (Figure 16) that after around 2000 instances, the forecaster always predicts 0.35. This is close to the average number of instances that it did rain which is approximately 0.37 (this long-term average is also called climatology in the meteorology literature). Although forecasting climatology can make the forecaster appear calibrated, it is arguably not a useful prediction given that there exist expert rain forecasters who can make sharp predictions for rain that change from day to day.

Proof of Theorem 2.1. The regret bounds for ONS and AIOLI depend on a few problem-dependent parameters.

The dimension d 2.

The radius of the reference class B.

Bound on the norm of the gradient, which for logistic regression is also the radius of the space of input vectors. Due to the assumption on fpxtq, the norm of the input is at most a

logitp0.01q2 12 a

logitp0.99q2 12 ď 5.

The AIOLI bound (9) follows from Theorem 1, equation (4) of J ez equel et al. (2020), setting d 2 and R 10.

The ONS bound (8) follows from Theorem 4.5 of Hazan (2016), plugging in G 5, D 2B, and α e B which is the known exp-concavity constant of the logistic loss over a ball of radius B (Foster et al., 2018).

In writing the proofs of the results in Section 3, we will use an object closely connected to sharpness called reﬁnement. For a sequence of forecasts p1:T and outcome sequence y1:T , the reﬁnement R is deﬁned as

Rpp1:T q : 1

b 1 Nb pybp1 pybq, (20)

Online Platt Scaling

where pyb is the average of the outcomes in every ϵ-bin; see the beginning of Section 3.2 where sharpness is deﬁned. The function xp P r0, 1sq ÞÑ xp1 xq is minimized at the boundary points t0, 1u and maximized at 1{2. Thus reﬁnement is lower if pyb is close to 0 or 1, or in other words if the bins discriminate points well. This is captured formally in the following (well-known) relationship between reﬁnement and sharpness.

Lemma E.1 (Sharpness-reﬁnement lemma). For any forecast sequence p1:T , the reﬁnement R deﬁned in (20) and the sharpness SHP deﬁned in (12) are related as:

Rpp1:T q y T SHPpp1:T q,

where y T 1

T řT t 1 yt.

Proof. Observe that

b 1 Nbpyb 1

b 1 Nbpy2 b 1

b 1 Nbpyb SHPpp1:T q.

The ﬁnal result follows simply by noting that

tďT,pt PBb yt

We now state a second lemma, that relates R to the Brier-score BS deﬁned as

BSpp1:T q : řT t 1pyt ptq2

Unlike R and SHP, BS is not deﬁned after ϵ-binning. It is well-known (see for example equation (1) of FH23) that if reﬁnement is deﬁned without ϵ-binning (or if the Brier-score is deﬁned with ϵ-binning), then reﬁnement is at most the Brier-score deﬁned above. Since we deﬁne R deﬁned with binning, further work is required to relate the two.

Lemma E.2 (Brier-score-reﬁnement lemma). For any forecast sequence p1:T and outcome sequence y1:T , the reﬁnement R and the Brier-score BS are related as

Rpp1:T q ď BSpp1:T q ϵ2

where ϵ is the width of the bins used to deﬁne R (10).

Proof. Deﬁne the discretization function disc : r0, 1s Ñ r0, 1s as discppq mid-pointp Bbq ðñ p P Bb. Note that for all p P r0, 1s, |p discppq| ď ϵ{2. Based on standard decompositions (such as equation (1) of FH23), we know that

Rpp1:T q ď řT t 1pyt discpp TOPS t qq2

We now relate the RHS of the above equation to BS

t 1 pyt discpptqq2

t 1 pyt pt pt discpptqq2

T BSpp1:T q

t 1 ppt discpptqq2 2

t 1 pyt ptqppt discpptqq

ď T BSpp1:T q Tpϵ{2q2 2

t 1 |yt pt| pϵ{2q.

ď T BSpp1:T q Tpϵ{2q2 Tϵ.

The result of the theorem follows by dividing by T on both sides.

Online Platt Scaling

Proof of Theorem 3.1. The calibeating paper (Foster and Hart, 2023) is referred to as FH23 in this proof for succinctness.

We use Theorem 3 of FH23, speciﬁcally equation (13), which gives an upper bound on the Brier-score of a tracking forecast (Bc t in their notation) relative to the reﬁnement (20) of the base forecast. In our case, the tracking forecast is TOPS, the base forecast is OPS, and FH23 s result gives,

BSpp TOPS 1:T q řT t 1pyt p TOPS t q2

T ď Rpp TOPS 1:T q log T 1

Using the Brier-score-reﬁnement lemma E.2 to lower bound BSpp TOPS 1:T q gives

Rpp TOPS 1:T q ϵ2

4 ϵ ď Rpp OPS 1:T q log T 1

Finally, using the sharpness-reﬁnement lemma E.1, we can replace each R with y T SHP. Rearranging terms gives the ﬁnal bound.

Proof of Theorem 3.2. The calibeating paper (Foster and Hart, 2023) is referred to as FH23 in this proof for succinctness.

Sharpness bound (16). Theorem 5 of FH23 showed that the expected Brier-score for a different hedging scheme (instead of F99), is at most the expected reﬁnement score of the base forecast plus ϵ2 log T 1

ϵ2T . In our case, the second term remains unchanged, but because we use F99, the ϵ2 needs to be replaced, and we show that it can be replaced by ϵ next.

Let us call the combination of OPS and the FH23 hedging method as FH23-HOPS, and the calibeating forecast as p FH23-HOPS 1:T . The source of the ϵ2 term in Theorem 5 of FH23 is the following property of FH23-HOPS: for both values of yt P t0, 1u,

Et 1 pyt p FH23-HOPS t q2 pyt Averagetys : s ă t, p OPS s p OPS t , p FH23-HOPS s p FH23-HOPS t uq2 ď ϵ2,

where Et 1 r s is the expectation conditional on py1:t 1, p FH23-hedging 1:t 1 , p OPS 1:t 1q (all that s happened in the past, and the current OPS forecast). For HOPS, we will show that

Qt : Et 1 pyt p HOPS t q2 pyt Averagetys : s ă t, p OPS s p OPS t , p HOPS s p HOPS t uq2 ď ϵ,

for yt P t0, 1u, which would give the required result.

At time t, the F99 forecast falls into one of two scenarios which we analyze separately (see Appendix C for details of F99 which would help follow the case-work).

Case 1. This corresponds to condition A in the description of F99 in Section C. There exists a bin index b such that q mid-pointp Bbq satisﬁes Averagetys : s ă t, p OPS s p OPS t , p HOPS s qu q ď ϵ{2.

In this case, F99 would set p HOPS t q (deterministically) for some q satisfying the above. Thus,

Qt pyt qq2 pyt Averagetys : s ă t, p OPS s p OPS t , p HOPS s quq2

ď maxppyt qq2 pyt q ϵ{2q2, pyt qq2 pyt q ϵ{2q2q

ď pϵ{2qp2 |yt q| ϵ{2q ă ϵ,

irrespective of yt, since q P rϵ{2, 1 ϵ{2s.

Case 2. This corresponds to condition B in the description of F99 in Section C. If Case 1 does not hold, F99 randomizes between two consecutive bin mid-points m ϵ{2 and m ϵ{2, where m is one of the edges of the ϵ-bins (10). Deﬁne n1 : Averagetys : s ă t, p OPS s p OPS t , p HOPS s m ϵ{2u and n2 : Averagetys : s ă t, p OPS s p OPS t , p HOPS s m ϵ{2u. The choice of m in F99 guarantees that n2 ă m ă n1, and the randomization probabilities are given by

Pt 1pp HOPS t m ϵ{2q m n2

n1 n2 , and Pt 1pp HOPS t m ϵ{2q n1 m

Online Platt Scaling

where Pt 1 is the conditional probability in the same sense as Et 1. We now bound Qt. If yt 1,

Qt Et 1 pyt p HOPS t q2 pyt Averagetys : s ă t, p OPS s p OPS t , p HOPS s p HOPS t uq2

p1 pm ϵ{2qq2 p1 n1q2 n1 m

p1 pm ϵ{2qq2 p1 n2q2

p1 mq2 p1 n1q2 n1 m

p1 mq2 p1 n2q2

looooooooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooooooooon

2 pϵ{2q pm n2qp1 mq pn1 mqp1 mq

n1 n2 looooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooon

pϵ{2q2 n1 n2

A1 and A2 simplify as follows.

A1 pm n2qpn1 mqp2 pn1 mqq pn1 mqpn2 mqp2 pn2 mqq

pm n2qpn1 mqpn2 n1q

since n2 ă m ă n1.

A2 ϵ pm n2qp1 mq

n1 n2 ϵ pm n1qp1 mq

ă ϵ pm n2qp1 mq

n1 n2 (since m ă n1)

Overall, we obtain that for yt 1, Qt ă ϵp1 mq pϵ2{4q ă ϵ,

where the ﬁnal inequality holds since m is an end-point between two bins, and thus m ě ϵ. We do the calculations for yt 0 less explicitly since it essentially follows the same steps:

Qt Et 1 p0 p HOPS t q2 p0 Averagetys : s ă t, p OPS s p OPS t , p HOPS s p HOPS t uq2

pm ϵ{2q2 n2 1 n1 m

pm ϵ{2q2 n2 2

pm n2qpm n1qpm n1q pn1 mqpm n2qpm n2q

n1 n2 ϵ pn2 mqm pn1 mqm

4 ă 0 ϵm pϵ2{4q ă ϵ.

Finally, by Proposition 1 of FH23 and the above bound on Qt, we obtain,

E Rpp HOPS 1:T q ď E BSpp HOPS 1:T q ď ϵ Rpp OPS 1:T q log T 1

Using the sharpness-reﬁnement lemma E.1, we replace each R with y T SHP. Rearranging terms gives the sharpness result.

Calibration bound (17). Recall that the number of bins is m 1{ϵ. For some bin indices b, b1 P t1, 2, . . . , mu, let SbÑb1 tt ď T : p OPS t P Bb, p HOPS t mid-pointp Bb1qu be the set of time instances at which the OPS forecast p OPS t belonged to bin b, but the HOPS forecast p HOPS t belonged to bin b1 (and equals the mid-point of bin b1). Also, let Sb tt ď T : p OPS t P Bbu be the set of time instances at which the p OPS t forecast belonged to bin b. Thus Sb Ťm b1 1 SbÑb1. Also deﬁne N OPS b |Sb| and N HOPS b tt ď T : p HOPS t mid-pointp Bbqu .

Online Platt Scaling

Now for any speciﬁc b, consider the sequence pytqt PSb. On this sequence, the HOPS forecasts correspond to F99 using just the outcomes (with no regard for covariate values once the bin of p OPS t is ﬁxed). Thus, within this particular bin, we have a usual CE guarantee that F99 s algorithm has for any arbitrary sequence:

t PSbÑb1 pyt p HOPS t q

looooooooooooooooooooooomooooooooooooooooooooooon

this is the expected CE over the Sb instances

ϵ N OPS b . (26)

This result is unavailable in exactly this form in Foster (1999) which just gives the reduction to Blackwell approachability, after which any ﬁnite-sample approachability bound can be used. The above version follows from Theorem 1.1 of Perchet (2014). The precise details of the Blackwell approachability set, reward vectors, and how the distance to the set can be translated to CE can be found in Gupta and Ramdas (2022a, Section 4.1).

Jensen s inequality can be used to lift this CE guarantee to the entire sequence:

E CEpp HOPS 1:T q E

N HOPS b ppy HOPS b pp HOPS b q

řT t 1pyt p HOPS t q1 p HOPS t P Bb (

t PSb1Ñbpyt p HOPS t q

t PSb1Ñbpyt p HOPS t q

ﬂ (Jensen s inequality)

t PSb1Ñbpyt p HOPS t q

N OPS b1 ϵ{2 2{ b

T (by (26))

N OPS b1 řm b1 1 N OPS b1 (since T

b1 1 N OPS b1 )

as needed to be shown. The inequality p q holds because, by Jensen s inequality (or AM-QM inequality),

cřm b1 1 N OPS b1 m ě

so that řm b1 1 b

N OPS b1 řm b1 1 N OPS b1

N OPS b1 břm b1 1 N OPS b1 1 břm b1 1 N OPS b1 ď ?m břm b1 1 N OPS b1 a