# estimating_uncertainty_online_against_an_adversary__1ab04594.pdf

Estimating Uncertainty Online Against an Adversary

Volodymyr Kuleshov Stanford University Stanford, CA 94305 tkuleshov@cs.stanford.edu

Stefano Ermon Stanford University Stanford, CA 94305 ermon@cs.stanford.edu

Assessing uncertainty is an important step towards ensuring the safety and reliability of machine learning systems. Existing uncertainty estimation techniques may fail when their modeling assumptions are not met, e.g. when the data distribution differs from the one seen at training time. Here, we propose techniques that assess a classiﬁcation algorithm s uncertainty via calibrated probabilities (i.e. probabilities that match empirical outcome frequencies in the long run) and which are guaranteed to be reliable (i.e. accurate and calibrated) on out-of-distribution input, including input generated by an adversary. This represents an extension of classical online learning that handles uncertainty in addition to guaranteeing accuracy under adversarial assumptions. We establish formal guarantees for our methods, and we validate them on two real-world problems: question answering and medical diagnosis from genomic data.

Introduction Assessing uncertainty is an important step towards ensuring the safety and reliability of machine learning systems. In many applications of machine learning including medical diagnosis (Jiang et al. 2012), natural language understanding (Nguyen and O Connor 2015), and speech recognition (Yu, Li, and Deng 2011) assessing conﬁdence can be as important as obtaining high accuracy. This work explores conﬁdence estimation for classiﬁcation problems. An important limitation of existing methods is the assumption that data is sampled i.i.d. from a distribution P(x, y); when test-time data is distributed according to a different P , these methods may become overconﬁdent and erroneous. Here, we introduce new, robust uncertainty estimation algorithms guaranteed to produce reliable conﬁdence estimates on out-of-distribution input, including input generated by an adversary. In the classiﬁcation setting, the most natural way of measuring an algorithm s uncertainty is via calibrated probability estimates that match the true empirical frequencies of an outcome. For example, if an algorithm predicted a 60% chance of rain 100 times in a given year, its forecast would be calibrated if it rained on about 60 of those 100 days.

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Background. Calibrated conﬁdence estimates are typically constructed via recalibration, using methods such as Platt scaling (Platt 1999) or isotonic regression (Niculescu Mizil and Caruana 2005). In the context of binary classiﬁcation, these methods reduce recalibration to a onedimensional regression problem that, given data (xi, yi)n i=1, trains a model g(s) (e.g. logistic regression) to predict probabilities pi = g(si) from uncalibrated scores si = h(xi) produced by a classiﬁer h (e.g. SVM margins). Fitting g is equivalent to performing density estimation targeting P(Y = 1|h(X) = si) and hence may fail on out-ofdistribution testing data. The methods we introduce in this work are instead based on calibration techniques developed in the literature on online learning in mathematical games (Foster and Vohra 1998; Abernethy, Bartlett, and Hazan 2011). These classical methods are not suitable for standard prediction tasks in their current form. For one, they do not admit covariates xi that might be available to improve the prediction of yi; hence, they also do not consider the predictive power of the forecasts. For example, predicting 0.5 on a sequence 01010... formed by alternating 0s and 1s is considered a valid calibrated forecaster. The algorithms we present here combine the advantages of online calibration (adversarial assumptions), and of batch probability recalibration (covariates and forecast sharpness).

Online learning with uncertainty. Whereas classical online optimization aims to accurately predict targets y given x (via a convex loss ℓ(x, y)), our algorithms aim to accurately predict uncertainties p(y = ˆy). The p here are deﬁned as empirical frequencies over data seen so far; it turns out that these probability-like quantities can be estimated under the standard adversarial assumptions of online learning. We thus see our work as extending classical online optimization to handle uncertainty in addition to guaranteeing accuracy.

Example. As a concrete motivating example, consider a medical system that diagnoses a long stream of patients indexed by t = 1, 2, ..., outputting a disease risk pt [0, 1] for each patient based on their medical record xt. Provably calibrated probabilities in this setting may be helpful for making informed policy decisions (e.g. by providing guaranteed up-

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

per bounds on the number of patients readmitted after a discharge) and may be used to communicate risks to patients in a more intuitive way. This setting is also inherently online, since patients are typically observed one at a time, and may not be i.i.d. due to e.g., seasonal disease outbreaks.

Contributions. More formally, our contributions are to:

Formulate a new problem called online recalibration, which requires producing calibrated probabilities on potentially adversarial input, while retaining the predictive power of a given baseline uncalibrated forecaster.

Propose a meta-algorithm for online recalibration that uses classical online calibration as a black box subroutine.

Show that our technique can recalibrate the forecasts of any existing classiﬁer at the cost of an O(1/ ϵ) overhead in the convergence rate of A, where ϵ > 0 is the desired level of accuracy.

Surprisingly, both online and standard batch recalibration (e.g., Platt scaling) may be performed only when accuracy is measured using speciﬁc loss functions; our work characterizes the losses which admit a recalibration procedure in both the online and batch settings.

Background Below, we will use IE denote the indicator function of E, [N] and [N]0 to (respectively) denote the sets {1, 2, ..., N} and {0, 1, 2, ..., N}, and Δd to denote the d-dimensional simplex.

Learning with Expert Advice Learning with expert advice (Cesa-Bianchi and Lugosi 2006) is a special case of the general online optimization framework (Shalev-Shwartz 2007) that underlies online calibration algorithms. At each time t = 1, 2, ..., the forecaster F receives advice from N experts and chooses a distribution wt ΔN 1 over their advice. Nature then reveals an outcome yt and F incurs an expected loss of N i=1 wtiℓ(yt, ait), where ℓ(yt, ait) is the loss under expert i s advice ait. Performance in this setting is measured using two notions of regret.

Deﬁnition 1. The external regret Rext T and the internal regret Rint T are deﬁned as

t=1 ℓ(yt, pt) min i [N]

t=1 ℓ(yt, ait)

Rint T = max i,j [N]

t=1 pt,i (ℓ(yt, ait) ℓ(yt, ajt)) ,

where ℓ(y, p) = N i=1 piℓ(y, ait) is the expected loss.

External regret measures loss with respect to the best ﬁxed expert, while internal regret is a stronger notion that measures the gain from retrospectively switching all the plays of action i to j. Both deﬁnitions admit algorithms with sublinear, uniformly bounded regret.

Figure 1: Our method bins uncalibrated scores and runs online calibration subroutines in each bin (not unlike the histogram recalibration method targeting P(y = 1 | s = t)).

In this paper, we will be particularly interested in proper losses ℓ, whose expectation over y is minimized by the probability corresponding to the average y. Deﬁnition 2. A loss ℓ(y, p) : {0, 1} [0, 1] R+ is proper if p arg minq Ey Ber(p)ℓ(y, q) p. Examples of proper losses include the L2 loss ℓ2(y, p) = (y p)2, the log-loss ℓlog(y, p) = y log(p)+(1 y) log(1 p), and the the misclassiﬁcation loss ℓmc(y, p) = (1 y)Ip<0.5 + y Ip 0.5. Counter-examples include the L1 and the hinge losses.

Calibration in Online Learning Intuitively, calibration means that the true and predicted frequencies of an event should match. For example, if an algorithm predicts a 60% chance of rain 100 times in a given year, then we should see rain on about 60 of those 100 days. More formally, let F cal be a forecaster making predictions in the set { i

N | i = 0, ..., N}, where 1/N is called the resolu-

tion of F cal; consider the quantities ρT (p) = T t=1 yt Ipt=p T t=1 Ipt=p and

t=1 I{pt= i

The term ρT (p) denotes the frequency at which event y = 1 occurred over the times when we predicted p. Our intuition was that ρT (p) and p should be close to each other; we capture this using the notion of calibration error Cp T for p 1; this corresponds to the weighted ℓp distance between the ρT (i/N) and the predicted probabilities i N ; typically one assumes that p = 1 or p = 2. To simplify notation, we will use the term CT when the exact p is unambiguous. Deﬁnition 3. We say that F cal is an (ϵ, ℓp)-calibrated algorithm with resolution 1/N if lim sup T Cp T ϵ a.s. There exists a vast literature on calibration in the online setting (Cesa-Bianchi and Lugosi 2006) which is primarily concerned with constructing calibrated predictions pt [0, 1] of a binary outcome yt {0, 1} based solely on the past sequence y1, ..., yt 1. Surprisingly, this is possible even when the yt are chosen adversarially by reducing the problem to internal regret minimization relative to

N + 1 experts with losses (yt i/N)2 and proposed predictions i/N for i [N]0. All such algorithms are randomized, hence our results will hold almost surely (a.s.). See Chapter 4 in Cesa-Bianchi and Lugosi for details.

Online Recalibration Unfortunately, existing online calibration methods are not directly applicable in real-world settings. For one, they do not take into account covariates xt that might be available to improve the prediction of yt. As a consequence, they cannot produce accurate forecasts: for example, they would constantly predict 0.5 on a sequence 01010... formed by alternating 0s and 1s. To address these shortcomings, we deﬁne here a new problem called online recalibration, in which the task is to transform a sequence of uncalibrated forecasts p F t into predictions pt that are calibrated and almost as accurate as the original p F t . The forecasts p F t may come from any existing machine learning system F; our methods treat it as a black box and preserve its favorable convergence properties. Formally, we deﬁne the online recalibration task as a generalization of the classical online optimization framework (Shalev-Shwartz 2007; Cesa-Bianchi and Lugosi 2006). At every step t = 1, 2, ...: 1: Nature reveals features xt Rd. 2: Forecaster F predicts p F t = σ(wt 1 xt) [0, 1]. 3: A recalibration algorithm A produces a calibrated probability pt = A(p F t ) [0, 1]. 4: Nature reveals label yt {0, 1}; F incurs loss of ℓ(yt, pt), where ℓ: [0, 1] {0, 1} R+ is convex in pt for all yt. 5: F chooses wt+1; A updates itself based on yt. Here, σ is a transfer function chosen such that the task is convex in wt. In the medical diagnosis example, xt represents medical or genomic features for patient t; we use feature weights wt to predict the probability p F t that the patient is ill; the true outcome is encoded by yt. We would like A to produce p F t that are accurate and well-calibrated in the following sense.

Deﬁnition 4. We say that A is an (ϵ, ℓcal)-accurate online recalibration algorithm for the loss ℓacc if (a) the forecasts pt = A(p F t ) are (ϵ, ℓcal)-calibrated and (b) the regret of pt with respect to p F t is a.s. small in terms of ℓacc:

ℓacc(yt, pt) ℓacc(yt, p F t ) ϵ. (2)

Algorithms for Online Recalibration Next, we propose an algorithm for performing online probability recalibration; we refer to our approach as a metaalgorithm because it repeatedly invokes a regular online calibration algorithm as a black-box subroutine. Algorithm 1 outlines this procedure. At a high level, Algorithm 1 partitions the uncalibrated forecasts p F t into M buckets/intervals I = {[0, 1

M ), ..., [ M 1

M , 1]}; it trains an independent instance of F cal on the data {p F t , yt | p F t Ij} belonging to

Algorithm 1 Online Recalibration

Require: Online calibration subroutine F cal and number of buckets M 1: Let I = {[0, 1

M ), ..., [ M 1

M , 1]} be a set of intervals that partition [0, 1]. 2: Let F = {F cal j | j = 0, ..., M 1} be a set of M independent instances of F cal. 3: for t = 1, 2, ...: do 4: Observe uncalibrated forecast p F t . 5: Let Ij I be the interval containing p F t . 6: Let pt be the forecast of F cal j . 7: Output pt. Observe yt and pass it to F cal j .

each bucket Ij I; at prediction time, it calls the instance of F cal associated with the bucket of the uncalibrated forecast p F t . Algorithm 1 works because a calibrated predictor is at least as accurate as any constant predictor; in particular, each subroutine F cal j is at least as accurate as the prediction j M , which also happens to be approximately p F t when F cal j was called. Thus, each F cal j is as accurate as its input sequence of p F t . One can then show that if each each F cal j is accurate and calibrated, then so it their aggrgate, Algorithm 1. The rest of this section provides a formal version of this argument; due to space limitations, we defer most of our full proofs to the appendix.

Calibration and Accuracy of Online Recalibration

Notation. We deﬁne the calibration error of F cal j and of Algorithm 1 at i/N as (respectively)

C(j) T,i = ρ(j) T (i/N) i

t=1 I(j) t,i

CT,i = ρT (i/N) i

where It,i = I{pt = i/N}. Terms marked with a (j) denote the restriction of the usual deﬁnition to the input of subroutine F cal j (see the appendix for details). We may write

the calibration losses of F cal j and Algorithm 1 as C(j) T = N i=0 C(j) T,i and CT = N i=0 CT,i.

Assumptions. In this section, we will assume that the subroutine F cal used in Algorithm 1 is (ϵ, ℓ1)-calibrated and that C(j) Tj RTj +ϵ uniformly (RTj = o(1) as Tj ; Tj is the number of calls to instance F cal j ). This also implies ℓpcalibration (by continuity of ℓp), albeit with different rates RTj and a different ϵ. Abernethy, Bartlett, and Hazan introduce (ϵ, ℓ1)-calibrated Fj. We also provide proofs for the ℓ2 loss in the appendix.

Crucially, we assume that the loss ℓused for measuring accuracy is proper and bounded with ℓ( , i/N) < B for i [N]0; since the set of predictions is ﬁnite, this is a mild requirement. Finally, we make additional continuity assumptions on ℓin Lemma 2.

Recalibration with proper losses. Surprisingly, not every loss ℓadmits a recalibration procedure. Consider, for example, the following continuously repeating sequence 001001001... of yt s. A calibrated forecaster must converge to predicting 1/3 (a constant prediction) with an ℓ1 loss of 0.44; however predicting 0 for all t has an ℓ1 loss of 1/3 < 0.44. Thus we cannot recalibrate this sequence and also remain equally accurate under the ℓ1 loss. The same argument also applies to batch recalibration (e.g. Platt scaling): we only need to assume that yt Ber(1/3) i.i.d. However, recalibration is possible for a very large class of proper losses. Establishing this fact will rely on the following key technical lemma. Lemma 1. If ℓis a proper loss bounded by B > 0, then an (ϵ, ℓ1)-calibrated F cal a.s. has a small internal regret w.r.t. ℓ and satisﬁes uniformly over time T the bound

Rint T = max ij

t=1 Ipt=i/N (ℓ(yt, i/N) ℓ(yt, j/N)) 2B(RT + ϵ).

According to Lemma 1, if a set of predictions is calibrated, then we never want to retrospectively switch to predicting p2 at times when we predicted p1. Intuitively, this makes sense: if predictions are calibrated, then p1 should minimize the total (or average) loss

t:pt=p1 ℓ(yt, p) over the times t when p1 was predicted (at least better so than p2). However, our ℓ1 counter-example above shows that this intuition does not hold for every loss; we need to explicitly enforce our intuition, which amounts to assuming that ℓis proper, i.e. that p arg minq Ey Ber(p)ℓ(y, q).

Accuracy and calibration. An important consequence of Lemma 1 is that a calibrated algorithm has vanishing regret relative to any ﬁxed prediction (since minimizing internal regret also minimizes external regret). Using this fact, it becomes straightforward to establish that Algorithm 1 is at least as accurate as the baseline forecaster F.

Lemma 2 (Recalibration preserves accuracy). Consider Algorithm 1 with parameters M N > 1/ϵ and let ℓbe a bounded proper loss for which

1. ℓ(yt, p) ℓ(yt, j/M)+B/M for p [j/M, (j+1)/M); 2. ℓ(yt, p) ℓ(yt, i/N) + B/N for p [i/N, (i + 1)/N);

Then the recalibrated pt a.s. have vanishing ℓ-loss regret relative to p F t and we have uniformly:

t=1 ℓ(yt, pt) 1

t=1 ℓ(yt, p F t ) < NB

T RTj +3Bϵ.

Proof (sketch). When pt is the output of a given Fj, we have ℓ(yt, p F t ) ℓ(yt, j/M) ℓ(yt, ij/M) (since p F t is in the jth bucket, and since M N is sufﬁciently high resolution).

Subroutine Regret Minimization Blackwell Approchability

Time / step O(1/ϵ) O(log(1/ϵ))

Space / step O(1/ϵ2) O(1/ϵ2)

Calibration O(1/ϵ

Advantage Simplicity Efﬁciency

Table 1: Time and space complexity and convergence rate of Algorithm 1 using different subroutines.

Since Fj is calibrated, Lemma 1 implies the pt have vanishing regret relative to the ﬁxed prediction ij/N; aggregating over j yields our result.

The assumptions of Lemma 2 essentially require that ℓ be Lipschitz with constant B, which holds e.g. for convex bounded losses that are studied in online learning. Our assumption is slightly more general since ℓmay also be discontinuous (like the misclassiﬁcation loss). When ℓis unbounded (like the log-loss), its values at the baseline algorithm s predictions must be bounded away from inﬁnity. Next, we also establish that combining the predictions of each F cal j preserves their calibration.

Lemma 3 (Preserving calibration). If each F cal j is (ϵ, ℓp)- calibrated, then Algorithm 1 is also (ϵ, ℓp)-calibrated and the bound CT M j=1 Tj

T RTj + ϵ holds uniformly over T.

These two lemmas lead to our main claim: that Algorithm 1 solves the online recalibration problem.

Theorem 1. Let F cal be an (ℓ1, ϵ/3B)-calibrated online subroutine with resolution N 3B/ϵ. and let ℓbe a proper loss satisfying the assumptions of Lemma 2. Then Algorithm 1 with parameters F cal and M = N is an ϵ-accurate online recalibration algorithm for the loss ℓ.

Proof. By Lemma 3, Algorithm 1 is (ℓ1, ϵ/3B)-calibrated and by Lemma 2, its regret w.r.t. the raw p F t tends to < 3B/N < ϵ. Hence, Theorem 1 follows.

In the appendix, we provide a detailed argument for how ℓcan be chosen to be the misclassiﬁcaiton loss. Interestingly, it also turns out that if ℓis not a proper loss, then recalibration is not possible for some ϵ > 0.

Theorem 2. If ℓis not proper, then no algorithm achieves recalibration w.r.t. ℓfor all ϵ > 0.

The proof of this algorithm is a slight generalization of the counter-example provided for the ℓ1 loss. Interestingly, it holds equally for online and batch settings. To our knowledge, it is one of the ﬁrst characterizations of the limitations of recalibration algorithms.

Convergence rates. Next, we are interested in the rate of convergence RT of the calibration error CT of Algorithm 1. For most online calibration subroutines F cal, RT f(ϵ)/

T for some f(ϵ). In such cases, we can further

Figure 2: We compare predictions from an uncalibrated expert F (blue), Algorithm 1 (green), and REGMIN (red) on sequences yt Ber(0.5) (plots a, b) and on adversarially chosen yt (plots c, d).

bound the calibration error in Lemma 3 as M j=1 Tj

T RTj M j=1

ϵT . In the second inequality, we set the Tj to be equal. Thus, our recalibration procedure introduces an overhead of 1 ϵ in the convergence rate of the calibration error CT and of the regret in Lemma 2. In addition, Algorithm 1 requires 1 ϵ times more memory (we run 1/ϵ instances of F cal j ), but has the same per-iteration runtime (we activate one F cal j per step). Table 1 summarizes the convergence rates of Algorithm 1 when the subroutine is either the method of Abernethy, Bartlett, and Hazan based on Blackwell approachability or the simpler but slower approach based on internal regret minimization (Mannor and Stoltz 2010).

Multiclass prediction. In the multiclass setting, we seek a recalibrator A : ΔK 1 ΔK 1 producing calibrated probabilities pt ΔK 1 that target class labels yt {1, 2, ..., K}. In analogy to binary recalibration, we may discretize the input space ΔK 1 into a K-dimensional grid and train a classical multi-class calibration algorithm F cal (Cesa-Bianchi and Lugosi 2006) on each subset of p F t associated with a cell. Just like in the binary setting, a classical calibration method F cal j predicts calibrated pt ΔK 1 based solely on past multiclass labels y1, y2, ..., yt 1; it can serve as a subroutine within Algorithm 1. However, in the multi-class setting, this construction will require O(1/ϵK) running time per iteration, O(1/ϵ2K) memory, and will have a convergence rate of O(1/(ϵ2K

T)). The exponential dependence on K cannot be avoided, since the calibration problem is fundamentally PPAD-hard (Hazan and Kakade 2012). However, there may exist practical workarounds inspired by popular heuristics for the batch setting, such as one-vs-all classiﬁcation (Zadrozny and Elkan 2002).

Experiments

We now proceed to study Algorithm 1 empirically. Algorithm 1 s subroutine is the standard internal regret minimization approach of Cesa-Bianchi and Lugosi ("REGMIN"). We measure calibration and accuracy in the ℓ2 norm.

Predicting a Bernoulli sequence. We start with a simple setting where we observe an i.i.d. sequence of yt Ber(p) as well as uncalibrated predictions (p F t )T t=1 that equal 0.3 whenever yt = 0 and 0.7 when yt = 1. The forecaster F is essentially a perfect predictor, but is not calibrated. In Figure 2, we compare the performance of REGMIN (which does not observe p F t ) to Algorithm 1 and to the uncalibrated predictor F. Both methods achieve low calibration error after about 300 observations, while the expert is clearly uncalibrated (Figure 2b); however, REGMIN is a terrible predictor: it always forecasts pt = 0.5 and therefore has high ℓ2 loss (Figure 2a). Algorithm 1, on the other hand, makes perfect predictions by recalibrating the input p F t .

Prediction against an adversary. Next, we test the ability of our method to achieve calibration on adversarial input. At each step t, we choose yt = 0 if pt > 0.5 and yt = 1 otherwise; we sample p F t Ber(0.5), which is essentially a form of noise. In Figure 2 (c, d), we see that Algorithm 1 successfully ignores the noisy forecaster F and instead quickly converges to making calibrated (albeit not very accurate) predictions (it reduces to REGMIN).

Natural language understanding. We used Algorithm 1 to recalibrate a state-of-the-art question answering system (Berant and Liang 2014) on the popular Free917 dataset (641 training, 276 testing examples). We trained the system on the training set as described in (Berant et al. 2013) and then calibrated probabilities using Algorithm 1 in one pass over ﬁrst the training, and then the testing examples. This setup emulates a pre-trained system that further improves itself from user feedback. Figure 3 (left) compares our predicted pt to the raw system probabilities p F t via calibration curves. Given pairs of predictions and outcomes pt, yt, we compute for each of N buckets B {[ i

N ) | 0 i 1}, averages p B =

t:pt B pt/NB and y B =

t:pt B yt/NB, where NB = |{pt B}|. A calibration curve plots the y B as a function of p B; perfect calibration corresponds to a straight line. Calibration curves indicate that the p F t are poorly calibrated in buckets below 0.9, while Algorithm 1 fares better. Figure 3a conﬁrms that our accuracy (measured by the ℓ2 loss) tracks the baseline forecaster.

Medical diagnosis. Our last task is predicting the risk of type 1 diabetes from genomic data. We use genotypes of 3,443 subjects (1,963 cases, 1,480 controls) over 447,221 SNPs (The Wellcome Trust Case Control Consortium 2007), with alleles encoded as 0, 1, 2 (major, heterozygous and minor homozygous resp.). We use an online ℓ1-regularized linear support vector machine (SVM) to predict outcomes one patient at a time, and report performance for each t [T]. Uncalibrated probabilities are normalized raw SVM scores st, i.e. p F t = (st + mt)/2mt, where mt = max1 r t |sr|. Figure 3 (right) measures calibration after observing all the data. Raw scores are not well-calibrated outside of the interval [0.4, 0.6]; recalibration makes them almost perfectly

Figure 3: Algorithm 1 (green) is used to recalibrate probabilities from a question answering system (left) and a medical diagnosis system (right; both in blue). We track prediction (a) and calibration error (b) over time; plot (c) displays calibration curves after seeing all the data; circle sizes are proportional to the number of predictions in the corresponding bucket.

calibrated. Figure 3 further shows that the calibration error of Algorithm 1 is consistently lower throughout the entire learning process, while accuracy approaches to within 0.01 of that of p F t .

Previous Work

Calibrated probabilities are widely used as conﬁdence measures in the context of binary classiﬁcation. Such probabilities are obtained via recalibration methods, of which Platt scaling (Platt 1999) and isotonic regression (Niculescu Mizil and Caruana 2005) are by far the most popular. Recalibration methods also possess multiclass extensions, which typically involve training multiple one-vs-all predictors (Zadrozny and Elkan 2002), as well as extensions to ranking losses (Menon et al. 2012), combinations of estimators (Zhong and Kwok 2013), and structured prediction (Kuleshov and Liang 2015). In the online setting, the calibration problem was formalized by Dawid; online calibration techniques were ﬁrst proposed by Foster and Vohra. Existing algorithms are based on internal regret minimization (Cesa-Bianchi and Lugosi 2006) or on Blackwell approachability (Foster 1997); recently, these approaches were shown to be closely related (Abernethy, Bartlett, and Hazan 2011; Mannor and Stoltz 2010). Recent work has shown that online calibration is PPAD-hard (Hazan and Kakade 2012). The concepts of calibration and sharpness were ﬁrst formalized in the statistics literature (Murphy 1973; Gneiting, Balabdaoui, and Raftery 2007). These metrics are captured by a class of proper losses and can be used both for evaluating (Buja, Stuetzle, and Shen 2005; Brocker 2009) and constructing (Kuleshov and Liang 2015) calibrated forecasts.

Discussion and Conclusion

Online vs batch. Algorithm 1 can be understood as a direct analogue of a simple density estimation technique called the histogram method. This technique divides the p F t into N bins and estimates the average y in each bin. By the i.i.d. assumption, output probabilities will be calibrated; sharpness will be determined by the bin width. Note that by Hoeffding s inequality, the average in a given bin with converge at a rate of O(1/

Tj) (Devroye, Györﬁ, and Lugosi 1996). This is faster than the O(1/

ϵTj) rate of Abernethy, Bartlett,

and Hazan and suggests that calibration is more challenging in the online setting.

Checking rules. An alternative way to avoid uninformative predictions (e.g. 0.5 on 010101...) is via the framework of checking rules (Cesa-Bianchi and Lugosi 2006). However, these rules must be speciﬁed in advance (e.g. the pattern 010101 must be known) and this framework does not explicitly admit covariates xt. Our approach on the other hand recalibrates any xt, yt in a black-box manner.

Defensive forecasting. Vovk, Takemura, and Shafer developed simultaneously calibrated and accurate online learning methods under the notion of weak calibration (Abernethy and Mannor 2011). We use strong calibration, which implies weak, although it requires different (e.g. randomized) algorithms. Vovk et al. also use a different notion of precision; their algorithm ensures a small difference between average predicted pt and true yt at times t when pt p and xt x , for any p , x . The relation is determined by a user-speciﬁed kernel (over e.g. sentences or genomes xt). Our approach, on the other hand, does not require specifying a kernel, and matches the accuracy of any given baseline forecaster; this may be simpler in some settings. Interestingly, we arrive at the same rates of convergence under different assumptions.

Conclusion. Current recalibration techniques implicitly require that the data is distributed i.i.d., which potentially makes them unreliable when this assumption does not hold. In this work, we introduced the ﬁrst recalibration technique that provably recalibrates any existing forecaster with a vanishingly small degradation in accuracy. This method does not make i.i.d. assumptions, and is provably calibrated even on adversarial input. We analyzed our method s theoretical properties and showed excellent empirical performance on several real-world benchmarks, where the method converges quickly and retains good accuracy.

Acknowledgements. This work is supported by the NSF (grant #1649208) and by the Future of Life Institute (grant 2016-158687).

Abernethy, J. D., and Mannor, S. 2011. Does an efﬁcient calibrated forecasting strategy exist? In COLT 2011 - The 24th Annual Conference on Learning Theory, June 9-11, 2011, Budapest, Hungary, 809 812. Abernethy, J.; Bartlett, P. L.; and Hazan, E. 2011. Blackwell approachability and no-regret learning are equivalent. In COLT 2011 - The 24th Annual Conference on Learning Theory, 27 46. Berant, J., and Liang, P. 2014. Semantic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 1415 1425. Berant, J.; Chou, A.; Frostig, R.; and Liang, P. 2013. Semantic parsing on freebase from question-answer pairs. In EMNLP 2013, 1533 1544. Brocker, J. 2009. Reliability, sufﬁciency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society 135(643):1512 1519. Buja, A.; Stuetzle, W.; and Shen, Y. 2005. Loss functions for binary class probability estimation and classiﬁcation: Structure and applications. Cesa-Bianchi, N., and Lugosi, G. 2006. Prediction, Learning, and Games. New York, NY, USA: Cambridge University Press. Dawid, A. P. 1982. The well-calibrated bayesian. Journal of the American Statistical Association 77(379):605 610. Devroye, L.; Györﬁ, L.; and Lugosi, G. 1996. A probabilistic theory of pattern recognition. Applications of mathematics. New York, Berlin, Heidelberg: Springer. Foster, D. P., and Vohra, R. V. 1998. Asymptotic calibration. Foster, D. P. 1997. A Proof of Calibration Via Blackwell s Approachability Theorem. Discussion Papers 1182, Northwestern University. Gneiting, T.; Balabdaoui, F.; and Raftery, A. E. 2007. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B 69(2):243 268. Hazan, E., and Kakade, S. M. 2012. (weak) calibration is computationally hard. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, 3.1 3.10. Jiang, X.; Osl, M.; Kim, J.; and Ohno-Machado, L. 2012. Calibrating predictive model estimates to support personalized medicine. JAMIA 19(2):263 274. Kuleshov, V., and Liang, P. 2015. Calibrated structured prediction. In Advances in Neural Information Processing Systems (NIPS). Mannor, S., and Stoltz, G. 2010. A geometric proof of calibration. Math. Oper. Res. 35(4):721 727. Menon, A. K.; Jiang, X.; Vembu, S.; Elkan, C.; and Ohno Machado, L. 2012. Predicting accurate probabilities with a ranking loss. In 29th International Conference on Machine Learning,. Murphy, A. H. 1973. A new vector partition of the probability score. Journal of Applied Meteorology 12(4):595 600.

Nguyen, K., and O Connor, B. 2015. Posterior calibration and exploratory analysis for natural language processing models. Co RR abs/1508.05154. Niculescu-Mizil, A., and Caruana, R. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22Nd International Conference on Machine Learning, ICML 05. Platt, J. C. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In ADVANCES IN LARGE MARGIN CLASSIFIERS, 61 74. MIT Press. Shalev-Shwartz, S. 2007. Online Learning: Theory, Algorithms, and Applications. Phd thesis, Hebrew University. The Wellcome Trust Case Control Consortium. 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447(7145):661 678. Vovk, V.; Takemura, A.; and Shafer, G. 2005. Defensive forecasting. In Proceedings of the Tenth International Workshop on Artiﬁcial Intelligence and Statistics, AISTATS 2005, Bridgetown, Barbados, January 6-8, 2005. Yu, D.; Li, J.; and Deng, L. 2011. Calibration of conﬁdence measures in speech recognition. Trans. Audio, Speech and Lang. Proc. 19(8):2461 2473. Zadrozny, B., and Elkan, C. 2002. Transforming classiﬁer scores into accurate multiclass probability estimates. In Eighth ACM Conference on Knowledge Discovery and Data Mining, 694 699. Zhong, L. W., and Kwok, J. T. 2013. Accurate probability calibration for multiple classiﬁers. IJCAI 13, 1939 1945. AAAI Press.