# testing_calibration_in_nearlylinear_time__12a7071d.pdf

Testing Calibration in Nearly-Linear Time

Lunjia Hu Harvard University lunjia@alumni.stanford.edu

Arun Jambulapati University of Michigan jmblpati@gmail.com

Kevin Tian University of Texas at Austin kjtian@cs.utexas.edu

Chutong Yang University of Texas at Austin cyang98@utexas.edu

In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23a], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given n draws from a distribution D on (predictions, binary outcomes), our goal is to distinguish between the cases where D is perfectly calibrated or ε-far from calibration. We make the simple observation that the empirical smooth calibration linear program can be reformulated as an instance of minimum-cost flow on a highly-structured graph, and design an exact dynamic programming-based solver for it which runs in time O(n log2(n)), and solves the calibration testing problem informationtheoretically optimally in the same time. This improves upon state-of-the-art black-box linear program solvers requiring Ω(nω) time, where ω > 2 is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem improving upon black-box linear program solvers, and give sample complexity lower bounds for alternative calibration measures to the one considered in this work. Finally, we present experiments showing the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale efficiently to accommodate large sample sizes.

1 Introduction

Probabilistic predictions are at the heart of modern data science. In domains as wide-ranging as forecasting (e.g. predicting the chance of rain from meteorological data [MW84, Mur98]), medicine (e.g. assessing the likelihood of disease [Doi07]), computer vision (e.g. assigning confidence values for categorizing images [VDDP17]), and more (e.g. speech recognition [AAA+16] and recommender systems [RRSK11]), prediction models have by now become essential components of the decisionmaking pipeline. Particularly in the context of critical, high-risk use cases, the interpretability of prediction models is therefore paramount in downstream applications. That is, how do we assign meaning to the predictions our model gives us, especially when the model is uncertain?

We focus on perhaps the most ubiquitous form of prediction modeling: binary predictions, represented as tuples (v, y) in [0, 1] {0, 1} (where the v coordinate is our prediction of the likelihood of an event, and the y coordinate is the observed outcome). We model prediction-outcome pairs in the binary prediction setting by a joint distribution D over [0, 1] {0, 1}, fixed in the following discussion. In

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

this context, calibration of a predictor has emerged as a basic desideratum. A prediction-outcome distribution D is said to be calibrated if

E(v,y) D[y | v = t] = t for all t [0, 1]. (1)

That is, calibration asks that the outcome is 1 exactly 60% of the time, when the model returns a prediction v = 0.6. While calibration (or approximate variants thereof) is a relatively weak requirement on a meaningful predictor, as it can be achieved by simple models,1 it can still be significantly violated in practice. For example, interest in calibration in the machine learning community was spurred by [GPSW17], which observed that many modern deep learning models are far from calibrated. Moreover, variants of calibration have been shown to have strong postprocessing properties for fairness constraints and loss minimization [HKRR18, DKR+21, GKR+22], which has garnered renewed interest in calibration by the theoretical computer science and statistics communities.

The question of measuring the calibration of a distribution is subtle; even a calibrated distribution incurs measurement error due to sampling. For example, consider the expected calibration error, used in e.g. [NCH15, GPSW17, MDR+21a, RT21b] as a ground-truth measure of calibration:

ECE(D) := E(v,y) D E(v ,y) D [y | v = v] v .

Unfortunately, the empirical ECE is typically meaningless; if the marginal density of v is continuous, we will almost surely only observe a single sample with each v value. Further, [KF08] observed that ECE is discontinuous in v. In practice, binned variants of ECE are often used as a proxy, where a range of v is lumped together in the conditioning event. However, hyperparameter choices (e.g. the number of bins) can significantly affect the quality of binned ECE variants as a distance measure [KLM19, NDZ+19, MDR+21a].2 Moreover, as we explore in this paper, binned calibration measures inherently suffer from larger sample complexity-to-accuracy tradeoffs, and are less faithful to ground truth calibration notions in experiments than the calibration measures we consider.

Recently, [BGHN23a] undertook a systematic study of various measures of distance to calibration proposed in the literature. They proposed information-theoretic tractability in the prediction-only access (PA) model, where the calibration measure definition can only depend on the joint predictionoutcome distribution (rather than the features of training examples),3 as a desirable criterion for calibration measures. Correspondingly, [BGHN23a] introduced Definition 1 as a ground-truth notion for measuring calibration in the PA model, which we also adopt in this work.4

Definition 1 (Lower distance to calibration). Let D be a distribution over [0, 1] {0, 1}. The lower distance to calibration (LDTC) of D, denoted d CE(D), is defined by

d CE(D) := inf Π ext(D) E(u,v,y) Π |u v| ,

where ext(D) is all distributions Π over (u, v, y) [0, 1] [0, 1] {0, 1} satisfying the following.

The marginal distribution of (v, y) is D.

The marginal distribution (u, y) is perfectly calibrated, i.e. EΠ[y|u] = u.

Definition 1 has various beneficial aspects: it is convex in v, computable in the PA model, and (as shown by [BGHN23a]) polynomially-related to various other calibration measures, including some which require feature access, e.g. the distance to calibration (DTC, Eq. (1), [BGHN23a]). Roughly, the DTC of a distribution is the tightest lower bound on the ℓ1 distance between v and any calibrated function of the features, after taking features into account. The LDTC is the analog of this feature-aware measure of calibration when limited to the PA model, so it does not depend on features. We focus on Definition 1 as our ground-truth measure in the remainder of the paper.

1The predictor which ignores features and always return the population mean is calibrated, for example. 2For example, [NDZ+19] observed that, in their words, dramatic differences in bin sensitivity can occur depending on properties of the (distribution) at hand, a sentiment echoed by Section 5 of [MDR+21a]. 3This access model is practically desirable because it abstracts away the feature space, which can lead to significant memory savings when our goal is only to test the calibration of model predictions. Moreover, this matches conventions in the machine learning literature, e.g. loss functions are typically defined in the PA model. 4We note that [BGHN23a] introduced an upper distance to calibration, also defined in the PA model, which they showed is quadratically-related to the d CE in Definition 1. However, the upper distance does not satisfy basic properties such as continuity, making it less amenable to estimation and algorithm design.

1.1 Our results

We initiate the algorithmic study of the calibration testing problem, defined as follows. Definition 2 (Calibration testing). Let ε [0, 1]. We say algorithm A solves the ε-calibration testing problem with n samples, if given n i.i.d. draws from a distribution D over [0, 1] {0, 1}, A returns either yes or no and satisfies the following with probability 2

A returns no if d CE(D) ε.

A returns yes if d CE(D) = 0.

In this case, we also call A an ε-calibration tester.

To our knowledge, we are the first to formalize the calibration testing problem in Definition 2, which is natural from the perspective of property testing, an influential paradigm in statistical learning [Ron08, Ron09, Gol17]. In particular, there is an εn = Θ(n 1/2) so that it is informationtheoretically impossible to solve the εn-calibration testing problem from n samples (see Lemma 5), so a variant of Definition 2 with an exact distinguishing threshold between calibrated/uncalibrated is not tractable. Hence, Definition 2 only requires distinguishing distributions D which are clearly uncalibrated (parameterized by a threshold ε) from those which are perfectly calibrated.

We note that a variant of Definition 2 where d CE is replaced by variants of the ECE was recently proposed by [LHHD23]. However, due to the aforementioned discontinuity and binning choice issues which plague the ECE, [LHHD23] posed as an explicit open question whether an alternative calibration metric makes for a more appropriate calibration testing problem, motivating our Definition 2. Indeed, Proposition 9 of [LHHD23] shows that without smoothness assumptions on the data distribution, it is impossible to solve the ECE calibration testing problem from finite samples.6

Our first algorithmic contribution is a nearly-linear time algorithm for calibration testing.

Theorem 1. Let n N, and ε = Ω(εn), where εn = Θ(n 1/2) is minimal such that it is informationtheoretically possible to solve the εn-calibration testing problem (Definition 2) with n samples. There is an algorithm which solves the ε-calibration testing problem with n samples, running in time O(n log2(n)).

The lower bound on the acceptable range of εn in Theorem 1 is well-known, and recalled in Lemma 5 for completeness. Our main contribution is to prove the upper bound (i.e., achieving O(εn)-calibration testing) in Theorem 1 by designing a new algorithm for computing sm CE( b Dn), the smooth calibration error (Definition 3), an alternative calibration measure, of an empirical distribution b Dn. Definition 3 (Smooth calibration error). Let W be the set of Lipschitz functions w : [0, 1] [ 1, 1]. The smooth calibration error of distribution D over [0, 1] {0, 1}, denoted sm CE(D), is

sm CE(D) = sup w W

E(v,y) D[(y v)w(v)] .

It was shown in [BGHN23a] that sm CE(D) is a constant-factor approximation to d CE(D) for all D on [0, 1] {0, 1} (see Lemma 4). Additionally, the empirical sm CE admits a representation as a linear program with an O(n) O(n)-sized constraint matrix encoding Lipschitz constraints.7 Thus, [BGHN23a] proposed a simple procedure for estimating sm CE(D): draw n samples from D, and solve the associated linear program on the empirical distribution. While there have been significant recent runtime advances in the linear programming literature [LS14, CLS21, vd BLSS20, vd BLL+21], all state-of-the-art black-box linear programming algorithms solve linear systems involving the constraint matrix, which takes Ω(nω) time, where ω > 2.371 [WXXZ23] is the current exponent of

5As is standard in property testing problems, the success probability of either a calibration tester or a tolerant calibration tester can be boosted to 1 δ for any δ (0, 1) at a k = O(log( 1

δ )) overhead in the sample complexity. This is because we can independently call k copies of the tester and output the majority vote, which succeeds with probability 1 δ by Chernoff bounds, so we focus on δ = 1

3. 6The work [LHHD23] considered k-class prediction tasks, extending our focus on binary classification (k = 2), which we believe is an exciting future direction. However, their Proposition 9 holds even when k = 2. 7Formally, the number of constraints in the sm CE linear program is O(n2), but we show that in the hardconstrained setting, requiring that adjacent constraints are met suffices (see Lemma 1).

matrix multiplication. Even under the best-possible assumption that ω = 2, the strategy of exactly solving a linear program represents an Ω(n2) quadratic runtime barrier for calibration testing.

We bypass this barrier by noting that the sm CE linear program is highly-structured, and can be reformulated as minimum-cost flow on a planar graph. We believe this observation is already independently interesting, as it opens the door to using powerful software packages designed for efficiently solving flow problems to measure calibration in practice. Moreover, using recent theoretical breakthroughs in graph algorithms [DGG+22, CKL+22] as a black box, this observation readily implies an O(n polylog(n))-time algorithm for solving the smooth calibration linear program.

However, these aforementioned algorithms are quite complicated, and implementations in practice are not available, leaving their relevance to empirical calibration testing unclear at the moment. Motivated by this, in Section 2 we develop a custom solver for the minimum-cost flow reformulation of empirical smooth calibration, based on dynamic programming. Our theoretical runtime improvement upon [DGG+22, CKL+22] is by at least a large polylogarithmic factor, and moreover our algorithm is simple enough to implement in practice, where it attains faster runtimes than general-purpose commercial solvers on moderate or large dataset sizes, evaluated in Section 3.

We further define a tolerant variant of Definition 2 (see Definition 4), where we allow for error thresholds in both the yes and no cases; yes is the required answer when d CE(D) ε2, and no is required when d CE(D) ε1. Our algorithm in Theorem 1 continues to serve as an efficient tolerant calibration tester when ε1 4ε2, formally stated in Theorem 3. This constant-factor loss comes from a similar loss in the relationship between sm CE and d CE, see Lemma 4. We make the observation that a constant factor loss in the tolerant testing parameters is inherent following this strategy, via a lower bound in Lemma 13. Thus, even given infinite samples, computing the smooth calibration error cannot solve tolerant calibration testing all the way down to the information-theoretic threshold ε1 ε2. To develop an improved tolerant calibration tester, we directly show how to approximate the LDTC of an empirical distribution, our second main algorithmic contribution.

Theorem 2 (Informal, see Theorem 4, Corollary 3). Let n N, and let ε1 ε2 = Ω(εn), where εn = Θ(n 1/2) is minimal such that it is information-theoretically possible to solve the εn-calibration testing problem (Definition 4) with n samples. There is an algorithm which solves the (ε1, ε2)-tolerant calibration testing problem with n samples, running in time O( n log(n)

(ε1 ε2)2 ) = O(n2 log(n)).

While Theorem 2 is slower than Theorem 1, it directly approximates the LDTC, making it applicable to tolerant calibration testing. We mention that state-of-the-art black-box linear programming based solvers, while still applicable to (a discretizeation of) the empirical LDTC, require Ω(n2.5) time [vd BLL+21], even if ω = 2. This is because the constraint matrix for the ε-approximate empirical LDTC linear program has dimensions O( n

ε ) O(n), resulting in an 1

ε = Ω( n) overhead in the dimension of the decision variable. We prove Theorem 2 in Appendix C, where we use recent advances in minimax optimization [JT23] and a custom combinatorial rounding procedure to develop a faster algorithm, improving state-of-the-art linear programming runtimes by an Ω( n) factor.

In Appendix D, we complement our algorithmic results with lower bounds (Theorems 5, 6) on the sample complexity required to solve variants of the testing problem in Definition 2, when d CE is replaced with different calibration measures. For several widely-used distances in the machine learning literature, including binned and convolved variants of ECE [NCH15, BN23], we show that eΩ(ε 2.5) samples are required to the associated ε-calibration testing problem. This demonstrates a statistical advantage of our focus on d CE as our ground-truth notion for calibration testing.

We corroborate our theoretical findings with experimental evidence on real and synthetic data in Section 3. First, on a simple Bernoulli example, we show that d CE and sm CE testers are more reliable measures of calibration than a recently-proposed binned ECE variant. We then apply our sm CE tester to postprocessed neural network predictions to test their calibration levels, validating against the findings in [GPSW17]. Finally, we implement our method from Theorem 1 on our Bernoulli dataset, showing that it scales to high dimensions and runs faster than both a linear program solver from CVXPY for computing the empirical sm CE, as well as a commercial minimum-cost flow solver from Gurobi Optimization (combined with our reformulation in Lemma 2).8

8Our code is included in the supplementary material.

1.2 Our techniques

Theorems 1 and 2 follow from designing custom algorithms for approximating empirical linear programs associated with the sm CE and d CE of a sampled dataset b Dn := {(vi, yi)}i [n] i.i.d. D. In both cases, generalization bounds from [BGHN23a] show it suffices to approximate the value of the empirical calibration measures to error ε = Ω(n 1/2), though our solver in Theorem 1 will be exact.

We begin by explaining our strategy for estimating sm CE( b Dn) (Definition 3). By definition, the smooth calibration error of b Dn can be formulated as a linear program,

min x [ 1,1]n 1 n

i [n] xi(vi yi), where |xi xj| |vi vj| for all (i, j) [n] [n]. (2)

Here, xi [ 1, 1] corresponds to the weight on vi, and there are 2 n 2 constraints on the decision variable x, each of which corresponds to a Lipschitz constraint. We make the simple observation that every Lipschitz inequality constraint can be replaced by two constraints of the form xi xj |vj vi| (with i, j swapped). Moreover, the box constraints x [ 1, 1]n can be handled by introducing a dummy variable xn+1 and writing max(xi xn+1, xn+1 xi) 1, after penalizing xn+1 appropriately in the objective. Notably, this substitution makes every constraint the difference of two decision variables, which is enforceable using the edge-vertex incidence matrix of a graph. Finally, the triangle inequality implies that we only need to enforce Lipschitz constraints in (2) corresponding to adjacent i, j. After making these simplifications, the result is the dual of a minimum-cost flow problem on a graph which is the union of a star and a path; this argument is carried out in Lemma 2.

Because of the sequential structure of the induced graph, we show in Appendix B.2 that a dynamic programming-based approach, which maintains the minimum-cost flow value after committing to the first i < n flow variables in the graph recursively, succeeds in computing the value (2). To implement each iteration of our dynamic program in polylogarithmic time, we rely on a generalization of the classical segment tree data structure that we develop in Appendix B.3; combining gives Theorem 1.

On the other hand, the linear program corresponding to the empirical d CE is more complex (with two types of constraints), and to our knowledge lacks the graphical structure to be compatible with the aforementioned approach. Moreover, it is not obvious how to use first-order methods, an alternative linear programming framework suitable when only approximate answers are needed, to solve this problem more quickly. This is because the empirical d CE linear program enforces hard constraints to a set that is difficult to project to under standard distance metrics. To develop our faster algorithm in Theorem 2, we instead follow an augmented Lagrangian method where we lift the constraints directly into the objective as a soft-constrained penalty term. To prove correctness of this lifting, we follow a line of results in combinatorial optimization [She13, JST19]. These works develop a proof-by-rounding algorithm framework to show that the hard-constrained and soft-constrained linear programs have equal values, summarized in Appendix C.1 (see Lemma 14).

To use this augmented Lagrangian framework, it remains to develop an appropriate rounding algorithm to the feasible polytope for the empirical d CE linear program, which enforces two types of constraints: marginal satisfaction of (v, y), and calibration of (u, y) (using notation from Definition 1). In Appendix C.3, we design a two-step rounding procedure, which first fixes the marginals on the (v, y) coordinates, and then calibrates the u coordinates without affecting any (v, y) marginal.

1.3 Related work

The calibration performance of deep neural networks has been studied extensively in the literature (e.g. [GPSW17, MDR+21b, Rt21a, BGHN23b]). Measuring the calibration error in a meaningful way can be challenging, especially when the predictions are not naturally discretized (e.g. in neural networks). Recently, [BGHN23a] addresses this challenge using the distance to calibration as a central notion. They consider a calibration measure to be consistent if it is polynomially-related to the distance to calibration. Consistent calibration measures include the smooth calibration error [KF04], Laplace kernel calibration error [KSJ18], interval calibration error [BGHN23a], and convolved ECE [BN23].9

9The calibration measure we call the convolved ECE in our work was originally called the smooth ECE in [BN23]. We change the name slightly to reduce overlap with the smooth calibration error (Definition 3), a central object throughout the paper.

On the algorithmic front, substantial observations were made by [BGHN23a] on linear programming characterizations of calibration measures such as the LDTC and smooth calibration. While there have been significant advances on the runtime frontier of linear programming solvers, current runtimes for handling an n d linear program constraint matrix with n d remain Ω(min(nd + d2.5, nω)) [CLS21, vd BLL+21, JSWZ21]. Our constraint matrix is roughly-square and highly-sparse, so it is plausible that e.g. the recent research on sparse linear system solvers [PV21, Nie22] could apply to the relevant Newton s method subproblems and improve upon these rates. Moreover, while efficient estimation algorithms have been proposed by [BGHN23a] for (surrogate) interval calibration error and by [BN23] for convolved ECE, these algorithms require suboptimal sample complexity for solving our testing task in Definition 2 (see Appendix D). To compute their respective distances to error ε from samples, these algorithms require Ω(ε 5) and Ω(ε 3) time. As comparison, under this parameterization Theorems 1 and 2 require e O(ε 2) and e O(ε 4) time, but can solve stronger testing problems with the same sample complexity, experimentally validated in Section 3.

Notation. Throughout, D denotes a distribution over [0, 1] {0, 1}. When D is clear from context, we let b Dn = {(vi, yi)}i [n] denote a dataset of n independent samples from D and, in a slight abuse of notation, the distribution with probability 1

n for each (vi, yi). We say d is a calibration measure if it takes distributions on [0, 1] {0, 1} to the nonnegative reals R 0, so d CE (Definition 1) and sm CE (Definition 3) are both calibration measures. We use e O and eΩto hide polylogarithmic factors in the argument. We denote [n] := {i N | i n}. We denote matrices in boldface throughout. For any A Rm n, we refer to its ith row by Ai: and its jth column by A:j. For a set S identified with rows of a matrix A, we let As: denote the row indexed by s S, and use similar notation for columns. For a directed graph G = (V, E), we define its edge-vertex incidence matrix B { 1, 0, 1}E V which has a row corresponding to each e = (u, v) E with Beu = 1 and Bev = 1. When G is undirected, we similarly define B { 1, 0, 1}E V with arbitrary edge orientations.

2 Smooth calibration

In this section, we overview our main result on approximating the smooth calibration of a distribution on [0, 1] {0, 1}, deferring some aspects of the proof to Appendix B. We first show that the linear program corresponding to the smooth calibration of an empirical distribution can be reformulated as an instance of minimum-cost flow on a highly-structured graph. We then explain our dynamic programming approach to solving this minimum-cost flow problem and state a runtime guarantee. Finally, we give our main result on near-linear time calibration testing, Theorem 1.

Throughout this section, we fix a dataset under consideration, b Dn := {(vi, yi)}i [n] [0, 1] {0, 1},

and the corresponding empirical distribution (which, in an abuse of notation, we also denote b Dn), i.e. we use (v, y) b Dn to mean that (v, y) = (vi, yi) with probability 1

n for each i [n]. We also assume without loss of generality that the {vi}i [n] are in sorted order, so 0 v1 . . . vn 1. Recalling Definition 3, the associated empirical smooth calibration linear program is

sm CE( b Dn) := max x [ 1,1]n b x,

where |xi xj| vj vi for all (i, j) [n] [n] with i < j,

and bi := 1

n(yi vi) for all i [n].

We first make a simplifying observation, which shows that it suffices to replace the Lipschitz constraints in (3) with only the Lipschitz constraints corresponding to adjacent indices (i, j). Lemma 1. If x, v Rn, where v has monotonically nondecreasing coordinates, and |xi xi+1| vi+1 vi for all i [n 1], then |xi xj| vj vi for all (i, j) [n] [n] with i < j.

We now reformulate (3) as a (variant of a) minimum-cost flow problem. Lemma 2. Consider an instance of (3). Let G = (V, E) be an undirected graph on n + 1 vertices labeled by V := [n + 1], and with 2n 1 directed edges E defined and with edge costs as follows.

For all i [n 1], there is an edge between vertices (i, i + 1) with edge cost vi+1 vi.

For all i [n], there is an edge between vertices (i, n + 1) with edge cost 1.

Let c RE be the vector of all edge costs, let d Rn+1 be the demand vector which concatenates b in (3) with a last coordinate set to P

i [n] bi, and let B { 1, 0, 1}E V be the edge-vertex incidence matrix of G. Then the problem

e E ce|fe| (4)

has the same value as the empirical smooth calibration linear program (3).

Figure 1: Example graph G for n = 5 with n + 1 = 6 vertices and 2n 1 = 9 edges.

Proof. By Lemma 1, solving (3) is equivalent to solving

min x [ 1,1]n b x, where |xi xi+1| vi+1 vi for all i [n 1], (5)

We create a dummy variable xn+1, and rewrite (5) as

min x Rn+1 X

i [n] bi(xi xn+1), where |xi xi+1| vi+1 vi for all i [n 1],

and 1 xi xn+1 1 for all i [n]. (6)

Next, consider a directed graph e G = (V, e E) with 4n 2 edges which duplicate the undirected edges described in the lemma statement in both directions. Let e B { 1, 0, 1} e E V be the edge-vertex incidence matrix of e G, and let c R e E be the edge cost vector so that both edges in e E corresponding to e E have the same cost ce. Then (6) is equivalent to the linear program maxx Rn+1 d x such that e Bx c, where d is described as in the lemma statement. The dual of this linear program is

min f R e E 0 e B f=d

a minimum-cost flow problem on e G. In particular, based on the way we defined e B, we can check that e B f encodes the net flow at each vertex of e G, which is set according to the demand vector d in the above optimization problem. Next, for each pair of directed edges (e , e ) in e G corresponding to some e E, note that an optimal solution to (7) will only put nonzero flow on one of e or e , else we can achieve a smaller cost by canceling out redundant flow. Therefore, we can collapse each pair of directed edges into a single undirected edge, where we allow the flow variable f to be negative but charge its magnitude |f| in cost, proving equivalence of (7) and (4) as claimed.

We believe this observation in Lemma 2 is already interesting, as it lets us to use specialized graph algorithms to achieve faster runtimes in both theory and practice for solving (3). By using the special structure of the graph (the union of a star and path), we show in Appendix B that we can develop a more efficient custom algorithm for this problem. Specifically, we show how to replace the constrained problem (4) with an unconstrained problem on only the path edges, of the form

min f Rn 1 A(f) := |d1 + f1| + |dn fn 1| + X

i [n 2] |fi fi+1 di+1| + X

i [n 1] ci|fi|. (8)

We prove the following result in Appendix B.2. Proposition 1. There is an algorithm which computes a minimizer f Rn 1 to A in (8), as well as the minimizing value A(f), in time O(n log2(n)).

Our algorithm for establishing Proposition 1 is based on dynamic programming, and recursively represents partial solutions to A as a piecewise-linear function. We implement updates to this representation via a segment tree data structure in polylogarithmic time, giving our overall solution.

Corollary 1. There is an algorithm which computes the value of (3) in time O(n log2(n)).

Proof. This is immediate from Lemma 2, the equivalence between the constrained problem (4) and the unconstrained problem (8) established in Appendix B.2, and Proposition 1.

We now describe how to build upon Corollary 1 to give an algorithm for proving Theorem 1, using a result from [BGHN23a] which bounds how well the smooth calibration of an empirical distribution approximates the smooth calibration of the population.

Lemma 3 (Corollary 9.9, [BGHN23a]). For ε (0, 1), there is an n = O( 1

ε2 ) such that if b Dn is the empirical distribution over n i.i.d. draws from D, with probability 2

3, |sm CE(D) sm CE( b Dn)| ε.

Further, we recall the smooth calibration error is constant-factor related to the LDTC.

Lemma 4 (Theorem 7.3, [BGHN23a]). For any distribution D over [0, 1] {0, 1}, we have 1 2d CE(D) sm CE(D) 2d CE(D).

Proof of Theorem 1. We take n = O( 1

ε2 ) samples so Lemma 3 ensures |sm CE(D) sm CE( b Dn)| ε

2 with probability 2

3. We then compute β = sm CE( b Dn) using Corollary 1, and return yes iff β ε

4, which distinguishes between the two cases in Definition 2 via Lemma 4.

3 Experiments

In this section, we present experiments on synthetic data and CIFAR-100 supporting our argument that d CE and sm CE are reliable measures of calibration for use in defining a testing problem. We then evaluate our custom algorithms from Section 2 and Appendix B, showing promising results on their runtimes outperforming standard packages for linear programming and minimum-cost flow. The experiments in the first and third part of this section are run on a 2018 laptop with 2.2 GHz 6-Core Intel Core i7 processor. The experiments in the second part are run on a cluster using 2x AMD EPYC 7763 64-Core Processor and a single NVIDIA A100 PCIE 40GB.

Synthetic dataset. In our first experiment, we considered the ability of ε-d-testers (Definition 5) to detect the miscalibration of a synthetic dataset, for various levels of ε {0.01, 0.03, 0.05, 0.07, 0.1}, and various choices of d {sm CE, d CE, Conv ECE}.10 The synthetic dataset we used is n independent draws from D, where a draw (v, y) D first draws v unif. [0, 1 ε ], and y Bern(v + ε ), for ε := 0.01.11 Note that d CE(D) = ε = 0.01, by the proof in Lemma 13. In Table 1, where the columns index n (the number of samples), for each choice of d we report the smallest value of ε such that a majority of 100 runs of an ε-d-tester report yes. For d = Conv ECE, we implemented our tester by running code in [BN23] to compute Conv ECE and thresholding at ε

2. For d {sm CE, d CE}, we used the standard linear program solver from CVXPY [DB16, AVDB18] and again thresholded at ε 2. We remark that the CVXPY solver, when run on the d CE linear program, fails to produce stable results for n > 29 due to the size of the constraint matrix. As seen from Table 1, both sm CE and d CE testers are more reliable estimators of the ground truth calibration error ε than Conv ECE.

Table 1: Calibration testing thresholds (smallest passed on half of 100 runs).

n 26 + 1 27 + 1 28 + 1 29 + 1 210 + 1 211 + 1 sm CE 0.07 0.05 0.03 0.03 0.01 0.01 d CE 0.03 0.01 0.01 c ECE 0.1 0.1 0.07 0.07 0.05 0.03 Ground Truth 0.01 0.01 0.01 0.01 0.01 0.01

10We implemented Conv ECE using code from [BN23], which automatically conducts a search for σ. 11This is a slight variation on the synthetic dataset used in [BGHN23a].

Table 2: Empirical sm CE on postprocessed Dense Net40 predictions (median over 20 runs)

D Dbase Diso Dtemp Empirical sm CE 0.2269 0.2150 0.1542

In Figure 2, we plot the median error with error bars for each calibration measure, where the x axis denotes log2(n 1), and results are reported over 100 runs.

Figure 2: The 25% quantile, median, and 75% quantile (over 100 runs) for sm CE, d CE and c ECE respectively. The x-axis is for dataset with size 2x + 1.

Postprocessed neural networks. In [GPSW17], which observed modern deep neural networks may be very miscalibrated, various strategies were proposed for postprocessing network predictions to calibrate them. We evaluate two of these strategies using our testing algorithms. We trained a Dense Net40 model [HLvd MW17] on the CIFAR-100 dataset [Kri09], producing a distribution Dbase, where a draw (v, y) Dbase selects a random example from the test dataset, sets y to be its label, and v to be the prediction of the neural network. We also learned calibrating postprocessing functions fiso and ftemp from the training dataset, the former via isotonic regression and the latter via temperature scaling. These induce (ideally, calibrated) distributions Diso, Dtemp, where a draw from Diso samples (v, y) Dbase and returns (fiso(v), y), and Dtemp is defined analogously. The neural network and postprocessing functions were all trained by adapting code from [GPSW17].

We computed the median smooth calibration error of 20 runs of the following experiment. In each run, for each D {Dbase, Diso, Dtemp}, we drew 256 random examples from D, and computed the average smooth calibration error sm CE of the empirical dataset using a linear program solver from CVXPY. We report our findings in Table 2. We also compared computing sm CE using the CVXPY solver and a commercial minimum-cost flow solver from Gurobi Optimization [Opt23] (on the objective from Lemma 2) in this setting. The absolute difference between outputs is smaller than 10 5 in all cases, verifying that minimum-cost flow solvers accurately measure smooth calibration.

Qualitatively, our results (based on sm CE) agree with findings in [GPSW17] (based on binned variants of ECE), in that temperature scaling appears to be the most effective postprocessing technique.

sm CE tester. Finally, we evaluated the efficiency of our proposed approaches to computing the empirical sm CE. Specifically, we measure the runtime of four solvers for computing (3): a linear program solver from CVXPY, a commercial minimum-cost flow solver from Gurobi Optimization, a naïve implementation of our algorithm from Corollary 1 using Python, and a slightly-optimized implementation using the Py Py package [Py P19]. We use the same experimental setup as in Table 1, i.e. measuring calibration of a uniform predictor on a miscalibrated synthetic dataset, with ε = 0.01.12 In Table 3, we report the average runtimes for each trial (across 10 runs), varying the sample size. Again, the absolute difference between the outputs of all methods is negligible ( 10 9 in all cases). As seen in Table 3, our custom algorithm (optimized with Py Py) outperforms standard packages from CVXPY and Gurobi Optimization starting from moderate sample sizes. We believe that Table 3 demonstrates that our new algorithms are a scalable, reliable way of testing calibration, and that these performance gains may be significantly improvable by further optimizing our code.

12We found similar runtime trends when using our algorithms to test calibration on the postprocessed neural network dataset, but the runtime gains were not as drastic as the sample size n = 28 was smaller in that case.

Acknowledgements

We thank Edgar Dobriban for pointing us to the reference [LHHD23], and Yang P. Liu and Richard Peng for helpful discussions on the segment tree data structure in Section B.3. We also thank Yue Zhao for advice on running our experiments.

Table 3: Runtimes (in seconds) for computing the value of (3), using various solvers

n 210 211 212 213 214 215 CVXPY LP solver 0.105 0.370 1.58 6.51 45.7 245 Gurobi minimum-cost flow solver 0.063 0.179 0.238 0.539 1.45 3.19 Solver from Corollary 1 0.177 0.389 0.899 2.01 4.66 10.6 Solver from Corollary 1 with Py Py 0.079 0.115 0.176 0.307 0.621 2.05

[AAA+16] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse H. Engel, Linxi Fan, Christopher Fougner, Awni Y. Hannun, Billy Jun, Tony Han, Patrick Le Gresley, Xiangang Li, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Sheng Qian, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Chong Wang, Yi Wang, Zhiqian Wang, Bo Xiao, Yan Xie, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. Deep speech 2 : End-to-end speech recognition in english and mandarin. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 173 182. JMLR.org, 2016.

[AVDB18] Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42 60, 2018.

[BGHN23a] Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. A unifying theory of distance from calibration. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1727 1740, 2023.

[BGHN23b] Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. When does optimizing a proper loss yield calibration? ar Xiv preprint ar Xiv:2305.18764, 2023.

[BN23] Jarosław Błasiok and Preetum Nakkiran. Smooth ECE: Principled reliability diagrams via kernel smoothing. ar Xiv preprint ar Xiv:2309.12236, 2023.

[Can22] Clément L. Canonne. Topics and techniques in distribution testing: A biased but representative sample. Foundations and Trends in Communications and Information Theory, 19(6):1032 1198, 2022.

[CKL+22] Li Chen, Rasmus Kyng, Yang P Liu, Richard Peng, Maximilian Probst Gutenberg, and Sushant Sachdeva. Maximum flow and minimum-cost flow in almost-linear time. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 612 623. IEEE, 2022.

[CLS21] Michael B. Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. J. ACM, 68(1):3:1 3:39, 2021.

[CST21] Michael B. Cohen, Aaron Sidford, and Kevin Tian. Relative lipschitzness in extragradient methods and a direct recipe for acceleration. In 12th Innovations in Theoretical Computer Science Conference, ITCS 2021, volume 185 of LIPIcs, pages 62:1 62:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.

[DB16] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016.

[DGG+22] Sally Dong, Yu Gao, Gramoz Goranci, Yin Tat Lee, Richard Peng, Sushant Sachdeva, and Guanghao Ye. Nested dissection meets ipms: Planar min-cost flow in nearlylinear time. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, pages 124 153. SIAM, 2022.

[DKR+21] Cynthia Dwork, Michael P. Kim, Omer Reingold, Guy N. Rothblum, and Gal Yona. Outcome indistinguishability. In STOC 21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021, pages 1095 1108. ACM, 2021.

[Doi07] Kunio Doi. Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Computerized medical imaging and graphics, 31(4 5):198 211, 2007.

[GKR+22] Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In 13th Innovations in Theoretical Computer Science Conference, ITCS 2022, volume 215 of LIPIcs, pages 79:1 79:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022.

[Gol17] Oded Goldreich. Introduction to Property Testing. Cambridge University Press, 2017.

[GPSW17] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321 1330. PMLR, 2017.

[HKRR18] Úrsula Hébert-Johnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, volume 80 of Proceedings of Machine Learning Research, pages 1944 1953. PMLR, 2018.

[HLvd MW17] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pages 2261 2269. IEEE Computer Society, 2017.

[IS03] Y. I. Ingster and I. A. Suslina. Nonparametric Goodness-of-Fit Testing under Gaussian Models, volume 169 of Lecture Notes in Statistics. Springer-Verlag, New York, 2003.

[JST19] Arun Jambulapati, Aaron Sidford, and Kevin Tian. A direct e O(1/ε) iteration parallel algorithm for optimal transport. Advances in Neural Information Processing Systems, 32, 2019.

[JSWZ21] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. A faster algorithm for solving general lps. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 823 832, 2021.

[JT23] Arun Jambulapati and Kevin Tian. Revisiting area convexity: Faster box-simplex games and spectrahedral generalizations. ar Xiv preprint ar Xiv:2303.15627, 2023.

[KF04] Sham M. Kakade and Dean P. Foster. Deterministic calibration and nash equilibrium. In John Shawe-Taylor and Yoram Singer, editors, Learning Theory, pages 33 48, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.

[KF08] Sham M. Kakade and Dean P. Foster. Deterministic calibration and nash equilibrium. J. Comput. Syst. Sci., 74(1):115 130, 2008.

[KLM19] Ananya Kumar, Percy Liang, and Tengyu Ma. Verified uncertainty calibration. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, pages 3787 3798, 2019.

[Kri09] Alex Krizhevsky. Learning multiple layers of features from tiny images. https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdf, 2009. Accessed: 2024-01-31.

[KSJ18] Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2805 2814. PMLR, 10 15 Jul 2018.

[LHHD23] Donghwan Lee, Xinmeng Huang, Hamed Hassani, and Edgar Dobriban. T-cal: An optimal test for the calibration of predictive models. Journal of Machine Learning Research, 24:1 72, 2023.

[LS14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programs in õ(vrank) iterations and faster algorithms for maximum flow. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, pages 424 433. IEEE Computer Society, 2014.

[MDR+21a] Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, pages 15682 15694, 2021.

[MDR+21b] Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 15682 15694. Curran Associates, Inc., 2021.

[Mur98] Allan H. Murphy. The early history of probability forecasts: Some extensions and clarifications. Weather and forecasting, 13(1):5 15, 1998.

[MW84] Allan H. Murphy and Robert L. Winkler. Probability forecasting in meteorology. Journal of the American Statistical Association, 79(387):489 500, 1984.

[NCH15] Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, pages 2901 2907. AAAI Press, 2015.

[NDZ+19] Jeremy Nixon, Michael W. Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, pages 38 41. Computer Vision Foundation / IEEE, 2019.

[Nie22] Zipei Nie. Matrix anti-concentration inequalities with applications. In STOC 22: 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 568 581. ACM, 2022.

[Opt23] Gurobi Optimization. Minimum-cost flow. https://gurobi-optimization-gurobioptimods.readthedocs-hosted.com/en/latest/mods/min-cost-flow.html, 2023. Accessed: 2024-05-21.

[PV21] Richard Peng and Santosh S. Vempala. Solving sparse linear systems faster than matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, pages 504 521. SIAM, 2021.

[Py P19] Py Py. Pypy. https://www.pypy.org/, 2019. Accessed: 2024-05-21.

[QM22] Benjamin Qi and Dustin Miao. Segment tree beats. https://usaco.guide/adv/segtreebeats?lang=cpp, 2022. Accessed: 2024-05-20.

[Ron08] Dana Ron. Property testing: A learning theory perspective. Found. Trends Mach. Learn., 1(3):307 402, 2008.

[Ron09] Dana Ron. Algorithmic and analysis techniques in property testing. Found. Trends Theor. Comput. Sci., 5(2):73 205, 2009.

[RRSK11] Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor. Recommender Systems Handbook. Springer New York, 2011.

[Rt21a] Rahul Rahaman and alexandre thiery. Uncertainty quantification and deep ensembles. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 20063 20075. Curran Associates, Inc., 2021.

[RT21b] Rahul Rahaman and Alexandre H. Thiéry. Uncertainty quantification and deep ensembles. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, pages 20063 20075, 2021.

[She13] Jonah Sherman. Nearly maximum flows in nearly linear time. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, pages 263 269. IEEE Computer Society, 2013.

[She17] Jonah Sherman. Area-convexity, l regularization, and undirected multicommodity flow. In Hamed Hatami, Pierre Mc Kenzie, and Valerie King, editors, Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 452 460. ACM, 2017.

[vd BLL+21] Jan van den Brand, Yin Tat Lee, Yang P. Liu, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, and Di Wang. Minimum cost flows, mdps, and ℓ1-regression in nearly linear time for dense instances. In STOC 21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021, pages 859 869. ACM, 2021.

[vd BLSS20] Jan van den Brand, Yin Tat Lee, Aaron Sidford, and Zhao Song. Solving tall dense linear programs in nearly linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, pages 775 788. ACM, 2020.

[VDDP17] Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, and Eftychios Protopapadakis. Deep learning for computer vision: A brief review. Computational Intelligence and Neuroscience, 2018, 2017.

[WXXZ23] Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. New bounds for matrix multiplication: from alpha to omega. Co RR, abs/2307.07970, 2023.

A Additional preliminaries

We first introduce some notation used in Appendices B, C, and D. When applied to a vector, we let p denote the ℓp norm for p 1. We let 0d and 1d denote the all-zeroes and all-ones vectors in dimension d. We let d := {x Rd 0 | x 1 = 1} denote the probability simplex in dimension d. The ith coordinate basis vector is denoted ei. We say x R is an ε-additive approximation of x R if | x x| ε. For a set S R, we say another set T R is an ε-cover of S if for all s S, there is t T with |s t| ε. For a, b R with a b, we let clip[a,b](t) := min(b, max(a, t)) denote the result of projecting t R onto [a, b].

We next formally define our tolerant variant of the calibration testing problem in Definition 2. Definition 4 (Tolerant calibration testing). Let 0 ε2 ε1 1. We say algorithm A solves the (ε1, ε2)-tolerant calibration testing problem with n samples, if given n i.i.d. draws from a distribution D over [0, 1] {0, 1}, A returns either yes or no and satisfies the following with probability 2

A returns no if d CE(D) ε1.

A returns yes if d CE(D) ε2.

In this case, we also call A an (ε1, ε2)-tolerant calibration tester.

Note that an algorithm which solves the (ε1, ε2)-tolerant calibration testing problem with n samples also solves the ε1-calibration testing problem with the same sample complexity. Moreover, we give a simple impossibility result on parameter ranges for calibration testing. Lemma 5. Let 0 ε2 ε1 1

2 satisfy ε1 ε2 = ε. There is a universal constant Ccoin such that, given n Ccoin

ε2 samples from a distribution on [0, 1] {0, 1}, it is information-theoretically impossible to solve the (ε1, ε2)-tolerant calibration testing problem.

Proof. Suppose ε 1 10, else we can choose Ccoin small enough such that n < 1. We consider two distributions over [0, 1] {0, 1}, D and D , with d CE(D) ε1 but d CE(D ) ε2, so if A succeeds at tolerant calibration testing for both D and D , we must have d TV(D n, (D ) n) 1

3, where we denote the n-fold product of a distribution with n. Else, A cannot return different answers from n samples with probability 2

3, as required by Definition 4. Specifically, we define D, D as follows.

To draw (v, y) D, let v = 1

2 + ε1 and y Bern( 1

To draw (v, y) D , let v = 1

2 + ε1 and y Bern( 1

We claim that d CE(D) = ε1. To see this, let Π ext(D) and (u, v, y) Π, so E(u,v,y) Π[u] = 1

2 because u is calibrated. By Jensen s inequality, we have:

E(u,v,y) Π [|u v|] E(u,v,y) Π [u] 1

The equality case is realized when u = 1

2 with probability 1, proving the claim. Similarly, d CE(D ) = ε2. Finally, let π := Bern( 1

2), π := Bern( 1

2 + ε), and π n, (π ) n denote their n-fold product distributions. Pinsker s inequality shows that it suffices to show that d KL(π n (π ) n) 1 5 to contradict our earlier claim d TV((D) n, (D ) n) 1

3. To this end, we have

d KL(π n (π ) n) = n d KL (π π ) = n

log 1 2 1 2 + ε

+ log 1 2 1 2 ε

2 log 1 1 4ε2

where the first line used tensorization of d KL, and the last chose Ccoin small enough.

We also generalize Definitions 2 and 4 to apply to an arbitrary calibration measure.

Definition 5 (d testing). Let d be a calibration measure. For ε R 0, we say algorithm A solves the ε-d-testing problem (or, A is an ε-d tester) with n samples, if given n i.i.d. draws from a distribution D over [0, 1] {0, 1}, A returns either yes or no and satisfies the following with probability 2

1. A returns no if d(D) ε.

2. A returns yes if d(D) = 0.

For 0 ε2 ε1, we say algorithm A solves the (ε1, ε2)-tolerant d testing problem (or, A is an (ε1, ε2)-tolerant d tester) with n samples, if given n i.i.d. draws from a distribution D over [0, 1] {0, 1}, A returns either yes or no and satisfies the following with probability 2

1. A returns no if d(D) ε1.

2. A returns yes if d(D) ε2.

B Appendix for Section 2

B.1 Deferred lemma proofs

Proof of Lemma 1. This follows from the triangle inequality:

k=i |xk xk+1|

k=i vk+1 vk = vj vi.

B.2 Dynamic programming

In this section, we give our dynamic programming approach to solving (4), which establishes Proposition 1. The graph G in (4) is the union of a path P on n vertices and a star S to the (n + 1)th vertex. Let us identify the edges in P with [n 1] (where edge i corresponds to vertices (i, i + 1)), and the edges in S with [2n 1] \ [n 1]. We first make a simplifying observation, which is that given the coordinates of a flow variable f RE on the edges in the path P, there is a unique way to set the values of f on the edges in the star S so that the demands B f = d are satisfied. Concretely, we require

fn 1+i = di + fi fi 1 for all 2 i n 1, fn = d1 + f1, and f2n 1 = dn fn 1.

Hence, minimizing the constrained problem in (4) is equivalent to minimizing the following unconstrained problem on the first n 1 flow variables, stated in (8) and reproduced here:

min f Rn 1 A(f) := |d1 + f1| + |dn fn 1| + X

i [n 2] |fi fi+1 di+1| + X

i [n 1] ci|fi|.

We now solve (8). We first define a sequence of partial functions {Aj : R R}j [n 1] by

A1(z) := |d1 + z| + c1|z|,

Aj(z) := min f Rj fj=z

|d1 + f1| + X

i [j 1] |fi fi+1 di+1| + X

i [j 1] ci|fi| for all 2 j n 2,

An 1(z) := min f Rj fj=z

(9) In other words, Aj(z) asks to minimize the partial function in (8) over the first j flow variables {fi}i [j], corresponding to all terms in which these flow variables participate, subject to fixing fj = z. We make some preliminary observations about the partial functions {Aj}j [n 1].

Lemma 6. For all j [n 2], Aj is a convex, continuous, piecewise linear function with at most j + 2 pieces, and An 1 is a convex, continuous, piecewise linear function with at most n + 2 pieces.

Proof. We first establish convexity by induction; the base case j = 1 is clear. We next observe that

Aj(z) = cj|z| + min w R |w z dj| + Aj 1(w) for all 2 j n 2,

An 1(z) = |dn z| + cn 1|z| + min w R |w z dn 1| + An 2(w). (10)

In other words, each partial function Aj can be recursively defined by first minimizing the first j 2 flow variables for a fixed value fj 1 = w, and then taking the optimal choice of w. Moreover, supposing inductively Aj 1 is convex, Aj is the sum of a convex function and a partial minimization over a jointly convex function of (w, z), so it is also convex, completing the induction.

To see that Aj is continuous (assuming continuity of Aj 1 inductively), it suffices to note Aj is the sum of a continuous function and a partial minimization over a continuous function in two variables.

We now prove the claims about piecewise linearity, and the number of pieces. Clearly, A1 is continuous and piecewise linear with at most 3 pieces. Next, for some 2 j n 2, suppose Aj 1 is piecewise linear with vertices {vi}i [j] in nondecreasing order (possibly with repetition) and slopes {ti}j i=0, so ti is the slope of the segment of Aj 1 between vi and vi+1, and t0 and tj are the leftmost and rightmost slopes. For convenience we define v0 := and vj+1 := . Consider the function

min w R |w z dj| + Aj 1(w). (11)

For all values z satisfying vi z +dj vi+1 where 0 i j, the function |w z dj|+Aj 1(w) is piecewise linear with vertices v1, . . . , vi, z + dj, vi+1, . . . , vj, and correspondingly ordered slopes t0 1, t1 1, . . . , ti 1, ti + 1, . . . , tj + 1. The minimizing w in (11) corresponds to any v {vi}i [j] {z + dj} where the slope switches from nonpositive to nonnegative, which is either a fixed vertex v = vk for the entire range vi z + dj vi+1, or the new vertex z + dj for this entire range.

In the former case, we have

min w R |w z dj| + Aj 1(w) = |vk z dj| + Aj 1(vk), (12)

which up to a constant additive shift is |z (vk dj)|, a linear function in z in the range vi z + dj vi+1 because the sign of z (vk dj) does not change. In the latter case, we have

min w R |w z dj| + Aj 1(w) = Aj 1(z + dj), (13)

which again is a linear function in z in the range vi z + dj vi+1 by induction. In conclusion, (11) is linear in each range z [vi dj, vi+1 dj], so it is piecewise linear with at most j +1 pieces; adding cj|z|, which introduces at most 1 more piece, completes the induction. An analogous argument holds for j = n 1, but we potentially introduce two more pieces due to adding |dn z| + cn 1|z|.

For convenience, we now describe how to update the slopes and vertices going from the piecewise linear function Aj 1 to Aj. Also, as above suppose the vertices of Aj 1 are {vi}i [j] and the corresponding slopes are {ti}j i=0. Then because we argued (11) is linear in each range z [vi dj, vi+1 dj], it has vertices {vi dj}i [j]. If z [vi dj, vi+1 dj] and ti [ 1, 1], then we are in the case of (13) and the corresponding slope in this range is ti. Otherwise, if ti 1 we are in the case of (12) with slope 1, and if ti 1 the slope of the piece is similarly 1. We then add cj|z| (and |dn z| if j = n 1). In summary, the new slopes and vertices are as follows.

If j n 2, the new vertices are {vi dj}i [j] {0}. If vk dj 0 vk+1 dj for some 0 k j, using our convention v0 = and vj+1 = , the new slopes are

{clip[ 1,1](ti) cj}0 i<k {clip[ 1,1](tk) cj}

{clip[ 1,1](tk) + cj} {clip[ 1,1](ti) + cj}k<i j. (14)

If j = n 1, the new vertices are {vi dj}i [j] {0} {dn}. If vk dj 0 vk+1 dj and vh dj dn vh+1 dn for some 0 h, k j, the new slopes are

{clip[ 1,1](ti) cjιvi+1 dj + cjιvi dj} 0 i j i {h,k}

n clip[ 1,1](tk) cj o n clip[ 1,1](tk) + cj o

n clip[ 1,1](th) 1 o n clip[ 1,1](th) + 1 o ,

where we let ιE denote the 0-1 indicator variable of an event E.

The ordering of these vertices and their slopes are uniquely determined, because they are sorted similarly due to convexity, which implies nondecreasing slopes as z increases. Finally, we note that assuming the invariant that at least one {ti}j i=0 is nonnegative and at least one is nonpositive (which holds in the first iteration), in either of the cases (14) or (15) this invariant is preserved, since clipping preserves signs, the smallest slope decreases, and the largest slope increases.

We require one additional property of the slope updates (14).

Lemma 7. For j [n 1], let {ti}j i=0 be the nondecreasing slopes of Aj. Then t0 1 and tj 1.

Proof. By observation, the smallest and largest slopes of A1 (9) are 1 c1 and 1 + c1. Hence, assuming inductively the lemma statement is true for Aj 1, the slope updates (14) result in smallest and largest slopes 1 cj and 1 + cj in Aj, completing the induction.

We next observe that, by storing a constant amount of information in each iteration, we can work backwards from an optimal solution to An 1 and recover all flow variables which realized this value.

Lemma 8. Let {vi}i [j] and {ti}j i=0 be the nondecreasing vertices and slopes of Aj 1 for some 2 j n 1. Suppose we know vℓand vr for ℓ, r [j], defined such that tℓ 1 1 but tℓ 1, and similarly tr 1 1 but tr 1. Then given a value of z, we can compute in O(1) time

argminw R|w z dj| + Aj 1(w).

Proof. Note that existence of vℓ, vr is guaranteed by Lemma 7. We consider three cases.

First, if z + dj [vℓ, vr], we claim w = z + dj. To see this, recall that if z + dj [vi, vi+1], we proved in Lemma 6 that the slopes of |w z dj|+Aj 1(w) (in w) are t0 1, t1 1, . . . , ti 1, ti + 1, . . . , tj +1. Hence, the assumptions imply ti [ 1, 1], so a vertex of |w z dj|+Aj 1(w) where the slope changes from nonpositive to nonnegative (i.e. the minimizing argument w) is w = z + dj, as claimed.

To handle the other two cases, if z + dj vℓ, then the above calculation shows the new optimal vertex is w = vℓ; similarly, if z + dj vr the new optimal vertex is w = vr.

We now describe an interface for a data structure that we use to efficiently implement the updates (14) and (15), whose existence we prove in the following Appendix B.3.

Lemma 9. There is a data structure, Segment Tree, which initializes a vector t Rn to t 0n, and supports the following operations each in O(log n) time.

Query(i): Return ti.

Add(ℓ, r, c): Update ti ti + c for all ℓ i r.

Set(ℓ, r, c): Update ti c for all ℓ i r.

Given access to Segment Tree, we now describe how to solve (8) in nearly-linear time, proving Proposition 1.

Proof of Proposition 1. We first describe how to compute all of the vertices and slopes {vi}i [n 1], {ti}n 1 i=0 of An 2 in time O(n log2(n)). The proof of Lemma 6 shows that the vertices are

where we treat the empty sum as 0. Moreover, the vertex Pn 2 k=j+1 dk is the new vertex which was inserted (as 0) when computing Aj, and then advanced through the remaining iterations. We sort the vertices (16) into nondecreasing order, keeping track of the resulting permutation π : [n 2] {0} [n 1], i.e. if the vertex Pn 2 k=j+1 dk is the ith smallest in (16), then π(j) = i. This step takes time O(n log n) and does not dominate.

We next initialize a Segment Tree (Lemma 9) with n vertices. We update it through the first n 2 recursive computations of slopes, via (14), keeping track of a time counter j, i.e. after we are done updating the slopes of Aj the time counter increments. The state of Segment Tree at the end of time j is as follows. For all k [j], letting π(h) be the next largest value after π(k) amongst {π(k )}k [j], we require that ti equals the slope of the segment to the right of the vertex inserted at time k in Aj for all the coordinates π(k) + 1 i π(h). In other words, Segment Tree stores all of the slopes of Aj in its coordinates (with redundancies due to vertices which will be inserted after time j), and π maps the times vertices are inserted to their coordinate values in Segment Tree.

We next show how to maintain this state in O(log2(n)) time per time increment. In iteration j, the update (14) requires us to clip all previous slopes to the range [ 1, 1], subtract cj from all slopes in the range [1, π(j)], and add cj to all slopes in the range [π(j) + 1, n]. We first use Query to perform binary searches for the values ℓ, r as defined in Lemma 7. We then sequentially apply

Set(1, ℓ, 1), Set(r + 1, n, 1), Add(1, π(j), cj), Add(π(j) + 1, n, cj).

The dominant runtime term is the cost of O(log(n)) calls to Query to perform the binary search, giving the claimed O(log2(n)) runtime per time increment. We can now call Query n times to compute all slopes and vertices of An 2. We will also store the values of vℓand vr at time j, which takes O(1) time given ℓ, r, π, and partial sums which can be precomputed in time O(n).

Given these slopes and vertices, it is straightforward to apply (15) to compute all slopes and vertices of An 1 in time O(n), at which point we can find z := argminz RAn 1(z). Next, by using our stored values of vℓand vr in every iteration, we can then use Lemma 8 to compute all the optimal flow values realizing An 1(z). Finally, we can compute the optimal value (8) in time O(n).

B.3 Implementation of Segment Tree

In this section, we develop a data structure known as a segment tree which plays a vital role in our main algorithm. In particular, it allows us to prove Lemma 9. While this data structure is well-known folklore in the competitive programming community (see e.g. an overview of this technique in [QM22]), we provide a full description and proof for completeness.

For integers ℓand r satisfying ℓ r, we use [ℓ: r] to denote the set {ℓ, ℓ+ 1, . . . , r}. Lemma 10 (Segment tree). Let G be a semigroup with an identity element e, where the semigroup product of a, b G is denoted by a b or ab and is not necessarily commutative. Let v be an array of length n, where each element of v is initialized to be the identity element e of G. There is a data structure D, called a segment tree, that can perform each of the following operations in O(log n) time (assuming a semigroup product can be computed in constant time).

1. Access(i): given i [1 : n], return the ith element in v.

2. Apply(g, ℓ, r): given ℓ, r [1 : n] satisfying ℓ r, and given a semigroup element g G, for each index i [ℓ: r], replace v[i] with g v[i].

Before proving Lemma 10, we first use it to prove Lemma 9.

Proof of Lemma 9. We apply the data structure in Lemma 10 to a specific semigroup G defined as follows. The elements of G are functions τ : R R, where the identity element e is the identity

function e(u) = u for every u R, and the semigroup product is defined as function composition: (a b)(u) = a(b(u)) for every a, b G and u R. The semigroup G consists of the following functions: addc and setc for every c R. These functions are defined as follows: addc(u) = u + c, setc(u) = c, for every u R. It is easy to check that these functions are closed under composition: addc addc = addc+c , setc setc = setc, addc setc = setc+c , setc addc = setc. Therefore, G is a valid semigroup. We can now implement the operations Query, Add, Set in Lemma 9 using the operations Access and Apply in Lemma 10 as follows.

To implement Query(i), we run Access(i) to obtain its output g = v[i] G, and return g(0).

To implement Add(ℓ, r, c), we run Apply(addc, ℓ, r).

To implement Set(ℓ, r, c), we run Apply(setc, ℓ, r).

The correctness of this implementation can be shown inductively. At initialization, v[i] = e, and thus Query(i) returns e(0) = 0, which is the correct value of ti at initialization. It remains to show inductively that after each Add and Set operation, the output of Query(i), i.e., v[i](0), is the intended value of ti. Indeed, after an Add operation, for any i [ℓ: r], the element v[i] is updated to v[i] := (addc v[i]), and thus v[i] (0) = (addc v[i])(0) = addc(v[i](0)) = v[i](0) + c = ti + c, which is the intended new value of ti. For i / [ℓ: r], the element v[i] remains unchanged, and thus v[i](0) remains unchanged. This is as desired because the new value of ti is intended to be the same as the old value. Combining these two cases, we have shown that the Add operation maintains that v[i](0) is the intended value of ti for every i [1 : n]. We can similarly show that the Set operation also has this property, and thus our implementation is correct. The running time guarantee of the implementation follows directly from the running time guarantee in Lemma 10.

We prove Lemma 10 by describing the construction of the segment tree data structure and analyzing its correctness (Lemma 11) and efficiency (Lemma 12). By appending to the array an appropriate number of auxiliary entries, we can assume without loss of generality that the array length n is a power of 2, i.e., n = 2r for a positive integer r. The data structure is implemented using a complete binary tree T with depth r, where the 2r leaves correspond to the n = 2r entries in the array.

More specifically, we use seg(τ) [1 : n] to denote the set of indices i that a node τ T is associated with. For the root τ0 of the tree T, we set seg(τ0) = [1 : n] = [1 : 2r], and for its two children τ1, τ2 we set seg(τ1) = [1 : n/2] = [1 : 2r 1] and seg(τ2) = [n/2 + 1 : n] = [2r 1 + 1 : 2r]. In general, for any non-leaf node τ and its two children τ1, τ2, we set seg(τ1) as the first half of seg(τ), and set seg(τ2) as the second half. In particular, for every leaf τ, seg(τ) is a singleton set consisting of a unique index i [1 : n], and we say τ is the (unique) leaf corresponding to index i. See Figure 3 for an example with depth r = 3.

B2j Ez RGFH0A/1Ev2p/n A3nhf Ny EVpb K3O20X/mv P4LKy Mc Q=</latexit>Layer 1

I7RCZogij6g X6i X7U/zpbz3Hmx DK1tl Dn P0H/mv Po LMRscw=</latexit>Layer 3

[1 : 4] [5 : 8]

[1 : 2] [3 : 4] [5 : 6] [7 : 8]

{1} {2} {3} {4} {5} {6} {7} {8}

Figure 3: Example of segment tree with depth r = 3.

At every node τ of the tree, we maintain a semigroup element gτ G initialized to be the identity element e.

Access. We implement Access(i) as follows. Let τ0 τr be the directed path from the root τ0 to the leaf τr corresponding to index i. Return the semigroup product gτ0gτ1 gτr.

Apply. We implement Apply(g, ℓ, r) recursively. That is, we implement Apply(g, ℓ, r, τ) in Algorithm 1 which takes a node τ T as an additional input. We then define Apply(g, ℓ, r) to be Apply(g, ℓ, r, τ0), where the additional input is set as the root τ0 of the tree T.

Algorithm 1 Apply(g, ℓ, r, τ)

1: if seg(τ) [ℓ: r] = then 2: return 3: end if 4: if seg(τ) [ℓ: r] then 5: gτ g gτ 6: return 7: end if 8: Let τ1, τ2 be the two children of τ 9: gτ1 gτ gτ1 10: gτ2 gτ gτ2 11: gτ e 12: Apply(g, ℓ, r, τ1) 13: Apply(g, ℓ, r, τ2)

Correctness. At initialization, each array element v[i] is initialized to be the identity element e. The semigroup element gτ stored at each tree node τ is also initialized to be e, so Access(i) returns the correct value e. It remains to show that Access(i) still returns the correct value of v[i] after each Apply operation. This is established in the following lemma:

Lemma 11. For i {1, . . . , n}, let τ0 τr be the directed path from the root τ0 to the leaf τr corresponding to index i. Let ˆgτ0, . . . , ˆgτr G denote the current states of the semigroup elements gτ0, . . . , gτr stored in the data structure. Then after we call Apply(g, l, r), we have

gτ0 gτr = g ˆgτ0 ˆgτr, if i [ℓ: r]; ˆgτ0 ˆgτr, if i / [ℓ: r].

Proof. It is clear that [1 : n] = seg(τ0) seg(τr) = {i}. If i [ℓ: r], let τj be the first node among τ0, . . . , τr such that seg(τj) [ℓ, r]. We can inductively show that for every j = 1, . . . , j, right before we make the recursive call to Apply(g, ℓ, r, τj ), we have

e, if i < j , ˆgτ0 ˆgτj, if i = j , ˆgτi, if i > j .

When we call Apply(g, ℓ, r, τj), since seg(τj) [ℓ: r], Line 5 is executed and the function returns after that. Now we have

e, if i < j, g ˆgτ0 ˆgτj, if i = j, ˆgτi, if i > j.

This implies gτ0 gτr = g ˆgτ0 ˆgτr, as desired.

Similarly, if i / [ℓ, r], let τj be the first node among τ0, . . . , τr such that seg(τj) [ℓ: r] = . We can show that

e, if i < j; ˆgτ0 ˆgτj, if i = j; ˆgτi, if i > j.

This implies gτ0 gτr = ˆgτ0 ˆgτr, as desired.

Efficiency. The following result establishes the running time guarantee in Lemma 10.

Lemma 12. Both Access and Apply run in O(log n) time.

Proof. It is clear that Access runs in time O(r) = O(log n). When we run Apply(g, ℓ, r, τ), the recursive calls at Lines 12-13 are made only when [ℓ: r] intersects but does not contain seg(τ), i.e., seg(τ) [ℓ: r] seg(τ). There are at most 2 such nodes τ at each level of the binary tree, so the total number of such nodes τ is O(log n). This implies that Apply runs in time O(log n).

B.4 Tolerant testing via smooth calibration

For completeness, we first make the simple (but to our knowledge, new) observation that, while the constants in Lemma 4 are not necessarily tight, there is a constant gap between d CE and sm CE.

Lemma 13. Suppose for constants B A > 0, it is the case that A d CE(D) sm CE(D) B d CE(D) for all distributions D over [0, 1] {0, 1}. Then, B

Proof. First, we claim that A 1. To see this, let (v, y) D be distributed where v = 1

2 with probability 1, and y Bern( 1

2 + ε) for some ε [0, 1

2]. Clearly, sm CE(D) = | 1

2 + ε)| = ε. Moreover, d CE(D) = ε, which follows from the same Jensen s inequality argument as in Lemma 5, so this shows that A 1. Next, we claim that B 3

2, concluding the proof. Consider the joint distribution over (u, v, y) in Table B.4, and let D be the marginal of (v, y). It is straightforward

Probability mass u v y w(v)

1 2 1 2 1 2 ε 1 1 1 2 1 2 1 2 0 1 ε

to check (u, y) is calibrated and E|u v| = ε

2, so d CE(D) ε

2. Moreover, sm CE(v, y) 3ε

4 , as witnessed by the Lipschitz weight function w(v) in Table B.4, finishing our proof that B 3

sm CE(v, y) E[(y v)w(v)] = 1

2 + ε 1 + 1

Using these claims, we now give our tolerant calibration tester in the regime ε1 > 4ε2.

Theorem 3. Let 0 ε2 ε1 1 satisfy ε1 > 4ε2, and let n Ctct 1 (ε1 4ε2)2 for a universal constant Ctct. There is an algorithm A which solves the (ε1, ε2)-tolerant calibration testing problem with n samples, which runs in time O n log2(n) .

Proof. Throughout the proof, let α := ε1

2 2ε2 > 0. Consider the following algorithm.

1. Sample n Ctct

α2 samples to form an empirical distribution b Dn, where Ctct is chosen large enough so that Lemma 3 guarantees |sm CE(D) sm CE( b Dn)| α

2 with probability 2

2. Call Corollary 1 to obtain β, the value of sm CE( b Dn).

3. Return yes if β 2ε2 + α

2 , and return no otherwise.

Conditioned on the event |sm CE(D) sm CE( b Dn)| α

2 , we show that the algorithm succeeds in tolerant calibration testing. First, if d CE(D) ε2, then sm CE(D) 2ε2 by Lemma 4, and therefore by the assumed success of Lemma 3, the algorithm will return yes. Second, if d CE(D) ε1, then sm CE(D) ε1

2 by Lemma 4, and similarly the algorithm returns no in this case. Finally, the runtime is immediate from Corollary 1; all other steps take O(1) time.

C Lower distance to calibration

In this section, we provide our main result on approximating the lower distance to calibration of a distribution on [0, 1] {0, 1}. We provide details on a framework for lifting constrained linear programs to equivalent unconstrained counterparts in Appendix C.1. In Appendix C.2, we next state preliminary definitions and results from [BGHN23a] used in our algorithm. In Appendix C.3, we then develop a rounding procedure compatible with a linear program which closely approximates the empirical lower distance to calibration. Finally, in Appendix C.4, we use our rounding procedure to design an algorithm for calibration testing, which solves the problem for a larger range of parameters than Theorem 4 (i.e. the entire relevant parameter range), at a quadratic runtime overhead.

C.1 Rounding linear programs

In this section, we give a general framework for approximately solving linear programs, following similar developments in the recent combinatorial optimization literature [She13, JST19]. Roughly speaking, this framework is a technique for losslessly converting a constrained convex program to an unconstrained one, provided we can show existence of a rounding procedure compatible with the constrained program in an appropriate sense. We begin with our definition of a rounding procedure.

Definition 6 (Rounding procedure). Consider a convex program defined on the intersection of convex set X with linear equality constraints Ax = b:

min x X Ax=b c x. (17)

We say Round is a ( e A, b, p)-equality rounding procedure for (A, b, c, X) if p 1, and for any x X, there exists x := Round(x) X such that Ax = b, e Ax = b, and

c x c x + e Ax b p . (18)

Intuitively, rounding procedures replace the hard-constrained problem (17) with its soft-constrained variants, i.e. the soft equality-constrained

min x X c x + e Ax b p , (19)

for some ( e A, b, p) constructed from the corresponding hard-constrained problem instance (parameterized by A, b, c, X). Leveraging the assumptions on our rounding procedure, we now show how to relate approximate solutions to these problems, generalizing Lemma 1 of [JST19].

Lemma 14. Let x be an ε-approximate minimizer to (19), and let Round be a ( e A, b, p)-equality rounding procedure for (A, b, c, X). Then x := Round(x) is an ε-approximate minimizer to (17).

Proof. We first claim that a minimizing solution to (19) satisfies the constraints Ax = b. To see this, given any x X, we can produce x X with Ax = b and such that x has smaller objective value in (19). Indeed, letting x := Round(x ), (18) guarantees

c x = c x + e Ax b p c x + e Ax b p ,

as claimed. Now let x X satisfying Ax = b minimize (19). Then, if x is an ε-approximate minimizer to (19) and x = Round(x), we have the desired claim from Ax = b, and

c x = c x + e Ax b p c x + e Ax b p c x + e Ax b p = c x + ε.

In the remainder of the section, we apply our rounding framework to a hard-constrained linear program, in the p = 1 geometry. To aid in approximately solving the soft-constrained linear programs arising from our framework, we use the following procedure from [JT23], building upon the recent

literature for solving box-simplex games at accelerated rates [She17, JST19, CST21]. In the statement of Proposition 2, we use the notation

A p q := max x Rn| x p 1 Ax q .

Notice that in particular, A 1 1 is the largest ℓ1 norm of any column of A.

Proposition 2 (Theorem 1, [JT23]). Let A Rn d, b Rd, c Rn, and ε > 0. There is an algorithm which computes an ε-approximate saddle point to the box-simplex game

min x [ 1,1]n max y d x Ay b y + c x, (20)

O nnz(A) A 1 1 log d

We also require a standard claim on converting minimax optimization error to error on an induced minimization objective. To introduce our notation, we say that x X is an ε-approximate minimizer of f : X R if f(x) minx X f(x ) ε. We call (x, y) X Y an ε-approximate saddle point to a convex-concave function f : X Y R if its duality gap is at most ε, i.e.

max y Y f(x, y ) min x X f(x , y) ε.

Lemma 15. Let f : X Y R be convex-concave for compact X, Y, and let g(x) := maxy Y f(x, y) for x X. If (x, y) is an ε-approximate saddle point to f, x is an ε-approximate minimizer to g.

Proof. Let y := argmaxy Yf(x, y) and x := argminx X f(x , y). The conclusion follows from

min x X g(x ) = min x X max y Y f(x , y ) = max y Y min x X f(x , y ) min x X f(x , y),

where we used strong duality (via Sion s minimax theorem), so that

g(x) min x X g(x ) g(x) f(x , y) = f(x, y ) f(x , y) ε.

The following corollary of Proposition 2 will be particularly useful in our development, which is immediate using Lemma 15, upon negating the box-simplex game (20), exchanging the names of the variables (x, y), (b, c), and explicitly maximizing over y [ 1, 1]n, i.e.

min x d c, x + Ax b 1 = min x d max y [ 1,1]n c, x + y (Ax b).

Corollary 2. Let A Rn d, b Rn, c Rd, and ε > 0. There is an algorithm which computes an ε-approximate minimizer to minx d c x + Ax b 1, in time

O nnz(A) A 1 1 log d

C.2 LDTC preliminaries

In this section, we collect preliminaries for our testing algorithm based on estimating the LDTC. First, analogously to Lemma 3, we recall a bound from [BGHN23a] on the deviation of the empirical estimate of d CE(D) from the population truth which holds with constant probability. Lemma 16 (Theorem 9.10, [BGHN23a]). For any ε (0, 1), there is an n = O( 1

ε2 ) such that, if b Dn is the empirical distribution over n i.i.d. draws from D, with probability 2

3, d CE(D) d CE( b Dn) ε.

Next, given a set U [0, 1], we provide an analog of Definition 1 which is restricted to U.

Definition 7 (U-LDTC). Let U [0, 1], and let D be a distribution over [0, 1] {0, 1}. Define ext U(D) to be all joint distributions Π over (u, v, y) U [0, 1] {0, 1}, with the following properties.

The marginal distribution of (v, y) is D.

The marginal distribution (u, y) is perfectly calibrated, i.e. EΠ[y|u] = u.

The U-lower distance to calibration (U-LDTC) of D, denoted d CEU(D), is defined by

d CEU(D) := inf Π ext U(D) E(u,v,y) Π |u v| .

Note that if we require {0, 1} U, then ext U(D) is always nonempty, because we can let u = y with probability 1. We also state a helper claim from [BGHN23a], which relates d CEU to d CE.

Lemma 17 (Lemma 7.11, [BGHN23a]). Let D be a distribution over [0, 1] {0, 1}, and let U be a finite ε

2-covering of [0, 1] satisfying {0, 1} U. Then, d CE(D) d CEU(D) d CE(D) + ε.

To this end, in the rest of the section we define, for any ε (0, 1),

Uε := {0, 1} iε

which is an ε

2-cover of [0, 1] satisfying |Uε| = O( 1

ε). Finally, we state a linear program, derived in [BGHN23a], whose value equals d CEU(D), when the first marginal of D is discretely supported.

Lemma 18 (Lemma 7.6, [BGHN23a]). Let U, V [0, 1] be discrete sets, where {0, 1} U, and let D be a distribution over V {0, 1}, where for (v, y) V {0, 1} we denote the probability of (v, y) D by D(v, y). The following linear program with 2|U||V | variables Π(u, v, y) for all (u, v, y) U V {0, 1}, is feasible, and its optimal value equals d CEU(D):

min Π R2|U||V | 0

(u,v,y) U V {0,1} |u v| Π(u, v, y)

such that X

u U Π(u, v, y) = D(v, y), for all (v, y) V {0, 1},

and (1 u) X

v V Π(u, v, 1) = u X

v V Π(u, v, 0), for all u U.

C.3 Rounding for empirical U-LDTC

In this section we fix a dataset under consideration,

b Dn := {(vi, yi)}i [n] [0, 1] {0, 1},

and the corresponding empirical distribution, also denoted b Dn, where (v, y) b Dn means (v, y) = (vi, yi) with probability 1

n for each i [n]. We let V := {vi}i [n] be identified with [n] in the natural way. Moreover, for a fixed parameter ε (0, 1) throughout, we let U := Uε defined in (21). Finally, we denote m := |U| = O( 1

ε), and let U [0, 1]m m be the diagonal matrix whose diagonal entries correspond to U. We also identify elements of U with j [m] in an arbitrary but consistent way, writing uj [0, 1] to mean the jth element of U according to this identification.

We next rewrite the linear program in Lemma 18 into a more convenient reformulation.

Lemma 19. The linear program in Lemma 18 can equivalently be written as:

d CEU( b Dn) := min x X Mx= 1

n 1n UB0x0=(Im U)B1x1

c x, where X := 2mn and we denote x = x0 Rmn x1 Rmn

where we define c R2mn, M Rn 2mn, and B0, B1 Rm mn by c(i,j,k) := |uj vi| for all (i, j, k) [n] [m] {0, 1},

Mi ,(i,j,k) := 1 yi = k, i = i 0 else for all i [n], (i, j, k) [n] [m] {0, 1},

and [B0]j ,(i,j,k) := 1 j = j , k = 0 0 else for all j [m], (i, j, k) [n] [m] {0, 1},

[B1]j ,(i,j,k) := 1 j = j , k = 1 0 else for all j [m], (i, j, k) [n] [m] {0, 1}.

Proof. This is clear from observation, but we give a brief explanation of the notation. First, x X represents the density function of our joint distribution Π over U V {0, 1}, and has 2mn coordinates identified with elements (i, j, k) V U {0, 1} [n] [m] {0, 1}. We let the subset of coordinates with k = 0 be denoted x0 Rmn, defining x1 similarly. Recalling the definition of the linear program in Lemma 18, x(i,j,k) is indeed reweighted by c(i,j,k) = |uj vi|.

Next, M represents the marginal constraints in Lemma 18, and enforcing Mx = 1

n1n is equivalent to the statement that, for each i [n], the sum of all entries (i, j, k) of x with i = i and k = yi is 1

n, since that is the probability density assigned to (vi , yi ) by the distribution b Dn.

Lastly, the jth calibration constraint in Lemma 18 is enforced by the jth row of the equation UB0x0 = (Im U)B1x1, which reads uj [B0]j:, x0 = (1 uj) [B1]j:, x1 . We can check by the definitions of [B0]j:, [B1]j: that this is consistent with our earlier calibration constraints.

We give a convenient way of visualizing the marginal and calibration constraints described in Lemma 19. For convenience, we identify each x 2mn with an m 2n matrix mat(x) = X = X0 Rm n X1 Rm n , (23)

where X0 consists of entries of x0 arranged in a matrix fashion (with rows corresponding to [m] U and columns corresponding to [n] V ), and similarly X1 is a rearrangement of x1, recalling (22). When explaining how we design our rounding procedures to modify X to satisfy constraints, it will be helpful to view entries of X as denoting an amount of physical mass which we can move around.

There are 2n columns in X, corresponding to pairs (i, k) V {0, 1}; among these, we say n columns are active, where column (i, k) is active iff yi = k, and we say the other n columns are inactive. Following notation (23), the marginal constraints X = 1 n1n simply ask that the total amount of mass in each active column is 1

n, so there is no mass in any inactive column since x 2mn.

Moreover, there are m rows in X, each corresponding to some j U. If we let ℓj denote the amount of mass on [X0]j: and rj the amount of mass on [X1]j:, the jth calibration constraint simply asks that ujℓj = (1 uj)rj, i.e. it enforces balance on the amount of mass in each row s two halves.

Finally, for consistency with Definition 6, the linear program in (22) can be concisely written as

min x X Ax=b c x, where A := M B

, B := (UB0 (Im U)B1) , b := 1

and c, X are as defined in (22). In the rest of the section, following Definition 6, we develop an equality rounding procedure for the equality-constrained linear program in (24) in two steps.

1. In Lemma 20, we first show how to take x X with Mx 1

n1n 1 = , and produce x X such that Mx = 1

n1n (i.e. x now satisfies the marginal constraints) and x x 1 = O( ). 2. In Lemma 22, we then consider x X such that, following the notation (22), UB0x0 (Im U)B1x1 1 = . We show how to produce x X such that Mx = Mx (i.e. the marginals of x are unchanged), UB0x 0 = (Im U)B1x 1 (i.e. x is calibrated), and c, x x = O( ).

Our rounding procedure uses Lemma 20 to satisfy the marginal constraints in (22), and then applies Lemma 22 to the result to satisfy the calibration constraints in (22) without affecting the marginal constraints. By leveraging the stability guarantees on these steps, we can show this is indeed a valid rounding procedure in the sense of (18). We now give our first step for marginal satisfication.

Lemma 20 (Marginal satisfaction). Following notation in (22), let x X satisfy Mx 1

n1n 1 = . There is an algorithm which runs in time O(mn), and returns x with

n1n, x x 1 2 .

Proof. Recall for i [n], we say column i of X0 is active if yi = 0, and similarly column i of X1 is active if yi = 1. We call I the set of n inactive columns, and partition A, which we call the set of n active columns, into three sets A>, A=, and A<, where A> are the columns whose sums are > 1

n, A< are the columns whose sums are < 1

n, and A= are the remaining columns. Hence, every column of X belongs to I, A>, A=, or A . Note that until |A=| = n, we can never have A< = , since this means all column sums in A are 1

n (with at least 1 strict inequality), contradicting x X.

We first take columns i A> one at a time, and pair them with an arbitrary column in i A<, moving mass from column i arbitrarily to column i until either column i or column i enters A=. We charge this movement to the marginal constraints corresponding to i and i , since the constraints were violated by the same amount as the mass being moved. After this process is complete, A> is empty, and we only moved mass from columns originally in A> to columns originally in A<.

Next, we take columns i I one at a time, and pair them with an arbitrary column i A<, moving mass until either column i is 0m or column i enters A=. We can charge half this movement to the marginal constraint corresponding to i , since the sign of the marginal violation stays the same throughout. Hence, the overall movement is 2 . After this is complete, all columns in I are 0m and all columns in A are in A=, so we can return x 2mn corresponding to the new matrix.

It is clear both steps of this marginal satisfaction procedure take O(mn) time, since we can sequentially process columns in A until they enter A=, and will never be considered again.

We next describe a procedure which takes x 2mn, and modifies it to satisfy the calibration constraints UB0x = (Im U)B1x without changing the marginals Mx. We first provide a helper lemma used in our rounding procedure, which describes how to fix the jth marginal constraint.

Lemma 21. Let x 2mn and X := mat(x) as defined in (23). Let j [m] correspond to an element uj U, let ℓj := [X0]j: 1, rj := [X1]j: 1, and let j := |ujℓj (1 uj)rj|. There exists j [m] such that we can move mass from only Xj: to Xj :, resulting in Rm 2n X x 2mn such that Mx = Mx, uj [X 0]j: 1 = (1 uj) [X 1]j: 1, and c, x x j.

Proof. Without loss of generality, suppose that the row j = 1 corresponds to uj = 0, and j = m corresponds to uj = 1. We split the proof into two cases, depending on the sign of ujℓj (1 uj)rj.

Case 1: ujℓj > (1 uj)rj. We let j = 1, i.e. we only move mass from the jth row to the first row. Specifically, we leave [X1]j: unchanged, and move mass from [X0]j: to [X0]1:, making sure to only move mass in the same column. The total amount of mass we must delete from [X0]j: is

uj rj = ujℓj (1 uj)rj

Our strategy is to arbitrarily move mass within columns until we have deleted j

uj total mass. If we denote the mass moved in column i [n] as δij, and let x be the result after the move,

i [n] |c(i,1,0) c(i,j,0)|δij = X

i [n] ||uj vi| |u1 vi|| δij X

i [n] ujδij = j.

Here, the first inequality was the triangle inequality, the first equality used the definition of c in (22), the second inequality used u1 = 0 and the triangle inequality, and the last used P

i [n] δij = j

Case 2: ujℓj < (1 uj)rj. This case is entirely analogous; we move mass arbitrarily from row j to row m, i.e. the last row with um = 1. The amount of mass we must move is

rj uj 1 uj ℓj = (1 uj)rj ujℓj

1 uj = j 1 uj .

Again denoting the amount of mass moved from column i [n] as δij, the claim follows:

i [n] ||uj vi| |um vi|| δij X

i [n] (1 uj)δij = j.

By iteratively applying Lemma 21, we have our marginal-preserving calibration procedure. Lemma 22 (Marginal-preserving calibration). Following the notation (24), given x 2mn with Bx 1 = , we can compute x with Mx = Mx, Bx = 0m, and c, x x in O(mn) time.

Proof. It suffices to apply Lemma 21 to each row i [m]. All of the movement in the rows i [2, m 1] are independent of each other, and do not affect the imbalance in the rows i {1, m} when we have finished applying Lemma 21, since e.g. u1ℓ1 = 0 regardless of how much mass is moved to [X0]1:, and a similar property holds for the mth row. The total change in c, x x is thus boundable by P

j [m] j , and applying Lemma 21 to each row takes O(n) time. Finally, Mx = Mx follows because we only move mass within the same column, so no marginal changes.

By combining Lemma 20 with Lemma 22, we can complete our rounding procedure.

Lemma 23. Let (A, b, c, X) be defined as in (22), (24), and let ( e A, b) := (4A, 4b). There exists Round, a ( e A, b, 1)-equality rounding procedure for (A, b, c, X), running in O(mn) time.

Proof. Throughout the proof, let M := Mx 1

n1n 1 and B := Bx 1, following the notation (24). We also denote the total violation by

:= e Ax b 1 = 4 M + 4 B.

We first apply Lemma 20 to x to produce x satisfying x x 1 2 M and M x = 1

n1n, in O(mn) time. Note that, because B 1 1 1 since all columns of B are 1-sparse, we have B x 1 Bx 1 + B 1 1 x x 1 B + 2 M.

Next, we apply Lemma 22 to x, resulting in x with Mx = 1

n1n, Bx = 0m, and c, x x B + 2 M, in O(mn) time. Recalling the definition (19), we have the conclusion from c 1, so c (x x) c ( x x) + c (x x)

c x x 1 + c (x x) 2 M + B + 2 M .

We conclude by applying the solver from Corollary 2 to our resulting unconstrained linear program. Proposition 3. Let ε 0. We can compute x X, an ε-approximate minimizer to (22), in time

Further, the objective value of x in (22) is a 2ε-additive approximation of d CE( b Dn).

Proof. Observe that for e A = 4A, we have e A 1 1 8 and nnz( e A) = O(mn), since no column is more than 2-sparse and all entries of A are in [ 1, 1]. Further, recalling the definition of U from (21), we have m = O( 1

ε). So, Corollary 2 shows we can compute an ε-approximate minimizer to

min x 2mn c x + e Ax b 1 within the stated runtime. The rest of the proof follows using Round from Lemma 23, where we recall |d CE( b Dn) d CEU( b Dn)| ε due to our definition of U and Lemma 17.

C.4 Testing via LDTC

We now give analogs of Theorems 1 and 3, using our solver in Proposition 3.

Theorem 4. Let 0 ε2 ε1 1 satisfy ε1 > ε2, and let n Ctct 1 (ε1 ε2)2 for a universal constant Ctct. There is an algorithm A which solves the (ε1, ε2)-tolerant calibration testing problem with n samples, which runs in time

Proof. Throughout the proof, let α := ε1 ε2 > 0. Consider the following algorithm.

1. For |U| = m 6

α, sample n Ctct

α2 samples to form an empirical distribution b Dn, where Ctct is chosen so Lemma 3 guarantees |d CE(D) d CE( b Dn)| α

6 with probability 2

2. Call Proposition 2 with ε α

6 to obtain β, an α

3 -additive approximation to |d CE( b Dn)|.

3. Return yes if β ε2 + α

2 , and return no otherwise.

Conditioned on the event that |d CE(D) d CE( b Dn)| α

6 , we show that the algorithm succeeds. First, if d CE(D) ε2, then d CE( b Dn) ε2 + α

6 by assumption, and so β ε2 + α

2 by Proposition 2, so the tester will return yes. Second, if d CE(D) ε1, then d CE( b Dn) ε1 α

6 by assumption, so β ε1 α

2 by Proposition 2 and similarly the tester will return no in this case. Finally, the runtime is immediate from Proposition 3 and the definition of α.

Theorem 4 has the following implication for (standard) calibration testing, by letting ε2 = 0. Corollary 3. Let n N and let εn (0, 1) be minimal such that it is information-theoretically possible to solve the εn-calibration testing problem with n samples. For some ε = Θ(εn), there is an algorithm A which solves the ε-calibration testing problem with n samples, which runs in time

O n2 log(n) .

D Sample complexity lower bounds for calibration measures

Recent works [BN23, BGHN23a] have introduced other calibration measures (e.g. the convolved ECE and interval CE), given efficient estimation algorithms for them, and showed that they are polynomially related to the lower distance to calibration d CE. Therefore, an alternative approach to the (non-tolerant) testing problem for d CE is by reducing it to testing problems for these measures. The main result of this section is that this approach leads to suboptimal sample complexity: the testing problems for these measures cannot be solved given only O(ε 2) data points {(vi, yi)}i [n].

To establish this sample complexity lower bound, we construct a perfectly calibrated distribution D0 and a family of miscalibrated distributions Dθ parameterized by θ belonging to a finite set. We use D n 0 (and D n θ ) to denote the joint distribution of n independent examples from D0 (and Dθ). In Lemma 24, we show that the total variation distance between D n 0 and the mixture E[D n θ ] of D n θ is small unless n is large, and thus distinguishing them requires large sample complexity. Consequently, the testing problem for a calibration measure has large sample complexity if it assigns every Dθ a large calibration error. Finally, we show every Dθ indeed has large convolved ECE and interval CE, establishing sample complexity lower bounds for these measures in Theorems 5 and 6.

To construct D0 and Dθ, we consider t values {ui}i [t] [ 1

3] where ui = 1

3 + i 3t for i [t]. We will determine the value of t N later. We also define the following distribution, a perfectly calibrated distribution which is related to the miscalibrated synthetic dataset used in Section 3. Definition 8. The distribution D0 of (v, y) [0, 1] {0, 1} is defined such that the marginal distribution of v is uniform over {ui}i [t] and ED0[y|v] = v.

Fix α (0, 1

3). For θ { 1, 1}t, we define distribution Dθ of (v, y) [0, 1] {0, 1} such that the marginal distribution of v is uniform over {ui}i [t] and EDθ[y|v = ui] = ui + θiα. In other words,

each conditional distribution given v is miscalibrated by α, but the bias takes a random direction. We now follow a standard approach by [IS03] to bound the total variation between our distributions. Lemma 24. For any t N and α (0, 1

d TV(D n 0 , Eθ[D n θ ]) 1

Here, to construct the mixture distribution Eθ[D n θ ], we first draw θ unif. { 1, 1}t, and then draw n independent examples from Dθ. We denote the distribution of the n examples by Eθ[D n θ ].

Proof. By a standard inequality between the total variation distance and the χ2 distance, we have

d TV(D n 0 , Eθ[D n θ ]) 1

χ2(Eθ[D n θ ] D n 0 ). (25)

By Ingster s method [IS03] (see also Section 3.1 of [Can22]),

χ2(Eθ[D n θ ] D n 0 ) = Eθ,θ

Dθ(ui, j)Dθ (ui, j)

where the expectation is over θ, θ drawn i.i.d. unif. { 1, 1}t. For every i [t], we have

Dθ(ui, 1)Dθ (ui, 1)

D0(ui, 1) =

t ui+θ iα t

t + (θi + θ i)α t + θiθ iα2

and similarly

Dθ(ui, 0)Dθ (ui, 0)

D0(ui, 0) =

t (1 ui θ iα) t

t (θi θ i) t + θiθ iα2

Adding up the two equations, we get X

Dθ(ui, j)Dθ (ui, j)

D0(ui, j) = 1

t + θiθ iα2

ui + 1 1 ui

t + 9θiθ iα2

where the last inequality uses the fact that ui [ 1

3]. Plugging this into (26), we get

χ2(Eθ[D n θ ] D n 0 ) Eθ,θ

i=1 Eθ,θ exp 9α2n

1 (by Hoeffding s lemma)

= exp 81α4n2

Plugging this into (25) completes the proof.

D.1 Lower bound for convolved ECE

We now introduce the definition of convolved ECE from [BN23], and show that for every θ { 1, 1}t, Dθ has a large convolved ECE in Lemma 26. This allows us to prove our sample complexity lower bound for convolved ECE in Theorem 5, by applying Lemma 24.

Definition 9 (Convolved ECE [BN23]). Let πR : R [0, 1] be the periodic function with period 2 satisfying πR(v) = v if v [0, 1], and πR(v) = 2 v if v [1, 2]. Consider a distribution D over [0, 1] {0, 1}. For (v, y) D, define random variable ˆv := πR(v + η), where η is drawn independently from N(0, σ2) for a parameter σ 0. The σ-convolved ECE is defined as follows:

c ECEσ(D) := E|E[(y v)|ˆv]|,

where the outer expectation is over the marginal distribution of ˆv, and the inner expectation is over the conditional distribution of (y, v) given ˆv. It has been shown in [BN23] that c ECEσ(D) [0, 1] is a nonincreasing function of σ 0 and there exists a unique σ 0 satisfying c ECEσ (D) = σ . The convolved ECE c ECE(D) is defined to be c ECEσ (D).

We also mention that the following relationship is known between c ECE and d CE. Lemma 25 (Theorem 7, [BN23]). For any distribution D over [0, 1] {0, 1}, it holds that

1 2d CE(D) c ECE(D) 2 p

We have the following lower bound on c ECE(Dθ): Lemma 26. For integer t 3, choose α = 1 t

ln t. Then for every θ { 1, 1}t,

c ECE(Dθ) 1 100t

Proof. It suffices to show that c ECEσ(Dθ) 1 100t

ln t whenever σ 1 100t

Consider (v, y) D and ˆv = πR(v + η), where η is drawn independently from N(0, σ2). By standard Gaussian tail bounds, we have

Next, consider a function ℓ: [0, 1] [t] such that ℓ(ˆv) argmini [t]|ui ˆv|. Let E denote the event that v = uℓ(ˆv). Let IE and I E be the indicators of E and its complement, respectively. We have

E[(y v)IE | ˆv, v] = IEE[y v | ˆv, v] (IE is fully determined by v and ˆv) = IEE[y v | v] (y is independent of ˆv given v) = IEE[y v | v = uℓ(ˆv)] = IEθℓ(ˆv)α.

Taking expectation over v conditioned on ˆv, we have

|E[(y v)IE | ˆv]| = | Pr[E | ˆv]θℓ(ˆv)α| = Pr[E | ˆv]α.

We also have E h (y v)I E | ˆv i E h |(y v)I E| | ˆv i Pr[ E | ˆv].

Therefore, |E[y v | ˆv]| Pr[E|ˆv]α Pr[ E | ˆv]. Taking expectations over ˆv, we have

c ECEσ(Dθ) = E[|E[y v | ˆv]|] Pr[E]α Pr[ E]. (28)

Whenever E does not occur, it must hold that |v ˆv| 1 6t, which can only hold when |η| 1 6t. Therefore, by plugging (27) into (28), we get

c ECEσ(Dθ) 1 1

Theorem 5. If A is an ε-c ECE tester with n samples (Definition 5), for ε (0, 1

1 ε2.5 ln0.25( 1

Proof. Without loss of generality, assume that ε 10 3. Let t 3 be the largest integer satisfying ε 1 100t

ln t. We choose α = 1 t

By Lemma 26, we have c ECE(Dθ) ε for every θ { 1, 1}t. By the guarantee of the tester, we have

d TV(D n 0 , Eθ[D n θ ]) 1

Combining this with Lemma 24, we get n = Ω(α 2

t) = Ω(t2.5 ln t), so the claim holds.

D.2 Lower bound for (surrogate) interval CE

The interval calibration error was introduced in [BGHN23a] as a modified version of the popular binned ECE to obtain a polynomial relationship to the lower distance to calibration (d CE). To give an efficient estimation algorithm, [BGHN23a] considered a slight variant of the interval calibration error, called the surrogate interval calibration error, which preserves the polynomial relationship. Below we include the definition of the surrogate interval calibration error, and its polynomial relationship with d CE. We then establish our sample complexity lower bound (Theorem 6) for surrogate interval CE by showing that every Dθ has a large surrogate interval CE (Lemma 28).

Definition 10 ([BGHN23a]). For a distribution D over [0, 1] {0, 1} and an interval width parameter w > 0, the random interval calibration error is defined to be

Rint CE(D, w) := Er

j Z |E(v,y) D[(y v)I(v Iw r,j)]|

where the outer expectation is over r drawn uniformly from [0, w) and Iw r,j is the interval [r + jε, r + (j + 1)ε). Note that although the summation is over j Z, there are only finitely many j that can contribute to the sum (which are the j that satisfy Iw r,j [0, 1] = ). The surrogate interval calibration error is defined as follows:

Sint CE(D) := inf k Z 0

Rint CE(D, 2 k) + 2 k .

Lemma 27 (Theorem 6.11, [BGHN23a]). For any distribution D over [0, 1] {0, 1}, it holds that

d CE(D) Sint CE(D) 6 p

Lemma 28. For t N, let α = 1

3t. Then for every θ { 1, 1}t, it holds that Sint CE(Dθ) 1

Proof. It suffices to prove that Rint CE(Dθ, w) 1

3t whenever w < 1

3t, where we recall the definition (29). Fix some r [0, 1]. Every ui belongs to the interval Iw r,ji for a unique ji Z. Since the interval width w is smaller than the gap between ui and ui for distinct i, i , we have ji = ji . Therefore, X

j Z |E(v,y) D[(y v)I(v Iw r,j)]| X

i [t] |E(v,y) D[(y v)I(v Iw r,ji)]|

i [t] |E(v,y) D[(y v)I(v = ui)]|

i [t] Pr[v = ui]E[y v|v = ui]

Plugging this into (29), we get Rint CE(Dθ, w) 1

Theorem 6. If A is an ε-Sint CE tester with n samples (Definition 5), for ε (0, 1

Proof. Choose t 1 to be the largest integer satisfying 1

3t ε, and choose α = 1

3t. By Lemma 28, Sint CE(Dθ) ε for every θ { 1, 1}t. By the guarantee of the tester, we have

d TV(D n 0 , Eθ[D n θ ]) 1

Combining this with Lemma 24, we get n = Ω(α 2

t) = Ω(ε 2.5).

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We claimed a novel testing problem for measuring calibration, faster algorithms for solving it, and lower bounds for alternative measures. This is precisely what we provide. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We explain that our basic tester does not solve the tolerant testing problem in all regimes, and that our more powerful tolerant tester has a slower runtimes. We also state that our practical implementation is preliminary and could be optimized. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: We give full, detailed proofs of all theoretical claims.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: We provide a complete description of our experimental setups, and all of the relevant code in the supplementary material for reproducibility.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: As previously mentioned, we give full descriptions and provide our code. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: See above. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We replicated our experiments across multiple runs and provide error bars. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We describe all our computing environments and provide runtime metrics. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We read the ethics guidelines and believe we meet them. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: We provide a new tool for assessing calibration, a measure of interpretability of models, which is not tied to any particular societal application, and do not foresee a pathway to use this tool for negative societal impact. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: We only use simple synthetic or public datasets for our experiments.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: We made sure to cite all relevant code and data used in our experiments.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We do not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.