# diagnostic_tool_for_outofsample_model_evaluation__f3417b85.pdf

Published in Transactions on Machine Learning Research (10/2023)

Diagnostic Tool for Out-of-Sample Model Evaluation

Ludvig Hult ludvig.hult@it.uu.se Department of Information Technology Uppsala University

Dave Zachariah dave.zachariah@it.uu.se Department of Information Technology Uppsala University

Petre Stoice ps@it.uu.se Department of Information Technology Uppsala University

Reviewed on Open Review: https: // openreview. net/ forum? id= Ulf3QZG9DC

Assessment of model fitness is a key part of machine learning. The standard paradigm of model evaluation is analysis of the average loss over future data. This is often explicit in model fitting, where we select models that minimize the average loss over training data as a surrogate, but comes with limited theoretical guarantees. In this paper, we consider the problem of characterizing a batch of out-of-sample losses of a model using a calibration data set. We provide finite-sample limits on the out-of-sample losses that are statistically valid under quite general conditions and propose a diagonistic tool that is simple to compute and interpret. Several numerical experiments are presented to show how the proposed method quantifies the impact of distribution shifts, aids the analysis of regression, and enables model selection as well as hyperparameter tuning.

1 Introduction

Fitting a model to data is a central task in machine learning, signal processing, statistics and other areas (Bishop, 2006; Hastie et al., 2009; Söderström & Stoica, 2001; Kay, 1993; Fitzmaurice et al., 2011). A fitted model f can be assessed by considering a loss function ℓ( ) that evaluates the model on future data points. This is called out-of-sample analysis, since it considers data points beyond those in the training data sample.

Classical statistical models are often assessed by verifying assumption validity through statistical or graphical tools, a process called model diagnostics, diagnostic checks, or regression diagnostics (Ruppert et al., 2003; Belsley et al., 1980). The term has also been established in a wider sense for general checks to verify a model fitness for given data, see e.g. Casper et al. (2022); Zhang et al. (2022). We propose a diagnostic tool that helps evaluating the model performance on out-of-sample data.

In this paper, we consider the problem of characterizing m out-of-sample losses of a model f using a calibration data set. The choice of m will depend on the application. In online learning, m = 1 is reasonable since the model may only be valid for a single prediction. In problems with low data-volumes, such as a predictive health-care model to be used on e.g. 30 future patients with a rare desease, m = 30 may be appropriate. For models in high volume applications, such as e-commerce, the limit m may be relevant. We derive finite-sample limits on the out-of-sample losses that are statistically valid under quite general conditions and for any of the choices of m. An illustration is provided in Fig. 1, where we fit a model for predicting house prices to training data and evaluate its absolute prediction error on a calibration data set D. The average

Published in Transactions on Machine Learning Research (10/2023)

$0 $20K $40K $60K $80K $100K 0%

LAL-curve ( = 0.50)

LAL-curve ( = 0.75)

Expected loss from CV

Figure 1: Model error diagonistic. Consider a predictive model f(X) of house price Y in an area described by a feature vector X. How will this model perform in m future price predictions? We evaluate the model using the prediction error ℓ(X, Y ) = |Y f(X)| as the chosen loss function. The cross-validation method (CV) evaluates the model by estimating the expected out-of-sample loss, which in this case is circa $36K (indicated by black vertical line). Being an estimate of an expectation, it provides limited information about what individual loss values might be observed. Suppose the model will be used in m = 100 prediction instances, and we want to bound at least β = 50% of these future losses. Then the lal-curve infers that with probability at least 80% (level α = 20%), this fraction of prediction errors will not exceed $25K (see blue solid curve) For a stronger guarantees, e.g. bounding at least β = 75% of the future losses, the lal-curve indicates that prediction errors of $48K must be tolerated at level α = 20% (see orange dashed curve). Full details for this experiment are provided in Sec. 4.1.

loss on D is a form of cross-validation and is indicated in the figure. The figure also illustrates the proposed diagnostic statistic: an upper bound ℓβ α(D) on the β-fraction of m yet unobserved prediction errors that holds with confidence level 1 α. By computing this statistical limit ℓβ α(D) as function of α, we quantify how probable different out-of-sample losses are for the model f. Since this quantification is valid for any size of D, the limit can be employed as a diagnostic tool also in cases where calibration data is scarce or costly to obtain. Moreover, as the validity does not depend on the distribution of the data used to train f, the limit can also be used to analyze the severity of distribution shifts, as we will illustrate in the numerical experiments below.

The rest of the paper is organized as follows. We first formalize the general problem of interest, then propose a measure ℓβ α(D) which we refer to as the level-α limit (lal) on the β-fraction of m out-of-sample losses and prove its statistical guarantees. This is followed by a series of numerical experiments that demonstrate the utility of lal. In the closing discussion, the proposed method is related to existing literature.

2 Problem Statement

Let f denote a model fitted to data samples D0 drawn from a distribution p0. We aim to quantify the performance of f on m out-of-sample data points {Z1, . . . , Zm} drawn from distribution p, which may differ from p0. The performance is quantified using any real-valued loss function ℓ(z) the user wants. We let upper case letters denote random variables, e.g., Z, and let the lower case version, z, represent their realization.

Example 1 (Density Estimation using a Gaussian model). A data point is a vector z RK and we consider

a Gaussian density model f(z) = N z ; bµ , bΣ , with a fitted mean bµ and covariance matrix bΣ. A common loss function is the negative log-likelihood

ℓ(z) = 2 ln f(z) = (z bµ)TbΣ 1(z bµ) + ln |bΣ|,

ignoring the constant and scaling by a factor 2.

Example 2 (Regression). A data point is a pair of features x and a label y, i.e., z = (x, y). The model f(x) is any estimate of the conditional expectation function E[y|x]. Example models include Gaussian process

Published in Transactions on Machine Learning Research (10/2023)

regressors, random forest regressors, and neural networks. A common loss function is the squared-error loss ℓ(x, y) = |y f(x)|2.

We assume that we have access to a calibration data set D = {Zc 1, . . . , Zc n} with n samples drawn randomly from p, such that the combined data set of n+m samples is exchangeable, i.e., the joint density of the random vector (Zc 1, . . . , Zc n, Z1, . . . , Zm) is invariant under relabeling of the data points. While the calibration data set {Zc 1, . . . , Zc n} is available to use in computations, the out-of-sample data {Z1, . . . , Zm} is future, yet unseen data. Exchangeability includes the common independent and identically distributed (iid) data assumption as a special case. An example of exchangeable non-iid data is sampling without replacement from a finite population.

The problem we consider is to characterize m unknown out-of-sample losses

{ℓ(Z1), . . . , ℓ(Zm)} (1)

of the model f. Specifically, we want to bound a fraction β (0, 1) of the m losses with a confidence 1 α. That is, we want to find a statistical limit ℓβ α(D) on how large future losses we are likely to observe for the model: P h at least a fraction β of losses ℓ(Zi) respects ℓ(Zi) ℓβ α(D) i 1 α. (2)

We will call such a value ℓβ α(D) the level-α limit for the β-fraction of m losses, or a lal for short.

We call the graph of ℓβ α(D) versus α a lal-curve, as illustrated in, e.g., Fig. 1. By plotting the value ℓβ α(D) on the horizontal axis, we can visualize the tail behaviour of the out-of-sample losses for f in a transparent manner. Thus given a valid lal, we propose to use the lal-curve as a diagnostic tool for model evaluation.

For notational convenience, let Li = ℓ(Zi) and arrange the m out-of-sample losses (1) in increasing order: L(1) L(2) L(m). The criterion (2) can be expressed compactly as

P L( mβ ) > ℓβ α(D) α (3)

and our objective is to find a limit ℓβ α(D) for any given β (0, 1] and α (0, 1).

For the calibration data D, let Lc i = ℓ(Zc i ) and define the set of losses {Lc 1, . . . , Lc n} and order statistics Lc (1) Lc (2) Lc (n). We also define the special cases Lc (0) and Lc (n+1) to be the infimum and supremum of the support of the distribution of {Lc i}m i=1, possibly .

3.1 General LAL expression

The following result holds generically; it assumes two or more continuously distributed random variables don t take exactly the same value. The probability of such an event is zero. To simplify the reading, we will skip the assertions of almost surely and assuming no ties .

Theorem 1. Let ℓβ α(D) = Lc (k ), where

k = min k {1, . . . , n + 1} | a(k) α ,

n j+m mβ n j j+ mβ 1 j

This is a valid lal, satisfying (3).

Proof. Define the set of calibration losses Lc = {Lc 1, . . . , Lc n} and out-of-sample losses L = {L1, . . . , Lm}. We will first prove the result in the case of continuous random variables, and then for discrete random variables.

Published in Transactions on Machine Learning Research (10/2023)

Consider the case when Lc L has a continuous joint distribution, so there are no ties. We need to show that P h L( mβ ) > Lc (k ) i α. (5)

Because of the decomposition into sum

P h L( mβ ) > Lc (k ) i = P h Lc (k ) L( mβ ) i =

j=k P h Lc (j) L( mβ ) < Lc (j+1) i , (6)

a closed form expression for P h Lc (j) L(i) < Lc (j+1) i may lead us to proving (5).

The set of random variables Lc L = {Lc 1, . . . , Lc n, L1, . . . , Lm} can be sorted. Denote these sorted variables J(1) < J(2) < . . . < J(n+m). Every such value J(i) come from an origin set, which is either Lc or L. This assignment to the origin set partitions the set of J(i)-values into two subsets. There are n+m m such partitions, and they are equally probable, due to the assumed exchangeability.

We next investigate how many of the partitions fulfil Lc (j) L(i) < Lc (j+1), for 1 i m, 1 j n. In this situation, L(i) = J(i+j). Of the i + j 1 losses less than J(i+j), j has Lc as the origin set. There are thus j+i 1 j equally probable partitions of the lesser loss values. Similarly, the loss values greater than L(i) can be partitioned in n j+m i n j ways. This shows that

P h Lc (j) L(i) < Lc (j+1) i =

n j+m i n j j+i 1 j

n+m m . (7)

By combining with (6), we have found

P h L( mβ ) > Lc (k) i =

n j+m mβ n j j+ mβ 1 j

n+m m =: a (k)

Any k such that a (k) α would provide us with a valid lal. However, since the range of a (k) is [ 1 ( n+m m ), 1],

the case when α < 1 ( n+m m ) means no such k exists. Defining b r = 0 for all b and r < 0 allows us to finally

define a(k) to

n j+m mβ n j j+ mβ 1 j

increasing its domain of definition to {1, . . . , n + 1} and its range to [0, 1]. By selecting k = min{k {1, . . . , n + 1}|a(k) α}, we have proven the theorem for continuous variables.

Next, we prove the result for discrete random variables. Let F be the joint cdf for Lc L, where each loss takes values in a finite or countable set V = {vi} of real numbers. Since we may get ties with non-zero probability, the preceding analysis fails. To circumvent this problem, construct a random vector (λ1, . . . , λn+m) with cdf F such that

F (l1, . . . , ln+m) F(l1, . . . , ln+m) always

F (l1, . . . , ln+m) = F(l1, . . . , ln+m) if (l1, . . . , ln+m) V n+m

Define further a set of random variables λi = min {v V |v λi} for all i. Now (λ1, . . . , λn+m) is equal to (Lc 1 . . . Lc n, L1, . . . , Lm) in distribution. Also, (λ(i) > λ(j)) (λ(i) > λ(j)) for all i, j. Together, this

means P h L( βm ) > Lc (k ) i = P

λ( βm +n) > λ(k ) P λ( βm +n) > λ(k ) = a(k ). where we have used the result on continuous random variables on F to compute k .

Remark 1. The definition of the lal, (2), only demands that the limit level is at least 1 α. However, the more conservative ℓβ α(D) is, the larger is the excess coverage. From the proof, we see that if the joint set of losses (Lc 1 . . . Lc n, L1, . . . , Lm) has no ties, one may compute the exact coverage 1 a(k ). Thus when the method is conservative, it is transparently so.

Published in Transactions on Machine Learning Research (10/2023)

Remark 2. The proof technique above is inspired by Fligner & Wolfe (1976) which proves the result for the special case of iid data. It is noteworthy that when generalizing from iid to exchangeable data, we keep the same level of precision.

3.2 LAL for a single out-of-sample data point

When m = 1, the lal takes a very simple closed form. By deriving it from basic principles rather than using Thm. 1, we also obtain a tightness guarantee.

Theorem 2. For a single out-of-sample data point (m=1), a lal can be constructed as ℓ1 α(D) = Lc (k ), where k = (n + 1)(1 α) . For continuous data distributions, the almost sure out-of-sample loss guarantee is α 1 1 + n P L1 > ℓ1 α(D) α (8)

For discrete data distributions, only the upper bound in (8) can be guaranteed.

Proof. When (Lc 1, . . . , Lc n, L1) are continuous, the values are almost surely unique, and therefore there are n+1 1 = 1 n+1 equally likely ways to select which one is L1. Only one such selection obeys Lc (j) L1 Lc (j+1). Therefore,

P[Lc(j) L1 Lc (j+1)] = 1 1 + n and P[L1 Lc (k )] = k

Because (1 α) (n + 1)(1 α) /(n + 1) (1 α) + 1/(n + 1) we compute

α 1 1 + n P[L1 > Lc (k )] α

When (Lc 1, . . . , Lc n, L1) are discrete and ties are possible, we can still prove the upper bound. Consider rank of R of L1, i.e., put {Lc 1, . . . , Lc n, L1} in nondecreasing order, and let R denote the position of L1. When there are ties, position the tied values in a uniformly random way. By construction R is uniformly distributed over 1, . . . , (n + 1) and

P[L1 L(k )] P[R k ] = k

Remark 3. For iid data and m = 1, the lal-curve approaches the complementary cdf of L1, i.e., 1 F. This facilitates the interpretation of the lal-curve as a quantile point estimate. To see this, let b F 1 n denote the empirical quantile function of L1 based on the losses Lc 1, . . . , Lc n. Then the lal of Thm. 2 can equivalently be defined as

( b F 1 n n+1

n (1 α) if n+1

n (1 α) (0, 1) Lc (n+1) else (9)

If L has a bounded and connected range, b F 1 n converges uniformly to F 1 (Bogoya et al., 2016), and ℓβ α(D) F 1(1 α). Plotting α on the vertical axis against F 1(1 α) is identical to plotting the complementary cdf 1 F(ℓ) on the vertical axis against ℓ, so in this case, the lal-curve converges to the graph of the complementary cdf.

3.3 LAL for large out-of-sample data sets

If the number of out-of-sample data points is very large, we may use a limit argument and let m in on Thm. 1.

Corollary 1. Let BIN 1( ; n, β) denote the quantile function of a binomial distribution with parameters (n, β). For an infinite sequence of exchangeable losses (Lc 1 . . . Lc n, L1, . . . ), let ℓβ α(D) = Lc (k ) with

k = 1 + BIN 1(1 α; n, β).

Published in Transactions on Machine Learning Research (10/2023)

This lal satisfies a limit form of (3):

lim m P L( mβ ) > ℓβ α(D) α (10)

Proof. The result follows by expanding the binomial coefficients with Stirling s formula and taking the limit for m . See App. A for details.

Starting from Cor. 1, we can find a different interpretation of the lal when m .

Remark 4. Consider the case of iid data, where Lc i and Lj have a common cdf F for all i, j. Let F 1 be the quantile function, and b F 1 m be the empirical quantile function based on {L1, . . . , Lm}.

As m , we have that L( mβ ) = b F 1 m (β) F 1(β). Therefore, (10) simplifies to P F 1(β) > ℓβ α(D) α, which gives another interpretation of the lal as the boundary of a confidence interval Cβ α(Lc) := ( , ℓβ α(D)]. This interval satisfies P F 1(β) Cβ α(Lc) α and is thus a valid confidence interval for the β-quantile of Li for all i.

This intuition is also useful in the non-iid case as m . The De Finetti theorem (Kingman, 1978) states that if (Lc 1 . . . Lc n, L1, L2 . . . ) forms an infinite sequence of exchangeable random variables, there is an auxiliary random variable ζ such that all the conditional variables (Lc i|ζ) and (Lj|ζ) are iid with cdf Fζ. Therefore, the interpretation of the lal as the boundary of a confidence interval is useful in the exchangeable-but-not-iid data setting, even if its interpretation must be handled with more caution.

4 Experiments

This section presents examples of how the lal can be applied to analyzing the out-of-sample loss of a model or family of models. Code to reproduce all experiments can be found at https://github.com/el-hult/lal.

The experiments are intended to be illustrative rather than exhaustive. They are limited in two respects. Firstly, to reduce runtime and simplify reproducibility of the experiments, we consider models f that have relatively few parameters and are trained using small data sets. Since the theoretical guarantees hold for any fitted model, they also hold for large models and models trained on large data sets. Secondly, the experiments use small to moderate calibration set sizes n. As suggested in Remark 3, increasing n improves the inference of loss distribution, but there is a practical limitation to be considered: When n is large, it is advisable to transfer some data to the training set to improve the model performance, rather than improving our inferences about the model performance.

4.1 Study of asymptotics

This experiment illustrates the limit for different m corresponding to Thm. 1, Cor. 1 and Thm. 2. The data set consist of California housing prices from the 1990 census (Kelley Pace & Barry, 1997), covering 20 640 housing blocks.

Each data point z = (x, y) represents a single city block. The label y R is the median house value in the block, and the feature vector x R8 consists of continuous variables such as block coordinates, median house ages and tenant median income. The training data set D0 has n0 = 15 000 sampled without replacement. The calibration data set D has n = 150 and is sampled without replacement from the remaining data.

The model is a regression for the logarithm of the median house value, operating on standardized features and labels, and uses a random Fourier feature basis (Rahimi & Recht, 2008). Let K random Fourier functions with bandwidth b be stacked in a vector ϕ. The model is f(x) = exp[ϕ(x)Tˆθ], where ˆθ is found by L2regularized least squares regression. The hyperparameters (number of basis functions K, bandwidth b and regularization strength λ) are tuned by five-fold cross-validation on the training data. The loss function is the absolute error expressed in dollars ℓ(x, y) = |y f(x)|.

Published in Transactions on Machine Learning Research (10/2023)

Figure 2: lal-curves. From left to right, m = 1 computed with Thm. 2, m = 30 computed with Thm. 1 and m = computed with Cor. 1. The curves assure that a fraction β of out-of-sample losses will not exceed ℓβ α(D), with at least probability 1 α. For example, we can see that among the m = 30 next samples, at least 80% of them will have prediction losses less than $70, 000, with a confidence of 90%.

We now turn to limiting m out-of-sample losses, where m {1, 30, }. lal-curves were drawn with varying β-fractions, shown in Fig. 1 and Fig. 2. The empirical risk, defined as the average loss on the calibration data, is also calculated, and the calibration losses are presented as a histogram.

Averaging the losses over the calibration data gives an unbiased point estimate of the expected out-of-sample loss for f but it, by itself, lacks statistical guarantees. While it estimates the mean out-of-sample prediction error to be around $35 000, the lal-curve for a single new prediction error (m = 1) in Fig. 2 shows that it may be close to $80 000 (for α 10%).

If we now consider a batch of m = 30 predictions, the lal-curve for m = 30 in Fig. 2 informs us that β = 80% of them will have errors smaller than $70 000 (for α 10%). The number of data points blocks not analyzed are 5 490, so we may also consider the limiting case m . The lal curve now tells us that β = 80% of prediction errors will be smaller than $60 000 (for α 10%).

By comparing the lal-curves in Fig. 2, we learn how the out-of-sample batch size m affects the tail behavior of the lal at a fixed β. The lal-curve is slowly decaying for m = 1. For m = 30, the decay is faster. For m = the decay is even more abrupt. Some intuition can be gained for iid data. As m increases, the variance of L( mβ ) decreases (it is asymptotically normal and m-consistent, see e.g. (Vaart, 1998, Cor. 21.5)), and the bound in (3) can be made tighter.

4.2 Distribution shift analysis

This experiment illustrates the analysis of distribution shifts (Quiñonero-Candela et al., 2009), also known as concept drift, using the lal-curve. The experiment also verifies Thm. 2 numerically. The availability of a calibration data D invites the notion of refitting or fine-tuning, to adapt to potential distribution shift (Lee et al., 2023). This is not always possible, e.g. when calibration data is limited (too small n). It may still be desirable to evaluate the model performance on out-of-distribution data. Since the lal is valid for any n, it will always be an available diagnostic tool to analyze the extent to which distribution shift affects the model performance.

Lu et al. (2018) proposes a taxonomy for distribution shift detection methods. In this taxonomy, the lal would qualify as an error rate-based method, as it is based on a loss function. Example methods in this group are the Drift Detection Method (DDM) (Gama et al., 2004) and Black Box Shift Estimation (BBSE) (Lipton et al., 2018) which detects distribution shift by monitoring when classifier performance changes with statistical significance. In contrast, the lal does focus on distributions of out-of-sample losses, enabling analysis of how/shift severity for distribution shifts (again using the Lu et al. (2014) taxonomy). Other methods for how, focus on statistical distances between feature distributions (Rabanser et al., 2019). While

Published in Transactions on Machine Learning Research (10/2023)

0.5 0.0 0.5 1.0 1.5 2.0 x

(a) Fitted model and calibration data

0 1 2 3 4 5 0%

(b) lal-curves, m = 1, β = 1

Figure 3: Quadratic regression model f(x) evaluated using absolute prediction error loss function ℓ(x, y) =

|y f(x)|. (a) A fitted model and two different calibration data sets D1 iid p1 (identical to training data

distribution) and D2 iid p2 (shifted distribution). (b) lal-curves under distributions p1 and p2. A single out-of-sample loss exceeds the limit ℓβ α(D) by a probability of at most α, as given by the curve. (See also Thm. 2).

such methods do not need labels, they do not measure shifts in the way that matters for the model, i.e., the loss.

Let Zi = (Xi, Yi), with real valued Xi and Yi, and generate data according to

Xi N µ , σ2 (11)

Yi|Xi = x N (x(x 1)(x + 1) , 1) (12)

The training data set D0 was created with n0 = 100, µ = 1, σ = 0.5. A quadratic regression model was used, since it approximates the conditional expectation function well over the training data; f(x) = bθ0 + bθ1x+ bθ2x2

was fitted to D0 via least-squares. See Fig. 3a. We analyze the case of out-of-sample batch size m = 1. The performance of the model is evaluated using the absolute error loss:

ℓ(x, y) = |y f(x)| (13)

The out-of-sample losses are quantified for two different data distributions. In the first case, µ = 1, σ = 0.5, so there is no distribution shift, and we use a calibration data set D1 for which n = 30. In the second case, µ = 0.75, σ = 0.75, resulting in a shift in X and we use a calibration data set D2 also having n = 30. The lal-curves in both cases are presented in Fig. 3b, which reveal significantly larger out-of-sample losses for the distribution p2 from which D2 was drawn. The 5%-tail losses are nearly twice as large for the shifted distribution.

Under the same experimental setup, we verified the tightness result of Thm. 2. Using 2 000 Monte Carlo runs, the empirical coverage was computed for different α, and plotted in Fig. 4a. Moreover, Rem. 3 states the convergence of the lal to the quantile function F 1(1 α) as n increases. This is illustrated in Fig. 4b. The quantile function was numerically approximated using 105 samples. The expected lal is computed using 2 000 Monte Carlo runs with data sampled from p2. We see that the lal approaches the quantile function as n increases.

4.3 Classification error analysis

This experiment shows that the proposed methodology can be applied to classification models as well. We will also consider how adversarial distribution shifts manifest in the lal-curve. We use the Palmer Penguin data set, popularized by Horst et al. (2020). The 333 complete record data points are pairs zi = (xi, yi) of features xi R7 and labels yi {1, 2, 3}. The labels yi are categorical, encoding the penguin species. The features xi are vectors of (Island, Bill Length, Bill Depth, Flipper Length, Body Mass, Sex, Year), a mixture of categorical and integer-valued features.

Published in Transactions on Machine Learning Research (10/2023)

0% 20% 40% 60% 80% 100% 0%

[Ln + 1 > ( )]

(a) Empirical coverage

0% 20% 40% 60% 80% 100%

Expected Ratio

n = 30 n = 300 n = 3000

(b) Convergence of the lal to quantile function

Figure 4: The finite-sample guarantee of Thm. 2 is verified in (a), which illustrates both the upper bound (solid line) and lower bound (dashed line). (b) Illustration of the connection between lal and the quantile function (Rem. 3). The expected value of the ratio ℓβ α(D)/F 1(1 α) is evaluated for different values of n with data drawn from the shifted distribution p2. As n increases, the expected ratio approaches 1, from above.

We use a training data set D0 of n0 = 150, leaving 183 samples for calibration. Using D0, we fit f(x) via multinomial logistic regression using L2-regularization and cross-validation. The model output is a threedimensional vector f(x) = [f1(x), f3(x), f3(x)]T approximating the conditional probabilities, so that fi(x) approximates P[Y = i|X = x]. The model is to be evaluated on calibration data using the misclassification probability loss

ℓ(x, y) = 1 fy(x) (14)

The out-of-sample batch size was set to m = 1.

Two different calibration data sets D1 and D2 of sample size n = 50 were constructed. We sampled from the 183 held out data points without replacement. For D1, the initial probability of selecting a sample was uniform over the data. For D2, the initial probability of selecting a sample was proportional to ℓ(x, y), effectively an adversarial reweighting of samples. The lal-curves of Thm. 2 (set m = 1 and β arbitrary) are presented in Fig. 5, showing that lal-curves can be applied to classification models. The fact that one curve comes from an adversarial calibration sample is manifest by larger lal for every confidence α compared to the non-adversarial calibration sample.

4.4 Regression error analysis

This experiment shows how alternative loss functions can be used to analyze the asymmetry of errors in regression problems. We use the UCI Airfoil data set (Dua & Graff, 2017). The task is to predict a label y R representing the sound measured in d B. The features x R5 Each data point is a feature-label pair zi = (xi, yi).

The calibration data D is constructed by weighted sampling of n = 100 samples. The probability to draw a data point (xi, yi) is proportional to exp( 1 0 0 0 1 xi), making data points with high frequency and small displacement more likely to sample, similar to the distribution shift experiments in Tibshirani et al. (2020, sec. 2.2). The remaining 1 403 data points constitute the training data set D0.

The model f(x) uses a spline basis that is fit using least-squares with L2-regularization and cross-validation. To study the asymmetry of prediction errors, we compute the lal-curves for a subsequent experiment

Published in Transactions on Machine Learning Research (10/2023)

0% 20% 40% 60% 80% 100% 0%

Figure 5: lal-curves (m = 1) for the classification error analysis. The loss function indicates the certainty of misclassification a loss of 80% means that the model assigned 80% probability to the wrong labels. Data set D1, is exchangeable with the training data. Data set D2 is adversarially sampled, presenting notably larger lal.

7.5 10.0 12.5 15.0 17.5 20.0 0%

Figure 6: lal-curves (m = 1) for the regression error analysis. Using different loss functions, all in units of d B, gives deeper insight in the model fit. Looking at α around 2 8%, using loss for underand overshoot, we see that the model is more likely to overshoot out-of-sample data than it is to undershoot.

(m = 1) using three different loss functions,

overshoot loss ℓ(x, y) = max(0, f(x) y) (15)

undershoot loss ℓ(x, y) = max(0, y f(x)) (16)

absolute loss ℓ(x, y) = |y f(x)| (17)

which are all in units of d B to enable a physical interpretation. The results are shown in Fig. 6, where we observe that f produces more severe overshoots than undershoots. The chosen loss functions (15-17) differs from the loss function used for model fitting (squared-error), illustrating the freedom that an analyst has to characterize the performance of f.

4.5 Model comparison

This experiment is concerned with model selection. The data used is the monthly number of earthquakes worldwide with magnitude 5 between 2012 and 2022 (USGS, 2022). There are 120 data points z1, . . . , z120, with zi {0, 1, 2, . . . } These are randomly split into 100 and 20 data points, forming D0 and D. We learn two models of the number of earthquakes per month z, P(Z = z), using the maximum likelihood method: a Poisson model f Poisson(z) and a Negative Binomial model f Neg Bin(z). The models are evaluated using the negative log-likelihood loss

ℓ(z) = log f(z) (18)

and Fig. 7 presents their respective lal-curves for a subsequent earthquake, i.e., m = 1. From this result we can conclude that the simpler one-parameter Poisson model produces much larger out-of-sample losses than the two-parameter Negative Binomial model.

We compare the lal analysis with a common model selection metric, the Akaike Information Criterion (AIC) (Ding et al., 2018). It evaluates the models as AICPoisson = 1 926.34 and AICNeg Bin = 1 015.63. One may also compare models by the average loss on the calibration data, yielding 5.69/4.78 nats for the Poisson/Neg Bin. Both these model selection metrics favor the Neg Bin model, in agreement with the lal-curve. Whereas the model selection metrics report a single number, the lal-curve presents a more nuanced picture, showing the reduction in tail losses.

Published in Transactions on Machine Learning Research (10/2023)

0.0 2.5 5.0 7.5 10.0 12.5 15.0

Poisson Neg Bin

Figure 7: Comparison of models for earthquake statistics, using lal curves and m = 1, β = 1. The loss function used is the negative log-likelihood of data under the fitted model. By considering how it handles the α = 20% most difficult-to-fit data points, we see that the Negative Binomial model provides a better fit than the Poisson model.

10 6 10 5 10 4 10 3 10 2 0

90% 95% 99%

Figure 8: Relation between a regularization parameter λ in neural network model f λ(x), and the corresponding lal for m , varying β, and (1 α) = 99% confidence. Appropriate regularization improves the performance in the tail of the loss distribution (β = 99%), with an optimum around λ 10 3. The bulk of the losses (β = 95%) are not reduced by regularization; their lal increase with λ.

4.6 Hyper-parameter tuning

This experiment shows how the lal-curve can be used for tuning a hyperparameter when training a neural network.

The data set is the MNIST handwritten digits (Bottou et al., 1994). The training data set D0 has n0 = 6 104

data points, and the calibration data set D has n = 104 data points. The features xi are images of handwritten digits, and the labels yi are integers 0-9. Data points are tuples zi = (xi, yi).

We construct a family of models f λ(x) = NN(x; θλ). The function NN(x; θλ) consists of a dense feedforward neural network with 3 hidden layers of 250 units each, parameters θλ, Re LU activations and softmax output. The parameter θλ is learned by minimizing the cross entropy, using the Adam optimizer with an L2-regularization parameter λ (aka. weight decay in the deep learning literature). The optimization was run for 100 epochs, employing a batch size of 1024 and learning rate 0.01. The output of the network f λ(x) is a 10-dimensional vector, where the ith component approximates P[y = i|x].

We evaluate the model using the negative log-likelihood loss

ℓ(x, y) = log f λ y (x) = log NN(x; θλ)

Since deep learning methods may be deployed without retraining over a large number of predictions, we consider the case m (Cor. 1) and study the lal for a β-fraction of out-of-sample losses. The results in Fig. 8 show that small regularization initially reduces losses for outliers without increasing loss on nominal samples.

5 Discussion

This section elaborates on connections to related fields.

5.1 Model evaluation and selection techniques

Cross-validatory out-of-sample expected loss estimation is arguably the most common approach to model evaluation in machine learning (Arlot & Celisse, 2010; Stone, 1974; Bishop, 2006; Hastie et al., 2009). The

Published in Transactions on Machine Learning Research (10/2023)

idea is that model performance is measured by expected loss on out-of-sample data (called the risk), and cross-validation estimates this quantity. In its most common form, a fraction of the training data is held out from model fitting. The average loss of the model is computed on the held out data, forming an estimate of the risk. Repeated splitting and refitting of the model (k-fold cross-validation) can be used to estimate the bias and variance of such estimates. Model evaluation with lal shifts focus to evaluating model performance on probabalistic bounds on out-of-sample losses. This is significant when the out-of-sample loss distribution is multimodal, skewed or otherwise not well described by its mean and variance alone.

Evaluating models with respect to the risk may be done without cross-validation. Statistical learning theories, such as the VC-theory (Vapnik, 1991), provide asymptotic bounds on the risk under certain assumptions on the models and the data. Similarly, M-estimation (Vaart, 1998) provides another asymptotically valid method of fitting parametric models, quantifying the convergence of the average loss on training data to the risk. If the losses are bounded and the samples are iid or constructed as sampling without replacement, non-parametric results for inferring the risk appear to be promising (Waudby-Smith & Ramdas, 2020), improving on both the Hoeffding inequality and the Empirical Bernstein bounds (Audibert et al., 2009; Maurer & Pontil, 2009). For unbounded losses and non-iid exchangeable data, the inferential problem of estimating the risk remain open. lal-curves provide a nonparemetric and nonasymptotic way to evaluate model performance by not using the risk as the quantity of comparison.

In this work, we have computed ℓβ α(D) at given confidence levels α. Conversely, one can interpret ℓβ α(D) as the boundary value for rejection of the null hypothesis that model predictions are exchangeable, at level α. Others have used hypothesis testing for model evaluation. Posterior predictive checks (Gelman et al., 2013; Rubin, 1984), and the data consistency criterion, (Lindholm et al., 2019) rely on exchangeability between observed data and data generated by the model to this end. Those methods are used to test whether a model is compatible with data, and reports a p-value for the test. A lal-based analysis acknowledges that models are always misspecified and instead quantifies how well a model performs.

An analyst may wish to not only diagnose and compare models, but also decide which model in a set of candidates to use. This is called model selection (Ding et al., 2018). Two principal goals motivate the model selection algorithm: 1) model selection for inference, or 2) model selection for prediction. The first goal has been used to motivate well known information criteria such as the AIC (Stoica & Selen, 2004). Information critera has also been combined with hypothesis testing to assess if a certain model or model class can be said to be significantly closer to the true one than other candidates (Vuong, 1989). Regarding the second goal, Ding et al. (2018) posits that the best model is the one with smallest expected out-of-sample loss, which may be estimated by cross-validation. By fixing α at any desired level, the lal can be used to rank models in fashion similar to cross-validatory risk estimation. However, by employing the full lal-curve as indicated in Sec. 4.5, a better informed decision can be taken by balancing typical versus outlier performance.

The Value-at-Risk (Va R) is a risk measurement used in the finance industry. For a random variable L describing finanical loss, the value-at-risk at level 1 α can be defined as (Jorion, 2007, Eq 5.1)

Va R1 α(L) = inf{x : P[L > x] α}

Note that x is not a random variable in the above expression. Comparing with (3) and fixing m = 1, we find that the Va R is the smallest lal, conditional on the calibration data D. Any Va R constitute a valid lal for m = 1, but the converse is not true. Many Va R estimation methods rely on parametric assumptions or are only asymptotically valid. (Christoffersen, 2009; Alemany et al., 2013) The relaxed definition of lal in contrast to Va R enables the distribution-free methodology presented in this article. For its use in e.g. financial risk measurements, Va R has been further developed into several variants such as Conditional Va R, Tilted Va R, Entropic Va R, and more. (Rockafellar & Uryasev, 2002; Li et al., 2021) This line of reasearch may be beneficial for the lal as a diagnostic tool, and constitute a direction for further research, but some of the requirements on financial risk metrics may not necessarily carry over to all model performance metrics.

5.2 Non-parametric statistics and conformal prediction

For ease of exposition, the lal has primarily been discussed as a point value ℓβ α(D). One can also consider it as the boundary point of the interval ( , ℓβ α(D)]. As such, it can be understood as a variant of a non-

Published in Transactions on Machine Learning Research (10/2023)

parametric tolerance interval (Thm. 1), a prediction interval (Thm. 2) or a confidence interval (Rem. 4). See for instance Vardeman (1992) for a comparison between the different statistical intervals.

Fligner & Wolfe (1976) show how to construct non-asymptotic, non-parametric prediction intervals for quantiles of future data, and is a methodological precursor for the results in this paper.

Considering lal a confidence interval of a quantile for iid data, there are other results with exact coverage (Zieliński & Zieliński, 2005), whereas the formula presented in this article is sometimes conservative. One could use that method with lal to get exact guarantee. Such intervals are constructed via extra randomization and become harder to interpret. The lal-curves in specific have no exact counterpart. We have therefore chosen to avoid this construction.

The theory of nonparametric prediction intervals also forms a foundation for conformal prediction. This field focuses on producing prediction sets for the output of any predictive model f, see e.g. Vovk et al. (2005) or Angelopoulos & Bates (2021) for introductions to the field. We will clarify the connection between conformal prediction and the lal, in the case of split-conformal inference. The general case is essentially identical.

Consider a set of exchangeable random vectors {Wi}M+1 i=1 taking values in W. We wish to produce a prediction set Cα {Wi}M i=1 so that P[WM+1 Cα {Wi}M i=1 ] 1 α

To this end, define a real-valued nonconformity score A : Wi 7 A(Wi) = Ai, with the semantic that a large value means Wi does not conform to the general data set. Since the score is real valued, one can employ similar principles as in Thm. 2 to define a prediction interval Aα such that

P[AM+1 Aα {Ai}M i=1 ] 1 α

By letting Cα be the inverse image of Aα ({Ai}) under A, i.e.,

Cα ({Wi}) = w|A(w) Aα {Ai}M i=1 ,

we ensure the desired coverage. Conformal prediction methodology is thus largely centered on finding suitable nonconformity scores A that are computationally tractable and handle various inference targets and data distributions (Angelopoulos & Bates, 2021).

More recently (Lei et al., 2018, Thm. 2.1), an upper bound on the coverage rate was derived,

1 α + 1 M + 1 P[WM+1 Cα ({Wi})] 1 α,

that holds whenever the non-conformity scores are almost surely unique. The line of reasoning is similar but not identical to Thm. 2.

6 Conclusion

We have proposed the level-α loss (lal) curve as a diagonistic tool for out-of-sample analysis of a model f. The method requires specifying a loss function of interest and the access to a calibration data set. In return it provides finite-sample guarantees about the probability of a batch of out-of-sample losses exceeding a certain threshold. The lal is simple to compute and easy to interpret. A series of numerical experiments have been presented to show its usefulness in regression error analysis, distribution shift analysis, model selection and hyper-parameter tuning. We anticipate that there are many other areas of applications for this methodology.

Ramon Alemany, Catalina Bolancé, and Montserrat Guillén. A nonparametric approach to calculating valueat-risk. Insurance: Mathematics and Economics, 52(2):255 262, 2013. ISSN 0167-6687. doi: https://doi. org/10.1016/j.insmatheco.2012.12.008. URL https://www.sciencedirect.com/science/article/pii/ S0167668713000048.

Published in Transactions on Machine Learning Research (10/2023)

Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification, 2021. URL https://arxiv.org/abs/2107.07511.

Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection. Statistics Surveys, 4, Jan 2010. ISSN 1935-7516. doi: 10.1214/09-SS054. URL https://projecteuclid.org/journals/statistics-surveys/volume-4/issue-none/ A-survey-of-cross-validation-procedures-for-model-selection/10.1214/09-SS054.full.

Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876 1902, April 2009. ISSN 03043975. doi: 10.1016/j.tcs.2009.01.016. URL https://linkinghub.elsevier.com/retrieve/pii/ S030439750900067X.

David A. Belsley, Edwin Kuh, and Roy E. Welsch. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley Series in Probability and Statistics. Wiley, 1 edition, Jun 1980. ISBN 978-0-471-05856-4. doi: 10.1002/0471725153. URL https://onlinelibrary.wiley.com/doi/book/10. 1002/0471725153.

Christopher M. Bishop. Pattern recognition and machine learning. Information science and statistics. Springer, New York, 2006. ISBN 978-0-387-31073-2.

J. M. Bogoya, A. Böttcher, and E. A. Maximenko. From convergence in distribution to uniform convergence. Boletín de la Sociedad Matemática Mexicana, 22(2):695 710, October 2016. ISSN 1405-213X, 2296-4495. doi: 10.1007/s40590-016-0105-y. URL http://link.springer.com/10.1007/s40590-016-0105-y.

Léon Bottou, Corinna Cortes, John S. Denker, Harris Drucker, Isabelle Guyon, Lawrence D. Jackel, Yann Le Cun, Urs A. Muller, Eduard Säckinger, Patrice Simard, and Vladimir Vapnik. Comparison of classifier methods: a case study in handwritten digit recognition. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, Conference B: Computer Vision & Image Processing., volume 2, pp. 77 82, Jerusalem, October 1994. IEEE. URL http://leon.bottou.org/papers/bottou-cortes-94.

Stephen Casper, Kaivalya Hariharan, and Dylan Hadfield-Menell. Diagnostics for deep neural networks with automated copy/paste attacks. In Neur IPS ML Safety Workshop, 2022. URL https://openreview.net/ forum?id=l-kqvue SRp7.

Peter Christoffersen. Value at risk models. In Handbook of financial time series, pp. 753 766. Springer, 2009.

Jie Ding, Vahid Tarokh, and Yuhong Yang. Model selection techniques: An overview. IEEE Signal Processing Magazine, 35(6):16 34, 2018. doi: 10.1109/MSP.2018.2867638. URL https://doi.org/10.1109/MSP. 2018.2867638.

Dheeru Dua and Casey Graff. UCI Machine Learning Repository, 2017. URL http://archive.ics.uci. edu/ml.

Garrett M. Fitzmaurice, Nan M. Laird, and James H. Ware. Applied longitudinal analysis. Wiley series in probability and statistics. Wiley, Hoboken, N.J, 2nd ed edition, 2011. ISBN 978-0-470-38027-7.

Michael A. Fligner and Douglas A. Wolfe. Some Applications of Sample Analogues to the Probability Integral Transformation and a Coverage Property. The American Statistician, 30(2):78, May 1976. ISSN 00031305. doi: 10.2307/2683799. URL https://www.jstor.org/stable/2683799?origin=crossref.

João Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. Learning with drift detection. In Ana L. C. Bazzan and Sofiane Labidi (eds.), Advances in Artificial Intelligence SBIA 2004, volume 3171 of Lecture Notes in Computer Science, pp. 286 295. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. ISBN 978-3-540-23237-7. doi: 10.1007/978-3-540-28645-5_29. URL http://link.springer.com/10. 1007/978-3-540-28645-5_29.

Published in Transactions on Machine Learning Research (10/2023)

Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian Data Analysis, Third Edition. CRC Press, Hoboken, 2013. ISBN 978-1-4398-9820-8.

Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer series in statistics. Springer, New York, NY, 2nd edition, 2009. ISBN 978-0-387-84857-0.

Allison Marie Horst, Alison Presmanes Hill, and Kristen B Gorman. palmerpenguins: Palmer Archipelago (Antarctica) penguin data, 2020. URL https://allisonhorst.github.io/palmerpenguins/.

Philippe Jorion. Value at risk: the new benchmark for managing financial risk. Mc Graw-Hill, New York, 3rd ed edition, 2007. ISBN 978-0-07-173692-3.

Steven M. Kay. Fundamentals of statistical signal processing. Prentice Hall signal processing series. Prentice Hall PTR, Englewood Cliffs, N.J, 1993. ISBN 978-0-13-345711-7 978-0-13-504135-2 978-0-13-280803-3.

R. Kelley Pace and Ronald Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3): 291 297, May 1997. ISSN 01677152. doi: 10.1016/S0167-7152(96)00140-X. URL https://linkinghub. elsevier.com/retrieve/pii/S016771529600140X.

J. F. C. Kingman. Uses of exchangeability. The Annals of Probability, 6(2), Apr 1978. ISSN 0091-1798. doi: 10.1214/aop/1176995566. URL https://projecteuclid.org/journals/annals-of-probability/ volume-6/issue-2/Uses-of-Exchangeability/10.1214/aop/1176995566.full.

Yoonho Lee, Annie S Chen, Fahim Tajwar, Ananya Kumar, Huaxiu Yao, Percy Liang, and Chelsea Finn. Surgical fine-tuning improves adaptation to distribution shifts. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=APu PRxj Hv Z.

Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. Distribution-Free Predictive Inference for Regression. Journal of the American Statistical Association, 113(523):1094 1111, July 2018. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.2017.1307116. URL https: //www.tandfonline.com/doi/full/10.1080/01621459.2017.1307116.

Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization, Mar 2021. URL http://arxiv.org/abs/2007.01162. ar Xiv:2007.01162 [cs, math, stat].

Andreas Lindholm, Dave Zachariah, Petre Stoica, and Thomas B. Schon. Data Consistency Approach to Model Validation. IEEE Access, 7:59788 59796, 2019. ISSN 2169-3536. doi: 10.1109/ACCESS.2019. 2915109. URL https://ieeexplore.ieee.org/document/8708204/.

Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and correcting for label shift with black box predictors. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3122 3130. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/lipton18a.html.

Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. Learning under Concept Drift: A Review. IEEE Transactions on Knowledge and Data Engineering, pp. 1 1, 2018. ISSN 1041-4347, 15582191, 2326-3865. doi: 10.1109/TKDE.2018.2876857. URL http://arxiv.org/abs/2004.05785. ar Xiv: 2004.05785.

Ning Lu, Guangquan Zhang, and Jie Lu. Concept drift detection via competence models. Artificial Intelligence, 209:11 28, April 2014. ISSN 00043702. doi: 10.1016/j.artint.2014.01.001. URL https://linkinghub.elsevier.com/retrieve/pii/S0004370214000034.

Andreas Maurer and Massimiliano Pontil. Empirical Bernstein Bounds and Sample-Variance Penalization. In COLT 2009 Proceedings, Montreal, Quebec, Canada, 2009. URL https://www.cs.mcgill.ca/ ~colt2009/papers/012.pdf.

Published in Transactions on Machine Learning Research (10/2023)

Joaquin Quiñonero-Candela, Masashi Sugyiama, Anton Schwaighofer, and Neil D. Lawrence (eds.). Dataset shift in machine learning. Neural information processing series. MIT Press, Cambridge, Mass, 2009. ISBN 978-0-262-17005-5.

Stephan Rabanser, Stephan Günnemann, and Zachary Lipton. Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 846c260d715e5b854ffad5f70a516c88-Paper.pdf.

Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (eds.), Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008. URL https:// proceedings.neurips.cc/paper/2008/file/0efe32849d230d7f53049ddc4a4b0c60-Paper.pdf.

Herbert Robbins. A remark on stirling s formula. The American Mathematical Monthly, 62(1):26 29, 1955. ISSN 00029890, 19300972. URL http://www.jstor.org/stable/2308012.

R.Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general loss distributions. Journal of Banking & Finance, 26(7):1443 1471, 2002. ISSN 0378-4266. doi: https://doi. org/10.1016/S0378-4266(02)00271-6. URL https://www.sciencedirect.com/science/article/pii/ S0378426602002716.

Donald B. Rubin. Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician. The Annals of Statistics, 12(4):1151 1172, December 1984. ISSN 0090-5364. doi: 10.1214/aos/1176346785. URL http://projecteuclid.org/euclid.aos/1176346785.

David Ruppert, M. P. Wand, and R. J. Carroll. Semiparametric Regression. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2003. ISBN 978-0-511-75545-3.

P. Stoica and Y. Selen. Model-order selection: a review of information criterion rules. IEEE Signal Processing Magazine, 21(4):36 47, Jul 2004. ISSN 1558-0792. doi: 10.1109/MSP.2004.1311138.

M. Stone. Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2):111 133, January 1974. ISSN 00359246. doi: 10.1111/ j.2517-6161.1974.tb00994.x. URL https://onlinelibrary.wiley.com/doi/10.1111/j.2517-6161. 1974.tb00994.x.

Torsten Söderström and Petre Stoica. System identification. Prentice Hall international series in systems and control engineering. Prentice-Hall, New York, NY, reprint edition, 2001. ISBN 978-0-13-881236-2.

Ryan J. Tibshirani, Rina Foygel Barber, Emmanuel J. Candès, and Aaditya Ramdas. Conformal Prediction Under Covariate Shift. ar Xiv:1904.06019 [stat], July 2020. URL http://arxiv.org/abs/1904.06019.

USGS. Earthquake Catalog, 2022. URL https://earthquake.usgs.gov/fdsnws/event/1/query.csv? starttime=2012-01-01%2000:00:00&endtime=2022-01-01%2000:00:00&minmagnitude=5&orderby= time.

A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1 edition, October 1998. doi: 10.1017/CBO9780511802256. URL https://www.cambridge.org/core/product/identifier/ 9780511802256/type/book.

V. Vapnik. Principles of Risk Minimization for Learning Theory. In J. Moody, S. Hanson, and R. P. Lippmann (eds.), Advances in Neural Information Processing Systems, volume 4. Morgan-Kaufmann, 1991. URL https://proceedings.neurips.cc/paper/1991/file/ ff4d5fbbafdf976cfdc032e3bde78de5-Paper.pdf.

Stephen B. Vardeman. What about the other intervals? The American Statistician, 46(3):193, Aug 1992. ISSN 00031305. doi: 10.2307/2685212. URL https://www.jstor.org/stable/2685212?origin= crossref.

Published in Transactions on Machine Learning Research (10/2023)

Vladimir Vovk, A. Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer, New York, 2005. URL https://doi.org/10.1007/b106715.

Quang H. Vuong. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57(2): 307, Mar 1989. ISSN 00129682. doi: 10.2307/1912557. URL https://www.jstor.org/stable/1912557? origin=crossref.

Ian Waudby-Smith and Aaditya Ramdas. Estimating means of bounded random variables by betting, 2020. URL https://arxiv.org/abs/2010.09686.

Yuhui Zhang, Jeff Z Hao Chen, Shih-Cheng Huang, Kuan-Chieh Wang, James Zou, and Serena Yeung. Drml: Diagnosing and rectifying vision models using language. In Neur IPS ML Safety Workshop, 2022. URL https://openreview.net/forum?id=losu6IAa Pe B.

Ryszard Zieliński and Wojciech Zieliński. Best exact nonparametric confidence intervals for quantiles. Statistics, 39(1):67 71, 2005. URL https://doi.org/10.1080/02331880412331329854.

A Details for Proof of Corollary 1

We must first show that

n j+m mβ n j j+ mβ 1 j

n+m m = n j

By definition of binomial coefficient, this is

lim m (n j + m mβ )!(j + mβ 1)!m!n! (n j)!(m mβ )!j!( mβ 1)!(n + m)!

Rearrangement of factors gives

lim m (n j + m mβ )!(j + mβ 1)!m!

(m mβ )!( mβ 1)!(n + m)!

So we must now show that

lim m (n j + m mβ )!(j + mβ 1)!m!

(m mβ )!( mβ 1)!(n + m)! | {z } =:H(m)

= βj(1 β)n j

By the upper and lower bounds in Stirling s formula (Robbins, 1955).

n e 1 12n+1 < n!

Published in Transactions on Machine Learning Research (10/2023)

Let δ := mβ mβ. We introduce

h(m, Q) = (2π(n j + m(1 β) δ))1/2 n j + m(1 β) δ

n j+m(1 β) δ exp 1 12(n j + m(1 β) δ) + Q

(2π(j + mβ + δ 1))1/2 j + mβ + δ 1

j+mβ+δ 1 exp 1 12(j + mβ + δ 1) + Q

m exp 1 12m + Q

(2π(m(1 β) δ)) 1/2 e m(1 β) δ

m(1 β) δ exp 1 12(m(1 β) δ) + (1 Q)

(2π(mβ 1 + δ)) 1/2 e mβ 1 + δ

mβ 1+δ exp 1 12(mβ 1 + δ) + (1 Q)

(2π(n + m)) 1/2 e n + m

n+m exp 1 12(n + m) + (1 Q)

allowing us to state succinctly h(m, 1) < H(m) h(m, 0)

The proof is complete if we can show that limm h(m, Q) = βj(1 β)n j. To see this we rearrange the factors. We will also use little-oh notation, i.e. f(m) = o(g(m)) iff limm |f(m)|/g(m) = 0. When taking limits, we use that 0 δ < 1 for all m.

h(m, Q) = 23π3

23π3 (n j + m(1 β) δ)(j + mβ + δ 1)m

(m(1 β) δ)(mβ 1 + δ)(n + m)

| {z } =1+o(1)

en j+m(1 β) δ+j+mβ+δ 1+m m(1 β)+δ mβ+1 δ n m) | {z } =1

n j + m(1 β) δ

| {z } =exp(n j)+o(1)

j + mβ + δ 1

| {z } =exp(j)+o(1)

| {z } =exp( n)+o(1)

(n j + m(1 β) δ)n j | {z } =mn j(1 β)n j+o(mn j)

(j + βm + δ 1)j | {z } =mjβj+o(mj)

(n + m) n | {z } =m n+o(m n)

exp (12(n j + m(1 β) δ) + Q) 1 + (12(j + mβ + δ 1) + Q) 1 + (12m + Q) 1

| {z } =1+o(1)

exp (12(m(1 β) δ) + (1 Q)) 1 (12(mβ 1 + δ) + (1 Q)) 1 (12(n + m) + (1 Q)) 1

| {z } =1+o(1)

By calculus of little-oh notation we find

h(m, Q) = exp(n j) exp(j) exp( n)mn jmjm nβj(1 β)n j + o(1)

which in turn means lim m h(m, Q) = βj(1 β)n j.

We have thus shown that the limit of (4) is

lim m a(k) =

βj(1 β)n j.

Published in Transactions on Machine Learning Research (10/2023)

By recognizing the binomial cumulative distribution function (and denoting it with the symbol BIN( ; n, β)) we see lim m a(k) = 1 BIN(k 1; n, β).

Finally k = min n k {1, . . . , n + 1} lim m a(k) α o = 1 + BIN 1(1 α; n, β)

B Extended Distribution Shift experiment

Continuing on the experiment on distribution shift, from Sec. 4.2. We keep n = 30, and also fix the random seed across runs to make the comparison clear. We generate calibration data sets D for three scenarios:

Scenario 1, Increasing mean

The mean µ shift from 0.5 (no shift) to 3 (large shift), and the standard deviation σ = 0.5 (no shift).

Scenario 2, Increasnig variance

The mean µ = 0.5 (no shift), and the standard deviation shifts from σ = 0.5 (no shift) to 2 (large shift)

Scenario 3, Decreasing variance

The mean µ = 0.5 (no shift), and the standard deviation shifts from σ = 0.5 (no shift) to 0.05 (large shift)

These three series of calibration-sets have their lal-curves plotted; see Fig. 9.

The results of Scenario 1 show that as the mean shift increase, model performance generally deteriorate. One detail should be noted for moderate mean-shifts. Some of the outlier data points with x < 0.5 contributed to a large lal when there is no shift, but with increasing shift, these data points are moved towards 0.5, where model performance is generally better. This effect means that the lal decreases somewhat, as can be observed for α 10%. When the mean-shift increases, outliers with x > 0.5 produce larger losses, which is observed through the lal-curve moving to the right. A large mean-shift moves most of the calibration set to x where model performance is worse, as indicated by the whole lal-curve shifting to the right, not only for selected α.

The results of Scenario 2 show that the tail of the lal-curve increase with feature variance, indicating that model performance on outliers get worse. Inliers are not as much affected.

The results of Scenario 3 show that the lal-curve moves to the left with decreasing feature variance, indicating smaller losses. This means that the data concentrates in regions of feature space where the model performs well. We do observe some distribution shift, but this model still performs well on the data.

Taken together, this indicate that the lal can indicate distribution shift, that it only detects shifts relevant for model performance, and that it indicates whether the shift incurs model degradation generally, or only on outliers.

Published in Transactions on Machine Learning Research (10/2023)

0 5 10 15 20 0%

(a) lal curves for distribution shift in Scenario 1, Increasing mean

0 20 40 60 80 100 0%

(b) lal curves for distribution shift in Scenario 2, Increasnig variance

0 1 2 3 4 5 0%

(c) lal curves for distribution shift in Scenario 3, Decreasing variance

Figure 9: lal-curves are drawn for various calibration sets D under distributions shift, as described in Appendix B. Dark blue line = no shift. Yellow line = large shift.