# model_similarity_mitigates_test_set_overuse__fa39c7c1.pdf

Model Similarity Mitigates Test Set Overuse

Horia Mania UC Berkeley hmania@berkeley.edu

John Miller UC Berkeley miller_john@berkeley.edu

Ludwig Schmidt UC Berkeley ludwig@berkeley.edu

Moritz Hardt UC Berkeley hardt@berkeley.edu

Benjamin Recht UC Berkeley brecht@berkeley.edu

Excessive reuse of test data has become commonplace in today s machine learning workﬂows. Popular benchmarks, competitions, industrial scale tuning, among other applications, all involve test data reuse beyond guidance by statistical conﬁdence bounds. Nonetheless, recent replication studies give evidence that popular benchmarks continue to support progress despite years of extensive reuse. We proffer a new explanation for the apparent longevity of test data: Many proposed models are similar in their predictions and we prove that this similarity mitigates overﬁtting. Speciﬁcally, we show empirically that models proposed for the Image Net ILSVRC benchmark agree in their predictions well beyond what we can conclude from their accuracy levels alone. Likewise, models created by large scale hyperparameter search enjoy high levels of similarity. Motivated by these empirical observations, we give a non-asymptotic generalization bound that takes similarity into account, leading to meaningful conﬁdence bounds in practical settings.

1 Introduction

Be it validation sets for model tuning, popular benchmark data, or machine learning competitions, the holdout method is central to the scientiﬁc and industrial activities of the machine learning community. As compute resources scale, a growing number of practitioners evaluate an unprecedented number of models against various holdout sets. These practices, collectively, put signiﬁcant pressure on the statistical guarantees of the holdout method. Theory suggests that for k models chosen independently of n test data points, the holdout method provides valid risk estimates for each of these models up to a deviation on the order of p

log(k)/n [5]. But this bound is the consequence of an unrealistic assumption. In practice, models incorporate prior information about the available test data since human analysts choose models in a manner guided by previous results. Adaptive hyperparameter search algorithms similarly evolve models on the basis of past trials [12].

Adaptivity signiﬁcantly complicates the theoretical guarantees of the holdout method. A simple adaptive strategy, resembling the practice of selectively ensembling k models, can bias the holdout method by as much as p

k/n [5]. If this bound were attained in practice, holdout data across the board would rapidly lose its value over time. Nonetheless, recent replication studies give evidence that popular benchmarks continue to support progress despite years of extensive reuse [15, 20].

In this work, we contribute a new explanation for why the adaptive bound is not attained in practice and why even the standard non-adaptive bound is more pessimistic than it needs to be. Our explanation centers around the phenomenon of model similarity. Practitioners evaluate models that incorporate common priors, past experiences, and standard practices. As we show empirically, this

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

0.0 0.5 1.0 Fraction of Models

Mean Similarity

Model Similarities on Image Net

Actual Similarity Independent Similarity

(a) Pairwise model similarities on Image Net

0.65 0.75 0.85 Model Similarity

Number of Testable Models

Image Net Models

Similarity-Based Bounds for Image Net

Similarity Bound Binomial plus Union Bound

(b) Number of models to be tested

Figure 1: (a) shows the empirical pairwise similarity between Imagenet models and the hypothetical similarity between models if they were making mistakes independently. (b) plots the number of testable models on Imagenet such that the population error rates for all models are estimated up to 1% error with probability 0.95. We compare the guarantee of the standard union bound with that of a union bound which considers model similarities.

results in models that exhibit signiﬁcant agreement in their predictions, well beyond what would follow from their accuracy values alone. Complementing our empirical investigation of model similarity, we provide a new theoretical analysis of the holdout method that takes model similarity into account, vastly improving over known bounds in the adaptive and non-adaptive cases when model similarity is high.

1.1 Our contributions

Our contributions are two-fold. On the empirical side, we demonstrate that a large number of proposed Image Net [3, 16] and CIFAR-10 [9] models exhibit a high degree of similarity: Their predictions agree far more than we would be able to deduce from their accuracy levels alone. Complementing our empirical ﬁndings, we give new generalization bounds that incorporate a measure of similarity. Our generalization bounds help to explain why holdout data has much greater longevity than prior bounds suggest when models are highly similar, as is the case in practice. Figure 1 summarizes these two complementary developments.

Underlying Figure 1a is a family of representative Image Net models whose pairwise similarity we evaluate. The mean level of similarity of these models, together with a reﬁned union bound, offers a 4 improvement over a carefully optimized baseline bound that does not take model similarity into account. In Figure 1b we compare our guarantee on the number of holdout reuses with the baseline bound. This illustrates that our bound is not just asymptotic, but concrete it gives meaningful values in the practical regime. Moreover, in Section 5 we discuss how an additional assumption on model predictions can boost the similarity based guarantee by multiple orders of magnitude.

Investigating model similarity in practice further, we evaluate similarity of models encountered during the course of a large random hyperparamter search and a large neural architecture search for the CIFAR-10 dataset. We ﬁnd that the pairwise model similarities throughout both procedures remain high. The similarity provides a counterweight to the massive number of model evaluations, limiting the amount of overﬁtting we observe.

1.2 Related work

Recht et al. [15] recently created new test sets for Image Net and CIFAR10, carefully following the original test set creation processes. Reevaluating all proposed models on the new test sets showed that while there was generally an absolute performance drop, the effect of overﬁtting due to adaptive behavior was limited to non-existent. Indeed, newer and better models on the old test set also performed better on the new test set, even though they had in principle more time to adapt to the test set. Also, Yadav and Bottou [20] recently released a new test set for the seminal MNIST task, on which they observed no overﬁtting.

Dwork et al. [5] recognized the issue of adaptivity in holdout reuse and provided new holdout mechanisms based on noise addition that support quadratically more queries than the standard method in the worse case. There is a rich line of work on adaptive data analysis; Smith [18] offers a comprehensive survey of the ﬁeld.

We are not the ﬁrst to proffer an explanation for the apparent lack of overﬁtting in machine learning benchmarks. Blum and Hardt [2] argued that if analysts only check if they improved on the previous best model, while ignoring models that did not improve, better adaptive generalization bounds are possible. Zrnic and Hardt [21] offered improved guarantees for adaptive analysts that satisfy natural assumptions, e.g. the analyst is unable to arbitrarily use information from queries asked far in the past. More recently, Feldman et al. [6] gave evidence that the number of classes in a classiﬁcation problem helps mitigate overﬁtting in benchmarks. We see these different explanations as playing together in what is likely the full explanation of the available empirical evidence. In parallel to our work, Yadav and Bottou [20] discussed the advantages of comparing models on the same test set; pairing tests can provide tighter conﬁdence bounds for model comparisons in this setting than individual conﬁdence intervals for each model.

2 Problem setup

Let f : X Y be a classiﬁer mapping examples from domain X to a label from the set Y. Moreover, we consider a test set S = {(x1, y1), . . .} of n examples sampled i.i.d. from a data distribution D. The main quantity we aim to analyze is the gap between the accuracy of the classiﬁer f on the test set S and the population accuracy of the same classiﬁer under the distribution D. If the gap between the two accuracies is large, we say f overﬁt to the test set.

As is commonly done in the adaptive data analysis literature [1], we formalize interactions with the test set via statistical queries q : X Y R. In our case, the queries are {0, 1}-valued; given a classiﬁer f we consider the query qf deﬁned by qf(z) = 1{f(x) = y}, where z = (x, y). Then, we denote the empirical mean of query qf on the test set S (i.e., f s test error) by ES[qf] = 1

n Pn i=1 qf(zi). The population mean (population error) is accordingly deﬁned as ED[q] = Ez Dq(z).

When discussing overﬁtting, we are usually interested in a set of classiﬁers, e.g., obtained via a hyperparameter search. Let f1, . . . , fk be such a set of classiﬁers and q1, . . . , qk be the set of corresponding queries. To quantify the probability that overﬁtting occurs (i.e., one of the fi has a large deviation between test and population accuracy), we would like to upper bound the probability

P max 1 i k |ES[qi] ED[qi]| ε . (1)

A standard way to bound (1) is to invoke the union bound and treat each query separately:

P max 1 i k |ES[qi] ED[qi]| ε

i=1 P (|ES[qi] ED[qi]| ε) (2)

We can then utilize standard concentration results to bound the right hand side. However, such an approach inherently cannot capture dependencies between the queries qi (or classiﬁers fi). In particular, we are interested in the similarity between two queries q and q measured by P (q(z) = q (z)) (the probability of agreement between the 0-1 losses of the corresponding two classiﬁers). The main goal of this paper is to understand how high similarity can lead to better bounds on (1), both in theory and in numerical experiments with real data from Image Net and CIFAR-10.

3 Non-adaptive classiﬁcation

We begin by analyzing the effect of the classiﬁer similarity when the classiﬁers to be evaluated are chosen non-adaptively. For instance, this is the case when the algorithm designer ﬁxes a grid of hyperparameters to be explored before evaluating any of the classiﬁers on the test set. To draw valid gains from the hyperparameter search, it is important that the resulting test accuracies reﬂect the true population accuracies, i.e., probability (1) is small.

Bound (2) is sharp when the events {|ES[qi] ED[qi]| ε} are almost disjoint, which is not true when the queries are similar to each other. To address this issue, we modify our use of the union

bound. We consider the left tails Ei = {ES[qi] ED[qi] ε}. For any t 0, we obtain

{ES[q1] ED[q1] ε t}

= P (ES[q1] ED[q1] ε t) + P

i=2 Ei {ES[q1] ED[q1] < ε t}

P (ES[q1] ED[q1] ε t) +

i=2 P (Ei {ES[q1] ED[q1] < ε t}) .

Intuitively, the terms P (Ei {ES[q1] ED[q1] < ε t}) are small when the queries q1 and qi are similar: if P(q1(z) = qi(z)) is large, we cannot simultaneously have ES[q1] < ED[q1] + ε t and ES[qi] ED[qi] + ε since the deviations go into opposite directions. In the rest of this section, we make this intuition precise in and derive an upper bound on (1) in terms of the query similarities. Before we state our main result, we introduce the following notion of a similarity covering. Deﬁnition 1. Let F be a set of queries. We say a query set M is a η similarity cover of F if for any query q F there exist q , q M such that ED[q ] ED[q], ED[q ] ED[q], P(q (z) = q(z)) η, and P(q (z) = q(z)) η ( M does not necessarily have to be a subset of F). Let Nη(F) denote the size of a minimal η similarity cover of F (when the query set F is clear from context we use the simpler notation Nη).

Theorem 2. Let F = {q1, q2, . . . , qk} be a collection of queries qi : Z {0, 1} independent of the test set {z1, z2, . . . , zn}. Then, for any η [0, 1] we have

P max 1 i k |ES[qi] ED[qi]| ε 2Nηe nε2

4 log(1+ ε 4(1 η)). (4)

Then, for all η 1 max 2 log(4k/δ)

, we have with probability 1 δ

max 1 i k |ES[qi] ED[qi]| max

2 log(4Nη/δ)

32(1 η) log (4k/δ)

Moreover, if ε = q

log((2Nη+1)/δ)

n and η 1 ε

4 e2ε(2k) 4 nε 1 , we have with probability 1 δ

max 1 i k |ES[qi] ED[qi]| ε. (6)

To elucidate how model similarity η controls the number of queries k for which Theorem (2) gives a non-trivial bound, consider the case where Nη = 1, i.e. at least one model is η-similar to all of the others. As the similarity η of the model collection grows, the number of queries k grows as well, as the following simple result shows. Corollary 3. Let F = {q1, q2, . . . , qk} be a collection of k queries qi : Z {0, 1} ﬁxed independently of the test set. Choose η so that Nη = 1. Suppose n c1 max ε 1, ε 2 and the number of queries k satisﬁes

k c2ε (1 η ) (7)

for positive constants c1, c2. Then, with probability 3/4, max1 i k |ES[qi] ED[qi]| ε.

The proof of Theorem (2) starts with the reﬁned union bound (3), or a standard triangle inequality, and then applies the Chernoff concentration bound shown in Lemma 4 for random variables which take values in { 1, 0, 1}. We defer the proof details of both the lemma and the theorem to Appendix A. Lemma 4. Suppose Xi are i.i.d. discrete random variables which take values 1, 0, and 1 with probabilities p 1, p0, and p1 respectively, and hence EXi = p1 p 1. Then, for any t 0 such that p1 p 1 + t/2 0 we have

i=1 Xi > p1 p 1 + t

2 log 1+ t 2p1

Discretization arguments based on coverings are standard in statistical learning theory. Covers based on the population Hamming distance P(q (z) = q(z)) have been previously studied [4, 11] (Note that for {0, 1}-valued queries the Hamming distance is equal to the L2 and L1 distances). An important distinction between our result and prior work is that prior work requires η to be greater than 1 ε. Theorem 2 can offer an improvement over the standard guarantee p

log(k)/n even when η is much smaller than 1 ε. First of all note that (5) holds for η bounded away from one. Moreover, since e2ε 1 + 2ǫ, if (2k) 4 nε 1 + ε (the choice of 1 + ε is somewhat arbitrary), we see the requirement on η for (6) is satisﬁed when η is on the order of 1 ε.

4 Adaptive classiﬁcation

In the previous section, we showed similarity can prevent overﬁtting when the sequence of queries is chosen non-adaptively, i.e. when the queries {q1, q2, . . . , qn} are ﬁxed independently of the test set S. In the adaptive setting, we assume the query qt can be selected as a function of the previous queries {q1, q2, . . . , qt 1} and estimates {ES[q1], ES[q2], . . . , ES[qt 1]}. Even when queries are chosen adaptively, we show leveraging similarity can provide sharper bounds on the probability of overﬁtting, P (max1 i k |ES[qi] ED[qi]| ε).

In the adaptive setting, the ﬁeld of adaptive data analysis offers a rich technical repertoire to address overﬁtting [5, 18]. In this framework, analogous to the typical machine learning workﬂow, an analyst iteratively selects a classiﬁer and then queries a mechanism to provide an estimate of test-set performance. In practice, the mechanism often used is the Trivial Mechanism which computes the empirical mean of the query on the test set and returns the exact value to the analyst. For simplicity, we study how similarity improves the performance of the trivial mechanism.

The empirical mean of any query can take at most n + 1 values, and thus a deterministic analyst might ask at most (n + 1)k 1 queries in k rounds of interaction with the Trivial Mechanism. Let F denote the set of (n + 1)k 1 possible queries. Then, we apply Theorem 2 to F.

Corollary 5. Let F be the set of queries that a ﬁxed analyst A might query the Trivial Mechanism. We assume that the Trivial Mechanism has access to a test set of size n. Let α [0, 1],

4(k1 α log(n + 1) + log(2/δ))

and η = 1 ε 4(eεkα 1). If Nη(F) (n + 1)k1 α, we have with probability 1 δ

max 1 i k |ES[qi] ED[qi]| ε, (8)

for any queries q1, q2, . . .qk chosen adaptively by A.

Proof. Note that when η = 1 ε 4(eεkα 1) we have log 1 + ε 4(1 η) εkα. Then, the result follows from the ﬁrst part of Theorem 2.

In Corollary 5, the parameter α quantiﬁes the strength of the similarity assumption. For α = 0, there is no similarity requirement, and Corollary 5 always applies. In this case, the bound matches standard results for the trivial mechanism with ε = O( p

k/n). However, as α grows, the similarity requirement becomes restrictive while the corresponding conﬁdence interval becomes increasingly tight. In particular, for any α > 0, if F permits a similarity cover Nη(F) (n + 1)k1 α for η = 1 (ε/4)(eεkα 1) 1, we obtain a super linear improvement in the dependence on k. For instance, if α = 1/2, then ε = O( p

k1/2/n), and we obtain a quadratic improvement in the number of queries for a ﬁxed sample size. This improvement is similar to that achieved by the Gaussian mechanism [1, 5]. Moreover, since our technique is essentially tightening a union bound, this improvement easily extends to other mechanisms that rely on compression-based arguments, for instance, the Ladder Mechanism [2].

5 Empirical results

So far, we have established theoretically that similarity between classiﬁers allows us to evaluate a larger number of classiﬁers on the test set without overﬁtting. In this section, we investigate whether these improvements already occur in the regime of contemporary machine learning. We speciﬁcally focus on Image Net and CIFAR-10, two widely used machine learning benchmarks that have recently been shown to exhibit little to no adaptive overﬁtting in spite of almost a decade of test set re-use [15]. For both datasets, we empirically measure two main quantities: (i) The similarity between a wide range of models, some of them arising from hyperparameter search experiments. (ii) The resulting increase in the number of models we can evaluate in a non-adaptive setting compared to a baseline that does not utilize the model similarities.

5.1 Similarities on Imagenet

We utilize the model testbed from Recht et al. [15],1 who collected a dataset of 66 image classiﬁers that includes a wide range of standard Image Net models such as Alex Net [10], Res Nets [7], Dense Nets [8], VGG [17], Inception [19], and several other models. As a baseline for the observed similarities between these models, we compare them to classiﬁers with the same accuracy but otherwise random predictions: given two models f1 and f2 with population error rates µ1 and µ2, we know that the similarity P(1{f1(x) = y} = 1{f2(x) = y}) equals µ1µ2 + (1 µ1)(1 µ2) if the random variables 1{f1(x) = y} and 1{f2(x) = y} are independent. Figure 1a in the introduction shows these model similarities assuming the models make independent mistakes and also the empirical data for the 66 2 = 2,145 pairs of models. We see that the empirical similarities are signiﬁcantly higher than the random baseline (mean 0.85 vs 0.62).

The corresponding Figure 1b shows two lower bounds on the number of models that can be evaluated for the empirical Image Net data. In particular, we use n = 50,000 (the size of the Image Net validation set) and a target probability δ = 0.05 for the overﬁtting event (1) with error ε = 0.01. We compare two methods for computing the number of non-adaptively testable models: a guarantee based on the simple union bound (2) and a guarantee based on our more reﬁned union bound derived from our theoretical analysis in Section 3. Later in this section, we introduce an even stronger bound that utilizes higher-order interactions between the model similarities and yields signiﬁcantly larger improvements under an assumption on the structure among the classiﬁers.

To obtain meaningful quantities in the regime of Image Net, all bounds here require signiﬁcantly sharper numerical calculations than the standard theoretical tools such as Chernoff bounds. We now describe these calculations at a high level and defer the details to Appendix B. After introducing the three methods, we compare them on the Image Net data.

Standard union bound. Given n, ε, and the population error rate of all models ED[qi], we can compute the right hand side of (2) exactly.2 It is well known that higher accuracies lead to smaller probability of error and hence allow for a larger number of test set reuses. We assume all models have population accuracy 75.6%, the average top-1 accuracy of the 66 Imagenet models. In this case, the vanilla union bound (2) guarantees that k = 257,397 models can be evaluated on a test set of size 50,000 so that their empirical accuracies would lie in the conﬁdence interval 0.756 0.01 with probability at least 95%.

Similarity Union Bound. While the union bound (2) is easy to use, it does not leverage the dependencies between the random variables 1{fi(x) = y} for i {1, 2, . . . k}. To exploit this property, we utilize the reﬁned union bound (3) which is guaranteed to be an improvement over (2) when the parameter t is optimized. In order to use (3), we must compute the probabilities

P ({ES[q2] ED[q2] α2} {ES[q1] ED[q1] α1}) (9)

for given α1, α2, ED[q1], ED[q2], and similarity P(q1(z) = q2(z)). In Appendix B, we show that we can compute these probabilities efﬁciently by assigning success probabilities to three independent Bernoulli random variables X1, X2, and W such that (X1W, X2W) is equal to (q1(z), q2(z)) in

1Available at https://github.com/modestyachts/Image Net V2. 2After an additional union bound to decouple the left and right tails.

distribution. Let pw := P(W = 1). Then, given i.i.d. draws X1i, X2i, and Wi, we condition on the values of Wi to express probability (9) as

P ({ES[q2] ED[q2] α2} {ES[q1] ED[q1] α1}) (10)

pj w(1 pw)n j P

i=1 X2i n(p2 + α2)

i=1 X1i n(p1 + α1)

We refer the reader to Appendix B for more details. The two tail probabilities for X1i and X2i can be computed efﬁciently with the use of beta functions. Using (10) and (3) with a binary search over t, we can compute the probability of making an error ε when estimating the population error rates of k models with given error rates and pairwise similarities. Figure 1b shows the maximum number of models k that can be evaluated on the same test set so that the probability of making an ε = 0.01 error in estimating all their error rates is at most 0.05 when the models satisfy ED[qi] = 0.244 and P(qi(z) = qj(z)) 0.85 for all 1 i, j k. The ﬁgure shows that our new bound offers a signiﬁcant improvement over the guarantee given by the standard union bound (2).

Similarity union bound with a Naive Bayes assumption. While the previous computation uses the pairwise similarities observed empirically to offer an improved guarantee on the number of allowed test set reuses, it does not take into account higher order dependencies between the models. In particular, Figure 4 in Appendix C shows that 27.8% of test images are correctly classiﬁed by all the models, 55.9% of test images are correctly classiﬁed by 60 of the 66 models considered, and 4.7% of test images are incorrectly classiﬁed by all the models. We now show how this kind of agreement between models enables a larger number of test set reuses. Inspired by the coupling used in (10), we make the following assumption.

Assumption A1 (Naive Bayes). Let q1, q2, . . . qk be a collection of queries such that ED[qi] = p and P(qi(z) = qj(z)) = η for some p and η, for all 1 i, j k. We say such a collection has a Naive Bayes structure if there exist px and pw in [0, 1] such that (q1(z), q2(z), . . . , qk(z)) is equal to (X1W, X2W, . . . , Xk W) in distribution, where W, X1, . . . Xk are independent Bernoulli random variables with P(W = 1) = pw and P(Xi = 1) = px for all 1 i k.

Intuitively, a collection of queries 1{fi(x) = y} has a Naive Bayes structure if the data distribution D generates easy examples (x, y) with probability 1 pw such that all the models fi classify correctly, and if an example is not easy, the models make mistakes independently. As mentioned before, Figure 4 supports the existence of such an easy set. When a test point in the Image Net test set is not an easy example, the models do not make mistakes independently. Therefore, Assumption A1 is not exactly satisﬁed by existing Image Net models. However, we know that independent Bernoulli trials saturate the standard union bound (2). This effect can also be observed in Figure 2. As the similarity between the models decreases, i.e. 1 pw decreases, the models make mistakes independently and the guarantee with Assumption A1 converges to the standard union bound guarantee. So while Assumption A1 is not exactly satisﬁed in practice, the violation among the Image Net classiﬁers likely implies an even better lower bound on the number of testable models.

Assumption A1 is computationally advantageous. It allows us to compute the overﬁtting probability (1) exactly, as we detail in Appendix B. Figure 2 is an extension of Figure 1b; it shows the relative improvement of our bounds over the standard union bound in terms of the number of testable models when ε = 0.01 and δ = 0.01. Moreover, Figure 2 also shows that the relative improvement of our bounds increases quickly with ε. According to Figure 2, Assumption A1 implies that we can evaluate 108 models on the test set in the regime of Image Net without overﬁtting. While this number of models might seem unnecessarily large, in Section 4 we saw that when models are chosen adaptively we must consider a tree of possible models, which can easily contain 108 models.

5.2 Similarities on CIFAR-10

Practitioners often evaluate many more models than the handful that ultimately appear in publication. The choice of architecture is the result of a long period of iterative reﬁnement, and the hyperparameters for any ﬁxed architecture are often chosen by evaluating a large grid of plausible models. Using data from CIFAR-10, we demonstrate these common practices both generate large classes of very similar models.

0.65 0.75 0.85 Model Similarity

Multiplicative Gains

Naive Bayes Bound Similarity Bound

0.006 0.008 0.010 0.012 Error ǫ

Multiplicative Gains

Naive Bayes Bound Similarity Bound

Figure 2: Left ﬁgure shows the multiplicative gains in the number of testable models, as a function of model similarity, over the guarantee offered by the standard union plus binomial bound, with ε = 0.01 and δ = 0.05. Right ﬁgure shows the same multiplicative gains, but as a function of ε, when δ = 0.05 and the pairwise similarity is η = 0.85.

0.00 0.25 0.50 0.75 1.00 Fraction of Model Pairs

CIFAR10 Hyperparameter Search Similarity

Observed Similarity Independent Similarity

0.5 0.6 0.7 0.8 0.9 1.0 Similarity Resolution η

CIFAR10 Hyperparameter Search Covering Numbers

Figure 3: Model similarities and covering numbers for random hyperparameter search on CIFAR10.

Random hyperparameter search. To understand the similarity between models evaluated in hyperparameter search, we ran our own random search to choose hyperparameters for a Res Net-110. The grid included properties of the architecture (e.g. type of residual block), the optimization algorithm (e.g. choice of optimizer), and the data distribution (e.g. data augmentation strategies). A full speciﬁcation of the grid is included in Appendix D. We sample and train 320 models, and, for each model, we select 10 checkpoints evenly spaced throughout training. The best model considered achieves accuracy of 96.6%, and, after restricting to models with accuracy at least 50%, we are left with 1,235 model checkpoints. In Figure 3, we show the similarity for each pair of checkpoints and compute an upper bound on the corresponding similarity covering number Nη(F) for each possible value of η. As in the case of Image Net, CIFAR10 models found by random search are signiﬁcantly more similar than random chance would suggest.

Neural architecture search. In the random search experiment, all of the models were chosen nonadaptively the grid of models is ﬁxed in advance. However, similarity protects against overﬁtting also in the adaptive setting. To illustrate this, we compute the similarity for models evaluated by automatic neural architecture search. In particular, we ran the DARTS neural architecture search pipeline to adaptively evaluate a large number of plausible models in search of promising conﬁgurations [13, 14]. In Table 1, we report the mean accuracies and pairwise similarities for 20 randomly selected conﬁgurations evaluated by DARTS, as well as the top 20 scoring conﬁgurations according to DARTS internal scoring mechanism. Table 1 also shows the multiplicative gains in the number of testable models offered by our similarity bound (SB) and our naive Bayes bound (NBB) over the standard union bound are between one and four orders of magnitude. Therefore, even in a high accuracy regime we can guarantee a signiﬁcantly higher number of test set reuses without overﬁtting when taking into account model similarities.

Table 1: Neural Architecture Search Similarities

Models Mean Accuracy Mean Similarity Increase in Testable Models SB NBB 20 Random 96.8% 97.5% 9.9 1.6 104 20 Highest Scoring 96.9% 97.6% 12.0 3.4 104

6 Conclusions and future work

We have shown that contemporary image classiﬁcation models are highly similar, and that this similarity increases the longevity of the test set both in theory and in experiment. It is worth noting that model similarity does not preclude progress on the test set: two models that are 85% similar can differ by as much as 15% in accuracy (for context: the top-5 accuracy improvement from the seminal Alex Net to the current state of the art on Image Net is about 17%). In addition, it is well known that higher model accuracy implies a larger number of test set reuses without overﬁtting. So as the machine learning practitioner explores increasingly better performing models that also become more similar, it can actually become harder to overﬁt.

There are multiple important avenues for future work. First, one natural question is why the classiﬁcation models turn out to be so similar. In addition, it would be insightful to understand whether the similarity phenomenon is speciﬁc to image classiﬁcation or also arises in other classiﬁcation tasks. There may also be further structural dependencies between models that mitigate the amount of overﬁtting. Finally, it would be ideal to have a statistical procedure that leverages such model structure to provide reliable and accurate performance bounds for test set re-use.

Acknowledgements. We thank Vitaly Feldman for helpful discussions. This work is generously supported in part by ONR awards N00014-17-1-2191, N00014-17-1-2401, and N00014-18-1-2833, the DARPA Assured Autonomy (FA8750-18-C-0101) and Lagrange (W911NF-16-1-0552) programs, a Siemens Futuremakers Fellowship, an Amazon AWS AI Research Award, a gift from Microsoft Research, and the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1752814.

[1] R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman. Algorithmic stability for adaptive data analysis. In Symposium on Theory of Computing (STOC), 2016. https: //arxiv.org/abs/1511.02513.

[2] A. Blum and M. Hardt. The Ladder: A reliable leaderboard for machine learning competitions. In International Conference on Machine Learning (ICML), 2015. https://arxiv.org/abs/ 1502.04585.

[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR), 2009. http://www.image-net.org/papers/imagenet_cvpr09.pdf.

[4] L. Devroye, L. Györﬁ, and G. Lugosi. A probabilistic theory of pattern recognition. Springer, 1996. http://www.szit.bme.hu/~gyorfi/pbook.pdf.

[5] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statistical validity in adaptive data analysis. In Symposium on Theory of computing (STOC), 2015. https://arxiv.org/abs/1411.2664.

[6] V. Feldman, R. Frostig, and M. Hardt. The advantages of multiple classes for reducing overﬁtting from test set reuse. In International Conference on Machine Learning (ICML), 2019. https://arxiv.org/abs/1905.10360.

[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016. https://arxiv.org/ abs/1512.03385.

[8] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. https: //arxiv.org/abs/1608.06993.

[9] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009. https://www. cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.

[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), 2012. https://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks.

[11] J. Langford. Quantitatively Tight Sample Complexity Bounds. Ph D thesis, Carnegie Mellon University, 2002. http://hunch.net/~jl/projects/prediction_bounds/thesis/ thesis.pdf.

[12] L. Li and K. Jamieson. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18:1 52, 2018.

[13] L. Li and A. Talwalkar. Random search and reproducibility for neural architecture search. In Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2019. https://arxiv.org/abs/ 1902.07638.

[14] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations (ICLR), 2019. https://arxiv.org/abs/1806. 09055.

[15] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do Image Net classiﬁers generalize to Image Net? In International Conference on Machine Learning (ICML), 2019. https:// arxiv.org/abs/1902.10811.

[16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li. Image Net large scale visual recognition challenge. International Journal of Computer Vision, 2015. https://arxiv.org/abs/ 1409.0575.

[17] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. 2014. https://arxiv.org/abs/1409.1556.

[18] A. Smith. Information, privacy and stability in adaptive data analysis, 2017. https://arxiv. org/abs/1706.00820.

[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015. https://arxiv.org/abs/1409.4842v1.

[20] C. Yadav and L. Bottou. Cold Case: The Lost MNIST Digits. 2019. https://arxiv.org/ abs/1905.10498.

[21] T. Zrnic and M. Hardt. Natural analysts in adaptive data analysis. In International Conference on Machine Learning (ICML), 2019. https://arxiv.org/abs/1901.11143.