# a_theory_of_dynamic_benchmarks__f0afea7f.pdf Published as a conference paper at ICLR 2023 A THEORY OF DYNAMIC BENCHMARKS Ali Shirali University of California, Berkeley Rediet Abebe Harvard Society of Fellows Moritz Hardt Max-Planck Institute for Intelligent Systems, T ubingen Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work. 1 INTRODUCTION In response to concerns around the limitations of static datasets as benchmarks, researchers have proposed dynamic benchmarking a setting where data collection and model building happen iteratively in tandem as an alternative (Nie et al., 2020; Potts et al., 2021; Kiela et al., 2021; Ma et al., 2021; Gehrmann et al., 2021). In dynamic benchmarking, model builders fit models against the current dataset, while annotators contribute new data points selected to challenge previously built models. In doing so, the hope is that the iterative process results in a more diverse set of test cases that can help induce better model performance. Though proponents argue dynamic adversarial data collection, where annotators craft examples that challenge continually improving models, holds promise as an approach for generating such diverse training sets, there is also a recognition that the long-term benefits or drawbacks of adopting it as a core dataset creation paradigm remain poorly understood. (Wallace et al., 2022) A major concern is that adversarial data collection proliferates idiosyncratic examples that do well in fooling models but eliminate coverage of necessary yet easier test cases. This can, in turn, reduce dataset diversity and limit external validity (Bowman & Dahl, 2021). A growing line of theoretical and empirical research on static benchmarks has improved our understanding of the strengths and limitations of this setting. In contrast, similar research on dynamic benchmarks has been limited. The high complexity and cost of studying live benchmarks impede experimental work. A stronger theoretical foundation for dynamic benchmarks could help guide empirical explorations of the vast design space, avoiding costly trial-and-error experiments. However Published as a conference paper at ICLR 2023 there has been no apparent theory of dynamic benchmarking that could offer provable guarantees about their performance and clarify how issues such as label noise interact with this setting. 1.1 OUR CONTRIBUTIONS In this work, we initiate a theoretical study of dynamic benchmarks. We contribute a versatile formal model of dynamic benchmarks that serve as the basis for our investigation. We start with a fundamental question: Question 1: Can we design dynamic benchmarks in such a way that models continue to improve as the number of rounds of data collection grows? We start from a theoretical model capturing existing implementations of dynamic benchmarking. This model proceeds in multiple rounds interweaving data collection and model building sequentially. In round t, model builders face a distribution Dt and are tasked with finding a classifier ht that performs well on Dt. We assume that model fitting succeeds in minimizing risk up to a positive classification error ϵ > 0. We further assume that annotators succeed in identifying the failure cases of the current model ht, giving us access to the uniform distribution Dt over the error cases of the model ht. We determine the new distribution Dt+1 by mixing Dt and Dt in some proportion. We assume a starting distribution D0 on the instances of interest. We can think of D0 as the distribution corresponding to standard data collection. Mirroring the motivation for dynamic benchmarking, this distribution might assign little to no weight to important families of instances. In particular, an error set of measure ϵ, which we assume we can achieve from the get-go, might contain many relevant instances. The goal is therefore to converge to well below the ϵ-error level guaranteed by the above assumption. We assume the distribution admits a perfect classifier so that process could, in principle, converge to 0 error. In this setting, we show that three rounds are guaranteed to converge to O(ϵ2) error. Unfortunately, this is where it ends. In general, there is no reason to expect this dynamic benchmark to progress below Ω(ϵ2) error. Put differently, there is no provable benefit to dynamic data collection beyond three rounds. The cause of our negative result is a form of catastrophic forgetting that mirrors the concerns quoted earlier. As the benchmark moves beyond three rounds, there is provably no way to retain knowledge of instances correctly classified at earlier stages. Furthermore, we show through experiments that this lower bound may also be encountered in practice, preventing dynamic benchmarks from progressing beyond a small number of rounds. In doing so, we propose a concrete way to simulate the performance of a dynamic benchmark that may be of independent interest in the empirical study of dynamic benchmarks. There is yet another impediment to successful dynamic benchmarking: Above we considered the case where the underlying learning problem is realizable, meaning that there exists a model that achieves 0 error on the distribution. In practice, unrealizable settings where we have label noise are commonplace. Unrealizability can result from, for instance, annotator disagreement where there is an emerging line of work aiming to understand their impact on data diversity, label noise, and model performance. We show that in this unrealizable setting, dynamic benchmarks concentrate on mislabeled instances, losing their representativeness of the underlying distribution. Though pessimistic, the above negative results may be inherent to the simple sequential design of dynamic benchmarks currently used in practice. To further probe this issue, we ask: Question 2: Are there more sophisticated dynamic benchmark designs that can guarantee convergence below the error barrier of the standard setting? We answer this question in the affirmative by considering a hierarchical model, which recursively uses the above sequential setting as a building block. In this setting, the organizer of a benchmark creates multiple instances of dynamic data collection and combines the outcomes in a particular way, e.g., by ensembling the resulting models and feeding the output into a new instance. We study the setting where the hierarchy has depth two and show that this setting guarantees convergence to error O(ϵ3), providing a strict separation with the standard model. Despite the improved performance, this depth-two setting significantly complicates the benchmarking process, and executing a design with further depth may be prohibitive in practice. Published as a conference paper at ICLR 2023 In search of alternative designs that can consistently improve the model s performance, we study a complementary design to dynamic benchmarks in Section A of Appendix. Instead of accumulating adversarial examples in a dynamic benchmark, here the model-in-the-loop carries the information by directly using new examples and becoming more complex throughout the process. We also make a natural connection to boosting methods. Despite achieving zero risk theoretically, this alternative has limited applicability due to either slow convergence or computational infeasibility. In sum, our results indicate that current bottlenecks observed in empirical settings under the sequential model are inherent to the set-up. This can alert practitioners to the limitations of current practice before many rounds of data are collected. Further, more complex designs such as the hierarchical setting can result in improved performance but may suffer from the organizational complexity of data collection. Combined, these results highlight stark tradeoffs in switching from static to dynamic settings and suggest that exploration of the design space for modeling dynamic benchmarks can play an important role. 1.2 RELATED WORKS For an introduction to datasets as benchmarks, see Chapter 8 in Hardt & Recht (2022). Concerns around static benchmarks are summarized in recent works, including adaptivity (Dwork et al., 2015), violation of sample independence in sequentially generated data (Shirali, 2022), and issues of annotator disagreement (Pavlick & Kwiatkowski, 2019; Prabhakaran et al., 2021; Davani et al., 2022). Numerous new benchmarks and benchmarking systems have recently been proposed that integrate some aspects of dynamic data collection. Adversarial data collection continually adds challenging examples found by annotators for an existing model (Dinan et al., 2019; Nie et al., 2020; Kiela et al., 2021; Potts et al., 2021; Wallace et al., 2022). Empirical studies show this does not necessarily lead to better performance or robustness (Kaushik et al., 2021; Wallace et al., 2022). A dynamic leaderboard periodically renews the test set (Zellers et al., 2021). In response to the fast growth of dynamic benchmarking, various tools and platforms are also developed. For example, platforms for adversarial data collection (Kiela et al., 2021), assistance of annotators to find challenging examples (Bartolo et al., 2021), personalized benchmarking (Narayan et al., 2021), and automatic crowdsourcing of leaderboard submissions (Khashabi et al., 2021). Adversarial filtering, which filters out examples from a static dataset that are identified to be easy for a given model, is another related technique (Paperno et al., 2016; Zellers et al., 2018; Le Bras et al., 2020). Such datasets are susceptible to being biased (Phang et al., 2022) or saturate faster than static datasets (Taori et al., 2020). Le Bras et al. (2020) include theoretical considerations on eliminating bias. The flurry of newly minted benchmarks stands in stark contrast with the scarcity of theory on the topic. Our work was inspired by the thought-provoking discussion of dynamic benchmarks by Bowman & Dahl (2021). 2 PROBLEM FORMULATION Our primary goal in this work is to understand the population-level dynamics that a benchmark design induces. We are centrally interested in capturing what the iterative process of model building and data collection converges to. Consequently, we ignore finite sample issues in our formulation and focus on distributions rather than samples. We assume that model builders successfully minimize risk approximately. While risk minimization may be computationally hard in the worst case, this assumption reflects the empirical reality that machine learning practitioners seem to be able to make consistent progress on fixed benchmarks. While we focus on population-level dynamics in the design of benchmarks, we restrict ourselves to operations with standard finite sample counterparts. A dynamic benchmark design can be represented as a directed acyclic graph where the nodes and edges correspond to classifiers, distributions, and the operations defined below: 1. Model building: Given a distribution, find an approximate risk minimizer. 2. Data collection: Given a model, find a new distribution. 3. Model combination: Combine a set of models into a single model. 4. Data combination: Combine a set of distributions into a single distribution. Published as a conference paper at ICLR 2023 !! ℎ! #!! ℎ" #!" ℎ# + + ! H H !" !# ! ! Figure 1: Example of a path dynamic benchmark. Symbol A represents risk minimization, H represents human data collection, and shows distribution mixing. We describe these operations in turn. First, we explain model building by defining the notion of risk and risk minimization. The risk of a classifier h: X Y on a distribution P supported on the data universe X Y with respect to the zero-one loss is defined as RP(h) = E(x,y) P h 1 h(x) = y i . An ϵ-approximate risk minimizer A is an algorithm that takes a distribution P as input and returns a classifier h: X Y such that RP(h) min h H RP(h) + ϵ , where H is a family of classifiers. In the benchmark setting, risk minimization is a collective effort of numerous participants. The algorithm A represents these collective model-building efforts. We abstract data collection as an operation that takes a classifier h and, for an underlying distribution D, returns a new distribution D. In the context of adversarial data collection, we assume that annotators can find the conditional distribution over instances on which h errs D = D|h(x) =y. This is an idealized representation of data collection, that ignores numerous real-world issues. Nonetheless, this idealized assumption will make our negative results even stronger. Model combination happens via a weighted majority vote among multiple classifiers. Distribution combination takes a mixture of multiple given distributions according to some proportions. The final piece in our problem formulation is the ultimate success criterion for a benchmark design. One natural goal of a dynamic benchmark is to output a classifier that minimizes risk on a fixed underlying distribution D. We envision that this distribution represents all instances of interest. There is a subtle aspect to this choice of a success criterion. If we already assume we have an ϵ-approximate risk minimizer on D, why are we not done from the get-go? The reason is that the ϵerror term might cover instances crucially important for the success of a classifier in real tasks. After all, a 95% accurate classifier can still perform poorly in practice if real-world instances concentrate on a set of small measure in D. The goal is therefore to find classifiers achieving risk significantly below the error level guaranteed by assumption. By asking for successively higher accuracy, we can ensure that the benchmark continues to elicit better-performing models as time goes on. Notation. We define error set of a classifier h as Eh = {(x, y) X Y | h(x) = y}. Note that RP(h) = Pr P(Eh). We drop P from Pr P( ) when this is clear from the context. We say a classification problem is realizable on distribution P if there is a classifier f H such that RP(f) = 0. Here, H is the hypothesis class and f is called the true classifier. For realizable problems, no uncertainty will be left over Y when x X is drawn, so we use P referring to the distribution over X and define error sets as a subset of X. Let p P be the probability density function associated with P. We say P is conditioned on E X and denote it by P|E if for any x X we have p P|E(x) = p P(x|E). For notational convenience, we sometimes use P(x) in place of p P(x). The support supp(P) of a distribution P is the largest subset of X such that P(x) > 0 for all x supp(P). Given probability distributions P1, P2, . . . , PT , we denote the mixture distribution with weights wt 0 such that P t wt = 1 by mix(P1, P2, , PT ), where pmix(x) = PT t=1 wt Pt(x). For a set of classifiers h1, h2, . . . , h T , we denote the weighted majority vote classifier by maj(h1, h2, , h T ). 3 PATH DYNAMIC BENCHMARKS The simplest case of a dynamic benchmark corresponds to the design of a directed path interleaving model building and data collection as illustrated in Figure 1. This is the design most similar to current proposals of dynamic benchmarks and adversarial data collection (Nie et al., 2020; Kiela Published as a conference paper at ICLR 2023 et al., 2021). Starting from an initial distribution, at each round, a new classifier is obtained from the latest distribution. The annotators are then asked to find the vulnerabilities of this model. This new insight will be leveraged towards updating the latest distribution. We call this procedure path dynamic benchmarking. Path dynamic benchmarking: For an underlying distribution D with true classifier f, given initial distribution D0 and an approximate risk minimizer A, at each round t: 1. ht = A(Dt) 2. Dt = D|ht(x) =f(x) 3. Dt+1 = mix(D0, D0, D1, D2, , Dt) We first formalize the rationale behind path dynamic benchmarks. Ideally, given a perfect, i.e. 0-approximate, risk minimizer, every time the current classifier misclassifies some part of the underlying distribution, annotators reveal that part and the updated classifier will avoid repeated mistakes. Since errors will not be repeated across the sequence, there can be a limited number of very bad classifiers. The following simple lemma formalizes this intuition for a target error level α > 0. Lemma 3.1. For any hypothesis class H, true classifier f H, perfect risk minimizer A, underlying distribution D, and initial distribution D0 such that supp(D0) supp(D), let (ht)T 1 t=0 be any sequence of classifiers obtained in a path dynamic benchmark with equally weighted mix( ). Then, for any α > 0, there are at most 1 α classifiers of risk more than α. In other words, |{t < T | RD(ht) > α}| 1 See proof on page 15. The lemma does not guarantee the latest classifier s risk, but it is straightforward to see a random selection of the classifiers after many rounds are accurate with high probability (see Corollary C.1). A more effective way to construct an accurate classifier from the sequence of classifiers is to take their majority vote. In this case, three rounds of model building suffice to find a perfect classifier. Proposition 3.2. Under the conditions of Lemma 3.1, let (ht)T 1 t=0 be any sequence of classifiers obtained in a path dynamic benchmark with uniform mixture weights. If T 3, RD maj(h0, h1, , h T 1) = 0. Proof. From the proof of Lemma 3.1 we know Et supp(Dt) = 0. So, Et Et = 0 for every t < t. The majority vote of hts will misclassify x if half or more of hts misclassify x. But no two distinct hts make a common mistake. So, for three or more classifiers, the majority vote classifier is always correct. So far, path dynamic benchmarking seems to be a promising choice when a perfect risk minimizer is available and the problem is realizable. The situation changes significantly when we go to approximate risk minimizers. We first study a three-round path dynamic benchmark. We then show how results would generalize for an arbitrary number of rounds. Our results apply to the case where the initial and underlying distributions D0 do not need to be identical to the target distribution D. To measure the distance between distributions with respect to a hypothesis class, we use the following notion. Definition 3.3 (See Ben-David et al. (2010)). For a hypothesis class H and distributions P1 and P2, the H H-distance between P1 and P2 is defined as d H H(P1, P2) = sup h,h H Ex P1[1{h(x) = h (x)}] Ex P2[1{h(x) = h (x)}] . (1) The next theorem discusses how path dynamic benchmarking with three rounds performs in the case of an ϵ-approximate risk minimizer. Theorem 3.4. For any hypothesis class H, true classifier f H, underlying distribution D, initial distribution D0 with supp(D0) supp(D), and any ϵ-approximate risk minimizer A, let h0, h1, and Published as a conference paper at ICLR 2023 h2 be the three classifiers obtained after three model building rounds in a path dynamic benchmark with uniform mixture weights. Then, the risk of the majority vote classifier is bounded by RD maj(h0, h1, h2) O ϵ2 + ϵd H H(D0, D) . (2) Note that for sufficiently similar D0 and D, i.e., d H H(D0, D) = O(ϵ), risk is bounded by O(ϵ2). See proof on page 15. The obtained O(ϵ2) error with only three rounds of model building is a significant improvement to the O(ϵ) error that could be achieved with static benchmarks and an ϵ-approximate risk minimizer. We then consider what happens if we continue dynamic benchmarking for many rounds. Theorem 3.5. For any ϵ-approximate risk minimizer A with 1 ϵ N, hypothesis class H with VCdim(H) 8 ϵ2 , and any path dynamic benchmark with L 3 rounds of model building and arbitrary mixture weights, there exists an underlying distribution D such that for any true classifier f H and initial distribution D0 with supp(D0) supp(D), there exists a sequence (ht)L 1 t=0 of classifiers consistent with path dynamic benchmark where the risk of their weighted majority vote is lowerbounded by RD maj(h0, h1, , h L 1) ϵ2 for any weighting of maj( ). Further, Theorem C.2 shows for any path dynamic benchmark, there exists H with constant VC dimension such that a similar lower-bound holds. See proof on page 16. Theorem 3.5 shows that Ω(ϵ2) error serves as a lower bound in the approximate risk minimizer setting for any path design (any mixture weighting and weighted majority). Then Theorem 3.4 shows three rounds of model building with uniform weights can achieve the lower bound, so it is optimal, and continuing dynamic benchmarking for more rounds might not be helpful. 3.1 CHALLENGES IN NON-REALIZABLE SETTINGS For many reasons, our problem may not be realizable. For example, a class of functions that is not complex enough to explain the true model constitutes an unrealizable setting. Even for a complex enough class, annotators might fail to label instances correctly. In a simplified model for unrealizable problems, we assign random labels to a small part of the distribution and let the rest of the distribution be realizable. Formally, let X δ be the randomly labeled subset of X and X δ = X \ X δ be the rest of the domain labeled with f. Since no classifier can do well on X δ, dynamic benchmarks are prone to overrepresent X δ. This intuition is formalized in Theorem 3.6 where we show a significant portion of Dt will be concentrated on X δ. Theorem 3.6. For any hypothesis class H, true classifier f, ϵ-approximate risk minimizer A, and any underlying distribution D such that δ-proportion of D is labeled randomly and the rest is labeled by f, if δ > ϵ, as long as t = O( δ ϵ), at least Ω(1)-proportion of Dt obtained through a path dynamic benchmark will be concentrated around the unrealizable instances, i.e., Pr Dt(x X δ) = Ω(1). See proof on page 18. As a direct consequence, classifiers trained on a path dynamic benchmark lose their sensitivity to realizable instances and might show an unexpectedly bad performance on the rest of the distribution. 4 HIERARCHICAL DYNAMIC BENCHMARKS Thus far, we have observed that given an ϵ-approximate risk minimizer, the smallest achievable risk through path dynamic benchmarks is Ω(ϵ2). This observation evokes the idea of possibly achieving a squared risk by adding a new layer to the benchmarking routine, which calls path dynamic benchmarking as a subroutine. Figure 2 shows such a structure with three steps. At each step, path dynamic benchmarks are obtained starting from a mixture of the initial distribution and error distributions of all previous steps. In the second layer, models found in different steps are aggregated. We Published as a conference paper at ICLR 2023 ℎ! ℎ" ℎ# !, H !, H ℎ$ ℎ% ℎ& !, H !, H ! ! ℎ' ℎ( ℎ) !, H !, H Figure 2: Depth-2 width-3 hierarchical dynamic benchmark. Symbols as in Figure 1 with M representing majority vote. call the dynamic benchmark of Figure 2 a depth-2 hierarchical dynamic benchmark. In an extended form, a depth-k hierarchical dynamic benchmark can be designed by adding a new layer on top of the models obtained from some depth-(k 1) benchmarks. Hierarchical dynamic benchmarking: For an underlying distribution D and true classifier f, given an initial distribution D0 and an approximate risk minimizer A, depth-k width-w hierarchical dynamic benchmarks are constructed recursively: def. A(k)(D0): 1. h0 = A(k 1)(D0) 2. For t = 1, , w 1: (a) Dt 1 = D|ht 1(x) =f(x) (b) Dt = mix(D0, D0, D1, , Dt 1) (c) ht = A(k 1)(Dt) 3. return maj(h0, h1, , hw 1) where A(0) = A. Note that this setting does not mean that different steps of path dynamic benchmarking can be done in isolation. Running two benchmarking tasks independently does not yield benefits since humans in the loop might return similar vulnerabilities. In the depth-2 hierarchy of Figure 2, we have bold-faced arrows corresponding to the sequence of steps that should be taken. So, any reasonable dynamic benchmarking based on our model is a sequential process, and the name hierarchy is meant to carry the intuition on how the aggregation of classifiers is happening. Also note that A(k) calls annotators for wk rounds. Since no known empirical dynamic benchmarking study has ever had more than 20 rounds (as studied in Wallace et al. (2022)), we limit our analysis to a depth-2 width-3 structure (Figure 2). Here w 3 is the necessary and sufficient number of rounds for path dynamic benchmarks to ensure O(ϵ2) risk when A is ϵ-approximate risk minimizer and D0 is sufficiently similar to D. The next theorem provides an upper bound on the risk of any classifier obtained through hierarchical data dynamic benchmarks of Figure 2. Theorem 4.1. For any hypothesis class H, true classifier f H, underlying distribution D, initial distribution D0 with supp(D0) supp(D), and any ϵ-approximate risk minimizer A, the risk of any classifier obtained from a hierarchical dynamic benchmark with depth-2 and width-3 (Figure 2) where mix( ) and maj( ) uniformly weight the inputs, is bounded by RD(maj(g0, g1, g2)) O ϵ3 + ϵ2d H H(D0, D) . (4) Note that for sufficiently similar D0 and D, i.e., d H H(D0, D) = O(ϵ), risk is bounded by O(ϵ3). See proof on page 19. The hope in adding a new layer to dynamic benchmarking was to obtain a squared risk of the previous layer s risk. So, ideally, a depth-2 hierarchical dynamic benchmark could achieve O(ϵ4) error. Published as a conference paper at ICLR 2023 But this is not the case; next, we show that the upper bound of Theorem 4.1 is tight up to a constant, and the conjecture of squared risk per layer does not hold. Theorem 4.2. For any ϵ-approximate risk minimizer A with 1 ϵ N, hypothesis class H with VCdim(H) 2 ϵ3 , and any hierarchical dynamic benchmark with depth-2 and width-3 (Figure 2) with arbitrary mixture and majority weights, there exists an underlying distribution D such that for any true classifier f H and initial distribution D0 with supp(D0) supp(D), there exists classifiers consistent with hierarchical dynamic benchmark for which RD maj(g0, g1, g2) ϵ3 Further, Theorem C.3 shows for any hierarchical dynamic benchmark, there exists H with constant VC dimension such that a similar lower-bound holds. See proof on page 20. Combined, these results show that design structures that are more intricate than the current practice, as captured by the path dynamic benchmark, can yield improved results but also have strong limitations and are challenging to implement in practice. 5 EXPERIMENTS Theorem 3.5, provides an Ω(ϵ2) lower bound on the risk achievable with an ϵ-approximate risk minimizer. The proof of this theorem is constructive, introducing a bad sequence of distributions and classifiers consistent with the path design with Θ(ϵ2) error. But the theorem does not rule out the existence of a good sequence achieving arbitrarily small risk. An important question is how frequently bad sequences appear in practice, retaining the error above zero even after many rounds. We study this question by simulating path dynamic benchmarks on two popular static benchmarks, CIFAR-10 (Krizhevsky et al., 2009) and SNLI (Bowman et al., 2015). The details of the data and models are reported in Section B.1 of Appendix. Our aim in these experiments is not to obtain a state-of-the-art model. Instead, we want to study the effectiveness of path dynamic benchmarks in a controlled experimental setting with light models. Our simulation design is similar for both datasets: 1. Train a base classifier on the whole dataset and fix it for the next steps. 2. Construct a new dataset from samples correctly classified by the base model and define the true and initial distributions as a uniform (point mass) distribution over these samples. Note that in this case empirical risk on samples weighted according to a point mass distribution is equivalent to risk on that distribution. Since the base model correctly labels all selected samples, the problem is also realizable. 3. Draw multiple rollouts of path dynamic benchmarks. A rollout from path dynamic benchmark is a sequence of distributions and models obtained by alternatingly training a new classifier on the weighted extracted dataset and up-weighting (down-weighting) the distribution over misclassified (correctly classified) samples according to a mixture rule with uniform weights. Note that the base model is fixed, and the randomness across rollouts solely comes from different initializations of new classifiers and possible randomness in optimization methods. There are two deviations from our theoretical study in this design: First, we studied binary classification, but both CIFAR-10 and SNLI define multi-label problems. We believe our main arguments hold for multi-label tasks as well. Second, although the problem is realizable, the solution is unlikely to have zero risk as training a new classifier at each round of a rollout typically consists of non-convex optimization. This is addressed in our theoretical framework as ϵ-approximate risk minimization. Our analysis requires a fixed ϵ across all rounds which approximately holds in practice. Figure 3a shows results from 100 rollouts of path dynamic benchmarks simulated on CIFAR-10. At each round, the value on the vertical axis shows the risk of the majority vote of all the classifiers obtained in a rollout so far. The solid line is the average of the majority vote s risk of all rollouts, and the shaded area shows the standard deviation. There are a few observations: First, although zero risk is attainable, path dynamic benchmarks fail to reach it. Second, the variance across rollouts is quite consistent at each round. This shows a good rollout keeps being a good, and likewise with a bad Published as a conference paper at ICLR 2023 0 5 10 15 20 25 round risk of majority vote avg. risk of all rollouts best rollout worst rollout (a) CIFAR-10 5 10 15 20 round risk of majority vote avg. risk of all rollouts best rollout worst rollout Figure 3: Path dynamic benchmark is simulated. The base model is kept fixed for all rollouts. The y-axis shows the risk of the majority vote of all classifiers obtained until that round. rollout, so continuing dynamic benchmarking does not help. We further discuss why a rollout turns out to be a good or bad one and how this is related to our theoretical negative example in Section B.2 of the Appendix. Figure 3b shows the results from 50 rollouts of path dynamic benchmarks simulated on SNLI. A similar observation regarding non-zero risk in limit can be made here. Compared to the image classification task, path dynamic benchmarks in the NLI task show more fluctuation through rounds which might be due to the harder nature of the problem or a more complex model. Summarizing these observations, path dynamic benchmarks, no matter how long we run them for, confront a lower bound even in simple realizable settings. This provides empirical evidence that our negative results are not contrived, but point at what may be inherent limitations to dynamic benchmarking. 6 DISCUSSION The scientific foundation of traditional benchmarks is the holdout method, whereby we split the dataset into a training and testing component. Although the machine learning community routinely operates well outside the statistical guarantees of the holdout method, at least there is some cognizable theoretical framework for static datasets as benchmarks. Moreover, more recent works have found new support for static benchmarks (Blum & Hardt, 2015; Recht et al., 2019). Dynamic benchmarks provide an intriguing proposal aimed at mitigating known issues with static benchmarks. Platforms for dynamic benchmarks are already up and running. However, machine learning theory has little to offer in the way of guiding the principled design of valid dynamic benchmarks. Responding to this lack of theoretical foundations, we propose a formal model of dynamic benchmarks that allows to combine model building and data collection steps in a flexible manner. We focus on what we think is the first-order concern in the design of a benchmark: Does the benchmark, in principle, induce the creation of better models? While our results show a provable benefit to dynamic benchmarking, it is arguably more subtle than hoped and comes at a significant increase in complexity. Our negative results are particularly concerning given that we optimistically assume that both model builders and annotators have significant competence. In practice, dynamic benchmarks likely face additional obstacles neglected by our idealized assumptions. As a result, it is less clear to what extent our positive results have prescriptive value. On the other hand, our positive results make it clear that the design space for dynamic benchmarks is larger than currently utilized, thus pointing at a range of interesting open problems. Finally, lowering risk or improving the accuracy of the induced models is the ultimate success criterion of a benchmark, however, there are other concerns including the robustness of the models that can be the topic of future works based on our proposed framework. Published as a conference paper at ICLR 2023 Max Bartolo, Tristan Thrush, Sebastian Riedel, Pontus Stenetorp, Robin Jia, and Douwe Kiela. Models in the loop: Aiding crowdworkers with generative annotation assistants. Co RR, abs/2112.09062, 2021. URL https://arxiv.org/abs/2112.09062. Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151 175, 2010. Avrim Blum and Moritz Hardt. The ladder: A reliable leaderboard for machine learning competitions. In International Conference on Machine Learning, pp. 1006 1014. PMLR, 2015. Samuel Bowman and George Dahl. What will it take to fix benchmarking in natural language understanding? In Proc. of NAACL, pp. 4843 4855, 2021. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632 642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://aclanthology.org/D15-1075. Aida Mostafazadeh Davani, Mark D ıaz, and Vinodkumar Prabhakaran. Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10:92 110, 2022. Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4537 4546, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1461. URL https://www.aclweb.org/anthology/D19-1461. Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 117 126, 2015. Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondˇrej Duˇsek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa Prasad Majumder, Pedro Henrique Martins, Angelina Mc Millan-Major, Simon Mille, Emiel van Miltenburg, Moin Nadeem, Shashi Narayan, Vitaly Nikolaev, Andre Niyongabo Rubungo, Salomey Osei, Ankur Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas Raunak, Juan Diego Rodriguez, Sashank Santhanam, Jo ao Sedoc, Thibault Sellam, Samira Shaikh, Anastasia Shimorina, Marco Antonio Sobrevilla Cabezudo, Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi Yang, Akhila Yerukola, and Jiawei Zhou. The GEM benchmark: Natural language generation, its evaluation and metrics. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pp. 96 120, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.gem-1.10. URL https://aclanthology. org/2021.gem-1.10. Moritz Hardt and Benjamin Recht. Patterns, predictions, and actions: Foundations of machine learning. Princeton University Press, 2022. Divyansh Kaushik, Douwe Kiela, Zachary C. Lipton, and Wen-tau Yih. On the efficacy of adversarial data collection for question answering: Results from a large-scale randomized study. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6618 6633, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/ v1/2021.acl-long.517. URL https://aclanthology.org/2021.acl-long.517. Published as a conference paper at ICLR 2023 Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A Smith, and Daniel S Weld. Genie: A leaderboard for human-in-the-loop evaluation of text generation. ar Xiv preprint ar Xiv:2101.06561, 2021. Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4110 4124, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.324. URL https://aclanthology. org/2021.naacl-main.324. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal, and Yejin Choi. Adversarial filters of dataset biases. In International Conference on Machine Learning, pp. 1078 1088. PMLR, 2020. Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Jain, Ledell Wu, Robin Jia, Christopher Potts, Adina Williams, and Douwe Kiela. Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking. Advances in Neural Information Processing Systems, 34, 2021. Avanika Narayan, Piero Molino, Karan Goel, Willie Neiswanger, and Christopher Re. Personalized benchmarking with the ludwig benchmarking toolkit. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://openreview.net/forum?id=hwjnu6q W7E4. Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. Denis Paperno, Germ an Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1525 1534, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1144. URL https://aclanthology.org/P16-1144. Ellie Pavlick and Tom Kwiatkowski. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677 694, 2019. Jason Phang, Angelica Chen, William Huang, and Samuel R. Bowman. Adversarially constructed evaluation sets are more challenging, but may not be fair. In Proceedings of the First Workshop on Dynamic Adversarial Data Collection, pp. 62 62, Seattle, WA, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.dadc-1.8. URL https://aclanthology. org/2022.dadc-1.8. Christopher Potts, Zhengxuan Wu, Atticus Geiger, and Douwe Kiela. Dyna Sent: A dynamic benchmark for sentiment analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2388 2404, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.186. URL https: //aclanthology.org/2021.acl-long.186. Vinodkumar Prabhakaran, Aida Mostafazadeh Davani, and Mark Diaz. On releasing annotatorlevel labels and information in datasets. In Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop, pp. 133 138, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.law-1.14. URL https://aclanthology.org/2021.law-1.14. Published as a conference paper at ICLR 2023 Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp. 5389 5400. PMLR, 2019. Robert E Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine learning, 37(3):297 336, 1999. Ali Shirali. Sequential nature of recommender systems disrupts the evaluation process. In Advances in Bias and Fairness in Information Retrieval: Third International Workshop, BIAS 2022, Stavanger, Norway, April 10, 2022, Revised Selected Papers, pp. 21 34. Springer, 2022. Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33:18583 18599, 2020. Eric Wallace, Adina Williams, Robin Jia, and Douwe Kiela. Analyzing dynamic adversarial training data in the limit. In Findings of the Association for Computational Linguistics: ACL 2022, pp. 202 217, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/ v1/2022.findings-acl.18. URL https://aclanthology.org/2022.findings-acl. 18. Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 93 104, Brussels, Belgium, October November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1009. URL https://www.aclweb.org/anthology/D18-1009. Rowan Zellers, Ari Holtzman, Elizabeth Clark, Lianhui Qin, Ali Farhadi, and Yejin Choi. Turing Advice: A generative and dynamic evaluation of language use. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4856 4880, Online, June 2021. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2021.naacl-main.386. A DYNAMIC GRADIENT-BASED UPDATE The current practice of dynamic benchmarking aims at diversifying the benchmark by dynamically adding new hard examples to it. The hope is to obtain better models from the benchmark every time new samples are added. As we observed, despite the initial benefit this method has, there exist arbitrarily long dynamic benchmarks with no significant improvement after the first few steps. But if the ultimate goal is to obtain improved models, there is a more direct way to do it. Instead of keeping all the adversarial feedback in the form of a dynamic benchmark, the model-in-the-loop itself can carry all the information by becoming more and more complex throughout the process. In other words, rather than updating the benchmark by adversarial examples, these examples can be collected in a way that can be directly used to update the model. Dynamic adversarial feedback from annotators is helpful here by providing fresh examples at each round that prevent overfitting. This phenomenon is also close to the boosting technique in machine learning. In this section, we discuss how directly updating the model using adversarial data can be formulated within our framework and why they are not practical. Since we use gradient-based techniques to update the model, we call these methods dynamic gradient-based updating. We first formally introduce a vector notation for functions and distributions, which makes our arguments easier to follow. Then we discuss how a classifier s risk with respect to the zero-one loss would be represented with this notation. In search of a classifier that minimizes this risk, we minimize surrogate convex losses rather than directly optimizing for zero-one risk. Here we discuss two popular choices, to be named hinge and exponential losses, and for each, discuss the corresponding method with its strengths and limitations. Published as a conference paper at ICLR 2023 !! ℎ! #!! $ℎ! ℎ" #!" $ℎ" ℎ# + + ! ! ! H H Figure 4: Dynamic gradient-based update: hinge loss minimization. Notation. Let h : X { 1, 1} be a binary classifier defined on the finite set X. The vector representation of h in X is h = (h(x))x X . Similarly, let P be a probability distribution over X. The vector representation of P in X is P = (P(x))x X . The entrywise product of two vectors h1 and h2 is denoted by h1 h2. For an underlying distribution D and true classifier f, we can use vector representations to write the risk with respect to the zero-one loss. Risk of a binary classifier h on D is RD(h) = 1 2 1 h f, D . For a general h : X R, still we can define the risk with respect to the zero-one loss as RD(h) = 1 2 1 sign(h f), D , where sign( ) is an entrywise operator. Upper-bounded risk. There are many ways to upper-bound RD(h). For example, for any entrywise function l( ) such that l(x) 1{x 0} = 1 2 sign(x) for all x R, the risk of h with respect to the zero-one loss can be upper-bounded by RD(h) Rl D(h) = l(h f), D . (6) A.1 MINIMIZING HINGE LOSS A popular function to upper-bound the zero-one risk is the hinge loss: l(x) = max(1 x, 0). Plugging l( ) into Equation 6 gives: RD(h) Rhinge D (h) = max(1 h f, 0), D , (7) where max( ) is element-wise maximization. Let g = h Rhinge D . Looking for an update of the form h := h + h to reduce Rhinge D (h), any direction such that g, h < 0 will be a descent direction and a small step size guarantees consistent decrease of Rhinge D . As we will show in the proof of Lemma A.1, directly applying gradient descent, i.e., h = η g, is not practical, as it incorporates summation of a distribution and a classifier vector. Unlike classifiers which are known for every point in the domain, in practice, distributions are limited to the available samples and this summation is not implementable. Alternatively, let Eh = {x X | h(x)f(x) < 1} be the set of margin errors of classifier h. We task annotators to return Dh = D|Eh given h. Let h = A(Dh) be the model built on the vulnerabilities of h. Next lemma shows h is a descent direction for the hinge loss. Lemma A.1. For any hypothesis class H, true classifier f H, current classifier h H, ϵ-approximate risk minimizer A, and any underlying distribution D, the vector representation h of the classifier h = A(D|h(x)f(x)<1) is a descent direction for Rhinge D (h). See proof on page 22. This lemma lets us write the updating rule h := h+η Rhinge D (h) h, depicted graphically in Figure 4. Since gradient dominance condition holds for this update and hinge loss is 1-Lipschitz, h will converge to f with the rate of O( |X| t2 ). Although this method guarantees convergence, the dependence on the domain size makes the bound useless for continuous or very large domains. A.2 MINIMIZING EXPONENTIAL LOSS Another candidate function to upper-bound the zero-one risk is the exponential loss: l(x) = exp( x). This leads to a similar analysis as the Ada Boost algorithm (Schapire & Singer, 1999). Plugging l( ) into Equation 6 gives: RD(h) Rexp D (h) = exp( h f), D , (8) where exp( ) is element-wise exponential function. Similar to the hinge loss minimization, we show in the proof of Lemma A.2 that directly updating h with a gradient term is not implementable. So, Published as a conference paper at ICLR 2023 we search in the hypothesis class for a classifier h such that h minimizes h, g , where g = h Rexp D . Next lemma finds such a classifier along with the optimal step size. Lemma A.2. For any hypothesis class H, true classifier f H, current classifier h H, ϵ-approximate risk minimizer A, and any underlying distribution D, h = A(Dh) is the solution of min h H h, g . Here Dh(x) D(x) exp( h(x)f(x)) and g = h Rexp D . Further, η = 1 2 log( 1 RDh( h) 1) is the best step size for the update rule h := h + η h. See proof on page 23. Let ht be the final classifier obtained after t updates according to the updating rule of Lemma A.2. An analysis similar to the analysis of Ada Boost shows RD(ht) exp( (1 2ϵ)2t 2 ). This method, despite the exponential convergence rate, is not practical for two reasons. First, it is computationally hard as reweighting a distribution requires the calculation of a normalization factor which is a sum over the whole domain. Second, it requires sampling from D which might not be possible. In summary, gradient-based updates guarantee convergence of the updated classifier to the true classifier; however, they either suffer from slow convergence or computational hardness. B MORE ON EXPERIMENTS We elaborate on the details of datasets, models, and further observations from the simulation of path dynamic benchmarks in this section. B.1 DATA AND MODELS The CIFAR-10 dataset contains 60, 000 of 32 32 color images in 10 different classes, commonly used as a static benchmark for image classification. We use a shallow feed-forward Convolutional Neural Network (CNN) consisting of two convolution layers (with 32 and 64 filters), interleaved with two max-pooling layers, followed by a dropout layer and a dense layer with 10 units at the output, increasing total number of trainable parameters to 40k. We use the same architecture to train a base model and train classifiers in drawing rollouts. The base classifier achieves 73% accuracy after 30 epochs on the training data, reasonably above the chance level of about 10%. The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a popular NLI benchmark consisting of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. We restrict our simulation to a 50k random subset of this data. Our model is a slightly modified model of Bowman et al. (2015) where words are represented with pre-trained 100d Glo Ve1 vectors, and two parallel Bidirectional Long Short-Term Memory layers (Bi LSTM) are used to extract sentence embeddings. The concatenated embeddings are then fed into three more hidden dense layers (each has 128 units) before going to the last dense layer with 3 units. The model has a total number of 120k trainable parameters, and the base model achieved an accuracy of 68% on the training data and 61% on the test data, consistent with the original study when the number of samples was limited. B.2 FURTHER OBSERVATIONS The observations we had in Section 5 raise another question: what makes a rollout a bad rollout? Do bad rollouts share a common characteristic? We hypothesize the more similar a rollout to the bad sequence constructed in the proof of Theorem 3.5, the worse its final risk. A unique feature of the negative example in Theorem 3.5 is that initial classifiers in the sequence cleverly choose their error sets to only overlap on a fixed part of the distribution, which will turn out to be the error set of the majority. We define a score that captures this behavior without being directly related to the accuracy of the final majority vote model. Let Em be the error set for the majority vote classifier and Eht be the error set of the round t classifier. Then for every pair of distinct rounds t1, t2 < T of a rollout, define zt1,t2 = |Eht1 Eht2 Em| |Eht1 Eht2 | . We use z T = 1 T (T 1) PT 1 t1=0 PT 1 t2=0,t2 =t1 zt1,t2 to 1http://nlp.stanford.edu/data/glove.6B.zip Published as a conference paper at ICLR 2023 0.46 0.48 0.50 0.52 0.54 z4 risk of majority vote at round 25 (a) CIFAR-10 0.20 0.22 0.24 0.26 z6 risk of majority vote at round 20 Figure 5: Similarity to the theoretical negative example measured by z T can partially explain the risk of the majority vote classifier of all rounds. The blue dashed line is the minimum squared error fit. quantify similarity of a rollout to the theoretical negative example. The proof of Theorem 3.5 shows that as long as T 2 ϵ , the theoretical negative example gives z T = 1. Figure 5a depicts the risk of the majority vote at the end of the rollout (round 25) vs. z4 score. The correlation is positive and statistically significant (Pearson r = 0.46, p < 0.001). So, the more similar a rollout to the theoretical bad example in terms of the defined score, the more likely it will be a bad rollout. Interestingly, at round 4, the risk of the majority vote classifier can barely explain the risk after 25 rounds (Pearson r = 0.18, p > 0.05). This shows our negative example construction has not only introduced a worst-case example but might have identified an important characteristic that naturally appears in dynamic benchmarks and limits their effectiveness. Figure 5b also shows a significant positive correlation (Pearson r = 0.68, p < 0.001) between the final risk of a rollout and its similarity to the theoretical negative example measured early at round 6. At this round, the risk of the current majority vote classifier is barely informative about the risk after 20 rounds (Pearson r = 0.27, p > 0.05). Once more, it shows the negative example constructed theoretically can explain the natural failure of dynamic benchmarking in different contexts. C ADDITIONAL STATEMENTS AND PROOFS Corollary C.1. Under conditions of Lemma 3.1, let ˆh be one of the hts selected uniformly at random. For any δ > 0 and α > 0, if T 1 δα, with probability at least 1 δ, ˆh is α-accurate. Proof. From Lemma 3.1, the probability that ˆh is α-bad is less than 1/α Proof of Lemma 3.1. As supp(D0) supp(D), running the path dynamic benchmarking will preserve supp(Dt) supp(D) for any t. On the other hand, we know RD(f) = 0. So, we have RDt(f) = 0. As A is a perfect risk minimizer and f H, we should have RDt(ht) = 0. Therefore, errors of ht should happen outside of supp(Dt): Et supp(Dt) = 0, (9) where Et is the error set of ht. Now assume t is one of the indices where ht is α-bad, i.e., Pr D(Et) > α. Then we have Pr D (x supp(Dt+1)) = Pr D (x supp(Dt)) + Pr D (x Et) Pr D (x supp(Dt)) + α. (10) This cannot happen more than 1 Published as a conference paper at ICLR 2023 Proof of Theorem 3.4. Starting from D0 where supp(D0) supp(D), path dynamic benchmarks will preserve supp(Dt) supp(D) for any t. Since the problem is realizable and A is ϵapproximate risk minimizer, for any consistent ht we should have RDt(ht) ϵ. The distribution Dt itself is a mixture of the initial and error distributions: Dt = mix(D0, D0, , Dt 1). So, we can expand RDt( ) as a linear combination of the risks under mixture components. Note that the risk of ht under an error distribution Dt , where t < t, can equivalently be written as RDt (ht) = Pr D(Et|Et ). For t = 0, 1, 2 we have: RD0(h0) ϵ 1 2RD0(h1) + 1 2 Pr D (Eh1|Eh0) ϵ 1 3RD0(h2) + 1 3 Pr D (Eh2|Eh0) + 1 3 Pr D (Eh2|Eh1) ϵ. The above constraints impose RD0(ht) = Pr D0(Et) (t + 1)ϵ and Pr D(Et|Et ) (t + 1)ϵ. However, to limit the joint error probability of two classifiers we need to bound Pr D(Et): Pr D (Et) = RD(h) RD0(h) + d H H(D0, D). (11) Finally, by a union bound, the error probability of maj(h0, h1, h2) is less than the sum of all pairwise joint error probabilities. Putting these all together: RD maj(h0, h1, h2) Pr D (Eh0 Eh1) + Pr D (Eh0 Eh2) + Pr D (Eh1 Eh2) 11ϵ2 + 8ϵd H H(D0, D). (12) Proof of Theorem 3.5. For a given domain X, we define a new finite domain Xd X d such that Xd can be shattered by H. For an arbitrary but fixed order on Xd, we can define equivalent vector representations of functions and distributions in the new domain. Particularly, let h : X { 1, 1} be a binary classifier defined on X. The vector representation of h in Xd is h = (h(x))x Xd { 1, 1}d. Similarly, let P be a probability distribution over X. The vector representation of P in Xd is P = (P(x))x Xd (Xd). Throughout the proof, we use K to show a subset of Xd with k elements. We also use P(K) as a shorthand for P x K P(x). Throughout the proof we use interval notation for discrete intervals. For example, [a, b) denotes {a, a + 1, . . . , b 1}. Finally, h1 h2 denotes the entrywise product of h1 and h2. The four allowed operations on classifiers and distributions have new interpretations in the new domain: 1. Dt A ht: The fact that A is an ϵ-approximate risk minimizer imposes the following constraint: ht f, Dt = X ht(x)f(x)=1 Dt(x) X ht(x)f(x)= 1 Dt(x) = 1 2RDt(ht) 1 2ϵ. (13) 2. ht H Dt: Let Kt be the set of indices that ht and f disagree. Then, it is straightforward to see Dt = 1 D(Kt)(D ht f D). (14) 3. (h0, h1, , ht) maj ˆh: This is equivalent to per dimension voting: ˆh(x) = maj(h0(x), h1(x), , ht(x)) for x Xd. 4. (D0, D0, D1, , Dt 1) mix Dt: This is simply equivalent to the weighted sum of Dt = wt,0D0 + X wt,t Dt , (15) where ws are the weights of the mixture components, summing up to 1. Published as a conference paper at ICLR 2023 Note that in all of the operations ht and f have appeared only as ht f. Since Xd can be shattered by H, without loss of generality, we assume f is all one and use ht instead of ht f in the following. Next, we start from the initial distribution D0 as the input to the path dynamic benchmarks and find the constraints imposed over variables throughout the process: D0 A h0: In this case h0, D0 = 1 2D0(K0) and Equation 13 requires: D0(K0) ϵ. (16) Dt A ht: For Dt defined in Equation 15, we have: ht, Dt = wt,0 ht, D0 + X wt,t ht, Dt , (17) where ht, D0 = 1 2D0(K0) and ht, Dt = 1 2D(Kt ) ht, D ht ht , D = 1 2D(Kt ) 1 2D(Kt) 1 2D(Kt Kt Kt Kt ) = 1 D(Kt Kt ) D(Kt ) . (18) Here we used D(Kt Kt Kt Kt ) = D(Kt) + D(Kt ) 2D(Kt Kt ). Plugging ht, Dt into Equation 17 and requiring ht, Dt 1 2ϵ from Equation 13, impose wt,0D0(Kt) + X wt,t D(Kt Kt ) D(Kt ) ϵ. (19) In the following, we propose Kts such that constraints of Equations 16 and 19 are satisfied. These are the necessary and sufficient conditions to have hts consistent with path dynamic benchmarks. For 0 t < T = 2 ϵ : Consider Kts such that Kt Kt = K for any t < t. Sufficient conditions to satisfy Equations 16 and 19 are Kt Kt = K (20) D(K) D(Kt) = k at every time step. Let s reorder axes in an ascending order of D0 and name them from 1 to d. Also assume this results in an ascending order of D as well. For example, D = unif(Xd) always satisfies this. We set K = [1, k] and K0 = [1, k ] where k = 2k ϵ . Then for 1 t < T = 2 ϵ , we set Kt = K (k + (t 1)(k k), k + t(k k)]: K0 : k = 2k K2 : k k k . . . It is straightforward to see that this assignment satisfies Equations 20 and 21. Since axes are ordered ascendingly according to D0, a sufficient condition to satisfy Equation 22 for all rounds is D0(KT 1) D0 k + (T 1)(k k) k D0(Tk )k = D0 2k Published as a conference paper at ICLR 2023 ϵ , the above condition is D0(y) 1 y. Again, since axes are ordered in an ascending order of D0, D0(y) 1 d y+1 (otherwise, the sum of D0 elements will go above 1). Then simple calculations show d 2y = 8k ϵ2 is sufficient to hold Equation 23. Since k is an integer, this is possible for d 8 ϵ2 which requires VCdim(H) 8 ϵ2 . For d = 8 ϵ2 , this gives k For t T: We set ht = hϕ(t) where ϕ : [T, ) [0, T 1] is an assignment function working as follows. We define updated weights for 0 τ < T: vt,0 = wt,0 (24) vt,τ = wt,τ + X wt,t 1{ϕ(t ) = τ}. (25) Then ϕ( ) assigns round t according to ϕ(t) = arg min 0 τ