# optimizing_data_collection_for_machine_learning__78d6c10f.pdf Journal of Machine Learning Research 26 (2025) 1-52 Submitted 3/23; Revised 10/24; Published 3/25 Optimizing Data Collection for Machine Learning Rafid Mahmood1,2 rmahmood@nvidia.com James Lucas1 jalucas@nvidia.com Jose M. Alvarez1 josea@nvidia.com Sanja Fidler1,3,4 sfidler@nvidia.com Marc T. Law1 marcl@nvidia.com 1 NVIDIA 2 University of Ottawa, Ottawa, Canada 3 University of Toronto, Toronto, Canada 4 Vector Institute, Toronto, Canada Editor: Isabelle Guyon Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm to model the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. This formulation generalizes to tasks with multiple data sources, such as labeled and unlabeled data used in semi-supervised learning, and can be easily modified to customized analyses such as how to introduce data from new classes to an existing model. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs. Keywords: data collection, neural scaling laws, active learning 1. Introduction Before deploying a machine learning model in production, stakeholders may mandate that the model meets a pre-determined baseline performance, such as a target score over a validation data set. One of the most reliable ways to achieve a desired performance is by augmenting current training sets with more data. Consequently, engineers must regularly determine how much and what kind of data they need. Managing data collection campaigns can impact costs and delays in model development. Overestimating how much data the model needs to reach a target performance will incur excess costs from collection, cleaning, and annotation. For example, annotating segmentation masks on driving data requires 15 to 40 seconds and between $0.02 to $0.08 per object (Acuna c 2025 Rafid Mahmood, James Lucas, Jose M. Alvarez, Sanja Fidler, Marc T. Law. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-0292.html. Mahmood, Lucas, Alvarez, Fidler, and Law et al., 2018; AWS, 2023). Processing 100,000 images with an average of 10 cars per image can take between 170 to 460 days-equivalent of time and cost between $20,000 to $80,000. Meanwhile, underestimating how much data is needed will incur delays from having to collect more data later. For instance, a 2019 survey of machine learning practitioners revealed 51% of engineering teams faced delays due to failing to collect enough data (Dimensional Research, 2019). This problem grows more challenging when given multiple vendors who can provide different levels of quality at different prices. Consider the following examples: Medical Imaging: A medical imaging company is planning to deploy automatic segmentation software on their devices within the next three years. They need to achieve an 80% Intersection-over-Union (Io U) to meet clinical standards. The company will partner with hospitals and hire clinicians to collect and annotate patient data, which can be expensive. If the company overestimates how much data to collect, they will spend more than necessary, but if they underestimate, they may have to restart the collection process next year. Autonomous Vehicles: A startup working on autonomous vehicles needs to build an object detector in the next five years. This model must achieve a minimum mean Average Precision of 95% on their validation set, or else it will not be deployable and the company will lose $1,000,000 in revenue. Collecting high-quality training data requires employing drivers to record video and annotators to label the data, where the marginal cost of each image is approximately $5. Alternatively, the startup could use lower quality synthetic data at $1 per image. To manage their resources, the startup must plan how much real versus synthetic data they need at the start of each year. The extant literature on learning curves and neural scaling laws suggests that the relationship between the performance of a deep learning model and training data set size follows a power law (Frey and Fisher, 1999; Gu et al., 2001; Hestness et al., 2017; Rosenfeld et al., 2020; Kaplan et al., 2020; Hoiem et al., 2021; Bahri et al., 2024; Bisla et al., 2021). This motivates an intuitive approach of fitting a power law learning curve with the performance statistics of the current data set, extrapolating this learning curve to estimate how the model will perform with more data, and then forecasting how much data is needed to reach the desired performance (Rosenfeld et al., 2020). However, the decay rate of power laws implies that even small errors in estimating a learning curve can lead to massively overor underestimating how much data is actually needed (Mahmood et al., 2022b). Moreover, estimating these learning curves becomes difficult with multiple data vendors, since different types of data have different costs and scale differently with performance (Mikami et al., 2022; Acuna et al., 2021; Prakash et al., 2021; Acuna et al., 2022; Prabhu et al., 2023). For example, in a semi-supervised learning task, unlabeled data may be easier to collect than labeled data, but we may require an order of magnitude more unlabeled data to match the performance of a small labeled set (Viering and Loog, 2022). Thus, collecting more data based only on scaling law estimates will fail to capture uncertainty and collection costs. In this paper, we propose a new paradigm to model the data collection workflow as an optimal data collection problem, where a firm must minimize the cost of collecting enough data to obtain a model capable of achieving a desired performance score. They have multiple collection rounds, and after each round, they re-evaluate their model and decide how much more data to order. There are per-sample collection costs, and the firm pays a penalty if they fail to meet the target score within a finite horizon. Using this framework, we develop Optimizing Data Collection for Machine Learning an optimization approach to minimize expected future costs and show that this problem can be optimized in each collection round via gradient descent. Our optimization problem generalizes to decisions over multiple data sources (e.g., unlabeled, long-tail, cross-domain, synthetic) that have different costs and impacts on performance. Most importantly, our framework presents natural tools for custom economic analyses such as comparing different collection strategies or introducing new classes to expand an existing model. Finally, we demonstrate the value of optimization over na ıvely estimating data set requirements for several machine learning tasks and data sets. Our contributions are as follows: 1. We propose the optimal data collection problem in machine learning, which extends the estimation of learning curves to a formal dynamic optimization problem to determine how much and what kind of data to collect over the model development life cycle. 2. We introduce Learn-Optimize-Collect (LOC), a learning-and-optimizing framework that minimizes expected collection costs and can be solved via gradient descent. This is the first exploration of optimizing data collection with multiple arbitrary data sources in machine learning, covering, for example, semi-supervised and long-tail learning. 3. We analyze the one-round problem of deciding how much data to collect. We show that this problem is equivalent to estimating a (1 ϵ)-quantile of the distribution of the minimum data needed to meet the target. Under Gaussian assumptions, LOC guarantees lower regret than estimation-only baselines. 4. We evaluate LOC over classification, segmentation, and detection tasks to show, on average, approximately a 2 reduction in the chances of failing to meet performance targets, versus estimation baselines. We also show the flexibility of our framework to solve customized managerial questions faced by engineering firms. The overall goal of this work is to provide a high-level framework with which machine learning model developers can obtain policy insights on how much data to collect. In practice, these decisions are strategic in nature and are determined over long-term horizons (see our motivating examples above). Moreover, the final policy decisions are often informed via a combination of our data-driven approach and human expert judgment. Consequently, we validate our methodology via a breadth of experiments on different data sets, tasks, and different managerial scenarios. A preliminary version of this article was published in Mahmood et al. (2022c). Our complete paper introduces theoretical analysis of the optimal data collection problem. First, we characterize the optimal solution space of the data collection problem (Theorem 1) and show that LOC yields the optimal solution to the problem (Lemma 2). Further, we extend the theoretical analysis of one-round data collection by deriving an analytic formula for the optimal solution as well as a regret bound under Gaussian assumptions (Proposition 5, Corollary 6, Lemma 7, Proposition 8). Finally, we present several new experiments, including comparisons with different learning curve estimators (Section C.2), analysis under active learning strategies (Appendix C.3), and two new empirical case studies (Appendix 6.4) to show how our high-level modeling approach can answer customized data collection challenges. In these experiments, we adapt LOC to estimate the potential costs and yield decisions when choosing between different data collection strategies. Mahmood, Lucas, Alvarez, Fidler, and Law 2. Related work This paper employs neural scaling laws to solve operational design problems. The problem of data-efficient learning is closely tied to active learning and statistical sample complexity. Below, we summarize the most relevant literature. 2.1 Learning Curves and Neural Scaling Laws. According to the learning curve literature, the performance of a machine learning model on a validation set scales with the size of the training data set with diminishing marginal value in a way that can usually be modeled via concave functions (Cortes et al., 1993; Frey and Fisher, 1999; Provost et al., 1999; Meek et al., 2002; Tomanek and Hahn, 2008; Figueroa et al., 2012; Kolachina et al., 2012). Consequently, performance at large data scales can be extrapolated by estimating the learning curve at smaller scales, both from a point perspective and by capturing uncertainty (Domhan et al., 2015). Although the literature primarily considers a single-round estimate of data collection requirements, John and Langley (1996) propose dynamic sampling over multiple data collection rounds, similar to our view. Furthermore while the literature primarily explores estimating dataset sizes, Last (2007) propose an optimization approach for data collection. Their framework minimizes the data collection cost plus a penalty associated with the amount of generalization error of the downstream model. In contrast to these two works, we propose a multi-round data collection optimization problem that minimizes the collection cost plus a penalty associated with failure to achieve a performance requirement. We refer to Viering and Loog (2022) for a detailed review on learning curves. The neural scaling law literature focuses on deep learning to empirically demonstrate a power law relationship between model performance and data set size (Hoiem et al., 2021; Bisla et al., 2021; Zhai et al., 2022; Caballero et al., 2023). For instance, Hestness et al. (2017) observe this property over vision, language, and audio tasks, Bahri et al. (2024) develop a theoretical relationship under assumptions on over-parametrization and the Lipschitz continuity of the loss, model, and data, and Rosenfeld et al. (2020) estimate power laws using smaller data sets and models to extrapolate future performance. Multi-variate scaling laws have also been considered for some specific tasks, for example in transfer learning from synthetic to real data sets (Mikami et al., 2022). Finally, Mahmood et al. (2022b) explore data collection by estimating the minimum amount of data needed to meet a given target performance over multiple rounds. Our paper extends these studies by introducing a sequential optimization problem where the amount of data to collect is optimized, rather than estimated, over multiple collection rounds and from multiple data sources that may feature different costs. 2.2 Active Learning In active learning, a model sequentially collects data by selecting new subsets of an unlabeled data pool to label under a predetermined labeling budget that replenishes after each round (Settles, 2009; Sener and Savarese, 2018; Yoo and Kweon, 2019; Sinha et al., 2019; Mahmood et al., 2022a). In contrast, we focus on systematically determining an optimal Optimizing Data Collection for Machine Learning collection budget. Thus, our work complements active learning which can be used to collect data up to the budget determined by our optimization policy. 2.3 Statistical Learning Theory The rich theoretical analysis on the sample complexity of machine learning explores worstcase bounds relating data set size and model performance, but these bounds are typically only tight asymptotically; we refer the reader to Section 7.1 of Viering and Loog (2022) for details on this challenge. Recent work have empirically analyzed these relationships (Jiang et al., 2020, 2021) For instance, Bisla et al. (2021) study generalization bounds for deep neural networks, provide empirical validation, and suggest using them to estimate data requirements. Ultimately, making data collection decisions based on worst-case bounds may be pessimistic and have consequences on collection costs. 2.4 Optimal Experiment Design The topic of how to collect data, select samples, and design statistical experiments is well-studied in econometrics (Smith, 1918; Cohn, 1993; Emery and Nenarokomov, 1998). For instance, pseudo-random strategies such as Latin Hypercube Sampling may be more efficient than i.i.d. sampling in data collection for statistical tests (Viana, 2016). Similarly, Bertsimas et al. (2015) optimize the assignment of samples into control and trial groups to minimize inter-group variances. Most recently, Carneiro et al. (2020) optimize how many samples and covariates to collect in a statistical experiment by minimizing a treatment effect estimation error or maximizing t-test power. However, our focus on industrial machine learning applications differs from experiment design by having target performance metrics and continual rounds of collection and modeling. 2.5 Sequential Decision-Making Our problem is a sequential decision-making problem with an unobservable state, i.e., we determine how much to collect in each round without knowing how much we will need. Although such problems can be formed as Partially Observable Markov Decision Processes (POMDPs) (Smallwood and Sondik, 1973; Puterman, 2014), the dimension of the state and action space combined with sampling limitations make such approaches untenable. We provide a detailed reformulation and discussion in Appendix A. 3. Main Problem We first introduce a formal model of data collection in machine learning as a dynamic decision-making game played over multiple collection rounds. We then discuss challenges with current intuitive, but na ıve approaches for determining how much data to collect. 3.1 Optimal Data Collection Consider K N different data sources, where for each k {1, . . . , K}, let zk be a data point and let Dk be a data set of points generated from a fixed algorithm, such as i.i.d. sampling or an active learning strategy (Settles, 2009). We train a learning model with data Mahmood, Lucas, Alvarez, Fidler, and Law sets D1, . . . , DK and evaluate a score function V (D1, . . . , DK). For example, if the learning problem is binary image classification, let K = 1 where z1 := (x, y) corresponds to images x X and labels y {0, 1}, and V (D1) is the validation set accuracy of a model trained on D1. Alternatively in semi-supervised learning, let K = 2 where z1 := (x1, y1) and z2 := x2, while V (D1, D2) is the validation accuracy of a model trained with both data sets. We omit superscripts and subscripts unless necessary. In general, we have training sets D1 q0,1, . . . , DK q0,K of q0,1, . . . , q0,K points, respectively, a target score V > V (D1 q0,1, . . . , DK q0,K), and a horizon of T rounds. Let q0 := (q0,1, . . . , q0,K)T be a vector of data set sizes and let Vq0 := V (D1 q0,1, . . . , DK q0,K). For each t {1, . . . , T}, we (i) Determine how much data to have at the end of the round qt := (qt,1, . . . , qt,K)T. (ii) Generate data until each Dk has qt,k points. (iii) Re-train our learning model. If Vqt V or if t = T, we terminate. In each round, we pay a cost ck > 0 for each additional point generated for the k-th data set. Further, if we do not reach V after T rounds, we pay a penalty P. Let c := (c1, . . . , c K)T be the cost vector. Then, the optimal data collection problem is min q1 q T c T (q1 q0) + 1 {Vq1 < V } c T (q2 q1) + 1 {Vq2 < V } c T (q3 q2) + 1 Vq T 1 < V c T (q T q T 1) + P1 {Vq T < V } ! = min q1 q T t=1 c T (qt qt 1) s=1 1 {Vqs < V } + P t=1 1 {Vqt < V } (2) Problem (1) is defined recursively where the objective includes the cost of collecting additional data in each round t and then conditioned on not collecting enough data in that round, the problem continues to the next round. Problem (2) refactors the objective by extracting the action in each round t to be depedendent on Qt 1 s=1 1 {Vqs < V }. If we use randomized algorithms to train a learning model and to sample data, the score function must be a random variable. Moreover, a general observation is that the score function typically increases monotonically with data set size (Frey and Fisher, 1999; Sun et al., 2017; Rosenfeld et al., 2020). We combine these two into the following assumption. Assumption 1 The score function is a realization of a stochastic process Vq := V (D1, . . . , DK) as a function of the data set size. Furthermore, Vq increases monotonically with q. Optimizing Data Collection for Machine Learning The score function may not always increase monotonically (Mohr et al., 2022), due to factors such as model mis-specification (Viering and Loog, 2022), active learning (Tomanek and Hahn, 2008), or mixing labeled and unlabeled data in semi-supervised learning (Cozman et al., 2002)). However, monotonic non-decreasing trends such as power laws has been consistently observed in large-scale deep learning systems (Hestness et al., 2017; Rosenfeld et al., 2020; Hernandez et al., 2021; Hoffmann et al., 2022). Furthermore, this assumption ensures Vq1 Vq T , meaning Qt 1 s=1 1 {Vqs < V } = 1 Vqt 1 < V , which allows us to simplify problem (2) to RHS (2) = min q1 q T t=1 c T (qt qt 1) 1 Vqt 1 < V + P1 {Vq T < V } . (3) As Vq is monotonically increasing, an intuitive strategy may be to collect the minimum amount of data such that Vq = V . We refer to this amount as the minimum data requirement D := arg min q n c Tq | Vq V o . (4) The minimum data requirement is also the stopping time of the stochastic process, i.e., a random variable that gives the lowest-cost index that passes V . We assume that problem (4) always has a solution, i.e., we can achieve V performance with a finite amount of data. Furthermore, we randomly pick a unique solution to break ties in (4). Below, we show that unless our penalty for failing to reach the target is too small, the optimal solution is to collect this minimum data requirement. Theorem 1 Suppose Assumption 1 holds and that q0 < D . If P < c T(D q0), then an optimal solution to problem (3) is q 1 = = q T = q0. Otherwise, an optimal solution is q 1 = = q T = D . Proof We prove for P < c T(D q0) by breaking into two cases. First, consider any solution q1, . . . , q T where q T > q0 but Vˆq T < V . Inputting this solution to problem (3) incurs an objective function value of c T(ˆq T q0) + P P, where the right-hand-side is the objective function value incurred by q 1 = = q T = q0. Next, consider the case where Vq T V . Let ˆt indicate the smallest qt for which Vqt V . The objective function value is at least c T(qˆt q0) c T(D q0) > P. Here, the first inequality follows from the definition of D in (4), and the second follows from our assumption. Therefore, q T = q0, which implies our unique optimal solution. We now prove for P c T(D q0). First, consider any solution q1, . . . , q T for which Vq T < V . This means that the objective function value is greater than P c T(D q0), where the right-hand-side is the objective function value incurred by q 1 = = q T . Second, consider any solution where Vq T V . Let ˆt indicate the smallest qt for which Vqt V . The objective function value is c T(qˆt q0) c T(D q0), where the inequality follows from the optimality of D . Theorem 1 shows that the penalty for failing to meet a target performance must be sufficiently high for the optimal data collection problem to be non-trivial. In practice, c and T are determined by the real costs and constraints of the machine learning project, but P simply Mahmood, Lucas, Alvarez, Fidler, and Law Scaling Law Estimator v(q; θ) Power Law θ1qθ2 + θ3 π arctan θ1 π 2 q + θ2 + θ3 Logarithmic θ1 log(q + θ2) + θ3 Algebraic Root 100q (1 + |θ1q|θ2)1/θ2 + θ3 Table 1: Four common scaling law functions with learnable parameters θ = {θ1, θ2, θ3} when K = 1. See Viering and Loog (2022) for an extensive list. For K > 1, we can add the scaling law for each data source according to (5). Algorithm 1 Na ıve Estimation of the Data Requirement 1: Input: Initial data set D1, . . . , DK of q points, Regression model ˆv(q; θ), Regression set size R. 2: Collect Performance Statistics(q) 3: Initialize R = , ˆD1 = = ˆDK = 4: for r {1, . . . , R} do 5: Sub-sample qk/R additional points from each Dk without replacement to augment ˆDk. 6: Evaluate V ( ˆD) and update R R n qr/R , V ( ˆD) o . 7: end for 8: end 9: Estimate D 10: Fit regression model θ = arg minθ P|R| r=1 (Vqr v(qr; θ))2. 11: Estimate the data requirement by solving ˆD = arg minq n c Tq v(q; θ ) V o . 12: end 13: Output: Estimated ˆD. reflects how much a firm stands to lose from not meeting performance targets. As such, practitioners have freedom to tune this parameter when modeling data collection, where setting a high P suggests that a strong need to meet the target within the time horizon. We expand on this observation in our theoretical analysis in Section 5. We assume in the rest of this paper that P > c T(D q0), which implies that the optimal amount of data to have is D . However, D is unknown to us at the time of decision-making. 3.2 An Intuitive but Na ıve Approach to Estimating the Data Requirement The recent neural scaling law literature suggests that if we can model the score function Vq, then we can estimate D directly and use this to determine how much data to collect. For instance, Rosenfeld et al. (2020); Hoiem et al. (2021); Caballero et al. (2023) fit parameters θ to an estimator v(q; θ) Vq of the score. They subsample from the current data sets D1 qt,1, . . . , DK qt,K to simulate small data set sizes, retrain the model, and evaluate the score. Repeating this process with R different training subsets yields a data set of training statistics R := {qr, Vqr}R r=1, which can be used to solve a Least Squares minimization problem. Once fitted, v(q; θ ) can replace Vq in problem (4) to estimate the stopping time. Algorithm 1 summarizes the general steps. Optimizing Data Collection for Machine Learning Figure 1: Extrapolating scaling laws to estimate D on Image Net (Deng et al., 2009). The solid blue line is the ground truth test accuracy as a function of data set size. Left: Fitting four different scaling law functions from Table 1 when initializing with 10% (q0 = 125, 000, dotted) and 50% (q0 = 600, 000, dashed) of the data set. All functions struggle to accurately extrapolate accuracy when q0 is small, but are accurate when q0 is large. Right: To hit a target V = 67% accuracy, we need 900, 000 images. If the scaling laws overor underestimate by only a small amount ( 6% error at q = 900, 000), they massively under- (e.g., ˆD = 580, 000) or overestimate (e.g., ˆD = 3, 000, 000) how much data is needed. The scaling law literature almost exclusively focuses on K = 1, wherein they propose different parametric functions for v(q; θ)(Rosenfeld et al., 2020; Viering and Loog, 2022; Hoiem et al., 2021; Caballero et al., 2023); we give examples in Table 1. The most common choice is a power law v(q; θ) = θ1qθ2 +θ3. We remark that some recent research has explored K > 1; for instance, Mikami et al. (2022) explore a K = 2 power law for transfer learning from synthetic to real data. To explore arbitrary K, we propose an easy-to-implement estimator that adds the contributions of each data source v(q; θ) := θ0 + k=1 vk(qk; θk), (5) where vk(qk; θk) can be any K = 1 estimator of the learning curve. Moreover, this additive model can be easily fit with a Least Squares algorithm and offers interpetable explanations of the contributions of different data types by assumping each data set has an independent contribution (Ghorbani and Zou, 2019). Remark 1 The above estimator extends the existing literature to arbitrary K N and can also be compounded on future K = 1 estimation algorithms. Further, different estimation functions v(q; θ) demonstrate specific biases towards either overor underestimating performance (Mahmood et al., 2022b). Designing a specific estimator is not the focus of this paper, as we will next show that all estimators have the same weaknesses with respect to data collection, i.e., disproportionate overor undercollection due to the concavity of the learning curve. For the remainder of the main paper, we focus on the specific case of v(q; θ) Mahmood, Lucas, Alvarez, Fidler, and Law being a linear combination of power laws, since power laws are the most common choice for learning curves. In Appendix C, we include ablations with different estimators (see Table 7). Due to the fact that estimation will have some degree of inaccuracy, this intuitive strategy na ıvely suffers from two major consequences when used in data collection. These properties were initially observed by Mahmood et al. (2022b), but we expand on their analysis below. We highlight them using the Image Net data set in Figure 1: 1. Extrapolation performance is tied to current amount of data: Figure 1 (Left) plots four different scaling law functions that were fit using an initial q0 = 125, 000 and q0 = 600, 000 images, i.e., 10% and 50% of the entire data set respectively. When q0 is small, every function fails to accurately extrapolate future performance and ultimately diverges from the ground truth learning curve. Alternatively when q0 is large, every function reasonably accurately follows the learning curve. Thus, an initial estimate of D from a small q0 is likely to be inaccurate. This property was also observed by Hestness et al. (2017), who referred to this as the the small-data regime. 2. Small estimation errors lead to large overor undercollection: Figure 1 (Right) plots two different scaling law functions. To collect a target V = 67%, the ground truth D is 900,000 images. Both estimated v(q; θ) functions are reasonably accurate and have less than 6% error at q = 900, 000. However, the function that overestimates Vq (orange) suggests collecting 580, 000 images, which is about 60% of the true amount, whereas the function underestimating Vq (green) suggests collecting 3 million images, which is over three times the true amount. Finally, note that the magnitude of error is much larger in this example when we underestimate v(q; θ) (i.e., the green curve). As q increases, both v(q; θ) and the estimated curve flatten. If the rate of change for both curves approach zero, the distance between the points at which the two curves respectively reach V can increase arbitrarily. We conclude that data collection policies that simply estimate how much data is needed for a given task can lead to paying arbitrarily large collection costs, even if the estimated learning curves are close to the true curves. Moreover, estimators diverge drastically from the true learning curves when given a limited amount of data. As a result, robust data collection policies must also capture the uncertainty of estimation. 4. Learn-Optimize-Collect (LOC) Our solution approach, which we refer to as Learn-Optimize-Collect (LOC), minimizes the total collection cost while incorporating the uncertainty of estimation. To facilitate this problem, we assume that D is a continuous random variable that has a well-behaved, differentiable cumulative density function. Assumption 2 The random variable D is absolutely continuous and has a differentiable cumulative density function (CDF) F(q) := Pr{D q} and probability density function (PDF) f(q) := d F(q)/dq. Under this assumption, we will optimize over a continuous decision space for q and round any non-integer values. Although in practice, D is discrete, it is often realized on the order Optimizing Data Collection for Machine Learning Optimize-Collect Sample bootstrap performance with subsets of data to learn data requirement distribution Use estimated distribution to optimize collection cost plus risk of failing to meet V* Pay c(qt - qt-1) to collect additional data Have we achieved score V*? Have we hit time limit T? Terminate Pay P and Initialize with q0 data points & target V* Learn Optimize Collect Figure 2: In the optimal data collection problem, we iteratively determine the amount of data that we should have, pay to collect the additional data, and then re-evaluate our model. Learn-Optimize-Collect (LOC) optimizes for the minimum amount of data q t to collect. of thousands or greater, which makes the assumption of continuity mild and rounding errors generally minor. In this section, we first propose a stochastic optimization alternative to the optimal data collection problem (3). We then develop our framework, where we estimate the probability distribution of D and optimize over the estimated distribution. 4.1 A Stochastic Reformulation of Optimal Data Collection Solving problem (3) directly is difficult because evaluating whether or not a given amount of data q is sufficient to reach V requires collecting the data itself and training the learning model. However, note that for any solution q to problem (3), if q D , then by definition Vq V . As a result, consider the following approximation of the original problem: t=1 c T (qt qt 1) 1 {qt 1 D } + P1 {q T D } . (6) Problem (6) replaces the condition of achieving V from problem (3) with the condition of collecting at least D points over all of the data sources. We show below that when K = 1, these two problems are exactly equivalent. For general K, the two problems share the same optimal solution. Lemma 2 The following statements are true: 1. If P < c T (D q0), then an optimal solution to Problem (6) is q 1 = = q T = q0. Otherwise, an optimal solution is q 1 = = q T = D . 2. If K = 1, then problem (6) is equivalent to problem (3). Proof The proof for the first statement is identical to the proof of Theorem 1. To prove the second statement note that from Assumption 1 and the fact that q is a scalar, D = arg minq{q | Vq V }. Thus, we have the following equivalence q D Vq V . Mahmood, Lucas, Alvarez, Fidler, and Law Substituting this into the indicators completes the proof. Lemma 2 states that D is an optimal solution to the approximation (6). Because this is also an optimal solution to the original problem, solving either problem to optimality can achieve the same decision. Furthermore, the total cost is the same for both problems when collecting up to D data points. The approximation (6) relies on D , which is not observable a priori. However, since D is a random variable, we can take the expectation over the objective in problem (6) to obtain min q1 q T ED f(q) t=1 c T (qt qt 1) 1 {qt 1 D } + P1 {q T D } = min q1 q T t=1 c T (qt qt 1) (1 F(qt 1)) + P (1 F(q T )) , (7) where F(q) is the CDF of D . Due to Assumption 2, the reformulated stochastic objective is differentiable in q. We treat problem (7) to be continuous over q1, , q T and round any non-integer values to determine the amount of data to collect. Therefore, this problem can be solved via gradient descent-based algorithms. Finally, we remark on the interpretation of problem (7). Recall that problem (6) is equivalent to the original data collection problem (3) insofar as they both share an optimal solution D . Furthermore, for any data collection decision qt, the CDF F(qt) gives the probability that qt D . In the optimization problem (7), we determine how much data to collect in each round qt by minimizing the probability of qt D , i.e., 1 F(qt), multiplied by the cost of collecting this additional data and the penalty of failing to achieve the model within T rounds. This approach contrasts with the previous na ıve estimator which directly used a point estimate of D , by now incorporating the risk and potential cost of overor under-collecting data due to the stochasticity in this random variable. 4.2 Estimating-then-Optimizing How Much Data to Collect Solving problem (7) requires access to the distribution F(q). However, just as we can estimate D , we can now estimate the distribution and use this estimated distribution in the optimization problem. We propose a simple strategy of estimating this distribution by bootstrapping the point estimates of D obtained via Algorithm 1. We first use the same steps as before to create a regression set of training statistics R. Then, let B > 1 be the number of bootstrap estimates. For each b {1, . . . , B}, we create a bootstrap resampled set of R and solve a corresponding Least Squares minimization problem to fit a scaling law estimator vb(q; θb) with parameters θb. We use this estimator in place of Vq in problem (4) to estimate the minimum data requirement. After repeating this process, we obtain a bootstrap set of estimates {ˆqb}B b=1, which we then use to fit a Kernel Density Estimator (KDE) ˆf(q) of the PDF of the data requirement. Numerical integration of this KDE model yields the estimated CDF ˆF(q) := R q 0 ˆf(q)dq. Algorithm 2 summarizes the steps. Although there may exist several strategies for estimating this distribution, our bootstrapping procedure remains easy-to-perform and requires computation only to call Algorithm 1 for B rounds. In numerical experiments, we set B = 500 and find that this can generate Optimizing Data Collection for Machine Learning Algorithm 2 Estimating the Data Requirement Distribution F(q) 1: Input: Initial data set Dq, Regression model ˆv(q; θ), Regression set size R, Number of bootstrap samples B, Kernel Density Estimation (KDE) model ˆf(q). 2: Initialize R = , ˆD = 3: Update R using the Collect Performance Statistics(q) sub-routine of Algorithm 1 4: Initialize Q = 5: Bootstrapped CDF(R) 6: for b {1, . . . , B} do 7: Create bootstrap Rb by sub-sampling R points with replacement from R 8: Fit regression model θ = arg minθ P (q,Vq) Rb (Vq v(q; θ))2 9: Estimate the data requirement ˆqb = arg minq c Tq | v(q; θ ) V 10: Update Q Q {ˆqb} 11: end for 12: Fit KDE model ˆf(q) using the empirical distribution Q and let ˆF(q) := R q 0 ˆf(q)dq 13: end 14: Output: Estimate of the requirement distribution ˆF(q) high-quality distribution estimates (see Appendix C.1 for details). Most importantly, the first derivative of this CDF ˆF(q) is immediately recoverable as the original KDE model ˆf(q). This derivative is necessary for gradient descent optimization of problem (7). 4.3 Putting It All Together We now describe the complete framework for optimizing the amount of data to collect in each round of the data collection problem. This framework uses a model-predictivecontrol approach of re-estimating the data requirement distribution, solving the stochastic optimization problem (8), and taking only the recommendation of how much data to collect for the immediate round. Figure 2 summarizes the steps of our algorithm, Learn-Optimize Collect (LOC), which is also described in detail in Algorithm 3. Given a target score V and an initial amount of data q0, we must collect data until we have met the target or until T rounds have passed. In the t-th round, we initialize with qt 1 data points over the K data sources. We first collect performance statistics by measuring the training dynamics of the current datasets D1, . . . , DK. In each round t, rather than re-collecting the full performance statistic R in each round, we alternatively update R with the latest result (q, Vqt); this significantly reduces the computational burden of collecting training statistics in each round. We then bootstrap these statistics to estimate the data requirement distribution for the t-th round, which we refer to as ˆFt(q). Given an estimated distribution of D , we first define the variables dt RK where dt := qt qt 1 and solve a reformulated version of problem (7), below min d1,...,d T 0 The above problem (8) is differentiable on its domain, and the variables are only constrained to non-negativity. Thus, this problem can be treated as a continuous optimization problem with only non-negativity constraints, and can be optimized via projected gradient descent Mahmood, Lucas, Alvarez, Fidler, and Law Algorithm 3 Optimal Data Collection via LOC 1: Input: Initial data sets D1, . . . , DK of q0 points, Regression model ˆv(q; θ), Regression set size R, Number of bootstrap samples B, Kernel Density Estimation (KDE) model f(q), Target V , Maximum number of rounds T, Cost c, Penalty P. 2: Initialize round t = 0, Total cost L = 0 3: Initialize statistics R = , datasets ˆD1 = = ˆDK = 4: repeat 5: Update R using the Collect Performance Statistics(qt) sub-routine of Algorithm 1 6: Learn-Optimize-Collect 7: Initialize Q = 8: Update KDE model ˆFt(q) using the Bootstrapped CDF(R) sub-routine of Algorithm 2 9: Freeze variables ds for s < t and solve problem (8) using ˆFt(q) to obtain d t , . . . , d T . 10: end 11: Collect Data(dt) 12: for k {1, . . . , K} do 13: Collect data zk until |Dk| = qt,k + dt,k 14: Update cost L L + ckdt,k 15: end for 16: Re-train model and update performance Vqt 17: end 18: t t + 1 19: until Vqt V or t = T 20: if Vq T < V then 21: Update loss L L + P 22: end if 23: Output: Final collected data sets D1, . . . , DK, Total cost L, Final model performance Vqt algorithms. Finally, in each round t, we fix the previous decision variables d1, . . . , dt 1 constant to their previously optimized values since we have already collected this data. Let d t , . . . , d T be the solution obtained from using a gradient method to optimize problem (8). We then collect data from each data source as determined by d t , i.e., the recommendation of how much data to collect immediately in round t. Once this data is collected, we re-train the machine learning model to evaluate the current performance Vqt and proceed to the next round of the LOC pipeline. 5. Theoretical Insights in One-Round Data Collection Although LOC can be used to generate long-term strategic data collection-term decisions over multiple rounds, a common use-case is to obtain a one-round T = 1 estimate of how much data is needed to meet the target V . For example, consider a firm that is deciding whether or not they should pursue a machine learning-based solution for a specific problem; to make an informed decision, the firm may want a single estimate of whether the amount of data needed to build the desired model is financially feasible. Such use-cases typically feature a single data type K = 1, a limited or potentially zero initial data q0, and a noisy estimator ˆF(q) of the data requirement distribution F(q) (for example, see Section 3.2). This setting permits theoretical analysis wherein we can derive exact globally optimal solutions Optimizing Data Collection for Machine Learning for how much data to collect as well as structural insights to the relationships between costs, penalties, and the estimated data distribution. In this section, we focus on the one-round, single-data source problem: min d1 cd1 + P 1 ˆF(q0 + d1) . (9) In Section 5.1, we first develop an analytic solution to problem (9). This solution is dependent on the ratio c/P of the per-sample cost and how much we stand to pay in penalty for failing to collect. Then in Section 5.2, under the assumption that ˆF(q) follows a Gaussian distribution, we fully characterize the analytic solution to obtain the globally optimal amount of data to collect. Finally in Section 5.3, we consider the case where ˆF(q) is a noisy estimate of the true data requirement distribution F(q). Here, we develop a regret bound to show that our optimization strategy outperforms the estimation-based approach (i.e., Algorithm 2) to data collection summarized in Section 3.2. 5.1 An Analytic Solution to One-Round Data Collection The cost c parameter reflects real data collection costs whereas the penalty P reflects how much a firm stands to pay if they cannot obtain a model with performance V . Since this term may not have an exact real value, it may be difficult to determine the appropriate P in practice. Instead, consider an alternate, more intuitive parameter ϵ 0 to be a measure of the maximum tolerable probability of not meeting V . Since the data requirement D is stochastic, ϵ represents how much a firm is willing to tolerate the chance of undercollecting. That is, we should collect enough data d1 such that ˆF(q0 + d1) 1 ϵ. Our main theorem below states an optimal solution d 1 to problem (9) in terms of this acceptable risk ϵ. Theorem 3 Assume ˆF(q) is strictly increasing and continuous. If there exists d1 0 where c P ˆF(q0 + d1) ˆF(q0) then there exists an ϵ 1 ˆF(q0) that satisfies P = c/ ˆf( ˆF 1(1 ϵ)) and an optimal solution to problem (9) is d 1 := ˆF 1(1 ϵ) q0. Otherwise, d 1 := 0. Before proving this result, we first summarize the intuition and consequences. Theorem 3 states that for the one-round T = 1 problem, when there is only a single data type K = 1, the optimal one-round estimate of the data requirement is determined by taking a 1 ϵ quantile of the distribution of D . Rather than solving the optimization problem (9), we can instead just use the estimated data requirement distribution ˆF(q) and prescribe a maximum acceptable risk of failing to collect enough data ϵ := Pr{q0 + d1 < D }. We then collect exactly d 1 = ˆF 1(1 ϵ) q0 additional points. The equivalence between simply prescribing a maximum risk ϵ and solving problem (9) is determined via choice of the cost and penalty parameters to satisfy c/P = ˆf( ˆF 1(1 ϵ)). Nonetheless, we remark that this equivalence only holds if the cost-to-penalty ratio c/P is sufficiently small and also only for T = 1, K = 1. Thus, problem (9) can also be seen as a generalization of Bayesian approaches to determine how much data to collect with respect to estimating the distribution of the learning curves and selecting a minimum pre-determined quantile (Domhan et al., 2015). Mahmood, Lucas, Alvarez, Fidler, and Law To prove Theorem 3, we will equate the original problem (9) to the following alternative constrained optimal data collection problem min d1 c d1 s. t. ˆF(q0 + d1) 1 ϵ , d1 0. (11) The above problem incorporates the maximum acceptable risk parameter ϵ as a constraint to replace the previous penalty-based formulation. We first prove that this problem is is convex and has an intuitive analytic solution. Lemma 4 Assume that ˆF(q) is strictly increasing and continuous. Then, 1. Problem (11) is a convex optimization problem. 2. The unique optimal solution to problem (11) is d 1 = max{ ˆF 1(1 ϵ) q0, 0}. Proof To prove the first statement, note that the objective and the second constraint are convex. Thus, we only need to prove that the set {d1 | ˆF(q0 + d1) 1 ϵ} is a convex set. Since ˆF(q) is strictly increasing in q, for any θ [0, 1] and ˆd d 0, we have ˆF(q0 + θ ˆd + (1 θ)d) ˆF(q0 + d) 1 ϵ. Because the convex combination of any two points is in the set, the set must be convex, which makes the optimization problem convex. To prove the second statement, we consider two cases. First, if ˆF 1(1 ϵ) q0, let d1 be the value that satisfies q0 + d1 = ˆF 1(1 ϵ) := inf n q | ˆF(q) 1 ϵ o , which makes d1 the smallest value to satisfy ˆF(q0+d1) 1 ϵ. Therefore, d 1 = ˆF 1(1 ϵ) q0 is a minimizer in this case. In the second case where ˆF 1(1 ϵ) < q0, we have ˆF(q0) > 1 ϵ. Since d1 0 is a constraint, the minimizer is d 1 = 0. We prove Theorem 3 by showing that under (10), problems (9) and (11) are equivalent. Proof of Theorem 3 First, note that when T = 1 and K = 1, problem (9) is a minimization of a continuous function over a single variable with a single non-negative constraint. Then, there must exist an optimal solution d 1 that satisfies either d 1 = 0 via the boundary condition, or ˆf(q0 + d 1) = c/P as a zero-gradient solution. Moreover, for the zero gradient solution to be optimal, there must exist some decision d1 > 0 that achieves a lower objective than the zero solution. We rewrite this condition to cd1 + P(1 ˆF(q0 + d1)) P(1 ˆF(q0)) = c P ˆF(q0 + d1) ˆF(q0) To derive the structure of the optimal solution, we rewrite the original problem (9) to min ϵ min d1 cd1 + Pϵ s. t. ϵ 1 ˆF(q0 + d1), d1 0, (13) where ϵ is a slack variable. However, for any fixed value of ϵ, the inner problem (13) is equivalent to the constrained alternative problem (11), meaning that it has an analytic Optimizing Data Collection for Machine Learning solution as determined by Lemma 4. Specifically, for any ϵ 1 ˆF(q0), the optimal solution is d 1 = 0, but for any ϵ 1 ˆF(q0), the optimal solution is d 1 = ˆF 1(1 ϵ) q0. Consequently, problem (13) can be further rewritten as n c ˆF 1(1 ϵ) q0 + Pϵ | ϵ 1 ˆF(q0) o , P(1 ˆF(q0)) . (14) Note that problem (14) has an equivalent optimality condition to (12). That is, for the left term to be the minimum, there must exist some ϵ 1 ˆF(q0) that satisfies c ˆF 1(1 ϵ) q0 + Pϵ P(1 ˆF(q0)) = c P 1 ˆF(q0) ϵ ˆF 1(1 ϵ) q0 (15) Setting d1 = ˆF 1(1 ϵ) q0 make conditions (12) and (15) equivalent. Most importantly, this must also hold for the minimizer d 1 = ˆF 1(1 ϵ ) q0. We can substitute this condition back into the zero-gradient solution of problem (9) to obtain f ˆF 1(1 ϵ) = c P , ϵ 1 ˆF(q0). If such an ϵ does not exist, then the optimal solution must be d 1 = 0 and ϵ = 1 ˆF(q0). Finally, we remark on the assumption that ˆF(q) be strictly increasing and continuous. Since in practice, ˆF(q) is the integral of a KDE ˆf(q), our assumption is always satisfied given an appropriate kernel (e.g., Gaussian). Furthermore as we show next, we can derive an exact formula under the case where the estimated data requirement distribution is Gaussian. 5.2 An Analytic Solution for Gaussian Distributions Theorem 3 demonstrates an equivalence between optimizing how much data to collect given certain costs and penalties versus collecting data according to a minimum pre-specified risk tolerance ϵ. However, to use Theorem 3, we must compute both ˆF(q) and ˆF 1(1 ϵ). With a structural form of this CDF, we may further simplify Theorem 3 to obtain an exact analytic formula on how much data to collect. We focus on the case where the estimated data requirement follows a Gaussian distribution D N(ˆµ, ˆσ) and where F(q) := 1/2 + erf((q ˆµ)/( 2ˆσ))/2 is the corresponding Gaussian CDF. The motivation for a Gaussian distribution stems from an observation that even when we estimate ˆF(q) using KDE, the estimated distribution is often unimodular (see Appendix C.1 for details on this observation). As such, an intuitive strategy may be to simply estimate ˆF(q) using a single Gaussian distribution rather than with many Gaussian kernels. We first consider the case where our estimated CDF is exactly equal to the true CDF ˆF(q) = F(q). Here, problem (9) simplifies to min d1 cd1 + P 1 erf q0 + d1 ˆµ We can obtain an exact solution for any cost and penalty values. We first show that if the penalty P is too small, then the optimal solution to problem (16) is to collect no data, Mahmood, Lucas, Alvarez, Fidler, and Law i.e., the penalty for failing to achieve V is insufficient to outweigh the cost of collecting additional data. We then show that given a sufficiently large P, the optimal amount of data to collect can only take one of two possible values. Proposition 5 Suppose that ˆF(q) models a Gaussian distribution N(ˆµ, ˆσ). If P < cˆσ 2π, then the optimal solution to problem (16) is d 1 = 0. Proof The gradient of the objective function of problem (16) is 2πˆσ exp (q0 + d1 ˆµ)2 Setting this gradient to zero yields However, this only exists if P > cˆσ 2π; otherwise if P < cˆσ 2π, then there is no zerogradient solution to problem (16). Furthermore, by inspection, the objective function is monotone non-decreasing, meaning that the minimizer is the boundary point d 1 = 0. Proposition 5 states that if the penalty is less than cˆσ 2π, then it is strategically better to not collect any data at all. Recall that P represents a potential loss from failing to develop a machine learning model that achieves V performance. Thus, not collecting additional data is equivalent to deciding not to build this model and simply accepting the penalty. This represents real-world scenarios where the costs of developing a machine learning model outweigh the benefits that this machine learning model can yield to the developer. If the penalty is sufficiently large such that d 1 > 0, then building the machine learning model can be deemed useful. The minimum value on the penalty parameter is dependent only on the cost and the standard deviation ˆσ of the estimated data requirement distribution. When using LOC, if the variance of the estimated distribution is too large, then there is an increasingly larger chance that the minimum amount of data required D is low. If the penalty for undercollecting is low compared to this probability, then the optimal strategy would be to refrain from collecting data and correspondingly incur the penalty, rather than overcollect data. Figure 3 (Left) visualizes this trend by sweeping different values for P and plotting the objective function of problem (16). Here, if P < ˆσ 2π (blue and orange curves), then the optimal d 1 = 0. Corollary 6 Suppose that ˆF(q) models a Gaussian distribution N(ˆµ, ˆσ) and P > cˆσ log P log(cˆσ 2π). The optimal solution to problem (16) satisfies: 1. If q0 ˆµ 2ˆσζ q0 if c P < erf ζ erf q0 ˆµ 0 otherwise Optimizing Data Collection for Machine Learning 2ˆσζ < q0 ˆµ + 2ˆσζ, then d 1 = ˆµ + 3. If q0 ˆµ + 2ˆσζ, then d 1 = 0. Proof Following the same steps from the proof to Proposition 5, we find that the optimal solution must be a zero-gradient point or the boundary point d 1 n 0, ˆµ 2ˆσζ q0 o , conditioned on whether the zero-gradient solution is feasible. Furthermore, we can show below that ˆµ 2ˆσζ q0 is a local maximizer whereas ˆµ + 2ˆσζ q0 is a local minimizer via the second-order condition. Specifically, note that the second derivative of the objective function is P (q0 + d1 ˆµ) (q0 + d1 ˆµ)2 Substituting d1 = ˆµ 2ˆσζ q0 into the above equation admits a negative second derivative, whereas d1 = ˆµ + 2ˆσζ q0 admits a positive second derivative. Therefore, the optimal amount of data to collect can only be in d 1 n 0, ˆµ + 2ˆσζ q0 o . We now break the problem down into the three cases. First, if q0 ˆµ 2ˆσζ, then both the local minimizer and the boundary point are feasible. For d 1 = ˆµ + 2ˆσζ to be the global minimizer, the condition on the c/P ratio from Theorem 3 must hold. That is, 1 2 1 + erf q0+ 2 1 + erf q0 ˆµ 2ˆσζ q0 = erf ζ erf q0 ˆµ Otherwise, the optimal solution in this regime is d 1 = 0. Second, if ˆµ 2ˆσζ < q0 ˆµ + 2ˆσζ, then note that the objective function within the feasible region for d1 consists of a curve with a single local minimum and no other zero-gradient points. Therefore, this local minimum d 1 = ˆµ + 2ˆσζ must be the global minimizer. Finally, if q0 ˆµ + 2ˆσζ, then none of the zero-gradient points are feasible solutions. Therefore, the optimal solution must lie on the boundary point d 1 = 0. Corollary 6 gives an exact characterization of the optimal data collection problem when ˆF(q) is modeled via a single Gaussian distribution. We first confirm that the penalty is sufficiently high for the problem to be meaningful; if the penalty for under-collection is too small, then it would incur lower costs to simply not collect any data. Managerially, this implies that the machine learning product is a poor investment as the costs of data collection are higher than simply not building the machine learning model. Mahmood, Lucas, Alvarez, Fidler, and Law 0 200000 400000 600000 800000 d1 Collection Costs with different P N(106, 104), q0 = 104, c = 1 1e+03 1e+04 1e+05 1e+06 300000 400000 500000 q0 + d1 Collection Costs with different q0 N(106, 104), c = 1, P = 105 Figure 3: Evaluating problem (16). Left: We set c = 1, q0 = 104 and sweep different values for P. If P is sufficiently small (i.e., P < cˆσ 2π), then the total expected cost is minimized by setting d1 = 0 (i.e., orange, blue curves). Right: We set c = 1, P = 105 and sweep different values for q0. The dashed lines point to the local maxima and minima at ˆµ 2ˆσζ. When q0 ˆµ + 2ˆσζ, the optimal d 1 = 0 (i.e., red, purple). When q0 [ˆµ 2ˆσζ], the optimal d 1 is at the zero-gradient minima (i.e., green, red). When q0 < ˆµ 2ˆσζ, the optimal d 1 depends on the relationship between q0, c, P. That is, d 1 = 0 for the blue curve, but d 1 is equal to the zero-gradient minima for the orange curve. Assuming the penalty is sufficiently high, we can compare the value of the initial amount of data q0 to the critical points ˆµ 2ˆσζ. Specifically, if q0 is greater than ˆµ + 2ˆσζ, then the optimal solution to the data collection problem is to not collect any data. This scenario is equivalent to setting a small value of P (c.f. Proposition 5), since ζ is a function of P, c, and ˆσ. On the other hand, if q0 is between ˆµ 2ˆσζ and ˆµ + 2ˆσζ, then the optimal solution is always to collect data up to having ˆµ + 2ˆσζ points. Finally, if q0 is very small, i.e., less than ˆµ 2ˆσζ, then we must again check if the ratio of cost over penalty is sufficiently small before collecting data. If this ratio is too high, then the penalty is again too low, and the optimal strategy would be to not collect any data and incur the penalty. Figure 3 (Right) visualizes the implications of Corollary 6 by evaluating the objective function of problem (16) for different values of q0; we plot with respect to q = q0 + d1, since the zero-gradient points for this formulation are always ˆµ 2ˆσζ for any q0. Here, we observe that if q0 ˆµ 2ˆσζ, the optimal total amount of data is q = max ˆµ + 2ˆσζ, q0 (i.e., the green, red, and purple curves). On the other hand if q0 < ˆµ 2ˆσζ, then the optimal amount of data to collect is equal to the zero-gradient minimizer only if q0 is sufficiently large with respect to ζ, and therefore c and P (i.e., the orange curve). Otherwise d 1 = 0 and q = q0 (i.e., the blue curve). This condition for how large q0 must be can be most easily determined by checking whether the cost-to-penalty ratio c/P satisfies the conditions from Theorem 3. 5.3 Regret Bounds under Noisy Estimates of the Gaussian CDF We have shown so far that in the single-round setting, the optimal decision d 1 has an analytic solution, especially when the estimated ˆF(q) of D follows a Gaussian distribution N(ˆµ, ˆσ). Optimizing Data Collection for Machine Learning In practice however, the estimated ˆF(q) = F(q) will differ from the true CDF of D . In this sub-section, we assume that D N(µ, σ) follows a Gaussian distribution and ˆF(q) is a noisy estimate where ˆµ = µ and ˆσ = σ. Our goal is to study the expected regret R(d1) from a data collection decision d1 made via the noisy estimate ˆF(q) versus the true optimal solution D to problem (9): R(d1) :=ED N(σ,µ) h cd1 + P1 {q0 + d1 < D } c (D q0) i =c (d1 µ + q0) + P (1 F(q0 + d1)) . This regret is measured in expectation over the true distribution of D . We compare two data collection strategies on the amount of regret incurred. First, we consider LOC, which determines the amount of data to collect d 1 from Corollary 6. Specifically, we focus on the case where d 1 = ˆµ + 2ˆσζ q0, rather than equal to 0. To compare against LOC, we consider the na ıve estimation-only baseline from Section 3.2. This baseline sets d1 = ˆµ q0, which means it collects enough data to reach the estimated expected value of the minimum data requirement. We show that when the data collection decision is made using the LOC strategy, the expected regret is lower than if the d1 is determined using the estimation-only approach. We consider two scenarios, where the estimated distribution is noise-free (i.e., ˆµ = µ, ˆσ = ˆσ) and one where the estimated mean and variance differ from their true values (i.e., ˆµ = µ, ˆσ = σ). Below, we first consider the noise-free setting to demonstrate that LOC incurs a lower expected regret over the distribution of D versus the na ıve estimation-only approach. Lemma 7 Suppose we perfectly estimate F(q), i.e., ˆµ = µ, ˆσ = σ. Then, R(d 1) R(ˆµ q0) = P Proof The inequality follows from the fact that d 1 minimizes the objective cd1 + P(1 F(q0 + d1)). The equality follows from substituting ˆµ q0 into the regret equation. Lemma 7 states that even when we know the exact distribution D , it is disadvantageous to make data collection decisions from using only the estimate. Due to the inherent variability in the distribution, the expected regret is P/2 when using an estimation-only strategy. Specifically, using the expected value means that with 50% probability, the na ıve strategy will under-estimate how much data to collect, and consequently incur the penalty. Instead, setting d 1 according to Corollary 6 will incur a lower expected regret. Note that Lemma 7 holds for any symmetric probability distribution on D and not just the Gaussian setting. We now present our main theoretical result, which explores the case where our estimated distribution N(ˆµ, ˆσ) of the minimum data requirement D differs from the true distribution N(µ, σ). If the estimated distribution is sufficiently close to the true distribution, then the data collection decision generated by LOC will still incur a lower expected regret than a data collection decision generated by estimation alone. This sufficiency is characterized as an upper bound on the Total Variation distance between the two distributions, with respect to the cost, penalty, initial data, and estimated mean. Mahmood, Lucas, Alvarez, Fidler, and Law Proposition 8 Let d TV := sup A | Pr D N(µ,σ){A} Pr D N(ˆµ,ˆσ){A}| be the Total Variation distance between N(µ, σ) and N(ˆµ, ˆσ). If d TV c 2P (ˆµ q0) 1 then, R(d 1) R(ˆµ q0). Proof We want to show 0 R(d 1) R(ˆµ q0) (18) 2ˆσζ + P 1 F(ˆµ + 2ˆσζ) c (ˆµ µ) P (1 F(ˆµ)) (19) 2ˆσζ + P Pr D N(µ,σ){ˆµ + 2ˆσζ < D } Pr D N(µ,σ){ˆµ < D } (20) 2ˆσζ P Pr D N(µ,σ){ˆµ D < ˆµ + 2ˆσζ} . (21) Above, (19) applies the definition of R(d 1) and R(ˆµ q0), (20) rewrites the CDFs and (21) collects the corresponding probabilities. Note from Corollary 6 that d 1 = ˆµ + 2ˆσζ implies it incurs a lower objective function value that d 1 = 0 with respect to problem (9). That is, 2ˆσζ q0 + P Pr D N(ˆµ,ˆσ){ˆµ + 2ˆσζ < D } P Pr D N(ˆµ,ˆσ){q0 < D } . (22) By adding and subtracting P Pr D N(µ,σ){ˆµ + 2ˆσζ D } and P Pr D N(µ,σ){q0 D }, respectively, on both sides of the above equation and re-arranging the terms, we obtain 2ˆσζ q0 P Pr D N(µ,σ){q0 < D } Pr D N(µ,σ){ˆµ + P Pr D N(ˆµ,ˆσ){q0 < D } Pr D N(µ,σ){q0 < D } + P Pr D N(µ,σ){ˆµ + 2ˆσζ < D } Pr D N(ˆµ,ˆσ){ˆµ + 2ˆσζ < D } (23) Collecting the probability terms in the left-hand-side of (23) and bounding the right-hand-side by the total variation distance yields 2ˆσζ + c (ˆµ q0) P Pr D N(µ,σ){q0 < D < ˆµ + 2ˆσζ} 2Pd TV (24) 2ˆσζ + c (ˆµ q0) P Pr D N(µ,σ){q0 < D < ˆµ} P Pr D N(µ,σ){ˆµ D < ˆµ + 2ˆσζ} 2Pd TV (25) Above (25) breaks the probability into two independent components between [q0, ˆµ) and [ˆµ, ˆµ + Optimizing Data Collection for Machine Learning From (25), it remains only to prove that 2Pd TV + P Pr D N(µ,σ){q0 D < ˆµ} c(ˆµ q0) 0. Here, we rearrange the Total Variation distance bound (17) to c(ˆµ q0) P(2d TV + 1) P 2d TV + Pr D N(µ,σ){q0 D < ˆµ} , showing that the inequality is satisfied and proving inequality (21). Proposition 8 guarantees that even when LOC uses a noisy estimate of ˆF(q) of the true CDF F(q) of the minimum data requirement, the optimization problem solved by LOC will yield, on average, lower cost and regret when compared to a strategy that only estimates D from the noisy ˆF(q). This guarantee holds under condition (17), which states that the estimated distribution is sufficiently close to the true distribution. We first note that this upper bound can be computed from known values and furthermore, can be controlled by appropriate selection of the penalty parameter. Specifically, if we suspect the estimated ˆF(q) is far from F(q), then d TV is large, and we can select a smaller P to ensure that LOC will generate a high-quality data collection decision. If we suspect that ˆF(q) is close to F(q) or if we observe that ˆµ q0 is large, then we are safe to select a large P and still ensure that LOC will generate a high-quality decision. Finally, we note that the bound in Proposition 8 is a sufficiency condition only and not necessary for guaranteeing the quality of LOC-generated data collection decisions. 6. Empirical Results In this section, we numerically evaluate LOC on data collection for six computer vision data sets spread over three tasks: image classification, object detection, and segmentation. We evaluate on K = 1 and K = 2, which are the two most common use-cases for obtaining high-level collection guidelines on data collection. In the first setting, a firm can only determine the data set size without focusing on the type of data. The second setting reflects a common problem class where data can be categorized into expensive and inexpensive types, e.g., long-tail versus common data or acquiring real versus auto-labeled data via semi-supervised learning. Our experiments reveal: Section 6.2: For every data set and task, LOC significantly reduces the number of instances where we fail to meet the data requirement V , while incurring only marginally more expensive costs compared to an oracle policy that determines exactly the minimum amount of data. In contrast, data collection decisions via estimation alone almost always leads to undercollecting and failing to meet V . Section 6.3: LOC is robust to both the cost and penalty parameters. For nearly all settings, modifying either parameter by orders of magnitude does not significantly affect the quality of decisions. Therefore in practical use, it is sufficient to roughly estimate the costs and penalties rather than obtain precise values. Section 6.4: LOC can be easily adapted to solve custom economic analyses and machine learning scenarios. We demonstrate two unique examples where a firm would like to (i) modify an existing machine learning model to now accommodate a new class; Mahmood, Lucas, Alvarez, Fidler, and Law and (ii) contrast two entirely different data collection policies. For both cases, LOC yields appropriate data collection decisions. We expand on our experiment setup in Appendix B. We expand on the main results in Appendix C and include further ablations and comparisons against different baselines. 6.1 Experiment Setup We explore three tasks for K = 1. First, we consider classification on CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), and Image Net (Deng et al., 2009), where we train Res Nets (He et al., 2016) to meet a target validation accuracy. We explore semantic segmentation using Deeplabv3 (Chen et al., 2018) on BDD100K (Yu et al., 2020), which is a large-scale driving data set, as well as Bird s-Eye-View (BEV) segmentation on nu Scenes (Caesar et al., 2020) using the Lift Splat architecture (Philion and Fidler, 2020); both tasks require a target mean intersection-over-union (Io U). Finally, we explore 2-D object detection on PASCAL VOC (Everingham et al., 2007, 2012) using SSD300 (Liu et al., 2016), where we evaluate mean average precision (m AP). We evaluate two tasks for K = 2. First, we mimic the scenario of long-tail or imbalanced learning where data for some classes (e.g., long-tail) is much more expensive to collect than others. Here, we divide CIFAR-100 into two subsets containing data from the first and last 50 classes, respectively, and assign a higher cost to the first 50 classes versus a lower cost to the last 50. Our second experiment models the scenario where one can acquire real labeled data versus use a prior model to cheaply autolabel data. Labeled data incurs a higher cost from collection plus annotation whereas autolabeled data incurs only collection costs, since autolabeled annotations are effectively free. We design this experiment using the labeled and unlabeled data splits of BDD100K. We simulate the deep learning workflow following the procedure of Mahmood et al. (2022b), to approximate the true problem while simplifying the experiments (see Appendix B for details). To avoid repeatedly sampling data, re-training a model, and evaluating the score, this simulation uses a piecewise-linear approximation of a ground truth learning curve that returns model performance as a function of data set size. We initialize with q0 = 10% of the full data set (we use 20% for VOC). In each round, we solve for the amount of data to collect and call the piecewise-linear learning curve to obtain the current score. For each data set and task, we consider T = 1, 3, 5 rounds. For each T, we sweep V [V (Dq0) + 1, V (D)] where D is the entire labeled data set; we then aggregate the LOC performance over the full range of potential target settings. For example, in our experiments of optimizing the number of training examples of CIFAR-100 to collect, we find that the initial dataset Dq0 achieves 42% accuracy and D achieves 75% accuracy; consequently, we sweep over all values of V [42, 75] and evaluate the average performance of the data collection policies. We first evaluate policies on the failure rate, which is the fraction of V settings in which our data collection policy fails to achieve the target performance within T rounds. Second, we evaluate the cost ratio c T (q T q0) c T (D q0) 1, Optimizing Data Collection for Machine Learning which is the relative sub-optimality of the policy compared to an oracle policy that knows exactly D a priori. To ensure that this metric is non-negative and is scaled reasonably, we average the cost ratio over all settings of V in which the policy yields a model that exceeds V ; i.e., this metric ignores instances that fail to collect enough data. For K = 1, we also measure the ratio of points collected q T /D following Mahmood et al. (2022b). Although there is a trade-offbetween low cost ratio (undercollecting) and failure rate (overcollecting), we emphasize that our goal is to have low cost but with zero chance of failure. Our primary baseline is the conventional estimation detailed in Section 3.2, which fits a regression model to the learning curve statistics, extrapolates the learning curve for more data, and then solves for the minimum data requirement under this extrapolation. We refer to this as the Power Law Regression approach. In addition, we also ablate the effect of optimizing the minimum data requirement by comparing against an intermediate estimation baseline which estimates the distribution of the minimum data requirement and selects the average estimated value. This baseline, referred to as Ensemble Regression, leverages the bootstrapping and density estimation procedure to learn the distribution. There are many different regression models that can be used to fit learning curves (Jones et al., 2003; Figueroa et al., 2012; Hestness et al., 2017; Hoiem et al., 2021; Viering and Loog, 2022). Since power laws are the most commonly studied approach in the neural scaling law literature, we focus on power laws here, but compare against the other functions from Table 1 in Appendix C.2. 6.2 Main Results: The Value of Optimization over Estimation Below, we discuss results for K = 1 and K = 2. 6.2.1 Case with K = 1 We first discuss experiments on K = 1. Figure 4 compares LOC versus the corresponding power law regression baseline when c = 1. We fix P = 107 as the default penalty for CIFAR10, CIFAR-100, BDD100K, and nu Scenes, but set P = 108 for Image Net and P = 106 for VOC1. If a curve is below the black line, then it failed to collect enough data to meet the target. LOC consistently remains above this black line for most settings, whereas even with T = 5 rounds, collecting data based only on regression estimates leads to failure. Table 2 aggregates failure rates and cost ratios for each setting. Here, LOC fails at less than 10% of instances for 12/18 settings, whereas regression fails over 30% for 15/18 settings. In particular, regression nearly always under-collects data when given a single T = 1 round. Here, LOC reduces the risk of under-collecting by 40% to 90% over the baseline. Moreover, our cost ratios are consistently less than 0.5 for 12/18 settings, meaning that we spend at most 50% more than the true minimum cost. Table 2 also ablates the effect of optimization versus only estimating the distribution. The ensemble uses the same KDE distribution as in LOC but simply outputs the mean of the distribution. First, LOC always outperforms the Ensemble Regression baseline, showing that 1. This tuning is due to the fact that Image Net naturally has an order of magnitude more data within it, meaning that the scale of data required to reach target performances is naturally higher; nonetheless in Section 6.3, we demonstrate that our LOC results are generally robust to order-of-magnitude shifts in the penalty parameter for most settings. Mahmood, Lucas, Alvarez, Fidler, and Law 80 85 90 V * Ratio of points q * /D * CIFAR-10 (T=1) Power Law LOC 80 85 90 V * Ratio of points q * /D * CIFAR-10 (T=3) Power Law LOC 80 85 90 V * Ratio of points q * /D * CIFAR-10 (T=5) Power Law LOC 40 50 60 70 V * Ratio of points q * /D * CIFAR-100 (T=1) Power Law LOC 40 50 60 70 V * Ratio of points q * /D * CIFAR-100 (T=3) Power Law LOC 40 50 60 70 V * 0.8 Ratio of points q * /D * CIFAR-100 (T=5) Power Law LOC 45 50 55 60 65 V * Ratio of points q * /D * Imagenet (T=1) Power Law LOC 45 50 55 60 65 V * Ratio of points q * /D * Imagenet (T=3) Power Law LOC 45 50 55 60 65 V * Ratio of points q * /D * Imagenet (T=5) Power Law LOC 66 68 70 72 74 V * Ratio of points q * /D * Power Law LOC 66 68 70 72 74 V * Ratio of points q * /D * Power Law LOC 66 68 70 72 74 V * Ratio of points q * /D * Power Law LOC 20 22 24 26 V * Ratio of points q * /D * BDD100K (T=1) Power Law LOC 20 22 24 26 V * Ratio of points q * /D * BDD100K (T=3) Power Law LOC 20 22 24 26 V * Ratio of points q * /D * BDD100K (T=5) Power Law LOC 22 24 26 28 30 V * Ratio of points q * /D * nu Scenes (T=1) Power Law LOC 22 24 26 28 30 V * Ratio of points q * /D * nu Scenes (T=3) Power Law LOC 22 24 26 28 30 V * Ratio of points q * /D * nu Scenes (T=5) Power Law LOC Figure 4: Mean standard deviation of 5 seeds of the ratio of data collected q T /D for different V . The rows correspond to T = 1, 3, 5 and the columns to different data sets. The black line corresponds to collecting exactly the minimum data requirement. LOC almost always remains slightly above the black line, meaning we rarely fail to meet the target. Optimizing Data Collection for Machine Learning Data set T Power Law Regression Ensemble Regression LOC (Ours) Failure rate Cost ratio Failure rate Cost ratio Failure rate Cost ratio CIFAR-10 1 100% 100% 60% 0.19 0.00 3 95% 0.00 0.00 100% 32% 0.05 0.00 5 86% 0.00 0.00 92% 0.00 0.00 29% 0.03 0.00 CIFAR-100 1 56% 0.12 0.00 57% 0.11 0.00 4% 0.99 0.01 3 48% 0.10 0.00 51% 0.10 0.00 3% 0.31 0.00 5 48% 0.10 0.00 46% 0.09 0.00 2% 0.19 0.00 Imagenet 1 99% 0.00 0.00 100% 37% 0.49 0.00 3 75% 0.01 0.00 96% 0.00 0.00 5% 0.16 0.00 5 56% 0.01 0.00 82% 0.00 0.00 2% 0.10 0.00 BDD100K 1 77% 0.03 0.01 69% 0.07 0.01 12% 2.03 0.10 3 31% 0.00 0.00 23% 0.03 0.00 0% 0.72 0.03 5 23% 0.01 0.00 15% 0.03 0.00 0% 0.35 0.02 nu Scenes 1 95% 0.00 0.00 95% 0.00 0.00 52% 0.16 0.00 3 71% 0.01 0.00 90% 0.00 0.00 0% 0.09 0.00 5 62% 0.00 0.00 76% 0.00 0.00 0% 0.04 0.00 VOC 1 36% 1.24 0.06 33% 1.12 0.06 25% 0.56 0.02 3 8% 0.88 0.04 6% 0.80 0.04 0% 1.10 0.07 5 6% 0.86 0.04 6% 0.81 0.04 0% 0.84 0.04 Table 2: Average cost ratio standard error and failure rate measured over a range of V for each T and data set. We fix c = 1 and P = 107 (P = 106 for VOC and P = 108 for Image Net). The best performing failure rate for each setting is bolded. LOC consistently reduces the average failure rate, often down to 0%, while keeping the average cost ratio almost always below 1 (i.e., spending at most 2 the optimal amount). optimization is indeed necessary when determining how much data to collect. Furthermore, the Ensemble Regression only outperforms the na ıve Power Law Regression baseline for BDD100K and VOC. For the other four data sets, the two baselines are either equivalent or Power Law Regression outperforms Ensemble Regression. This shows that estimating the distribution of the minimum data requirement is insufficient for determining how much data to collect. Specifically, estimating this distribution can capture the uncertainty in the na ıve Pow Law Regression estimator, but knowing this stochasticity does not yield accurate data collection decisions. This result validates the two-step approach of LOC. Finally, we remark that our results here for T = 1 validate our theoretical analysis in Section 5, which showed that when our estimator of the data requirement distribution produces underestimates, it incurs significantly larger regret than LOC. This can be observed particularly in the failure rates, as every failure will incur a penalty of P. Note that the cost ratios listed in Table 2 do not include P, since the penalty is always significantly greater than the collection cost and including it would obfuscate cost comparisons for instances where V was reached. In other words, for CIFAR-10, Image Net, or nu Scenes, the baseline incurs a true unfiltered cost ratio (i.e., factoring in the penalty) at approximately P divided by the oracle cost. Due to the high failure rate of the baseline, this unfiltered cost ratio would be several orders of magnitude higher than the values reported in Table 2. Since, Mahmood, Lucas, Alvarez, Fidler, and Law 40 50 60 70 V * 10 2 CIFAR-100 (2 Types) (T=5) Power Law c=[0.01 0.005] 40 50 60 70 V * 0 Failure Rate 40 50 60 70 V * 10 2 CIFAR-100 (2 Types) (T=5) Power Law c=[0.01 0.002] 40 50 60 70 V * 0 Failure Rate 40 50 60 70 V * 10 2 CIFAR-100 (2 Types) (T=5) Power Law c=[0.01 0.001] 40 50 60 70 V * 0 Failure Rate 40 50 60 70 V * 10 2 CIFAR-100 (2 Types) (T=5) Power Law c=[0.01 0.0005] 40 50 60 70 V * 0 Failure Rate 36 38 40 42 V * 10 2 BDD100K (Semi-supervised) (T=5) Power Law c=[1. 0.1] 36 38 40 42 V * 0 Failure Rate 36 38 40 42 V * 10 2 BDD100K (Semi-supervised) (T=5) Power Law c=[1. 0.05] 36 38 40 42 V * 0 Failure Rate 36 38 40 42 V * 10 2 BDD100K (Semi-supervised) (T=5) Power Law c=[1. 0.01] 36 38 40 42 V * 0 Failure Rate 36 38 40 42 V * 10 2 BDD100K (Semi-supervised) (T=5) Power Law c=[1. 0.005] 36 38 40 42 V * 0 Failure Rate Figure 5: Mean standard deviation over 5 seeds of the cost ratio and failure rate for different V , after removing 99-th percentile outliers. The columns correspond to scenarios where the first set c1 costs increasingly more than the second c2. See Appendix C for the results with T = 1, 3. LOC incurs less than 50% failure rate for all three of these instances, our reported cost ratios more accurately reflect the true values. 6.2.2 Case with K = 2 We now discuss experiments on K = 2. Figure 5 compares LOC versus regression at T = 5 with different costs. LOC decreases the failure rates while keeping a similar cost ratio to the baseline. We include similar plots for T = 1, 3 in Appendix C.4. Table 3 aggregates failure rates and cost ratios for all settings, showing LOC consistently achieves lower failure rates for nearly all settings of T. When T = 5, LOC also achieves lower cost ratios versus regression on CIFAR-100, meaning that with multiple rounds of collection, we can ensure meeting performance requirements while paying nearly the optimal amount of data. However, the optimization problem is generally more difficult as K increases and we sometimes over-collect data by several orders of magnitude margins. Because these outliers drastically skew the summary average statistics that we measure, we remove the 99-th percentile with respect to total cost for both LOC and the baseline regression estimator. In practice, outlier data collection decisions would naturally be pruned by decision-makers via common-sense reasoning. For instance, if an algorithm suggests to collect 10 million unlabeled images when a human expert would guess at 10 thousand images, the practical decision would not be to blindly use the algorithmic solution. Although the results in this section compare specifically against power law-based regression, Mahmood et al. (2022b) show that we can use other regression functions (e.g., see Table 1) as well as modifications to reduce the undercollection of these estimation-based Optimizing Data Collection for Machine Learning Data set T Cost Power Law Regression LOC Failure rate Cost ratio Failure rate Cost ratio CIFAR-100 (2 Types) (0.01, 0.0005) 62% 0.84 0.02 40% 41.80 1.80 (0.01, 0.001) 58% 1.19 0.01 46% 9.85 0.27 (0.01, 0.002) 56% 1.55 0.01 54% 6.98 0.18 (0.01, 0.005) 54% 1.65 0.01 33% 4.43 0.07 (0.01, 0.0005) 43% 3.47 0.16 30% 4.88 0.26 (0.01, 0.001) 45% 1.22 0.02 43% 1.31 0.03 (0.01, 0.002) 45% 1.47 0.02 44% 1.21 0.02 (0.01, 0.005) 38% 1.31 0.01 36% 1.17 0.01 (0.01, 0.0005) 38% 3.31 0.16 24% 5.19 0.22 (0.01, 0.001) 35% 1.22 0.02 24% 0.79 0.01 (0.01, 0.002) 37% 1.33 0.01 38% 0.90 0.01 (0.01, 0.005) 36% 1.30 0.01 24% 0.82 0.00 BDD100K (Semi-supervised) (1, 0.005) 86% 0.11 0.01 44% 7.02 1.11 (1, 0.01) 79% 0.15 0.01 30% 13.47 1.50 (1, 0.05) 72% 0.19 0.01 49% 1.02 0.19 (1, 0.1) 70% 0.19 0.01 65% 0.40 0.05 (1, 0.005) 23% 0.18 0.01 7% 1.20 0.12 (1, 0.01) 21% 0.15 0.00 7% 2.57 0.58 (1, 0.05) 26% 0.18 0.01 23% 0.50 0.06 (1, 0.1) 26% 0.21 0.01 30% 0.15 0.01 (1, 0.005) 16% 0.22 0.01 2% 1.91 0.30 (1, 0.01) 21% 0.15 0.00 2% 0.86 0.13 (1, 0.05) 16% 0.17 0.01 9% 0.27 0.03 (1, 0.1) 16% 0.20 0.01 7% 0.32 0.03 Table 3: Average cost ratio standard error and failure rate over different V for each T and c, after removing 99-th percentile outliers. We fix P = 1013 for CIFAR-100 and P = 108 for BDD100K. The best performing failure rate for each setting is bolded. LOC reduces the average failure rate, is more robust to uneven costs than regression, and for T > 1, preserves the cost ratio. baselines. In Appendix C.2, we include further results comparing against these alternate functions. In every setting, we show that while baseline estimators consistently either underor overestimate how much data to collect, LOC consistently reduces either the cost or the failure rates significantly. 6.3 Robustness to Cost and Penalty Parameters Figure 6 evaluates the ratio of points collected for T = 5 when the cost and the penalty of the optimization problem are varied. We include similar results for T = 1, 3 in Appendix C.5. LOC is robust to variations in these parameters, as LOC retains the same shape and scale for almost every parameter setting and data set. Further, LOC consistently remains above the horizontal 1 line, showing that even after varying c and P, we do not fail as frequently as the baseline. Finally, validating Theorem 3, the penalty parameter P provides natural Mahmood, Lucas, Alvarez, Fidler, and Law 80 85 90 V * Ratio of points q * /D * CIFAR-10 (T=5) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 40 50 60 70 V * Ratio of points q * /D * CIFAR-100 (T=5) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 45 50 55 60 65 V * Ratio of points q * /D * Imagenet (T=5) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 80 85 90 V * Ratio of points q * /D * CIFAR-10 (T=5) Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 40 50 60 70 V * 0.8 Ratio of points q * /D * CIFAR-100 (T=5) Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 45 50 55 60 65 V * Ratio of points q * /D * Imagenet (T=5) Power Law P=1e+07 P=1e+08 P=1e+09 P=1e+10 66 68 70 72 74 V * Ratio of points q * /D * Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 20 22 24 26 V * Ratio of points q * /D * BDD100K (T=5) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 22 24 26 28 30 V * Ratio of points q * /D * nu Scenes (T=5) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 66 68 70 72 74 V * Ratio of points q * /D * Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 20 22 24 26 V * Ratio of points q * /D * BDD100K (T=5) Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 22 24 26 28 30 V * Ratio of points q * /D * nu Scenes (T=5) Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 Figure 6: Mean standard deviation of 5 seeds of the ratio of data collected q T /D for different V and fixed T = 5. Rows 1 & 3: We sweep the cost parameter from 0.001 to 1 and fix P = 107. Rows 2 & 4: We sweep the penalty parameter from 106 to 109 and fix c = 1. The dashed black line corresponds to collecting exactly the minimum data requirement. See Appendix C for all T. control over the amount of data collected. As we increase P, the ratio of data collected increases consistently. 6.4 Adapting LOC to Custom Modeling Problems Our optimal data collection framework and LOC can also be adapted to solve custom questions faced by machine learning developers. Here, we present two case studies where we demonstrate how LOC yields high-quality data collection decisions. Optimizing Data Collection for Machine Learning T Power Law Regression LOC Failure rate Cost ratio Failure rate Cost ratio 1 89% 0.19 8% 1.35 3 38% 0.78 1% 0.63 5 24% 0.64 0% 0.26 Table 4: On CIFAR-100, average cost ratio and failure rate for beaver (new class) measured over a range of V for each T, when the model is initialized with only 99 classes and zero training examples for the new class. We fix c = 1 and P = 105. The best performing failure rate for each setting is bolded. LOC consistently achieves less than 10% failure rate, while keeping the average cost ratio even lower than the estimation baseline when T 3. 40 50 60 70 80 V * Collection Policies for BDD100K Active Learning Autolabeling Figure 7: Mean standard deviation of 5 seeds when comparing the total collection cost to reach different V on BDD100K depending on active learning versus autolabeling. We fix c L = 0.995 and c U = 0.005. Autolabeling is more cost-effective until V 70 m Io U, at which point the diminishing power law of the unlabeled data make active learning more effective. 6.4.1 Adding New Classes to an Existing Model Suppose that we have an existing M-class classifier achieving a desired accuracy V for each of the current classes individually. We now want to update this model with a new (M + 1)-th class. We require the classifier to also achieve V validation set accuracy for this class as well or else we will pay P. If we can collect training examples for this specific class at a per-sample cost c, how many should we obtain within the time horizon T? To address this problem, we use LOC where qt represents the number of training examples of the (M + 1)-th class. However, q0 = 0 since this is a new class; this means that we cannot fit training statistics to estimate F(q) at time t = 1. Instead in the first data collection round, we select one of the existing M classes, model the data requirement distribution for this prior class, and use this as a proxy for F(q). If T > 1, then after the first round, we will have an initial amount of data for the (M + 1)-th class, which we can use to learn F(q) using the standard technique. We simulate this problem using CIFAR-100, where we set M = 99 and first train the model using the training set data of all classes except for beaver (the 100-th class). For the first round estimate, we design F(q) using the statistics of dolphin , which is a similar class in the same hierarchy. Table 4 provides the failure rate and cost ratio for both LOC and the power law regression baseline. Here, LOC consistently achieves less than 10% failure rate and moreover, outperforms the baseline on both the failure rate and cost ratio for both T = 3, 5. Note that since we start with q0 = 0 initial data, our first estimation of F(q) will be poor for both our method and the baseline; this leads the baseline to high failure rates for T = 1 and surprisingly high costs for T = 3, 5. Instead, LOC, by virtue of optimization, yields low failure rates for T = 1 and strong performance at larger T. Mahmood, Lucas, Alvarez, Fidler, and Law 6.4.2 Deciding Between Two Data Annotation Campaigns Suppose that we have collected a large pool of unlabeled data. To annotate each data point, there is a small cost c C for collecting and pre-processing an unlabeled image and a labeling cost c L > c C for manually labeling the image. Using an initial labeled set of q0 points, we must decide between two potential annotation strategies via T = 1 round forecasts: 1. Autolabeling: Rather than manually labeling every data point, we can use our current model to synthetically generate labels for the unlabeled data. These autolabeled points may not as effective as the manually labeled points, but autolabeling incurs zero additional labeling costs. Consider a strategy where we manually label a fraction of the data that we have collected (i.e., incurring the cost of collecting and labeling the data c C + c L) and autolabel the rest of the data (i.e., incurring only the collection cost c C per data point). Let q = (q L, q C) where q L is the amount of manually labeled data and q U is the amount of unlabeled data that we autolabel. Then, our total cost under this strategy is (c L + c C) (q L q0) + c Cq C + P1 {Vq V } . 2. Active learning: Since the quality of autolabeled data may not be as high as that of manually labeled data, we may instead opt to only manually label data. Moreover, we can re-direct the engineering effort of building an autolabeler towards designing an active learning framework, which mines the collected (unlabeled) data to find the most useful training examples. With active learning, q = q L is the amount of data that is manually labeled. Furthermore, the cost of labeling each data point is equal to the cost of collection plus labeling c L + c C. The optimization objective under this strategy is (c L + c C) (q L q0) + P1 {Vq L V } Given a target performance V , we can solve both optimization problems to determine which of the two policies is more cost-efficient. We evaluate this experiment using the labeled and unlabeled splits of BDD100K. The autolabeling setting is equivalent to the experiments performed in Section 6 for K = 2. The active learning setting is a special case of the K = 1 experiment where the data collected in each round and subsampled for fitting scaling laws, are collected according to an active learning sampler rather than random sampling. We follow the approach in Mahmood et al. (2022b) where the steps of Algorithm 3 that involve estimating the minimum data requirement distribution are unchanged other than the fact that the data is collected via active learning (see Appendix C.3 for details and additional active learning experiments). For data collection, we use a simple strategy of computing the average Confidence scores of estimates per scene and selecting the Least Confident scenes; we leave further details in Appendix B.2 (Settles, 2009). Figure 7 plots the total cost as determined from an initial q0 = 7, 000 labeled points and costs c L = 0.995 and c C = 0.005; these costs replicate the settings explored in Table 3. We extrapolate the performance and costs up to V = 80. Since in this scenario, autolabeling data is 20 cheaper than manually labeling it, our Optimizing Data Collection for Machine Learning analysis suggests that for moderate target scores (e.g., V = 58 m Io U), we can save up to 100, 000 in total costs. However, in general, active learning produces more high-quality data than autolabeling, so we expect the 2-dimensional power law for autolabeling to flatten sooner than the active learning estimator. For V 70, the minimum data requirement and consequently, the total cost of autolabeling grows exponentially and exceeds the total cost of active learning. This leads us to an intuitive conclusion: in applications requiring high performance models, at some point it becomes more important to use high-quality data rather than simply lots of data. 7. Discussion Decisions on how much data to collect to improve a given machine learning model are fundamental in all machine learning applications, but such methods are typically handled by estimating a scaling law of training data set size to model performance and then extrapolating future behavior. This na ıve extrapolation tends to lead to costly decisions from overcollecting data or delays and future costs from undercollecting data. In our paper, we develop a rigorous framework for optimizing data collection workflows by introducing an optimal data collection problem that captures the uncertainty in estimating data requirements. Our general framework can model a variety of settings such as where multiple data sources incur different collection costs, where an existing model must be upgraded with data for a new class, or where we must assess the benefits of different sampling and labeling practices. We numerically validate our solution algorithm, Learn-Optimize-Collect, on six computer vision data sets covering classification, segmentation, and detection tasks to show that we consistently meet pre-determined performance metrics regardless of costs and time horizons. Our optimization model minimizes the expected total future collection cost, which is modeled by two parameters: (i) a per-sample cost c that models the cost of collecting, cleaning, and annotating data; and (ii) a penalty P that models the opportunity cost of the model failing to meet our desired performance metric. The second parameter may not be readily available in practice. However, we show empirically that LOC is typically robust to parameter variations on one to three orders of magnitude. That is, as long as we can roughly estimate these parameters for our setting, we will still be able to make good data collection decisions. Moreover, we theoretically analyze the one-round T = 1 setting to draw two high-level insights. First, our problem is equivalent to specifying a minimum tolerance ϵ Pr{Vq < V } on the probability that we do not collect enough data and simply collecting the quantile ˆF 1(1 ϵ) of the distribution of the minimum amount of data needed; consequently, practitioners can instantly obtain one-round estimates for how much data to collect. This analysis can be further specialized when the distribution has a given structure (e.g., Gaussian). Second, we prove that as long as the estimated distribution ˆF is sufficiently close to the true distribution of the minimum data requirement, our optimization model provably improves upon estimation-only strategies in terms of minimizing the total cost. LOC combines neural scaling law estimators with a stochastic optimization problem that can be solved via gradient descent. Our fundamental step is to treat the minimum amount of data we would need as a random variable and bootstrap a neural scaling law estimator to estimate its probability distribution. As future advances in neural scaling laws arrive, our bootstrapping can be deployed on top of new scaling law estimation algorithms to Mahmood, Lucas, Alvarez, Fidler, and Law optimize data collection decisions better than any estimation-only techniques. In particular, estimating neural scaling laws for arbitrary multi-dimensional settings (e.g., when combining different data sources) remains a difficult problem, but a side-product of this paper presents a simple multi-dimensional additive power law estimator that works reasonably well in our experiments. Our framework is designed specifically for settings where the machine learning model and training algorithm are fixed, that the data is generally of high-quality, and that there is no misspecification between model and data. In practice, these assumptions may not always be satisfied. For example, designers may modify the model or the data source in between collection rounds, change the evaluation metric, or seek to address alternative targets. Mathematically, our framework assumes that the desired performance target is achievable with a finite amount of data and that the true neural scaling law of a given problem instance is monotonically non-decreasing with the dataset size (Viering and Loog, 2022). The first assumption is necessary for this problem to be solvable. Although the second assumption is not always satisfied in machine learning tasks, it has been empirically observed to hold in most practical deep learning applications and is a standard assumption when scaling deep learning models (Hestness et al., 2017; Hoffmann et al., 2022). Finally, our numerical experiments rely on simulations with pre-constructed ground truth learning curves v(n). An alternative experimental setup may be to explicitly sample points based on the decisions generated by LOC, train the neural network model, and evaluate its score. However, such repeated retraining is computationally too expensive to perform for the range of experiments explored in this paper. Furthermore, the quality of our simulation is proportional to the number of data points that we sample, meaning that we can maintain accurate experimental analysis. Improving data collection practices yields potentially positive and negative societal impacts. By reducing the collection of extraneous data, we implicitly reduce the environmental costs of training models. On the other hand, equitable data collection should also be considered in real-world data collection practices that involve humans. We envision a potential future work to incorporate privacy and fairness constraints to prevent overor under-sampling of protected groups. Finally, our method is guided by a score function on a held-out validation set. Biases in this set may be exacerbated when optimizing data collection to meet target performance. Finally, we emphasize that this work addresses a longstanding problem in machine learning practice. There is a folklore observation that over 80% of industry machine learning projects fail to reach production, often due to insufficient, noisy, or inappropriate data (Venture Beat, 2019). As mentioned previously, industry surveys have reported that 51% of practitioners face delays from under-collection (Dimensional Research, 2019). Our numerical experiments verify this observation by showing that na ıvely estimating power laws typically leads to undercollection. We believe that robust data collection policies obtained via LOC can reduce failures while further guiding practitioners on how to manage both costs and time. Acknowledgments Optimizing Data Collection for Machine Learning The authors thank the Action Editor and all referees for their valuable comments, which have significantly improved this work. The authors also thank Daiqing Li, Jonah Philion, and Zhiding Yu for valuable feedback and help on earlier versions of this work. Appendix A. An Alternative Partially Observable Markov Decision Process Formulation of the Optimal Data Collection Problem Our problem defined in Section 3 can be written as a Partially Observable Markov Decision Process (POMDP) (Puterman, 2014; Bertsekas, 2012), modeled by the tuple (Θ, A, S, p, rt). Here, the state space characterizes the data requirement D Θ := RK +, the action space characterizes the additional data collected dt := (qt qt 1) A := RK +, and the observation set S := {0, 1} characterizes a binary variable 1{V (Dqt) V }. Furthermore, p( |D , dt) is the observation transition probability and rt( ) is the reward function where rt(qt, qt 1, D ) := c(qt qt 1) for t T and r T+1(qt, qt 1, D ) := P1{q T < D }. Finally, note that the state variable is constant throughout the MDP, meaning this problem can be written as an EK Learning-and-Doing model (Easley and Kiefer, 1988). POMDPs are typically solved by using a belief distribution of the state variable to average the reward in the value function. In general, these methods are susceptible to a curse of dimensionality and can sometimes be only tackled via approximations (Zhao et al., 2021). Alternatively, we may consider applying reinforcement learning. However, note that real-world data collection tasks do not contain the requisite sizes of learning data or generalizable simulation mechanisms that are staples in reinforcement learning techniques. All of these challenges motivate our approach, which has the benefit of being an easy-to-solve optimization problem on top of existing neural scaling law methods. Appendix B. Simulation Experiment Setup The most intuitive approach of validating our data collection problem is by repeatedly sampling from a data set, training a model, and solving the optimization problem. However, since performing a large set of such experiments over many data sets becomes computationally intractable, we follow the approach of Mahmood et al. (2022b), who propose a simulation model of the data collection problem. Below, we summarize the simulation setup. The simulation replicates the steps in Algorithm 3 except with one key difference. In the simulation, we replace the score function V (D) with a ground truth function vgt(q) that serves as an oracle which reports the expected score of the model trained with q data points. Thus, rather than having to collect data and train a model in each round, we evaluate vgt(qt) and treat this as the current model score. The optimization and regression models do not have access to vgt(q). B.1 A Piecewise-Linear Ground Truth Approximation In order to build a ground truth function, we first use the sub-sampling procedure in Algorithm 3 to collect performance statistics over subsets of the entire training data set. Mahmood, Lucas, Alvarez, Fidler, and Law Data set Task Score Full data set size CIFAR-10 (Krizhevsky, 2009) Classification Accuracy 50, 000 CIFAR-100 (Krizhevsky, 2009) Classification Accuracy 50, 000 Image Net (Deng et al., 2009) Classification Accuracy 1, 281, 167 BDD100K (Yu et al., 2020) Semantic Segmentation Mean Io U 7, 000 nu Scenes (Caesar et al., 2020) BEV Segmentation Mean Io U 28, 130 VOC (Everingham et al., 2007, 2012) 2-D Object Detection Mean AP 16, 551 CIFAR-100 (Krizhevsky, 2009) Classification Accuracy 25, 000 (Classes 0-49) 25, 000 (Classes 50-99) BDD100K (Yu et al., 2020) Semantic Segmentation Mean Io U 7, 000 (Labeled) 70, 000 (Unlabeled) Table 5: Data sets, tasks, and score functions considered. Using these observed statistics, we then build a piecewise-linear model of the ground truth. Below, we first highlight how to construct a piecewise-linear model when given a set of data set sizes and their corresponding scores. In the next subsection, we will detail the exact data collection process. The Single-variate (K = 1) Case. Mahmood et al. (2022b) develop a ground truth function by collecting training statistics over subsets of the entire data set and performing a linear interpolation. Let q0 q1 q2 be a series of data set sizes and let Dq0 Dq1 Dq2 be their corresponding sets. Then, the piecewise-linear function: q0 n, q q0 V (Dqt) V (Dqt 1) qt qt 1 (q qt) + V (Dqt 1), qt 1 q qt is concave and monotonically increasing, which follows the general trend of real learning curves Hestness et al. (2017). Furthermore Mahmood et al. (2022b) show that given sufficient resolution, i.e., enough data subsets, this piecewise linear function is an accurate approximation of the true learning curve V (D). The Multi-variate (K = 2) Case. In the previous K = 1 case, the ground truth was formed by taking linear interpolations between different subset sizes. When K > 1, we have multiple subsets that are used to evaluate the score V (D1, . . . , DK). In our numerical experiments, we focus on K = 2. Here, we can generalize the linear interpolation process to a bilinear interpolation (Wang and Yang, 2008). Let q1 0 q1 1 q1 2 and q2 0 q2 1 q2 2 be two series of data set sizes, and consider the grid (q1 0, q2 0) (q1 1, q2 0) (q1 2, q2 0) (q1 0, q2 1) (q1 1, q2 1) (q1 2, q2 1) (q1 0, q2 2) (q1 1, q2 2) (q1 2, q2 2) ... ... ... ... We estimate v(q1, q2) for any q1, q2 [q1 s 1, q1 s] [q2 t 1, q2 t ] via bilinear interpolation over the square defined by these borders (Wang and Yang, 2008). For K > 2. The piecewise linear approximations grow increasingly complex as the dimension K increases. Furthermore, the number of subsets of data set sizes required to create a piecewise linear approximation increases exponentially with K. Specifically for k {1, . . . , K}, let Mk denote the number of subsets (i.e., |{qk 0, qk 1, . . . , qk Mk}|) of a data set that Optimizing Data Collection for Machine Learning we consider when creating subsets. For each combination of K subsets, we must then train a model and evaluate it s performance to record V (D1, . . . , DK). Thus, we must subsample and train our model for O(Q k Mk) combinations, which quickly becomes computationally prohibitive. B.2 Data Collection We now summarize the data collection and training process used to create the above piecewise-linear functions for each data set and task. All models were implemented using Py Torch and trained on machines with up to eight NVIDIA V100 GPU cards. Table 5 details each task and data set size. Image Classification Tasks. For all experiments with CIFAR-10 and CIFAR-100, we use a Res Net18 He et al. (2016) following the same procedure as in Coleman et al. (2020). For Image Net, we use a Res Net34 He et al. (2016) using the procedure in Coleman et al. (2020). All models are trained with cross entropy loss using SGD with momentum. We evaluate all models on Top-1 Accuracy. For all experiments, we set the initial data set at q0 = 10% of the data. In data collection, we create five subsets containing 2%, 4%, , 10% of the training data, five subsets containing 12%, 14%, , 20% of the training data, and eight subsets containing 30%, 40%, , 100% of the data. Each subset is contained in the following subsets. Note that we use higher granularity in the early stage as this is where the dynamics of the learning curve vary the most. With more data, the learning curve eventually has a nearly zero slope. For each subset, we train our respective model and evaluate performance. VOC. We use the Single-Shot Detector 300 (SSD300) Liu et al. (2016) based on a VGG16 backbone Simonyan and Zisserman (2015), following the same procedure as in Elezi et al. (2022). We evaluate all models on mean AP. For all experiments, we set the initial data set at q0 = 10% of the data. In data collection, we sample twenty subsets at 5% intervals, i.e., 5%, 10%, 15%, , 100% of the training data. BDD100K. We use Deeplabv3 Chen et al. (2018) with Res Net50 backbone. We use random initialization for the backbone. We use the original data set split from Yu et al. (2020) with 7, 000 and 1, 000 data points in the train and validation sets respectively. The evaluation metrics is mean Intersection over Union (Io U). We follow the same protocol used in the Image classification tasks to create our subsets of data. Active Learning. We perform experiments involving active learning on CIFAR-100 and BDD100K. For these experiments, rather than collecting data by i.i.d. sampling of subsets at the above respective intervals, we use an active learning algorithm to select the subsets at each interval; for example, with CIFAR-100, we start with a random i.i.d. sample of 2% of the data and for each subsequent 4%, 6%, 8%, , 20%, 30%, 40%, , 100%, we select this data with an active learning algorithm. Each subset is contained in the following subsets. For CIFAR-100, we explore active learning via Maximum Entropy (Settles, 2009), Least Confidence (Settles, 2009), and Greedy k-Centers (Sener and Savarese, 2018). For BDD100K, we use only Least Confidence (Settles, 2009). We use the same models and training practice as described previously. nu Scenes. We use the Lift Splat architecture Philion and Fidler (2020), which is used for BEV segmentation from driving scenes, following the steps from the original paper to Mahmood, Lucas, Alvarez, Fidler, and Law Parameter Setting Optimizer GD with Momentum (β = 0.9), Adam (β0, β1 = 0.9, 0.999) Learning rate 0.005, . . . , 500 Number of bootstrap samples B 500 Number of regression subsets R See Appendix B.2 Density Estimation Model KDE for K = 1, GMM for K = 2 KDE Bandwidth 20000, . . . , 20000000 for Image Net 200, . . . , 4000 for all others GMM number of clusters 4, . . . , 10 Table 6: Summary of hyperparameters used in our experiments. train this model. We evaluate on mean Io U. Our data collection procedure follows the same steps as used for BDD100K and the Image classification tasks. CIFAR-100 (2 Types). We partition this data set into two subsets D1 and D2 of 25, 000 images each containing the first 50 and last 50 classes, respectively. We then train a Res Net18 (He et al., 2016) using different fractions of the two subsets. We follow the same training procedure as in the single-variate case except with one difference. Since some of the data sets will naturally be imbalanced (e.g., if we train with half of the first subset and all of the second subset), we employ a class-balanced cross entropy loss using the inverse frequency of samples per class. For each Dk subsets, respectively, we follow the same subsampling procedure used in the single-variate case. That is, we let q1 0 = 10% of the first data subset and q2 0 = 10% of the second data subset. For each subset, we create 10 subsampled sets at intervals of 2%, 4%, 6%, , 20% of the respective data subset. We then create eight further subsampled sets at 30%, 40%, , 100% of the respective data subset. Finally, we train our model and evaluate the score on every combination of the subsampled subsets of D1 D2. BDD100K (Semi-supervised). For this task, we consider semi-supervised segmentation via pseudo-labeling the unlabeled data set in BDD100K. The data is partitioned into two subsets D1 and D2 containing 7, 000 labeled and 70, 000 unlabeled scenes. We use Deeplabv3 (Chen et al., 2018) with a Res Net50 backbone. Here however, we: 1. First train with a labeled subset of D1 via supervised learning. 2. Pseudo-label an unlabeled subset of D2 using the trained model. 3. Re-train the segmentation model with the labeled subset and the pseudo-labeled subset. We follow the same procedure as in the single-variate case for both training steps, except we weigh the unlabeled data by 0.2 to reduce its contribution to the loss. Training via semi-supervised learning on BDD100K requires long compute times, so we reduce the number of subsets used in this experiment. For the labeled set D1, we create subsets with 5%, 10%, 15%, 20%, 40%, 60%, 80%, 100% of the data. For the unlabeled set D2, we create subsets with 0%, 10%, 25%, 50%, 100% of the data. Note that we have five settings of unlabeled data since we include the case of training with no unlabeled data as well. Optimizing Data Collection for Machine Learning B.3 LOC Implementation For all experiments, we initialize with 10% of the training data set. We consider T = 1, 3, 5 rounds and sweep a range of V . We provide a summary of parameters in Table 6. For the experiments with K = 1, we model the data requirement PDF f(q) in each round of the problem as follows. We first draw B = 500 bootstrap resamples of the current training statistics R, where R = {(rq0/R, V (Drq0/R))}R r=1 {(qs, V (Dqs))}t s=1 contains all of the measured statistics up to the initial data set (e.g., for CIFAR-10, this includes performance with 2%, 4%, , 10% of the data), and the previous collected data. The latter is obtained by calling our piecewise-linear ground truth approximation. For each bootstrap resample, we fit a power regression model ˆv(q; θ) = θ0qθ1 + θ2 and solve for the estimated minimum data requirement by minimizing a least squares problem using the Trust Region Reflective algorithm, with initial values set to 1, 0, 0 and bounds set to [0, 100], [0, 0.999], [ 10, 10] for θ0, θ1, θ2, respectively. After fitting ˆv(q; θ) for each resample, we then compute the corresponding estimate of D . We then use our set of estimates to fit a Kernel Density Estimation (KDE) model after gridsearching for the best bandwidth parameter. For the Image Net data set, we grid search for the best bandwidth parameter between [20000, 40000, 100000, 200000, 400000, 1000000, 2000000, 4000000, 10000000, 20000000], whereas for all other data sets, we grid search for the best bandwidth parameter between [200, 400, 1000, 2000, 4000]. We use these different settings because the scale of data on Image Net is between two to three orders of magnitude greater than the scale of data for the other data sets that we consider. Thus, to ensure that the estimated distribution can reasonably capture the variation on larger scales, we find that larger bandwidths are necessary. We note that in practice, a reasonable choice of bandwidth ranges for gridsearching can be visually determined by plotting the histogram of the estimated data requirement distribution (e.g., see Figure 9 in Appendix C.1) for different choices of bandwidth and inspecting the fitted distributions. For the experiments with K = 2, we use the same above procedure but fit Gaussian Mixture Models (GMM) due to their having an easily computable CDF via the Gaussian erf( ), rather than numerically integrating the PDF. We grid-search over the number of mixture components for the GMM model. We optimize over problems (8) using gradient descent. Depending on the current state and data set, different hyperparameters perform better. As a result, we perform extensive hyperparameter tuning every time we need to solve the optimization problem. Here, we sweep all combinations of gradient descent with momentum (β = 0.9), and Adam (β0 = 0.9, β1 = 0.999), and learning rates between [0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500]. We initialize each problem with qt equal to the baseline regression solution and qt+s = qt/(s + 1) for all 1 s T t. That is, we set the initial value for future collection amounts to be fractions of the initial value of the immediate amount of data to collect. We identified this initialization by manually inspecting the solutions found by LOC; it improves the conditioning of the loss landscape relative to other random initialization schemes. Appendix C. Additional Numerical Results This section contains expanded results of our numerical experiments and further ablations. Our key results include: Mahmood, Lucas, Alvarez, Fidler, and Law 0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000 0.0 0.2 0.4 0.6 0.8 1.0 q 1e6 0 2000 4000 6000 8000 10000 0 2000 4000 6000 q 0 5000 1000015000200002500030000 Figure 8: For a fixed seed, ground truth learning curves (black) and the estimated power law learning curves (blue) obtained via bootstrapping and ensembling. The shaded region represents the 95 percentile of the ensemble and the dashed blue line represents the mean of the regression functions. The mean is consistently higher than the unknown ground truth, whereas the shaded region can at times cover it. 8000 9000 1000011000120001300014000 Relative frequency CIFAR-100 (V * = 50) 10000 12500 15000 17500 20000 22500 Relative frequency CIFAR-100 (V * = 60) 15000 20000 25000 30000 35000 40000 Relative frequency CIFAR-100 (V * = 70) Figure 9: For a fixed seed, the histogram of estimates of D from different bootstrapped models (blue bars), the estimated F(q) (orange curve), and the ground truth D (black dashed line). Each plot corresponds to a different V for CIFAR-100 (see Figure 8 for the learning curve). With higher targets, regression (i.e., collecting the mean of the distribution) will lead to larger under-estimations. In Appendix C.1, we evaluate the effectiveness of estimating F(q) by plotting the estimated learning curves as well as the empirical histograms used to model the data requirement distribution. In Appendix C.2, we consider variants of LOC where we use different regression functions to estimate the data requirement distribution. Our optimization framework can be deployed on top of any regression function to reduce the failure rate. In Appendix C.3, we consider LOC for scenarios where data is not collected via random sampling, but instead by active learning. Our optimization framework consistently outperforms baselines even under such non-i.i.d. settings. In Appendix C.4, we explore the multi-variate LOC (i.e., K = 2) for problems where we have a small number of T = 1, 3 rounds. The baseline fails for almost all instances of T = 1, whereas LOC maintains a low failure rate. In Appendix C.5, we explore the sensitivity of our optimization algorithm to variations in the cost and penalty parameters. In all except one instance, LOC consistently maintains a low total cost and failure rate. Optimizing Data Collection for Machine Learning 40 50 60 70 V * 10 2 CIFAR-100 (2 Types) (T=1) Power Law c=[0.01 0.005] 40 50 60 70 V * 0 Failure Rate 40 50 60 70 V * 10 2 CIFAR-100 (2 Types) (T=1) Power Law c=[0.01 0.002] 40 50 60 70 V * 0 Failure Rate 40 50 60 70 V * 10 2 CIFAR-100 (2 Types) (T=1) Power Law c=[0.01 0.001] 40 50 60 70 V * 0 Failure Rate 40 50 60 70 V * 10 2 CIFAR-100 (2 Types) (T=1) Power Law c=[0.01 0.0005] 40 50 60 70 V * 0 Failure Rate 40 50 60 70 V * 10 2 CIFAR-100 (2 Types) (T=3) Power Law c=[0.01 0.005] 40 50 60 70 V * 0 Failure Rate 40 50 60 70 V * 10 2 CIFAR-100 (2 Types) (T=3) Power Law c=[0.01 0.002] 40 50 60 70 V * 0 Failure Rate 40 50 60 70 V * 10 2 CIFAR-100 (2 Types) (T=3) Power Law c=[0.01 0.001] 40 50 60 70 V * 0 Failure Rate 40 50 60 70 V * 10 2 CIFAR-100 (2 Types) (T=3) Power Law c=[0.01 0.0005] 40 50 60 70 V * 0 Failure Rate Figure 10: For experiments on CIFAR-100 with two data types, mean standard deviation over 5 seeds of the cost ratio c T(q T q0)/c T(D q0) 1 and failure rate for different V after removing 99-th percentile outliers. We fix c0 = 1 and P = 1013. The rows correspond to T = 1, 3 (see the main paper for T = 5) and the columns correspond to c1 = c0/2, c0/5, c0/10, c0/20. C.1 Estimating the Data Requirement Distribution F(q) To estimate F(q), we first create an ensemble of estimated learning curves, which we then invert to obtain an empirical distribution of estimated values for D . Figure 8 plots our bootstrap resampled estimated learning curves versus the ground truth performance for the first round of data collection when we have access to an initial Dq0 containing 10% of the full data set. As noted in Mahmood et al. (2022b), the mean estimated learning curve diverges from the ground truth. However, by bootstrap resampling an ensemble of learning curves, we can cover the ground truth with some probability. Figure 9 plots the empirical histograms of estimated D as well as the estimated F(q) obtained via KDE on CIFAR-10 with three different values for V . Although the mode of the estimated distribution is far from the ground truth D , the estimated distribution assigns some probability to the ground truth region. LOC optimizes over this estimated F(q), which allows us to conservatively collect data and reduce the chances of failure. C.2 LOC with Alternative Regression Functions Mahmood et al. (2022b) show that we can use other regression functions instead of the power law to estimate the data requirement. Moreover, some functions tend to consistently overor under-estimate the requirement. LOC can be deployed on top of any such regression function, since the regression function is only used to generate bootstrap samples. In this Mahmood, Lucas, Alvarez, Fidler, and Law Regression With Regression With Regression With Data set T Logarithmic ˆv(q) LOC Arctan ˆv(q) LOC Algebraic Root ˆv(q) LOC Failure rate (FR) Cost ratio (CR) FR CR FR CR FR CR FR CR FR CR CIFAR-10 1 3% 0.59 0% 2.18 88% 0.01 6% 1.93 100% 12% 1.74 3 3% 0.59 0% 1.16 57% 0.01 0% 0.91 97% 0 0% 0.87 5 3% 0.59 0% 0.9 43% 0.01 0% 0.62 88% 0 0% 0.70 CIFAR-100 1 43% 0.19 2% 1.17 23% 3.31 0% 5.56 52% 0.11 23% 0.81 3 37% 0.17 2% 0.54 15% 3.01 0% 3.92 44% 0.09 2% 0.87 5 34% 0.16 2% 0.39 12% 2.90 0% 3.6 44% 0.09 2% 0.54 Imagenet 1 0% 24.5 34% 0.48 100% 0% 2.65 100% 40% 0.48 3 0% 24.5 2% 0.17 98% 0.01 0% 0.81 94% 0.01 0% 0.19 5 0% 24.5 0% 0.1 89% 0.01 0% 0.93 79% 0.01 0% 0.10 BDD100K 1 0% 0.55 12% 4.78 85% 0.04 12% 9.51 73% 0.08 12% 4.82 3 0% 0.55 0% 1.78 62% 0.02 0% 5.33 27% 0.04 0% 1.48 5 0% 0.55 0% 0.72 54% 0.01 0% 2.77 15% 0.03 0% 0.79 nu Scenes 1 0% 0.52 0% 2.78 95% 0 0% 8.96 95% 0 0% 3.42 3 0% 0.52 0% 0.78 95% 0 0% 6.63 81% 0.02 0% 1.73 5 0% 0.52 0% 0.5 95% 0 0% 7.55 67% 0.01 0% 2.25 VOC 1 100% 6% 1.85 83% 0.11 14% 0.49 14% 11.2 14% 13.5 3 100% 0% 1.60 75% 0.07 0% 1.00 0% 9.61 0% 12.1 5 94% 0 0% 1.37 67% 0.06 0% 1.32 0% 9.61 0% 18.4 Table 7: Comparing against the following alternate estimators of Mahmood et al. (2022b) with the same setup as in Table 2: Logarithmic ˆv(q; θ) = θ0 log(q + θ1) + θ2, Arctan ˆv(q; θ) = 200 π arctan(θ0 π 2 q + θ1) + θ2. Algebraic Root ˆv(q; θ) = 100q 1+|θ0q|θ1)1/θ1 + θ2. The best performing cost ratio is underlined and the best performing failure rate for each setting is bolded. Optimizing Data Collection for Machine Learning Data set T Regression With Correction (Mahmood et al., 2022b) LOC Failure rate Cost ratio Failure rate Cost ratio CIFAR-100 1 14% 0.94 4% 0.99 3 1% 0.23 3% 0.31 5 0% 0.17 2% 0.19 Imagenet 1 7% 1.03 37% 0.49 3 0% 0.21 5% 0.16 5 0% 0.14 2% 0.10 BDD100K 1 4% 4.03 12% 2.03 3 0% 1.02 0% 0.72 5 0% 0.62 0% 0.35 nu Scenes 1 0% 27.2 52% 0.16 3 0% 0.75 0% 0.09 5 0% 0.30 0% 0.04 VOC 1 0% 44.6 25% 0.56 3 0% 7.02 0% 1.10 5 0% 3.98 0% 0.84 Table 8: Comparing against the correction factor-based Power Law Regression of Mahmood et al. (2022b) with the same setup as in Table 2. The best performing cost ratio is underlined and the best performing failure rate for each setting is bolded. Although the baseline achieves low failure rates, LOC often can achieve competitive failure rates while reducing the cost ratios by an order of magnitude. Strategy T Regression LOC Failure rate Cost ratio Failure rate Cost ratio Entropy 1 23% 0.53 1% 1.79 3 18% 0.50 3% 1.02 5 18% 0.50 2% 0.83 k-Centers 1 51% 0.16 26% 0.38 3 44% 0.14 13% 0.17 5 44% 0.14 8% 0.13 Least Confidence 1 24% 0.70 14% 3.36 3 23% 0.68 5% 1.71 5 21% 0.67 6% 1.44 Table 9: Different active learning policies for experiments on CIFAR-100, measuring the average cost ratio and failure rate measured over a range of V and T. We fix c = 1 and P = 107. The best performing failure rate for each setting is bolded. The cost ratio is measured only for instances that achieve V . section, we show that these baseline estimators consistently either underor overestimate how much data to collect, in contrast to LOC. Table 7 highlights experiments on all six data sets with three alternative regression functions that were used by Mahmood et al. (2022b). For each function and almost on each Mahmood, Lucas, Alvarez, Fidler, and Law 36 38 40 42 V * 10 2 BDD100K (Semi-supervised) (T=1) Power Law c=[1. 0.1] 36 38 40 42 V * 0 Failure Rate 36 38 40 42 V * 10 2 BDD100K (Semi-supervised) (T=1) Power Law c=[1. 0.05] 36 38 40 42 V * 0 Failure Rate 36 38 40 42 V * 10 2 BDD100K (Semi-supervised) (T=1) Power Law c=[1. 0.01] 36 38 40 42 V * 0 Failure Rate 36 38 40 42 V * 10 2 BDD100K (Semi-supervised) (T=1) Power Law c=[1. 0.005] 36 38 40 42 V * 0 Failure Rate 36 38 40 42 V * 10 2 BDD100K (Semi-supervised) (T=3) Power Law c=[1. 0.1] 36 38 40 42 V * 0 Failure Rate 36 38 40 42 V * 10 2 BDD100K (Semi-supervised) (T=3) Power Law c=[1. 0.05] 36 38 40 42 V * 0 Failure Rate 36 38 40 42 V * 10 2 BDD100K (Semi-supervised) (T=3) Power Law c=[1. 0.01] 36 38 40 42 V * 0 Failure Rate 36 38 40 42 V * 10 2 BDD100K (Semi-supervised) (T=3) Power Law c=[1. 0.005] 36 38 40 42 V * 0 Failure Rate Figure 11: For experiments on BDD100K with two data types, mean standard deviation over 5 seeds of the cost ratio c T(q T q0)/c T(D q0) 1 and failure rate for different V after removing 99-th percentile outliers. We fix c0 = 1 and P = 1013. The rows correspond to T = 1, 3 (see the main paper for T = 5) and the columns correspond to c1 = c0/2, c0/5, c0/10, c0/20. dataset, we observe the same trends seen in the original Table 2. That is, LOC reduces the failure rate down to approximately zero, at a marginal relative increase in cost. Noting that Power Law Regression often leads to failure, Mahmood et al. (2022b) also propose a correction factor heuristic wherein they learn a parameter τ such that if the data collection problem requires a target performance V , we should instead aim to collect enough data to meet V + τ. In order to learn this correction factor, we require a pre-existing data set upon which we can simulate a data collection policy. Mahmood et al. (2022b) set τ such that we can achieve the data requirement V for any V on the pre-existing data set, and then fixing this parameter for new data sets. Table 8 compares LOC (i.e., repeating Table 2) with the Correction factor-based Power Law regression baseline of Mahmood et al. (2022b). Following the original paper, we tune τ using CIFAR-10 and apply it on all other data sets. The correction factor is designed to minimize the failure rate and thus, achieves nearly 0% failure rate for all settings, but often at high cost ratios. On the other hand, LOC achieves generally low failure rates and low cost ratios. Specifically, for T = 3, 5, we are competitive with the baseline on failure rates for most tasks while obtaining up to an order of magnitude decrease in costs. For T = 1, we typically admit higher failure rates; however for the segmentation and detection tasks, we obtain up multiple orders of magnitude lower costs. Finally, note that this baseline requires a similar prior task to be effective. For example, the baseline outperforms us on cost and failure rate both only on CIFAR-100, since it is tuned on CIFAR-10. On the other hand, Optimizing Data Collection for Machine Learning 80 85 90 V * Ratio of points q * /D * CIFAR-10 (T=1) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 40 50 60 70 V * Ratio of points q * /D * CIFAR-100 (T=1) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 45 50 55 60 65 V * Ratio of points q * /D * Imagenet (T=1) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 80 85 90 V * Ratio of points q * /D * CIFAR-10 (T=3) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 40 50 60 70 V * Ratio of points q * /D * CIFAR-100 (T=3) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 45 50 55 60 65 V * Ratio of points q * /D * Imagenet (T=3) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 66 68 70 72 74 V * Ratio of points q * /D * Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 20 22 24 26 V * Ratio of points q * /D * BDD100K (T=1) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 22 24 26 28 30 V * Ratio of points q * /D * nu Scenes (T=1) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 66 68 70 72 74 V * Ratio of points q * /D * Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 20 22 24 26 V * Ratio of points q * /D * BDD100K (T=3) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 22 24 26 28 30 V * Ratio of points q * /D * nu Scenes (T=3) Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00 Figure 12: Mean standard deviation of the ratio of data collected q T /D for different V when we sweep the cost parameter from 0.001 to 1 and fix P = 107. We show T = 1, 3 and refer to the main paper for T = 5. The dashed black line corresponds to collecting exactly the minimum data requirement. LOC does not require this prior data set to be effective as evidence by its performance on non-classification tasks. C.3 LOC with Active Learning Although most of our experiments assume that data is collected via i.i.d. random sampling, we may also consider alternative approaches to data collection such as active learning. Here, the optimal data collection problem amounts to determining the optimal budget to set when running an active learning query algorithm. Given the budget, we then sample data according to the active learning strategy. This relies on the assumption that the learning curve under active learning is monotone non-decreasing; this may not always be the case Mahmood, Lucas, Alvarez, Fidler, and Law 80 85 90 V * Ratio of points q * /D * CIFAR-10 (T=1) Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 40 50 60 70 V * Ratio of points q * /D * CIFAR-100 (T=1) Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 45 50 55 60 65 V * Ratio of points q * /D * Imagenet (T=1) Power Law P=1e+07 P=1e+08 P=1e+09 P=1e+10 80 85 90 V * Ratio of points q * /D * CIFAR-10 (T=3) Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 40 50 60 70 V * Ratio of points q * /D * CIFAR-100 (T=3) Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 45 50 55 60 65 V * Ratio of points q * /D * Imagenet (T=3) Power Law P=1e+07 P=1e+08 P=1e+09 P=1e+10 66 68 70 72 74 V * Ratio of points q * /D * Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 20 22 24 26 V * Ratio of points q * /D * BDD100K (T=1) Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 22 24 26 28 30 V * Ratio of points q * /D * nu Scenes (T=1) Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 66 68 70 72 74 V * Ratio of points q * /D * Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 20 22 24 26 V * Ratio of points q * /D * BDD100K (T=3) Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 22 24 26 28 30 V * Ratio of points q * /D * nu Scenes (T=3) Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09 Figure 13: Mean std of the ratio of data collected q T /D for different V when we sweep the penalty parameter from 106 to 109 and fix c = 1. We show T = 1, 3 and refer to the main paper for T = 5. The dashed black line corresponds to collecting exactly the minimum data requirement. (e.g., see Viering and Loog (2022)), but has been empirically demonstrated for many common batch-mode active learning strategies for computer vision (e.g., see Sener and Savarese (2018); Mahmood et al. (2022b)). Overall, the LOC framework does not change from Algorithm 3, except in how the training set statistics are collected in Algorithm 1. Given a small amount of data, we can estimate the neural scaling law from the active learning algorithm and apply estimation-only approaches and LOC. Table 9 highlights the effect of using LOC when the data collection policy uses active learning instead of random sampling. We focus on CIFAR-100 and test three different active learning strategies: Maximum Entropy (Settles, 2009), k-Centers (Sener and Savarese, 2018), and Least Confidence (Settles, 2009). This table demonstrates the same trends as before, namely that our policy Optimizing Data Collection for Machine Learning can better make decisions on the data collection budget compared to an estimation-only policy regardless of the specific collection algorithm. C.4 The Value of Optimization over Estimation when K = 2 Figure 10 and Figure 11 expand Figure 5 to T = 1, 3 rounds. The results validate the summary observations from Table 3 in that the baseline has considerably higher failure rates versus LOC. In particular for BDD100K at T = 1, the baseline fails consistently for four out of five random seeds. On the other hand, recall that LOC admits a higher cost ratio compared to the baseline when T = 1. We can observe now that this high cost ratio is due to the method incurring high cost for a few target V values. This behavior is similar to the observation above on VOC with high penalties at T = 3. C.5 Robustness to the Cost and Penalty Parameters Figure 12 expands the cost parameter sweep from Figure 6 (Top row) to the settings of T = 1, 3. For nearly all settings, LOC remains stable to variations in the cost parameter. Nonetheless, careful parameter selection becomes important as T decreases. This is due to the fact that for low costs, the total amount of data collected increases as T decreases (e.g., c = 0.001 for BDD100K). Furthermore, Figure 13 expands the penalty parameter sweep from Figure 6 (Bottom row). Here, we observe similar properties to the cost parameter sweep. Although LOC is relatively stable on all other data sets, our results demonstrate some extreme results for VOC, potentially due to noise in the simulation. For example in Figure 13, setting P = 109, V = 71, and T = 3 led to collecting 10, 000 times the minimum data requirement. Such a situation is unrealistic in a production-level implementation, since in a real implementation, we could impose further constraints onto problem (8), such as upper bounds on the total amount of data permissible. Amazon sagemaker data labeling pricing, 2023. URL https://aws.amazon.com/sagemaker/ data-labeling/pricing/. David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Efficient interactive annotation of segmentation datasets with polygon-rnn++. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. David Acuna, Guojun Zhang, Marc T Law, and Sanja Fidler. f-domain adversarial learning: Theory and algorithms. In International Conference on Machine Learning, pages 66 75. PMLR, 2021. David Acuna, Marc T Law, Guojun Zhang, and Sanja Fidler. Domain adversarial training: A game perspective. In International Conference on Learning Representations, 2022. Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws. Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024. Mahmood, Lucas, Alvarez, Fidler, and Law Dimitri Bertsekas. Dynamic programming and optimal control: Volume I, volume 1. Athena scientific, 2012. Dimitris Bertsimas, Mac Johnson, and Nathan Kallus. The power of optimization over randomization in designing experiments involving small samples. Operations Research, 63 (4):868 876, 2015. Devansh Bisla, Apoorva Nandini Saridena, and Anna Choromanska. A theoretical-empirical approach to estimating sample complexity of dnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3270 3280, 2021. Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws. In The Eleventh International Conference on Learning Representations, 2023. Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621 11631, 2020. Pedro Carneiro, Sokbae Lee, and Daniel Wilhelm. Optimal data collection for randomized control trials. The Econometrics Journal, 23(1):1 31, 2020. Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. David Cohn. Neural network exploration using optimal experiment design. In Advances in Neural Information Processing Systems, volume 6, 1993. Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2020. Corinna Cortes, Lawrence D Jackel, Sara Solla, Vladimir Vapnik, and John Denker. Learning curves: Asymptotic values and rate of convergence. In Advances in Neural Information Processing Systems, volume 6, 1993. Fabio Gagliardi Cozman, Ira Cohen, and M Cirelo. Unlabeled data can degrade classification performance of generative classifiers. In FLAIRS, pages 327 331, 2002. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248 255. IEEE, 2009. Dimensional Research. Artificial intelligence and machine learning projects are obstructed by data issues: Global survey of data scientists, ai experts and stakeholders. Technical report, 05 2019. Optimizing Data Collection for Machine Learning Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twenty-fourth International Joint Conference on Artificial Intelligence, 2015. David Easley and Nicholas M Kiefer. Controlling a stochastic process with unknown parameters. Econometrica: Journal of the Econometric Society, pages 1045 1064, 1988. Ismail Elezi, Zhiding Yu, Anima Anandkumar, Laura Leal-Taixe, and Jose M Alvarez. Not all labels are equal: Rationalizing the labeling costs for training object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. Ashley F Emery and Aleksey V Nenarokomov. Optimal experiment design. Measurement Science and Technology, 9(6):864, 1998. Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html, 2007. Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html, 2012. Rosa L Figueroa, Qing Zeng-Treitler, Sasikiran Kandula, and Long H Ngo. Predicting sample size required for classification performance. BMC Medical Informatics and Decision Making, 12(1):1 10, 2012. Lewis J Frey and Douglas H Fisher. Modeling decision tree performance with the power law. In Seventh International Workshop on Artificial Intelligence and Statistics. PMLR, 1999. Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pages 2242 2251. PMLR, 2019. Baohua Gu, Feifang Hu, and Huan Liu. Modelling classification performance for large data sets. In International Conference on Web-Age Information Management, pages 317 328. Springer, 2001. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam Mc Candlish. Scaling laws for transfer. ar Xiv preprint ar Xiv:2102.01293, 2021. Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. ar Xiv preprint ar Xiv:1712.00409, 2017. Mahmood, Lucas, Alvarez, Fidler, and Law Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In Advances in Neural Information Processing Systems, volume 36, 2022. Derek Hoiem, Tanmay Gupta, Zhizhong Li, and Michal Shlapentokh-Rothman. Learning curves for analysis of deep networks. In International Conference on Machine Learning, pages 4287 4296. PMLR, 2021. Yiding Jiang, Pierre Foret, Scott Yak, Daniel M Roy, Hossein Mobahi, Gintare Karolina Dziugaite, Samy Bengio, Suriya Gunasekar, Isabelle Guyon, and Behnam Neyshabur. Neurips 2020 competition: Predicting generalization in deep learning. ar Xiv preprint ar Xiv:2012.07976, 2020. Yiding Jiang, Parth Natekar, Manik Sharma, Sumukh K Aithal, Dhruva Kashyap, Natarajan Subramanyam, Carlos Lassance, Daniel M Roy, Gintare Karolina Dziugaite, Suriya Gunasekar, et al. Methods and analysis of the first competition in predicting generalization of deep learning. In Neur IPS 2020 Competition and Demonstration Track, pages 170 190. PMLR, 2021. George H. John and Pat Langley. Static versus dynamic sampling for data mining. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD 96, page 367 370. AAAI Press, 1996. S Jones, S Carley, and M Harrison. An introduction to power and sample size estimation. Emergency Medicine Journal: EMJ, 20(5):453, 2003. Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020. Prasanth Kolachina, Nicola Cancedda, Marc Dymetman, and Sriram Venkatapathy. Prediction of learning curves in machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22 30, 2012. Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. Mark Last. Predicting and optimizing classifier utility with the power law. In Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), pages 219 224. IEEE, 2007. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, pages 21 37. Springer, 2016. Rafid Mahmood, Sanja Fidler, and Marc T. Law. Low-budget active learning via wasserstein distance: An integer programming approach. In International Conference on Learning Representations, 2022a. Optimizing Data Collection for Machine Learning Rafid Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion, Jose M. Alvarez, Zhiding Yu, Sanja Fidler, and Marc T. Law. How much more data do we need? estimating requirements for downstream tasks. In 2022 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2022b. Rafid Mahmood, James Lucas, Jose M Alvarez, Sanja Fidler, and Marc T Law. Optimizing data collection for machine learning. In Advances in Neural Information Processing Systems, volume 36, 2022c. Christopher Meek, Bo Thiesson, and David Heckerman. The learning-curve sampling method applied to model-based clustering. Journal of Machine Learning Research, 2(Feb):397 418, 2002. Hiroaki Mikami, Kenji Fukumizu, Shogo Murai, Shuji Suzuki, Yuta Kikuchi, Taiji Suzuki, Shin-ichi Maeda, and Kohei Hayashi. A scaling law for syn2real transfer: How much is your pre-training effective? In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19 23, 2022, Proceedings, Part III, page 477 492, 2022. Felix Mohr, Tom J Viering, Marco Loog, and Jan N van Rijn. Lcdb 1.0: An extensive learning curves database for classification tasks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 3 19. Springer, 2022. Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision, 2020. Viraj Uday Prabhu, David Acuna, Rafid Mahmood, Marc T Law, Yuan-Hong Liao, Judy Hoffman, Sanja Fidler, and James Lucas. Bridging the sim2real gap with CARE: Supervised detection adaptation with conditional alignment and reweighting. Transactions on Machine Learning Research, 2023. Aayush Prakash, Shoubhik Debnath, Jean-Francois Lafleche, Eric Cameracci, Gavriel State, Stan Birchfield, and Marc T. Law. Self-supervised real-to-sim scene generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16044 16054, October 2021. Foster Provost, David Jensen, and Tim Oates. Efficient progressive sampling. In Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 23 32, 1999. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. Jonathan S Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales. In International Conference on Learning Representations, 2020. Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. Mahmood, Lucas, Alvarez, Fidler, and Law Burr Settles. Active learning literature survey. 2009. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015. Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972 5981, 2019. Richard D Smallwood and Edward J Sondik. The optimal control of partially observable markov processes over a finite horizon. Operations Research, 21(5):1071 1088, 1973. Kirstine Smith. On the standard deviations of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations. Biometrika, 12(1/2):1 85, 1918. Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision, pages 843 852, 2017. Katrin Tomanek and Udo Hahn. Approximating learning curves for active-learning-driven annotation. In LREC, volume 8, pages 1319 1324, 2008. Venture Beat. Why do 87% of data science projects never make it into production?, Jul 2019. URL https://venturebeat.com/2019/07/19/ why-do-87-of-data-science-projects-never-make-it-into-production/. Felipe AC Viana. A tutorial on latin hypercube design of experiments. Quality and reliability engineering international, 32(5):1975 1985, 2016. Tom Viering and Marco Loog. The shape of learning curves: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. Sen Wang and KJ Yang. An image scaling algorithm based on bilinear interpolation with vc++. Techniques of Automation and Applications, 27(7):44 45, 2008. Donggeun Yoo and In So Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 93 102, 2019. Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2636 2645, 2020. Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. Eric Zhao, Anqi Liu, Animashree Anandkumar, and Yisong Yue. Active learning under label shift. In International Conference on Artificial Intelligence and Statistics, pages 3412 3420. PMLR, 2021.