# optimizing_data_collection_for_machine_learning__78d6c10f.pdf

Journal of Machine Learning Research 26 (2025) 1-52 Submitted 3/23; Revised 10/24; Published 3/25

Optimizing Data Collection for Machine Learning

Raﬁd Mahmood1,2 rmahmood@nvidia.com

James Lucas1 jalucas@nvidia.com

Jose M. Alvarez1 josea@nvidia.com

Sanja Fidler1,3,4 sfidler@nvidia.com

Marc T. Law1 marcl@nvidia.com

1 NVIDIA 2 University of Ottawa, Ottawa, Canada 3 University of Toronto, Toronto, Canada 4 Vector Institute, Toronto, Canada

Editor: Isabelle Guyon

Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workﬂows. We propose a new paradigm to model the data collection workﬂow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. This formulation generalizes to tasks with multiple data sources, such as labeled and unlabeled data used in semi-supervised learning, and can be easily modiﬁed to customized analyses such as how to introduce data from new classes to an existing model. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We signiﬁcantly reduce the risks of failing to meet desired performance targets on several classiﬁcation, segmentation, and detection tasks, while maintaining low total collection costs.

Keywords: data collection, neural scaling laws, active learning

1. Introduction

Before deploying a machine learning model in production, stakeholders may mandate that the model meets a pre-determined baseline performance, such as a target score over a validation data set. One of the most reliable ways to achieve a desired performance is by augmenting current training sets with more data. Consequently, engineers must regularly determine how much and what kind of data they need. Managing data collection campaigns can impact costs and delays in model development. Overestimating how much data the model needs to reach a target performance will incur excess costs from collection, cleaning, and annotation. For example, annotating segmentation masks on driving data requires 15 to 40 seconds and between $0.02 to $0.08 per object (Acuna

c 2025 Raﬁd Mahmood, James Lucas, Jose M. Alvarez, Sanja Fidler, Marc T. Law.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-0292.html.

Mahmood, Lucas, Alvarez, Fidler, and Law

et al., 2018; AWS, 2023). Processing 100,000 images with an average of 10 cars per image can take between 170 to 460 days-equivalent of time and cost between $20,000 to $80,000. Meanwhile, underestimating how much data is needed will incur delays from having to collect more data later. For instance, a 2019 survey of machine learning practitioners revealed 51% of engineering teams faced delays due to failing to collect enough data (Dimensional Research, 2019). This problem grows more challenging when given multiple vendors who can provide diﬀerent levels of quality at diﬀerent prices. Consider the following examples:

Medical Imaging: A medical imaging company is planning to deploy automatic segmentation software on their devices within the next three years. They need to achieve an 80% Intersection-over-Union (Io U) to meet clinical standards. The company will partner with hospitals and hire clinicians to collect and annotate patient data, which can be expensive. If the company overestimates how much data to collect, they will spend more than necessary, but if they underestimate, they may have to restart the collection process next year.

Autonomous Vehicles: A startup working on autonomous vehicles needs to build an object detector in the next ﬁve years. This model must achieve a minimum mean Average Precision of 95% on their validation set, or else it will not be deployable and the company will lose $1,000,000 in revenue. Collecting high-quality training data requires employing drivers to record video and annotators to label the data, where the marginal cost of each image is approximately $5. Alternatively, the startup could use lower quality synthetic data at $1 per image. To manage their resources, the startup must plan how much real versus synthetic data they need at the start of each year.

The extant literature on learning curves and neural scaling laws suggests that the relationship between the performance of a deep learning model and training data set size follows a power law (Frey and Fisher, 1999; Gu et al., 2001; Hestness et al., 2017; Rosenfeld et al., 2020; Kaplan et al., 2020; Hoiem et al., 2021; Bahri et al., 2024; Bisla et al., 2021). This motivates an intuitive approach of ﬁtting a power law learning curve with the performance statistics of the current data set, extrapolating this learning curve to estimate how the model will perform with more data, and then forecasting how much data is needed to reach the desired performance (Rosenfeld et al., 2020). However, the decay rate of power laws implies that even small errors in estimating a learning curve can lead to massively overor underestimating how much data is actually needed (Mahmood et al., 2022b). Moreover, estimating these learning curves becomes diﬃcult with multiple data vendors, since diﬀerent types of data have diﬀerent costs and scale diﬀerently with performance (Mikami et al., 2022; Acuna et al., 2021; Prakash et al., 2021; Acuna et al., 2022; Prabhu et al., 2023). For example, in a semi-supervised learning task, unlabeled data may be easier to collect than labeled data, but we may require an order of magnitude more unlabeled data to match the performance of a small labeled set (Viering and Loog, 2022). Thus, collecting more data based only on scaling law estimates will fail to capture uncertainty and collection costs.

In this paper, we propose a new paradigm to model the data collection workﬂow as an optimal data collection problem, where a ﬁrm must minimize the cost of collecting enough data to obtain a model capable of achieving a desired performance score. They have multiple collection rounds, and after each round, they re-evaluate their model and decide how much more data to order. There are per-sample collection costs, and the ﬁrm pays a penalty if they fail to meet the target score within a ﬁnite horizon. Using this framework, we develop

Optimizing Data Collection for Machine Learning

an optimization approach to minimize expected future costs and show that this problem can be optimized in each collection round via gradient descent. Our optimization problem generalizes to decisions over multiple data sources (e.g., unlabeled, long-tail, cross-domain, synthetic) that have diﬀerent costs and impacts on performance. Most importantly, our framework presents natural tools for custom economic analyses such as comparing diﬀerent collection strategies or introducing new classes to expand an existing model. Finally, we demonstrate the value of optimization over na ıvely estimating data set requirements for several machine learning tasks and data sets. Our contributions are as follows:

1. We propose the optimal data collection problem in machine learning, which extends the estimation of learning curves to a formal dynamic optimization problem to determine how much and what kind of data to collect over the model development life cycle.

2. We introduce Learn-Optimize-Collect (LOC), a learning-and-optimizing framework that minimizes expected collection costs and can be solved via gradient descent. This is the ﬁrst exploration of optimizing data collection with multiple arbitrary data sources in machine learning, covering, for example, semi-supervised and long-tail learning.

3. We analyze the one-round problem of deciding how much data to collect. We show that this problem is equivalent to estimating a (1 ϵ)-quantile of the distribution of the minimum data needed to meet the target. Under Gaussian assumptions, LOC guarantees lower regret than estimation-only baselines.

4. We evaluate LOC over classiﬁcation, segmentation, and detection tasks to show, on average, approximately a 2 reduction in the chances of failing to meet performance targets, versus estimation baselines. We also show the ﬂexibility of our framework to solve customized managerial questions faced by engineering ﬁrms.

The overall goal of this work is to provide a high-level framework with which machine learning model developers can obtain policy insights on how much data to collect. In practice, these decisions are strategic in nature and are determined over long-term horizons (see our motivating examples above). Moreover, the ﬁnal policy decisions are often informed via a combination of our data-driven approach and human expert judgment. Consequently, we validate our methodology via a breadth of experiments on diﬀerent data sets, tasks, and diﬀerent managerial scenarios. A preliminary version of this article was published in Mahmood et al. (2022c). Our complete paper introduces theoretical analysis of the optimal data collection problem. First, we characterize the optimal solution space of the data collection problem (Theorem 1) and show that LOC yields the optimal solution to the problem (Lemma 2). Further, we extend the theoretical analysis of one-round data collection by deriving an analytic formula for the optimal solution as well as a regret bound under Gaussian assumptions (Proposition 5, Corollary 6, Lemma 7, Proposition 8). Finally, we present several new experiments, including comparisons with diﬀerent learning curve estimators (Section C.2), analysis under active learning strategies (Appendix C.3), and two new empirical case studies (Appendix 6.4) to show how our high-level modeling approach can answer customized data collection challenges. In these experiments, we adapt LOC to estimate the potential costs and yield decisions when choosing between diﬀerent data collection strategies.

Mahmood, Lucas, Alvarez, Fidler, and Law

2. Related work

This paper employs neural scaling laws to solve operational design problems. The problem of data-eﬃcient learning is closely tied to active learning and statistical sample complexity. Below, we summarize the most relevant literature.

2.1 Learning Curves and Neural Scaling Laws.

According to the learning curve literature, the performance of a machine learning model on a validation set scales with the size of the training data set with diminishing marginal value in a way that can usually be modeled via concave functions (Cortes et al., 1993; Frey and Fisher, 1999; Provost et al., 1999; Meek et al., 2002; Tomanek and Hahn, 2008; Figueroa et al., 2012; Kolachina et al., 2012). Consequently, performance at large data scales can be extrapolated by estimating the learning curve at smaller scales, both from a point perspective and by capturing uncertainty (Domhan et al., 2015). Although the literature primarily considers a single-round estimate of data collection requirements, John and Langley (1996) propose dynamic sampling over multiple data collection rounds, similar to our view. Furthermore while the literature primarily explores estimating dataset sizes, Last (2007) propose an optimization approach for data collection. Their framework minimizes the data collection cost plus a penalty associated with the amount of generalization error of the downstream model. In contrast to these two works, we propose a multi-round data collection optimization problem that minimizes the collection cost plus a penalty associated with failure to achieve a performance requirement. We refer to Viering and Loog (2022) for a detailed review on learning curves.

The neural scaling law literature focuses on deep learning to empirically demonstrate a power law relationship between model performance and data set size (Hoiem et al., 2021; Bisla et al., 2021; Zhai et al., 2022; Caballero et al., 2023). For instance, Hestness et al. (2017) observe this property over vision, language, and audio tasks, Bahri et al. (2024) develop a theoretical relationship under assumptions on over-parametrization and the Lipschitz continuity of the loss, model, and data, and Rosenfeld et al. (2020) estimate power laws using smaller data sets and models to extrapolate future performance. Multi-variate scaling laws have also been considered for some speciﬁc tasks, for example in transfer learning from synthetic to real data sets (Mikami et al., 2022). Finally, Mahmood et al. (2022b) explore data collection by estimating the minimum amount of data needed to meet a given target performance over multiple rounds. Our paper extends these studies by introducing a sequential optimization problem where the amount of data to collect is optimized, rather than estimated, over multiple collection rounds and from multiple data sources that may feature diﬀerent costs.

2.2 Active Learning

In active learning, a model sequentially collects data by selecting new subsets of an unlabeled data pool to label under a predetermined labeling budget that replenishes after each round (Settles, 2009; Sener and Savarese, 2018; Yoo and Kweon, 2019; Sinha et al., 2019; Mahmood et al., 2022a). In contrast, we focus on systematically determining an optimal

Optimizing Data Collection for Machine Learning

collection budget. Thus, our work complements active learning which can be used to collect data up to the budget determined by our optimization policy.

2.3 Statistical Learning Theory

The rich theoretical analysis on the sample complexity of machine learning explores worstcase bounds relating data set size and model performance, but these bounds are typically only tight asymptotically; we refer the reader to Section 7.1 of Viering and Loog (2022) for details on this challenge. Recent work have empirically analyzed these relationships (Jiang et al., 2020, 2021) For instance, Bisla et al. (2021) study generalization bounds for deep neural networks, provide empirical validation, and suggest using them to estimate data requirements. Ultimately, making data collection decisions based on worst-case bounds may be pessimistic and have consequences on collection costs.

2.4 Optimal Experiment Design

The topic of how to collect data, select samples, and design statistical experiments is well-studied in econometrics (Smith, 1918; Cohn, 1993; Emery and Nenarokomov, 1998). For instance, pseudo-random strategies such as Latin Hypercube Sampling may be more eﬃcient than i.i.d. sampling in data collection for statistical tests (Viana, 2016). Similarly, Bertsimas et al. (2015) optimize the assignment of samples into control and trial groups to minimize inter-group variances. Most recently, Carneiro et al. (2020) optimize how many samples and covariates to collect in a statistical experiment by minimizing a treatment eﬀect estimation error or maximizing t-test power. However, our focus on industrial machine learning applications diﬀers from experiment design by having target performance metrics and continual rounds of collection and modeling.

2.5 Sequential Decision-Making

Our problem is a sequential decision-making problem with an unobservable state, i.e., we determine how much to collect in each round without knowing how much we will need. Although such problems can be formed as Partially Observable Markov Decision Processes (POMDPs) (Smallwood and Sondik, 1973; Puterman, 2014), the dimension of the state and action space combined with sampling limitations make such approaches untenable. We provide a detailed reformulation and discussion in Appendix A.

3. Main Problem

We ﬁrst introduce a formal model of data collection in machine learning as a dynamic decision-making game played over multiple collection rounds. We then discuss challenges with current intuitive, but na ıve approaches for determining how much data to collect.

3.1 Optimal Data Collection

Consider K N diﬀerent data sources, where for each k {1, . . . , K}, let zk be a data point and let Dk be a data set of points generated from a ﬁxed algorithm, such as i.i.d. sampling or an active learning strategy (Settles, 2009). We train a learning model with data

Mahmood, Lucas, Alvarez, Fidler, and Law

sets D1, . . . , DK and evaluate a score function V (D1, . . . , DK). For example, if the learning problem is binary image classiﬁcation, let K = 1 where z1 := (x, y) corresponds to images x X and labels y {0, 1}, and V (D1) is the validation set accuracy of a model trained on D1. Alternatively in semi-supervised learning, let K = 2 where z1 := (x1, y1) and z2 := x2, while V (D1, D2) is the validation accuracy of a model trained with both data sets. We omit superscripts and subscripts unless necessary. In general, we have training sets D1 q0,1, . . . , DK q0,K of q0,1, . . . , q0,K points, respectively, a target score V > V (D1 q0,1, . . . , DK q0,K), and a horizon of T rounds. Let q0 := (q0,1, . . . , q0,K)T

be a vector of data set sizes and let Vq0 := V (D1 q0,1, . . . , DK q0,K). For each t {1, . . . , T}, we

(i) Determine how much data to have at the end of the round qt := (qt,1, . . . , qt,K)T.

(ii) Generate data until each Dk has qt,k points.

(iii) Re-train our learning model. If Vqt V or if t = T, we terminate.

In each round, we pay a cost ck > 0 for each additional point generated for the k-th data set. Further, if we do not reach V after T rounds, we pay a penalty P. Let c := (c1, . . . , c K)T

be the cost vector. Then, the optimal data collection problem is

min q1 q T c T (q1 q0) + 1 {Vq1 < V }

c T (q2 q1)

+ 1 {Vq2 < V } c T (q3 q2)

+ 1 Vq T 1 < V c T (q T q T 1)

+ P1 {Vq T < V } !

= min q1 q T

t=1 c T (qt qt 1)

s=1 1 {Vqs < V } + P

t=1 1 {Vqt < V } (2)

Problem (1) is deﬁned recursively where the objective includes the cost of collecting additional data in each round t and then conditioned on not collecting enough data in that round, the problem continues to the next round. Problem (2) refactors the objective by extracting the action in each round t to be depedendent on Qt 1 s=1 1 {Vqs < V }. If we use randomized algorithms to train a learning model and to sample data, the score function must be a random variable. Moreover, a general observation is that the score function typically increases monotonically with data set size (Frey and Fisher, 1999; Sun et al., 2017; Rosenfeld et al., 2020). We combine these two into the following assumption.

Assumption 1 The score function is a realization of a stochastic process Vq := V (D1, . . . , DK) as a function of the data set size. Furthermore, Vq increases monotonically with q.

Optimizing Data Collection for Machine Learning

The score function may not always increase monotonically (Mohr et al., 2022), due to factors such as model mis-speciﬁcation (Viering and Loog, 2022), active learning (Tomanek and Hahn, 2008), or mixing labeled and unlabeled data in semi-supervised learning (Cozman et al., 2002)). However, monotonic non-decreasing trends such as power laws has been consistently observed in large-scale deep learning systems (Hestness et al., 2017; Rosenfeld et al., 2020; Hernandez et al., 2021; Hoﬀmann et al., 2022). Furthermore, this assumption ensures Vq1 Vq T , meaning Qt 1 s=1 1 {Vqs < V } = 1 Vqt 1 < V , which allows us to simplify problem (2) to

RHS (2) = min q1 q T

t=1 c T (qt qt 1) 1 Vqt 1 < V + P1 {Vq T < V } . (3)

As Vq is monotonically increasing, an intuitive strategy may be to collect the minimum amount of data such that Vq = V . We refer to this amount as the minimum data requirement

D := arg min q

n c Tq | Vq V o . (4)

The minimum data requirement is also the stopping time of the stochastic process, i.e., a random variable that gives the lowest-cost index that passes V . We assume that problem (4) always has a solution, i.e., we can achieve V performance with a ﬁnite amount of data. Furthermore, we randomly pick a unique solution to break ties in (4). Below, we show that unless our penalty for failing to reach the target is too small, the optimal solution is to collect this minimum data requirement.

Theorem 1 Suppose Assumption 1 holds and that q0 < D . If P < c T(D q0), then an optimal solution to problem (3) is q 1 = = q T = q0. Otherwise, an optimal solution is q 1 = = q T = D .

Proof We prove for P < c T(D q0) by breaking into two cases. First, consider any solution q1, . . . , q T where q T > q0 but Vˆq T < V . Inputting this solution to problem (3) incurs an objective function value of c T(ˆq T q0) + P P, where the right-hand-side is the objective function value incurred by q 1 = = q T = q0. Next, consider the case where Vq T V . Let ˆt indicate the smallest qt for which Vqt V . The objective function value is at least c T(qˆt q0) c T(D q0) > P. Here, the ﬁrst inequality follows from the deﬁnition of D in (4), and the second follows from our assumption. Therefore, q T = q0, which implies our unique optimal solution. We now prove for P c T(D q0). First, consider any solution q1, . . . , q T for which Vq T < V . This means that the objective function value is greater than P c T(D q0), where the right-hand-side is the objective function value incurred by q 1 = = q T . Second, consider any solution where Vq T V . Let ˆt indicate the smallest qt for which Vqt V . The objective function value is c T(qˆt q0) c T(D q0), where the inequality follows from the optimality of D .

Theorem 1 shows that the penalty for failing to meet a target performance must be suﬃciently high for the optimal data collection problem to be non-trivial. In practice, c and T are determined by the real costs and constraints of the machine learning project, but P simply

Mahmood, Lucas, Alvarez, Fidler, and Law

Scaling Law Estimator v(q; θ)

Power Law θ1qθ2 + θ3

π arctan θ1 π

2 q + θ2 + θ3 Logarithmic θ1 log(q + θ2) + θ3

Algebraic Root 100q (1 + |θ1q|θ2)1/θ2 + θ3

Table 1: Four common scaling law functions with learnable parameters θ = {θ1, θ2, θ3} when K = 1. See Viering and Loog (2022) for an extensive list. For K > 1, we can add the scaling law for each data source according to (5).

Algorithm 1 Na ıve Estimation of the Data Requirement

1: Input: Initial data set D1, . . . , DK of q points, Regression model ˆv(q; θ), Regression set size R. 2: Collect Performance Statistics(q) 3: Initialize R = , ˆD1 = = ˆDK = 4: for r {1, . . . , R} do 5: Sub-sample qk/R additional points from each Dk without replacement to augment ˆDk.

6: Evaluate V ( ˆD) and update R R n qr/R , V ( ˆD) o .

7: end for 8: end 9: Estimate D

10: Fit regression model θ = arg minθ P|R| r=1 (Vqr v(qr; θ))2.

11: Estimate the data requirement by solving ˆD = arg minq n c Tq v(q; θ ) V o .

12: end 13: Output: Estimated ˆD.

reﬂects how much a ﬁrm stands to lose from not meeting performance targets. As such, practitioners have freedom to tune this parameter when modeling data collection, where setting a high P suggests that a strong need to meet the target within the time horizon. We expand on this observation in our theoretical analysis in Section 5. We assume in the rest of this paper that P > c T(D q0), which implies that the optimal amount of data to have is D . However, D is unknown to us at the time of decision-making.

3.2 An Intuitive but Na ıve Approach to Estimating the Data Requirement

The recent neural scaling law literature suggests that if we can model the score function Vq, then we can estimate D directly and use this to determine how much data to collect. For instance, Rosenfeld et al. (2020); Hoiem et al. (2021); Caballero et al. (2023) ﬁt parameters θ to an estimator v(q; θ) Vq of the score. They subsample from the current data sets D1 qt,1, . . . , DK qt,K to simulate small data set sizes, retrain the model, and evaluate the score. Repeating this process with R diﬀerent training subsets yields a data set of training statistics R := {qr, Vqr}R r=1, which can be used to solve a Least Squares minimization problem. Once ﬁtted, v(q; θ ) can replace Vq in problem (4) to estimate the stopping time. Algorithm 1 summarizes the general steps.

Optimizing Data Collection for Machine Learning

Figure 1: Extrapolating scaling laws to estimate D on Image Net (Deng et al., 2009). The solid blue line is the ground truth test accuracy as a function of data set size. Left: Fitting four diﬀerent scaling law functions from Table 1 when initializing with 10% (q0 = 125, 000, dotted) and 50% (q0 = 600, 000, dashed) of the data set. All functions struggle to accurately extrapolate accuracy when q0 is small, but are accurate when q0 is large. Right: To hit a target V = 67% accuracy, we need 900, 000 images. If the scaling laws overor underestimate by only a small amount ( 6% error at q = 900, 000), they massively under- (e.g., ˆD = 580, 000) or overestimate (e.g., ˆD = 3, 000, 000) how much data is needed.

The scaling law literature almost exclusively focuses on K = 1, wherein they propose diﬀerent parametric functions for v(q; θ)(Rosenfeld et al., 2020; Viering and Loog, 2022; Hoiem et al., 2021; Caballero et al., 2023); we give examples in Table 1. The most common choice is a power law v(q; θ) = θ1qθ2 +θ3. We remark that some recent research has explored K > 1; for instance, Mikami et al. (2022) explore a K = 2 power law for transfer learning from synthetic to real data. To explore arbitrary K, we propose an easy-to-implement estimator that adds the contributions of each data source

v(q; θ) := θ0 +

k=1 vk(qk; θk), (5)

where vk(qk; θk) can be any K = 1 estimator of the learning curve. Moreover, this additive model can be easily ﬁt with a Least Squares algorithm and oﬀers interpetable explanations of the contributions of diﬀerent data types by assumping each data set has an independent contribution (Ghorbani and Zou, 2019).

Remark 1 The above estimator extends the existing literature to arbitrary K N and can also be compounded on future K = 1 estimation algorithms. Further, diﬀerent estimation functions v(q; θ) demonstrate speciﬁc biases towards either overor underestimating performance (Mahmood et al., 2022b). Designing a speciﬁc estimator is not the focus of this paper, as we will next show that all estimators have the same weaknesses with respect to data collection, i.e., disproportionate overor undercollection due to the concavity of the learning curve. For the remainder of the main paper, we focus on the speciﬁc case of v(q; θ)

Mahmood, Lucas, Alvarez, Fidler, and Law

being a linear combination of power laws, since power laws are the most common choice for learning curves. In Appendix C, we include ablations with diﬀerent estimators (see Table 7).

Due to the fact that estimation will have some degree of inaccuracy, this intuitive strategy na ıvely suﬀers from two major consequences when used in data collection. These properties were initially observed by Mahmood et al. (2022b), but we expand on their analysis below. We highlight them using the Image Net data set in Figure 1:

1. Extrapolation performance is tied to current amount of data: Figure 1 (Left) plots four diﬀerent scaling law functions that were ﬁt using an initial q0 = 125, 000 and q0 = 600, 000 images, i.e., 10% and 50% of the entire data set respectively. When q0 is small, every function fails to accurately extrapolate future performance and ultimately diverges from the ground truth learning curve. Alternatively when q0 is large, every function reasonably accurately follows the learning curve. Thus, an initial estimate of D from a small q0 is likely to be inaccurate. This property was also observed by Hestness et al. (2017), who referred to this as the the small-data regime.

2. Small estimation errors lead to large overor undercollection: Figure 1 (Right) plots two diﬀerent scaling law functions. To collect a target V = 67%, the ground truth D is 900,000 images. Both estimated v(q; θ) functions are reasonably accurate and have less than 6% error at q = 900, 000. However, the function that overestimates Vq (orange) suggests collecting 580, 000 images, which is about 60% of the true amount, whereas the function underestimating Vq (green) suggests collecting 3 million images, which is over three times the true amount. Finally, note that the magnitude of error is much larger in this example when we underestimate v(q; θ) (i.e., the green curve). As q increases, both v(q; θ) and the estimated curve ﬂatten. If the rate of change for both curves approach zero, the distance between the points at which the two curves respectively reach V can increase arbitrarily.

We conclude that data collection policies that simply estimate how much data is needed for a given task can lead to paying arbitrarily large collection costs, even if the estimated learning curves are close to the true curves. Moreover, estimators diverge drastically from the true learning curves when given a limited amount of data. As a result, robust data collection policies must also capture the uncertainty of estimation.

4. Learn-Optimize-Collect (LOC)

Our solution approach, which we refer to as Learn-Optimize-Collect (LOC), minimizes the total collection cost while incorporating the uncertainty of estimation. To facilitate this problem, we assume that D is a continuous random variable that has a well-behaved, diﬀerentiable cumulative density function.

Assumption 2 The random variable D is absolutely continuous and has a diﬀerentiable cumulative density function (CDF) F(q) := Pr{D q} and probability density function (PDF) f(q) := d F(q)/dq.

Under this assumption, we will optimize over a continuous decision space for q and round any non-integer values. Although in practice, D is discrete, it is often realized on the order

Optimizing Data Collection for Machine Learning

Optimize-Collect Sample bootstrap performance

with subsets of data to learn data requirement distribution

Use estimated distribution to

optimize collection cost plus

risk of failing to meet V*

Pay c(qt - qt-1) to collect

additional data

Have we achieved score V*?

Have we hit time limit T?

Terminate Pay P and

Initialize with q0 data

points & target V*

Learn Optimize Collect

Figure 2: In the optimal data collection problem, we iteratively determine the amount of data that we should have, pay to collect the additional data, and then re-evaluate our model. Learn-Optimize-Collect (LOC) optimizes for the minimum amount of data q t to collect.

of thousands or greater, which makes the assumption of continuity mild and rounding errors generally minor. In this section, we ﬁrst propose a stochastic optimization alternative to the optimal data collection problem (3). We then develop our framework, where we estimate the probability distribution of D and optimize over the estimated distribution.

4.1 A Stochastic Reformulation of Optimal Data Collection

Solving problem (3) directly is diﬃcult because evaluating whether or not a given amount of data q is suﬃcient to reach V requires collecting the data itself and training the learning model. However, note that for any solution q to problem (3), if q D , then by deﬁnition Vq V . As a result, consider the following approximation of the original problem:

t=1 c T (qt qt 1) 1 {qt 1 D } + P1 {q T D } . (6)

Problem (6) replaces the condition of achieving V from problem (3) with the condition of collecting at least D points over all of the data sources. We show below that when K = 1, these two problems are exactly equivalent. For general K, the two problems share the same optimal solution.

Lemma 2 The following statements are true:

1. If P < c T (D q0), then an optimal solution to Problem (6) is q 1 = = q T = q0. Otherwise, an optimal solution is q 1 = = q T = D .

2. If K = 1, then problem (6) is equivalent to problem (3).

Proof The proof for the ﬁrst statement is identical to the proof of Theorem 1. To prove the second statement note that from Assumption 1 and the fact that q is a scalar, D = arg minq{q | Vq V }. Thus, we have the following equivalence q D Vq V .

Mahmood, Lucas, Alvarez, Fidler, and Law

Substituting this into the indicators completes the proof.

Lemma 2 states that D is an optimal solution to the approximation (6). Because this is also an optimal solution to the original problem, solving either problem to optimality can achieve the same decision. Furthermore, the total cost is the same for both problems when collecting up to D data points. The approximation (6) relies on D , which is not observable a priori. However, since D

is a random variable, we can take the expectation over the objective in problem (6) to obtain

min q1 q T ED f(q)

t=1 c T (qt qt 1) 1 {qt 1 D } + P1 {q T D }

= min q1 q T

t=1 c T (qt qt 1) (1 F(qt 1)) + P (1 F(q T )) , (7)

where F(q) is the CDF of D . Due to Assumption 2, the reformulated stochastic objective is diﬀerentiable in q. We treat problem (7) to be continuous over q1, , q T and round any non-integer values to determine the amount of data to collect. Therefore, this problem can be solved via gradient descent-based algorithms. Finally, we remark on the interpretation of problem (7). Recall that problem (6) is equivalent to the original data collection problem (3) insofar as they both share an optimal solution D . Furthermore, for any data collection decision qt, the CDF F(qt) gives the probability that qt D . In the optimization problem (7), we determine how much data to collect in each round qt by minimizing the probability of qt D , i.e., 1 F(qt), multiplied by the cost of collecting this additional data and the penalty of failing to achieve the model within T rounds. This approach contrasts with the previous na ıve estimator which directly used a point estimate of D , by now incorporating the risk and potential cost of overor under-collecting data due to the stochasticity in this random variable.

4.2 Estimating-then-Optimizing How Much Data to Collect

Solving problem (7) requires access to the distribution F(q). However, just as we can estimate D , we can now estimate the distribution and use this estimated distribution in the optimization problem. We propose a simple strategy of estimating this distribution by bootstrapping the point estimates of D obtained via Algorithm 1. We ﬁrst use the same steps as before to create a regression set of training statistics R. Then, let B > 1 be the number of bootstrap estimates. For each b {1, . . . , B}, we create a bootstrap resampled set of R and solve a corresponding Least Squares minimization problem to ﬁt a scaling law estimator vb(q; θb) with parameters θb. We use this estimator in place of Vq in problem (4) to estimate the minimum data requirement. After repeating this process, we obtain a bootstrap set of estimates {ˆqb}B b=1, which we then use to ﬁt a Kernel Density Estimator (KDE) ˆf(q) of the PDF of the data requirement. Numerical integration of this KDE model yields the estimated CDF ˆF(q) := R q 0 ˆf(q)dq. Algorithm 2 summarizes the steps. Although there may exist several strategies for estimating this distribution, our bootstrapping procedure remains easy-to-perform and requires computation only to call Algorithm 1 for B rounds. In numerical experiments, we set B = 500 and ﬁnd that this can generate

Optimizing Data Collection for Machine Learning

Algorithm 2 Estimating the Data Requirement Distribution F(q)

1: Input: Initial data set Dq, Regression model ˆv(q; θ), Regression set size R, Number of bootstrap samples B, Kernel Density Estimation (KDE) model ˆf(q). 2: Initialize R = , ˆD = 3: Update R using the Collect Performance Statistics(q) sub-routine of Algorithm 1 4: Initialize Q = 5: Bootstrapped CDF(R) 6: for b {1, . . . , B} do 7: Create bootstrap Rb by sub-sampling R points with replacement from R 8: Fit regression model θ = arg minθ P

(q,Vq) Rb (Vq v(q; θ))2

9: Estimate the data requirement ˆqb = arg minq c Tq | v(q; θ ) V

10: Update Q Q {ˆqb} 11: end for 12: Fit KDE model ˆf(q) using the empirical distribution Q and let ˆF(q) := R q 0 ˆf(q)dq 13: end 14: Output: Estimate of the requirement distribution ˆF(q)

high-quality distribution estimates (see Appendix C.1 for details). Most importantly, the ﬁrst derivative of this CDF ˆF(q) is immediately recoverable as the original KDE model ˆf(q). This derivative is necessary for gradient descent optimization of problem (7).

4.3 Putting It All Together

We now describe the complete framework for optimizing the amount of data to collect in each round of the data collection problem. This framework uses a model-predictivecontrol approach of re-estimating the data requirement distribution, solving the stochastic optimization problem (8), and taking only the recommendation of how much data to collect for the immediate round. Figure 2 summarizes the steps of our algorithm, Learn-Optimize Collect (LOC), which is also described in detail in Algorithm 3. Given a target score V and an initial amount of data q0, we must collect data until we have met the target or until T rounds have passed. In the t-th round, we initialize with qt 1 data points over the K data sources. We ﬁrst collect performance statistics by measuring the training dynamics of the current datasets D1, . . . , DK. In each round t, rather than re-collecting the full performance statistic R in each round, we alternatively update R with the latest result (q, Vqt); this signiﬁcantly reduces the computational burden of collecting training statistics in each round. We then bootstrap these statistics to estimate the data requirement distribution for the t-th round, which we refer to as ˆFt(q). Given an estimated distribution of D , we ﬁrst deﬁne the variables dt RK where dt := qt qt 1 and solve a reformulated version of problem (7), below

min d1,...,d T 0

The above problem (8) is diﬀerentiable on its domain, and the variables are only constrained to non-negativity. Thus, this problem can be treated as a continuous optimization problem with only non-negativity constraints, and can be optimized via projected gradient descent

Mahmood, Lucas, Alvarez, Fidler, and Law

Algorithm 3 Optimal Data Collection via LOC

1: Input: Initial data sets D1, . . . , DK of q0 points, Regression model ˆv(q; θ), Regression set size R, Number of bootstrap samples B, Kernel Density Estimation (KDE) model f(q), Target V , Maximum number of rounds T, Cost c, Penalty P. 2: Initialize round t = 0, Total cost L = 0 3: Initialize statistics R = , datasets ˆD1 = = ˆDK = 4: repeat 5: Update R using the Collect Performance Statistics(qt) sub-routine of Algorithm 1 6: Learn-Optimize-Collect 7: Initialize Q = 8: Update KDE model ˆFt(q) using the Bootstrapped CDF(R) sub-routine of Algorithm 2 9: Freeze variables ds for s < t and solve problem (8) using ˆFt(q) to obtain d t , . . . , d T . 10: end 11: Collect Data(dt) 12: for k {1, . . . , K} do 13: Collect data zk until |Dk| = qt,k + dt,k 14: Update cost L L + ckdt,k 15: end for 16: Re-train model and update performance Vqt 17: end 18: t t + 1 19: until Vqt V or t = T 20: if Vq T < V then 21: Update loss L L + P 22: end if 23: Output: Final collected data sets D1, . . . , DK, Total cost L, Final model performance Vqt

algorithms. Finally, in each round t, we ﬁx the previous decision variables d1, . . . , dt 1 constant to their previously optimized values since we have already collected this data. Let d t , . . . , d T be the solution obtained from using a gradient method to optimize problem (8). We then collect data from each data source as determined by d t , i.e., the recommendation of how much data to collect immediately in round t. Once this data is collected, we re-train the machine learning model to evaluate the current performance Vqt and proceed to the next round of the LOC pipeline.

5. Theoretical Insights in One-Round Data Collection

Although LOC can be used to generate long-term strategic data collection-term decisions over multiple rounds, a common use-case is to obtain a one-round T = 1 estimate of how much data is needed to meet the target V . For example, consider a ﬁrm that is deciding whether or not they should pursue a machine learning-based solution for a speciﬁc problem; to make an informed decision, the ﬁrm may want a single estimate of whether the amount of data needed to build the desired model is ﬁnancially feasible. Such use-cases typically feature a single data type K = 1, a limited or potentially zero initial data q0, and a noisy estimator ˆF(q) of the data requirement distribution F(q) (for example, see Section 3.2). This setting permits theoretical analysis wherein we can derive exact globally optimal solutions

Optimizing Data Collection for Machine Learning

for how much data to collect as well as structural insights to the relationships between costs, penalties, and the estimated data distribution. In this section, we focus on the one-round, single-data source problem:

min d1 cd1 + P 1 ˆF(q0 + d1) . (9)

In Section 5.1, we ﬁrst develop an analytic solution to problem (9). This solution is dependent on the ratio c/P of the per-sample cost and how much we stand to pay in penalty for failing to collect. Then in Section 5.2, under the assumption that ˆF(q) follows a Gaussian distribution, we fully characterize the analytic solution to obtain the globally optimal amount of data to collect. Finally in Section 5.3, we consider the case where ˆF(q) is a noisy estimate of the true data requirement distribution F(q). Here, we develop a regret bound to show that our optimization strategy outperforms the estimation-based approach (i.e., Algorithm 2) to data collection summarized in Section 3.2.

5.1 An Analytic Solution to One-Round Data Collection

The cost c parameter reﬂects real data collection costs whereas the penalty P reﬂects how much a ﬁrm stands to pay if they cannot obtain a model with performance V . Since this term may not have an exact real value, it may be diﬃcult to determine the appropriate P in practice. Instead, consider an alternate, more intuitive parameter ϵ 0 to be a measure of the maximum tolerable probability of not meeting V . Since the data requirement D is stochastic, ϵ represents how much a ﬁrm is willing to tolerate the chance of undercollecting. That is, we should collect enough data d1 such that ˆF(q0 + d1) 1 ϵ. Our main theorem below states an optimal solution d 1 to problem (9) in terms of this acceptable risk ϵ.

Theorem 3 Assume ˆF(q) is strictly increasing and continuous. If there exists d1 0 where

c P ˆF(q0 + d1) ˆF(q0)

then there exists an ϵ 1 ˆF(q0) that satisﬁes P = c/ ˆf( ˆF 1(1 ϵ)) and an optimal solution to problem (9) is d 1 := ˆF 1(1 ϵ) q0. Otherwise, d 1 := 0.

Before proving this result, we ﬁrst summarize the intuition and consequences. Theorem 3 states that for the one-round T = 1 problem, when there is only a single data type K = 1, the optimal one-round estimate of the data requirement is determined by taking a 1 ϵ quantile of the distribution of D . Rather than solving the optimization problem (9), we can instead just use the estimated data requirement distribution ˆF(q) and prescribe a maximum acceptable risk of failing to collect enough data ϵ := Pr{q0 + d1 < D }. We then collect exactly d 1 = ˆF 1(1 ϵ) q0 additional points. The equivalence between simply prescribing a maximum risk ϵ and solving problem (9) is determined via choice of the cost and penalty parameters to satisfy c/P = ˆf( ˆF 1(1 ϵ)). Nonetheless, we remark that this equivalence only holds if the cost-to-penalty ratio c/P is suﬃciently small and also only for T = 1, K = 1. Thus, problem (9) can also be seen as a generalization of Bayesian approaches to determine how much data to collect with respect to estimating the distribution of the learning curves and selecting a minimum pre-determined quantile (Domhan et al., 2015).

Mahmood, Lucas, Alvarez, Fidler, and Law

To prove Theorem 3, we will equate the original problem (9) to the following alternative constrained optimal data collection problem

min d1 c d1 s. t. ˆF(q0 + d1) 1 ϵ , d1 0. (11)

The above problem incorporates the maximum acceptable risk parameter ϵ as a constraint to replace the previous penalty-based formulation. We ﬁrst prove that this problem is is convex and has an intuitive analytic solution.

Lemma 4 Assume that ˆF(q) is strictly increasing and continuous. Then,

1. Problem (11) is a convex optimization problem.

2. The unique optimal solution to problem (11) is d 1 = max{ ˆF 1(1 ϵ) q0, 0}.

Proof To prove the ﬁrst statement, note that the objective and the second constraint are convex. Thus, we only need to prove that the set {d1 | ˆF(q0 + d1) 1 ϵ} is a convex set. Since ˆF(q) is strictly increasing in q, for any θ [0, 1] and ˆd d 0, we have ˆF(q0 + θ ˆd + (1 θ)d) ˆF(q0 + d) 1 ϵ. Because the convex combination of any two points is in the set, the set must be convex, which makes the optimization problem convex. To prove the second statement, we consider two cases. First, if ˆF 1(1 ϵ) q0, let d1 be the value that satisﬁes

q0 + d1 = ˆF 1(1 ϵ) := inf n q | ˆF(q) 1 ϵ o ,

which makes d1 the smallest value to satisfy ˆF(q0+d1) 1 ϵ. Therefore, d 1 = ˆF 1(1 ϵ) q0 is a minimizer in this case. In the second case where ˆF 1(1 ϵ) < q0, we have ˆF(q0) > 1 ϵ. Since d1 0 is a constraint, the minimizer is d 1 = 0.

We prove Theorem 3 by showing that under (10), problems (9) and (11) are equivalent.

Proof of Theorem 3 First, note that when T = 1 and K = 1, problem (9) is a minimization of a continuous function over a single variable with a single non-negative constraint. Then, there must exist an optimal solution d 1 that satisﬁes either d 1 = 0 via the boundary condition, or ˆf(q0 + d 1) = c/P as a zero-gradient solution. Moreover, for the zero gradient solution to be optimal, there must exist some decision d1 > 0 that achieves a lower objective than the zero solution. We rewrite this condition to

cd1 + P(1 ˆF(q0 + d1)) P(1 ˆF(q0)) = c

P ˆF(q0 + d1) ˆF(q0)

To derive the structure of the optimal solution, we rewrite the original problem (9) to

min ϵ min d1 cd1 + Pϵ s. t. ϵ 1 ˆF(q0 + d1), d1 0, (13)

where ϵ is a slack variable. However, for any ﬁxed value of ϵ, the inner problem (13) is equivalent to the constrained alternative problem (11), meaning that it has an analytic

Optimizing Data Collection for Machine Learning

solution as determined by Lemma 4. Speciﬁcally, for any ϵ 1 ˆF(q0), the optimal solution is d 1 = 0, but for any ϵ 1 ˆF(q0), the optimal solution is d 1 = ˆF 1(1 ϵ) q0. Consequently, problem (13) can be further rewritten as

n c ˆF 1(1 ϵ) q0 + Pϵ | ϵ 1 ˆF(q0) o , P(1 ˆF(q0)) . (14)

Note that problem (14) has an equivalent optimality condition to (12). That is, for the left term to be the minimum, there must exist some ϵ 1 ˆF(q0) that satisﬁes

c ˆF 1(1 ϵ) q0 + Pϵ P(1 ˆF(q0)) = c

P 1 ˆF(q0) ϵ ˆF 1(1 ϵ) q0 (15)

Setting d1 = ˆF 1(1 ϵ) q0 make conditions (12) and (15) equivalent. Most importantly, this must also hold for the minimizer d 1 = ˆF 1(1 ϵ ) q0. We can substitute this condition back into the zero-gradient solution of problem (9) to obtain

f ˆF 1(1 ϵ) = c

P , ϵ 1 ˆF(q0).

If such an ϵ does not exist, then the optimal solution must be d 1 = 0 and ϵ = 1 ˆF(q0).

Finally, we remark on the assumption that ˆF(q) be strictly increasing and continuous. Since in practice, ˆF(q) is the integral of a KDE ˆf(q), our assumption is always satisﬁed given an appropriate kernel (e.g., Gaussian). Furthermore as we show next, we can derive an exact formula under the case where the estimated data requirement distribution is Gaussian.

5.2 An Analytic Solution for Gaussian Distributions

Theorem 3 demonstrates an equivalence between optimizing how much data to collect given certain costs and penalties versus collecting data according to a minimum pre-speciﬁed risk tolerance ϵ. However, to use Theorem 3, we must compute both ˆF(q) and ˆF 1(1 ϵ). With a structural form of this CDF, we may further simplify Theorem 3 to obtain an exact analytic formula on how much data to collect. We focus on the case where the estimated data requirement follows a Gaussian distribution D N(ˆµ, ˆσ) and where F(q) := 1/2 + erf((q ˆµ)/(

2ˆσ))/2 is the corresponding Gaussian CDF. The motivation for a Gaussian distribution stems from an observation that even when we estimate ˆF(q) using KDE, the estimated distribution is often unimodular (see Appendix C.1 for details on this observation). As such, an intuitive strategy may be to simply estimate ˆF(q) using a single Gaussian distribution rather than with many Gaussian kernels. We ﬁrst consider the case where our estimated CDF is exactly equal to the true CDF ˆF(q) = F(q). Here, problem (9) simpliﬁes to

min d1 cd1 + P

1 erf q0 + d1 ˆµ

We can obtain an exact solution for any cost and penalty values. We ﬁrst show that if the penalty P is too small, then the optimal solution to problem (16) is to collect no data,

Mahmood, Lucas, Alvarez, Fidler, and Law

i.e., the penalty for failing to achieve V is insuﬃcient to outweigh the cost of collecting additional data. We then show that given a suﬃciently large P, the optimal amount of data to collect can only take one of two possible values.

Proposition 5 Suppose that ˆF(q) models a Gaussian distribution N(ˆµ, ˆσ). If P < cˆσ

2π, then the optimal solution to problem (16) is d 1 = 0.

Proof The gradient of the objective function of problem (16) is

2πˆσ exp (q0 + d1 ˆµ)2

Setting this gradient to zero yields

However, this only exists if P > cˆσ

2π; otherwise if P < cˆσ

2π, then there is no zerogradient solution to problem (16). Furthermore, by inspection, the objective function is monotone non-decreasing, meaning that the minimizer is the boundary point d 1 = 0.

Proposition 5 states that if the penalty is less than cˆσ

2π, then it is strategically better to not collect any data at all. Recall that P represents a potential loss from failing to develop a machine learning model that achieves V performance. Thus, not collecting additional data is equivalent to deciding not to build this model and simply accepting the penalty. This represents real-world scenarios where the costs of developing a machine learning model outweigh the beneﬁts that this machine learning model can yield to the developer. If the penalty is suﬃciently large such that d 1 > 0, then building the machine learning model can be deemed useful. The minimum value on the penalty parameter is dependent only on the cost and the standard deviation ˆσ of the estimated data requirement distribution. When using LOC, if the variance of the estimated distribution is too large, then there is an increasingly larger chance that the minimum amount of data required D is low. If the penalty for undercollecting is low compared to this probability, then the optimal strategy would be to refrain from collecting data and correspondingly incur the penalty, rather than overcollect data. Figure 3 (Left) visualizes this trend by sweeping diﬀerent values for P and plotting the objective function of problem (16). Here, if P < ˆσ

2π (blue and orange curves), then the optimal d 1 = 0.

Corollary 6 Suppose that ˆF(q) models a Gaussian distribution N(ˆµ, ˆσ) and P > cˆσ

log P log(cˆσ

2π). The optimal solution to problem (16) satisﬁes:

1. If q0 ˆµ

2ˆσζ q0 if c

P < erf ζ erf q0 ˆµ

0 otherwise

Optimizing Data Collection for Machine Learning

2ˆσζ < q0 ˆµ +

2ˆσζ, then d 1 = ˆµ +

3. If q0 ˆµ +

2ˆσζ, then d 1 = 0.

Proof Following the same steps from the proof to Proposition 5, we ﬁnd that the optimal solution must be a zero-gradient point or the boundary point

d 1 n 0, ˆµ

2ˆσζ q0 o ,

conditioned on whether the zero-gradient solution is feasible. Furthermore, we can show below that ˆµ

2ˆσζ q0 is a local maximizer whereas ˆµ +

2ˆσζ q0 is a local minimizer via the second-order condition. Speciﬁcally, note that the second derivative of the objective function is

P (q0 + d1 ˆµ)

(q0 + d1 ˆµ)2

Substituting d1 = ˆµ

2ˆσζ q0 into the above equation admits a negative second derivative, whereas d1 = ˆµ +

2ˆσζ q0 admits a positive second derivative. Therefore, the optimal amount of data to collect can only be in

d 1 n 0, ˆµ +

2ˆσζ q0 o .

We now break the problem down into the three cases. First, if q0 ˆµ

2ˆσζ, then both the local minimizer and the boundary point are feasible. For d 1 = ˆµ +

2ˆσζ to be the global minimizer, the condition on the c/P ratio from Theorem 3 must hold. That is,

1 2 1 + erf q0+

2 1 + erf q0 ˆµ

2ˆσζ q0 = erf ζ erf q0 ˆµ

Otherwise, the optimal solution in this regime is d 1 = 0. Second, if ˆµ

2ˆσζ < q0 ˆµ +

2ˆσζ, then note that the objective function within the feasible region for d1 consists of a curve with a single local minimum and no other zero-gradient points. Therefore, this local minimum d 1 = ˆµ +

2ˆσζ must be the global minimizer. Finally, if q0 ˆµ +

2ˆσζ, then none of the zero-gradient points are feasible solutions. Therefore, the optimal solution must lie on the boundary point d 1 = 0.

Corollary 6 gives an exact characterization of the optimal data collection problem when ˆF(q) is modeled via a single Gaussian distribution. We ﬁrst conﬁrm that the penalty is suﬃciently high for the problem to be meaningful; if the penalty for under-collection is too small, then it would incur lower costs to simply not collect any data. Managerially, this implies that the machine learning product is a poor investment as the costs of data collection are higher than simply not building the machine learning model.

Mahmood, Lucas, Alvarez, Fidler, and Law

0 200000 400000 600000 800000 d1

Collection Costs with different P

N(106, 104), q0 = 104, c = 1

1e+03 1e+04 1e+05 1e+06

300000 400000 500000 q0 + d1

Collection Costs with different q0

N(106, 104), c = 1, P = 105

Figure 3: Evaluating problem (16). Left: We set c = 1, q0 = 104 and sweep diﬀerent values for P. If P is suﬃciently small (i.e., P < cˆσ

2π), then the total expected cost is minimized by setting d1 = 0 (i.e., orange, blue curves). Right: We set c = 1, P = 105 and sweep diﬀerent values for q0. The dashed lines point to the local maxima and minima at ˆµ

2ˆσζ. When q0 ˆµ +

2ˆσζ, the optimal d 1 = 0 (i.e., red, purple). When q0 [ˆµ

2ˆσζ], the optimal d 1 is at the zero-gradient minima (i.e., green, red). When q0 < ˆµ

2ˆσζ, the optimal d 1 depends on the relationship between q0, c, P. That is, d 1 = 0 for the blue curve, but d 1 is equal to the zero-gradient minima for the orange curve.

Assuming the penalty is suﬃciently high, we can compare the value of the initial amount of data q0 to the critical points ˆµ

2ˆσζ. Speciﬁcally, if q0 is greater than ˆµ +

2ˆσζ, then the optimal solution to the data collection problem is to not collect any data. This scenario is equivalent to setting a small value of P (c.f. Proposition 5), since ζ is a function of P, c, and ˆσ. On the other hand, if q0 is between ˆµ

2ˆσζ and ˆµ +

2ˆσζ, then the optimal solution is always to collect data up to having ˆµ +

2ˆσζ points. Finally, if q0 is very small, i.e., less than ˆµ

2ˆσζ, then we must again check if the ratio of cost over penalty is suﬃciently small before collecting data. If this ratio is too high, then the penalty is again too low, and the optimal strategy would be to not collect any data and incur the penalty. Figure 3 (Right) visualizes the implications of Corollary 6 by evaluating the objective function of problem (16) for diﬀerent values of q0; we plot with respect to q = q0 + d1, since the zero-gradient points for this formulation are always ˆµ

2ˆσζ for any q0. Here, we observe that if q0 ˆµ

2ˆσζ, the optimal total amount of data is q = max ˆµ +

2ˆσζ, q0 (i.e., the green, red, and purple curves). On the other hand if q0 < ˆµ

2ˆσζ, then the optimal amount of data to collect is equal to the zero-gradient minimizer only if q0 is suﬃciently large with respect to ζ, and therefore c and P (i.e., the orange curve). Otherwise d 1 = 0 and q = q0 (i.e., the blue curve). This condition for how large q0 must be can be most easily determined by checking whether the cost-to-penalty ratio c/P satisﬁes the conditions from Theorem 3.

5.3 Regret Bounds under Noisy Estimates of the Gaussian CDF

We have shown so far that in the single-round setting, the optimal decision d 1 has an analytic solution, especially when the estimated ˆF(q) of D follows a Gaussian distribution N(ˆµ, ˆσ).

Optimizing Data Collection for Machine Learning

In practice however, the estimated ˆF(q) = F(q) will diﬀer from the true CDF of D . In this sub-section, we assume that D N(µ, σ) follows a Gaussian distribution and ˆF(q) is a noisy estimate where ˆµ = µ and ˆσ = σ. Our goal is to study the expected regret R(d1) from a data collection decision d1 made via the noisy estimate ˆF(q) versus the true optimal solution D to problem (9):

R(d1) :=ED N(σ,µ) h cd1 + P1 {q0 + d1 < D } c (D q0) i

=c (d1 µ + q0) + P (1 F(q0 + d1)) .

This regret is measured in expectation over the true distribution of D . We compare two data collection strategies on the amount of regret incurred. First, we consider LOC, which determines the amount of data to collect d 1 from Corollary 6. Speciﬁcally, we focus on the case where d 1 = ˆµ +

2ˆσζ q0, rather than equal to 0. To compare against LOC, we consider the na ıve estimation-only baseline from Section 3.2. This baseline sets d1 = ˆµ q0, which means it collects enough data to reach the estimated expected value of the minimum data requirement. We show that when the data collection decision is made using the LOC strategy, the expected regret is lower than if the d1 is determined using the estimation-only approach. We consider two scenarios, where the estimated distribution is noise-free (i.e., ˆµ = µ, ˆσ = ˆσ) and one where the estimated mean and variance diﬀer from their true values (i.e., ˆµ = µ, ˆσ = σ). Below, we ﬁrst consider the noise-free setting to demonstrate that LOC incurs a lower expected regret over the distribution of D versus the na ıve estimation-only approach.

Lemma 7 Suppose we perfectly estimate F(q), i.e., ˆµ = µ, ˆσ = σ. Then, R(d 1) R(ˆµ q0) = P

Proof The inequality follows from the fact that d 1 minimizes the objective cd1 + P(1 F(q0 + d1)). The equality follows from substituting ˆµ q0 into the regret equation.

Lemma 7 states that even when we know the exact distribution D , it is disadvantageous to make data collection decisions from using only the estimate. Due to the inherent variability in the distribution, the expected regret is P/2 when using an estimation-only strategy. Speciﬁcally, using the expected value means that with 50% probability, the na ıve strategy will under-estimate how much data to collect, and consequently incur the penalty. Instead, setting d 1 according to Corollary 6 will incur a lower expected regret. Note that Lemma 7 holds for any symmetric probability distribution on D and not just the Gaussian setting. We now present our main theoretical result, which explores the case where our estimated distribution N(ˆµ, ˆσ) of the minimum data requirement D diﬀers from the true distribution N(µ, σ). If the estimated distribution is suﬃciently close to the true distribution, then the data collection decision generated by LOC will still incur a lower expected regret than a data collection decision generated by estimation alone. This suﬃciency is characterized as an upper bound on the Total Variation distance between the two distributions, with respect to the cost, penalty, initial data, and estimated mean.

Mahmood, Lucas, Alvarez, Fidler, and Law

Proposition 8 Let d TV := sup A | Pr D N(µ,σ){A} Pr D N(ˆµ,ˆσ){A}| be the Total Variation distance between N(µ, σ) and N(ˆµ, ˆσ). If

d TV c 2P (ˆµ q0) 1

then, R(d 1) R(ˆµ q0).

Proof We want to show

0 R(d 1) R(ˆµ q0) (18)

2ˆσζ + P 1 F(ˆµ +

2ˆσζ) c (ˆµ µ) P (1 F(ˆµ)) (19)

2ˆσζ + P Pr D N(µ,σ){ˆµ +

2ˆσζ < D } Pr D N(µ,σ){ˆµ < D } (20)

2ˆσζ P Pr D N(µ,σ){ˆµ D < ˆµ +

2ˆσζ} . (21)

Above, (19) applies the deﬁnition of R(d 1) and R(ˆµ q0), (20) rewrites the CDFs and (21) collects the corresponding probabilities. Note from Corollary 6 that d 1 = ˆµ +

2ˆσζ implies it incurs a lower objective function value that d 1 = 0 with respect to problem (9). That is,

2ˆσζ q0 + P Pr D N(ˆµ,ˆσ){ˆµ +

2ˆσζ < D } P Pr D N(ˆµ,ˆσ){q0 < D } . (22)

By adding and subtracting P Pr D N(µ,σ){ˆµ +

2ˆσζ D } and P Pr D N(µ,σ){q0 D }, respectively, on both sides of the above equation and re-arranging the terms, we obtain

2ˆσζ q0 P Pr D N(µ,σ){q0 < D } Pr D N(µ,σ){ˆµ +

P Pr D N(ˆµ,ˆσ){q0 < D } Pr D N(µ,σ){q0 < D }

+ P Pr D N(µ,σ){ˆµ +

2ˆσζ < D } Pr D N(ˆµ,ˆσ){ˆµ +

2ˆσζ < D } (23)

Collecting the probability terms in the left-hand-side of (23) and bounding the right-hand-side by the total variation distance yields

2ˆσζ + c (ˆµ q0) P Pr D N(µ,σ){q0 < D < ˆµ +

2ˆσζ} 2Pd TV (24)

2ˆσζ + c (ˆµ q0) P Pr D N(µ,σ){q0 < D < ˆµ}

P Pr D N(µ,σ){ˆµ D < ˆµ +

2ˆσζ} 2Pd TV (25)

Above (25) breaks the probability into two independent components between [q0, ˆµ) and [ˆµ, ˆµ +

Optimizing Data Collection for Machine Learning

From (25), it remains only to prove that 2Pd TV + P Pr D N(µ,σ){q0 D < ˆµ} c(ˆµ q0) 0. Here, we rearrange the Total Variation distance bound (17) to

c(ˆµ q0) P(2d TV + 1) P 2d TV + Pr D N(µ,σ){q0 D < ˆµ} ,

showing that the inequality is satisﬁed and proving inequality (21).

Proposition 8 guarantees that even when LOC uses a noisy estimate of ˆF(q) of the true CDF F(q) of the minimum data requirement, the optimization problem solved by LOC will yield, on average, lower cost and regret when compared to a strategy that only estimates D from the noisy ˆF(q). This guarantee holds under condition (17), which states that the estimated distribution is suﬃciently close to the true distribution. We ﬁrst note that this upper bound can be computed from known values and furthermore, can be controlled by appropriate selection of the penalty parameter. Speciﬁcally, if we suspect the estimated ˆF(q) is far from F(q), then d TV is large, and we can select a smaller P to ensure that LOC will generate a high-quality data collection decision. If we suspect that ˆF(q) is close to F(q) or if we observe that ˆµ q0 is large, then we are safe to select a large P and still ensure that LOC will generate a high-quality decision. Finally, we note that the bound in Proposition 8 is a suﬃciency condition only and not necessary for guaranteeing the quality of LOC-generated data collection decisions.

6. Empirical Results

In this section, we numerically evaluate LOC on data collection for six computer vision data sets spread over three tasks: image classiﬁcation, object detection, and segmentation. We evaluate on K = 1 and K = 2, which are the two most common use-cases for obtaining high-level collection guidelines on data collection. In the ﬁrst setting, a ﬁrm can only determine the data set size without focusing on the type of data. The second setting reﬂects a common problem class where data can be categorized into expensive and inexpensive types, e.g., long-tail versus common data or acquiring real versus auto-labeled data via semi-supervised learning. Our experiments reveal:

Section 6.2: For every data set and task, LOC signiﬁcantly reduces the number of instances where we fail to meet the data requirement V , while incurring only marginally more expensive costs compared to an oracle policy that determines exactly the minimum amount of data. In contrast, data collection decisions via estimation alone almost always leads to undercollecting and failing to meet V .

Section 6.3: LOC is robust to both the cost and penalty parameters. For nearly all settings, modifying either parameter by orders of magnitude does not signiﬁcantly aﬀect the quality of decisions. Therefore in practical use, it is suﬃcient to roughly estimate the costs and penalties rather than obtain precise values.

Section 6.4: LOC can be easily adapted to solve custom economic analyses and machine learning scenarios. We demonstrate two unique examples where a ﬁrm would like to (i) modify an existing machine learning model to now accommodate a new class;

Mahmood, Lucas, Alvarez, Fidler, and Law

and (ii) contrast two entirely diﬀerent data collection policies. For both cases, LOC yields appropriate data collection decisions.

We expand on our experiment setup in Appendix B. We expand on the main results in Appendix C and include further ablations and comparisons against diﬀerent baselines.

6.1 Experiment Setup

We explore three tasks for K = 1. First, we consider classiﬁcation on CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), and Image Net (Deng et al., 2009), where we train Res Nets (He et al., 2016) to meet a target validation accuracy. We explore semantic segmentation using Deeplabv3 (Chen et al., 2018) on BDD100K (Yu et al., 2020), which is a large-scale driving data set, as well as Bird s-Eye-View (BEV) segmentation on nu Scenes (Caesar et al., 2020) using the Lift Splat architecture (Philion and Fidler, 2020); both tasks require a target mean intersection-over-union (Io U). Finally, we explore 2-D object detection on PASCAL VOC (Everingham et al., 2007, 2012) using SSD300 (Liu et al., 2016), where we evaluate mean average precision (m AP). We evaluate two tasks for K = 2. First, we mimic the scenario of long-tail or imbalanced learning where data for some classes (e.g., long-tail) is much more expensive to collect than others. Here, we divide CIFAR-100 into two subsets containing data from the ﬁrst and last 50 classes, respectively, and assign a higher cost to the ﬁrst 50 classes versus a lower cost to the last 50. Our second experiment models the scenario where one can acquire real labeled data versus use a prior model to cheaply autolabel data. Labeled data incurs a higher cost from collection plus annotation whereas autolabeled data incurs only collection costs, since autolabeled annotations are eﬀectively free. We design this experiment using the labeled and unlabeled data splits of BDD100K. We simulate the deep learning workﬂow following the procedure of Mahmood et al. (2022b), to approximate the true problem while simplifying the experiments (see Appendix B for details). To avoid repeatedly sampling data, re-training a model, and evaluating the score, this simulation uses a piecewise-linear approximation of a ground truth learning curve that returns model performance as a function of data set size. We initialize with q0 = 10% of the full data set (we use 20% for VOC). In each round, we solve for the amount of data to collect and call the piecewise-linear learning curve to obtain the current score. For each data set and task, we consider T = 1, 3, 5 rounds. For each T, we sweep V [V (Dq0) + 1, V (D)] where D is the entire labeled data set; we then aggregate the LOC performance over the full range of potential target settings. For example, in our experiments of optimizing the number of training examples of CIFAR-100 to collect, we ﬁnd that the initial dataset Dq0 achieves 42% accuracy and D achieves 75% accuracy; consequently, we sweep over all values of V [42, 75] and evaluate the average performance of the data collection policies. We ﬁrst evaluate policies on the failure rate, which is the fraction of V

settings in which our data collection policy fails to achieve the target performance within T rounds. Second, we evaluate the cost ratio

c T (q T q0) c T (D q0) 1,

Optimizing Data Collection for Machine Learning

which is the relative sub-optimality of the policy compared to an oracle policy that knows exactly D a priori. To ensure that this metric is non-negative and is scaled reasonably, we average the cost ratio over all settings of V in which the policy yields a model that exceeds V ; i.e., this metric ignores instances that fail to collect enough data. For K = 1, we also measure the ratio of points collected q T /D following Mahmood et al. (2022b). Although there is a trade-oﬀbetween low cost ratio (undercollecting) and failure rate (overcollecting), we emphasize that our goal is to have low cost but with zero chance of failure. Our primary baseline is the conventional estimation detailed in Section 3.2, which ﬁts a regression model to the learning curve statistics, extrapolates the learning curve for more data, and then solves for the minimum data requirement under this extrapolation. We refer to this as the Power Law Regression approach. In addition, we also ablate the eﬀect of optimizing the minimum data requirement by comparing against an intermediate estimation baseline which estimates the distribution of the minimum data requirement and selects the average estimated value. This baseline, referred to as Ensemble Regression, leverages the bootstrapping and density estimation procedure to learn the distribution. There are many diﬀerent regression models that can be used to ﬁt learning curves (Jones et al., 2003; Figueroa et al., 2012; Hestness et al., 2017; Hoiem et al., 2021; Viering and Loog, 2022). Since power laws are the most commonly studied approach in the neural scaling law literature, we focus on power laws here, but compare against the other functions from Table 1 in Appendix C.2.

6.2 Main Results: The Value of Optimization over Estimation

Below, we discuss results for K = 1 and K = 2.

6.2.1 Case with K = 1

We ﬁrst discuss experiments on K = 1. Figure 4 compares LOC versus the corresponding power law regression baseline when c = 1. We ﬁx P = 107 as the default penalty for CIFAR10, CIFAR-100, BDD100K, and nu Scenes, but set P = 108 for Image Net and P = 106 for VOC1. If a curve is below the black line, then it failed to collect enough data to meet the target. LOC consistently remains above this black line for most settings, whereas even with T = 5 rounds, collecting data based only on regression estimates leads to failure. Table 2 aggregates failure rates and cost ratios for each setting. Here, LOC fails at less than 10% of instances for 12/18 settings, whereas regression fails over 30% for 15/18 settings. In particular, regression nearly always under-collects data when given a single T = 1 round. Here, LOC reduces the risk of under-collecting by 40% to 90% over the baseline. Moreover, our cost ratios are consistently less than 0.5 for 12/18 settings, meaning that we spend at most 50% more than the true minimum cost. Table 2 also ablates the eﬀect of optimization versus only estimating the distribution. The ensemble uses the same KDE distribution as in LOC but simply outputs the mean of the distribution. First, LOC always outperforms the Ensemble Regression baseline, showing that

1. This tuning is due to the fact that Image Net naturally has an order of magnitude more data within it, meaning that the scale of data required to reach target performances is naturally higher; nonetheless in Section 6.3, we demonstrate that our LOC results are generally robust to order-of-magnitude shifts in the penalty parameter for most settings.

Mahmood, Lucas, Alvarez, Fidler, and Law

80 85 90 V *

Ratio of points q * /D *

CIFAR-10 (T=1)

Power Law LOC

80 85 90 V *

Ratio of points q * /D *

CIFAR-10 (T=3)

Power Law LOC

80 85 90 V *

Ratio of points q * /D *

CIFAR-10 (T=5)

Power Law LOC

40 50 60 70 V *

Ratio of points q * /D *

CIFAR-100 (T=1)

Power Law LOC

40 50 60 70 V *

Ratio of points q * /D *

CIFAR-100 (T=3)

Power Law LOC

40 50 60 70 V * 0.8

Ratio of points q * /D *

CIFAR-100 (T=5)

Power Law LOC

45 50 55 60 65 V *

Ratio of points q * /D *

Imagenet (T=1)

Power Law LOC

45 50 55 60 65 V *

Ratio of points q * /D *

Imagenet (T=3)

Power Law LOC

45 50 55 60 65 V *

Ratio of points q * /D *

Imagenet (T=5)

Power Law LOC

66 68 70 72 74 V *

Ratio of points q * /D *

Power Law LOC

66 68 70 72 74 V *

Ratio of points q * /D *

Power Law LOC

66 68 70 72 74 V *

Ratio of points q * /D *

Power Law LOC

20 22 24 26 V *

Ratio of points q * /D *

BDD100K (T=1)

Power Law LOC

20 22 24 26 V *

Ratio of points q * /D *

BDD100K (T=3)

Power Law LOC

20 22 24 26 V *

Ratio of points q * /D *

BDD100K (T=5)

Power Law LOC

22 24 26 28 30 V *

Ratio of points q * /D *

nu Scenes (T=1)

Power Law LOC

22 24 26 28 30 V *

Ratio of points q * /D *

nu Scenes (T=3)

Power Law LOC

22 24 26 28 30 V *

Ratio of points q * /D *

nu Scenes (T=5)

Power Law LOC

Figure 4: Mean standard deviation of 5 seeds of the ratio of data collected q T /D for diﬀerent V . The rows correspond to T = 1, 3, 5 and the columns to diﬀerent data sets. The black line corresponds to collecting exactly the minimum data requirement. LOC almost always remains slightly above the black line, meaning we rarely fail to meet the target.

Optimizing Data Collection for Machine Learning

Data set T Power Law Regression Ensemble Regression LOC (Ours)

Failure rate Cost ratio Failure rate Cost ratio Failure rate Cost ratio

CIFAR-10 1 100% 100% 60% 0.19 0.00 3 95% 0.00 0.00 100% 32% 0.05 0.00 5 86% 0.00 0.00 92% 0.00 0.00 29% 0.03 0.00

CIFAR-100 1 56% 0.12 0.00 57% 0.11 0.00 4% 0.99 0.01 3 48% 0.10 0.00 51% 0.10 0.00 3% 0.31 0.00 5 48% 0.10 0.00 46% 0.09 0.00 2% 0.19 0.00

Imagenet 1 99% 0.00 0.00 100% 37% 0.49 0.00 3 75% 0.01 0.00 96% 0.00 0.00 5% 0.16 0.00 5 56% 0.01 0.00 82% 0.00 0.00 2% 0.10 0.00

BDD100K 1 77% 0.03 0.01 69% 0.07 0.01 12% 2.03 0.10 3 31% 0.00 0.00 23% 0.03 0.00 0% 0.72 0.03 5 23% 0.01 0.00 15% 0.03 0.00 0% 0.35 0.02

nu Scenes 1 95% 0.00 0.00 95% 0.00 0.00 52% 0.16 0.00 3 71% 0.01 0.00 90% 0.00 0.00 0% 0.09 0.00 5 62% 0.00 0.00 76% 0.00 0.00 0% 0.04 0.00

VOC 1 36% 1.24 0.06 33% 1.12 0.06 25% 0.56 0.02 3 8% 0.88 0.04 6% 0.80 0.04 0% 1.10 0.07 5 6% 0.86 0.04 6% 0.81 0.04 0% 0.84 0.04

Table 2: Average cost ratio standard error and failure rate measured over a range of V

for each T and data set. We ﬁx c = 1 and P = 107 (P = 106 for VOC and P = 108 for Image Net). The best performing failure rate for each setting is bolded. LOC consistently reduces the average failure rate, often down to 0%, while keeping the average cost ratio almost always below 1 (i.e., spending at most 2 the optimal amount).

optimization is indeed necessary when determining how much data to collect. Furthermore, the Ensemble Regression only outperforms the na ıve Power Law Regression baseline for BDD100K and VOC. For the other four data sets, the two baselines are either equivalent or Power Law Regression outperforms Ensemble Regression. This shows that estimating the distribution of the minimum data requirement is insuﬃcient for determining how much data to collect. Speciﬁcally, estimating this distribution can capture the uncertainty in the na ıve Pow Law Regression estimator, but knowing this stochasticity does not yield accurate data collection decisions. This result validates the two-step approach of LOC. Finally, we remark that our results here for T = 1 validate our theoretical analysis in Section 5, which showed that when our estimator of the data requirement distribution produces underestimates, it incurs signiﬁcantly larger regret than LOC. This can be observed particularly in the failure rates, as every failure will incur a penalty of P. Note that the cost ratios listed in Table 2 do not include P, since the penalty is always signiﬁcantly greater than the collection cost and including it would obfuscate cost comparisons for instances where V was reached. In other words, for CIFAR-10, Image Net, or nu Scenes, the baseline incurs a true unﬁltered cost ratio (i.e., factoring in the penalty) at approximately P divided by the oracle cost. Due to the high failure rate of the baseline, this unﬁltered cost ratio would be several orders of magnitude higher than the values reported in Table 2. Since,

Mahmood, Lucas, Alvarez, Fidler, and Law

40 50 60 70 V * 10 2

CIFAR-100 (2 Types) (T=5)

Power Law c=[0.01 0.005]

40 50 60 70 V * 0

Failure Rate

40 50 60 70 V * 10 2

CIFAR-100 (2 Types) (T=5)

Power Law c=[0.01 0.002]

40 50 60 70 V * 0

Failure Rate

40 50 60 70 V * 10 2

CIFAR-100 (2 Types) (T=5)

Power Law c=[0.01 0.001]

40 50 60 70 V * 0

Failure Rate

40 50 60 70 V * 10 2

CIFAR-100 (2 Types) (T=5)

Power Law c=[0.01 0.0005]

40 50 60 70 V * 0

Failure Rate

36 38 40 42 V * 10 2

BDD100K (Semi-supervised) (T=5)

Power Law c=[1. 0.1]

36 38 40 42 V * 0

Failure Rate

36 38 40 42 V * 10 2

BDD100K (Semi-supervised) (T=5)

Power Law c=[1. 0.05]

36 38 40 42 V * 0

Failure Rate

36 38 40 42 V * 10 2

BDD100K (Semi-supervised) (T=5)

Power Law c=[1. 0.01]

36 38 40 42 V * 0

Failure Rate

36 38 40 42 V * 10 2

BDD100K (Semi-supervised) (T=5)

Power Law c=[1. 0.005]

36 38 40 42 V * 0

Failure Rate

Figure 5: Mean standard deviation over 5 seeds of the cost ratio and failure rate for diﬀerent V , after removing 99-th percentile outliers. The columns correspond to scenarios where the ﬁrst set c1 costs increasingly more than the second c2. See Appendix C for the results with T = 1, 3.

LOC incurs less than 50% failure rate for all three of these instances, our reported cost ratios more accurately reﬂect the true values.

6.2.2 Case with K = 2

We now discuss experiments on K = 2. Figure 5 compares LOC versus regression at T = 5 with diﬀerent costs. LOC decreases the failure rates while keeping a similar cost ratio to the baseline. We include similar plots for T = 1, 3 in Appendix C.4. Table 3 aggregates failure rates and cost ratios for all settings, showing LOC consistently achieves lower failure rates for nearly all settings of T. When T = 5, LOC also achieves lower cost ratios versus regression on CIFAR-100, meaning that with multiple rounds of collection, we can ensure meeting performance requirements while paying nearly the optimal amount of data. However, the optimization problem is generally more diﬃcult as K increases and we sometimes over-collect data by several orders of magnitude margins. Because these outliers drastically skew the summary average statistics that we measure, we remove the 99-th percentile with respect to total cost for both LOC and the baseline regression estimator. In practice, outlier data collection decisions would naturally be pruned by decision-makers via common-sense reasoning. For instance, if an algorithm suggests to collect 10 million unlabeled images when a human expert would guess at 10 thousand images, the practical decision would not be to blindly use the algorithmic solution. Although the results in this section compare speciﬁcally against power law-based regression, Mahmood et al. (2022b) show that we can use other regression functions (e.g., see Table 1) as well as modiﬁcations to reduce the undercollection of these estimation-based

Optimizing Data Collection for Machine Learning

Data set T Cost Power Law Regression LOC

Failure rate Cost ratio Failure rate Cost ratio

CIFAR-100 (2 Types)

(0.01, 0.0005) 62% 0.84 0.02 40% 41.80 1.80 (0.01, 0.001) 58% 1.19 0.01 46% 9.85 0.27 (0.01, 0.002) 56% 1.55 0.01 54% 6.98 0.18 (0.01, 0.005) 54% 1.65 0.01 33% 4.43 0.07

(0.01, 0.0005) 43% 3.47 0.16 30% 4.88 0.26 (0.01, 0.001) 45% 1.22 0.02 43% 1.31 0.03 (0.01, 0.002) 45% 1.47 0.02 44% 1.21 0.02 (0.01, 0.005) 38% 1.31 0.01 36% 1.17 0.01

(0.01, 0.0005) 38% 3.31 0.16 24% 5.19 0.22 (0.01, 0.001) 35% 1.22 0.02 24% 0.79 0.01 (0.01, 0.002) 37% 1.33 0.01 38% 0.90 0.01 (0.01, 0.005) 36% 1.30 0.01 24% 0.82 0.00

BDD100K (Semi-supervised)

(1, 0.005) 86% 0.11 0.01 44% 7.02 1.11 (1, 0.01) 79% 0.15 0.01 30% 13.47 1.50 (1, 0.05) 72% 0.19 0.01 49% 1.02 0.19 (1, 0.1) 70% 0.19 0.01 65% 0.40 0.05

(1, 0.005) 23% 0.18 0.01 7% 1.20 0.12 (1, 0.01) 21% 0.15 0.00 7% 2.57 0.58 (1, 0.05) 26% 0.18 0.01 23% 0.50 0.06 (1, 0.1) 26% 0.21 0.01 30% 0.15 0.01

(1, 0.005) 16% 0.22 0.01 2% 1.91 0.30 (1, 0.01) 21% 0.15 0.00 2% 0.86 0.13 (1, 0.05) 16% 0.17 0.01 9% 0.27 0.03 (1, 0.1) 16% 0.20 0.01 7% 0.32 0.03

Table 3: Average cost ratio standard error and failure rate over diﬀerent V for each T and c, after removing 99-th percentile outliers. We ﬁx P = 1013 for CIFAR-100 and P = 108

for BDD100K. The best performing failure rate for each setting is bolded. LOC reduces the average failure rate, is more robust to uneven costs than regression, and for T > 1, preserves the cost ratio.

baselines. In Appendix C.2, we include further results comparing against these alternate functions. In every setting, we show that while baseline estimators consistently either underor overestimate how much data to collect, LOC consistently reduces either the cost or the failure rates signiﬁcantly.

6.3 Robustness to Cost and Penalty Parameters

Figure 6 evaluates the ratio of points collected for T = 5 when the cost and the penalty of the optimization problem are varied. We include similar results for T = 1, 3 in Appendix C.5. LOC is robust to variations in these parameters, as LOC retains the same shape and scale for almost every parameter setting and data set. Further, LOC consistently remains above the horizontal 1 line, showing that even after varying c and P, we do not fail as frequently as the baseline. Finally, validating Theorem 3, the penalty parameter P provides natural

Mahmood, Lucas, Alvarez, Fidler, and Law

80 85 90 V *

Ratio of points q * /D *

CIFAR-10 (T=5)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

40 50 60 70 V *

Ratio of points q * /D *

CIFAR-100 (T=5)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

45 50 55 60 65 V *

Ratio of points q * /D *

Imagenet (T=5)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

80 85 90 V *

Ratio of points q * /D *

CIFAR-10 (T=5)

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

40 50 60 70 V * 0.8

Ratio of points q * /D *

CIFAR-100 (T=5)

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

45 50 55 60 65 V *

Ratio of points q * /D *

Imagenet (T=5)

Power Law P=1e+07 P=1e+08 P=1e+09 P=1e+10

66 68 70 72 74 V *

Ratio of points q * /D *

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

20 22 24 26 V *

Ratio of points q * /D *

BDD100K (T=5)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

22 24 26 28 30 V *

Ratio of points q * /D *

nu Scenes (T=5)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

66 68 70 72 74 V *

Ratio of points q * /D *

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

20 22 24 26 V *

Ratio of points q * /D *

BDD100K (T=5)

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

22 24 26 28 30 V *

Ratio of points q * /D *

nu Scenes (T=5)

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

Figure 6: Mean standard deviation of 5 seeds of the ratio of data collected q T /D for diﬀerent V and ﬁxed T = 5. Rows 1 & 3: We sweep the cost parameter from 0.001 to 1 and ﬁx P = 107. Rows 2 & 4: We sweep the penalty parameter from 106 to 109 and ﬁx c = 1. The dashed black line corresponds to collecting exactly the minimum data requirement. See Appendix C for all T.

control over the amount of data collected. As we increase P, the ratio of data collected increases consistently.

6.4 Adapting LOC to Custom Modeling Problems

Our optimal data collection framework and LOC can also be adapted to solve custom questions faced by machine learning developers. Here, we present two case studies where we demonstrate how LOC yields high-quality data collection decisions.

Optimizing Data Collection for Machine Learning

T Power Law Regression LOC

Failure rate Cost ratio Failure rate Cost ratio

1 89% 0.19 8% 1.35 3 38% 0.78 1% 0.63 5 24% 0.64 0% 0.26

Table 4: On CIFAR-100, average cost ratio and failure rate for beaver (new class) measured over a range of V for each T, when the model is initialized with only 99 classes and zero training examples for the new class. We ﬁx c = 1 and P = 105. The best performing failure rate for each setting is bolded. LOC consistently achieves less than 10% failure rate, while keeping the average cost ratio even lower than the estimation baseline when T 3.

40 50 60 70 80 V *

Collection Policies for BDD100K

Active Learning Autolabeling

Figure 7: Mean standard deviation of 5 seeds when comparing the total collection cost to reach diﬀerent V on BDD100K depending on active learning versus autolabeling. We ﬁx c L = 0.995 and c U = 0.005. Autolabeling is more cost-eﬀective until V 70 m Io U, at which point the diminishing power law of the unlabeled data make active learning more eﬀective.

6.4.1 Adding New Classes to an Existing Model

Suppose that we have an existing M-class classiﬁer achieving a desired accuracy V for each of the current classes individually. We now want to update this model with a new (M + 1)-th class. We require the classiﬁer to also achieve V validation set accuracy for this class as well or else we will pay P. If we can collect training examples for this speciﬁc class at a per-sample cost c, how many should we obtain within the time horizon T?

To address this problem, we use LOC where qt represents the number of training examples of the (M + 1)-th class. However, q0 = 0 since this is a new class; this means that we cannot ﬁt training statistics to estimate F(q) at time t = 1. Instead in the ﬁrst data collection round, we select one of the existing M classes, model the data requirement distribution for this prior class, and use this as a proxy for F(q). If T > 1, then after the ﬁrst round, we will have an initial amount of data for the (M + 1)-th class, which we can use to learn F(q) using the standard technique.

We simulate this problem using CIFAR-100, where we set M = 99 and ﬁrst train the model using the training set data of all classes except for beaver (the 100-th class). For the ﬁrst round estimate, we design F(q) using the statistics of dolphin , which is a similar class in the same hierarchy. Table 4 provides the failure rate and cost ratio for both LOC and the power law regression baseline. Here, LOC consistently achieves less than 10% failure rate and moreover, outperforms the baseline on both the failure rate and cost ratio for both T = 3, 5. Note that since we start with q0 = 0 initial data, our ﬁrst estimation of F(q) will be poor for both our method and the baseline; this leads the baseline to high failure rates for T = 1 and surprisingly high costs for T = 3, 5. Instead, LOC, by virtue of optimization, yields low failure rates for T = 1 and strong performance at larger T.

Mahmood, Lucas, Alvarez, Fidler, and Law

6.4.2 Deciding Between Two Data Annotation Campaigns

Suppose that we have collected a large pool of unlabeled data. To annotate each data point, there is a small cost c C for collecting and pre-processing an unlabeled image and a labeling cost c L > c C for manually labeling the image. Using an initial labeled set of q0 points, we must decide between two potential annotation strategies via T = 1 round forecasts:

1. Autolabeling: Rather than manually labeling every data point, we can use our current model to synthetically generate labels for the unlabeled data. These autolabeled points may not as eﬀective as the manually labeled points, but autolabeling incurs zero additional labeling costs. Consider a strategy where we manually label a fraction of the data that we have collected (i.e., incurring the cost of collecting and labeling the data c C + c L) and autolabel the rest of the data (i.e., incurring only the collection cost c C per data point).

Let q = (q L, q C) where q L is the amount of manually labeled data and q U is the amount of unlabeled data that we autolabel. Then, our total cost under this strategy is

(c L + c C) (q L q0) + c Cq C + P1 {Vq V } .

2. Active learning: Since the quality of autolabeled data may not be as high as that of manually labeled data, we may instead opt to only manually label data. Moreover, we can re-direct the engineering eﬀort of building an autolabeler towards designing an active learning framework, which mines the collected (unlabeled) data to ﬁnd the most useful training examples.

With active learning, q = q L is the amount of data that is manually labeled. Furthermore, the cost of labeling each data point is equal to the cost of collection plus labeling c L + c C. The optimization objective under this strategy is

(c L + c C) (q L q0) + P1 {Vq L V }

Given a target performance V , we can solve both optimization problems to determine which of the two policies is more cost-eﬃcient. We evaluate this experiment using the labeled and unlabeled splits of BDD100K. The autolabeling setting is equivalent to the experiments performed in Section 6 for K = 2. The active learning setting is a special case of the K = 1 experiment where the data collected in each round and subsampled for ﬁtting scaling laws, are collected according to an active learning sampler rather than random sampling. We follow the approach in Mahmood et al. (2022b) where the steps of Algorithm 3 that involve estimating the minimum data requirement distribution are unchanged other than the fact that the data is collected via active learning (see Appendix C.3 for details and additional active learning experiments). For data collection, we use a simple strategy of computing the average Conﬁdence scores of estimates per scene and selecting the Least Conﬁdent scenes; we leave further details in Appendix B.2 (Settles, 2009). Figure 7 plots the total cost as determined from an initial q0 = 7, 000 labeled points and costs c L = 0.995 and c C = 0.005; these costs replicate the settings explored in Table 3. We extrapolate the performance and costs up to V = 80. Since in this scenario, autolabeling data is 20 cheaper than manually labeling it, our

Optimizing Data Collection for Machine Learning

analysis suggests that for moderate target scores (e.g., V = 58 m Io U), we can save up to 100, 000 in total costs. However, in general, active learning produces more high-quality data than autolabeling, so we expect the 2-dimensional power law for autolabeling to ﬂatten sooner than the active learning estimator. For V 70, the minimum data requirement and consequently, the total cost of autolabeling grows exponentially and exceeds the total cost of active learning. This leads us to an intuitive conclusion: in applications requiring high performance models, at some point it becomes more important to use high-quality data rather than simply lots of data.

7. Discussion

Decisions on how much data to collect to improve a given machine learning model are fundamental in all machine learning applications, but such methods are typically handled by estimating a scaling law of training data set size to model performance and then extrapolating future behavior. This na ıve extrapolation tends to lead to costly decisions from overcollecting data or delays and future costs from undercollecting data. In our paper, we develop a rigorous framework for optimizing data collection workﬂows by introducing an optimal data collection problem that captures the uncertainty in estimating data requirements. Our general framework can model a variety of settings such as where multiple data sources incur diﬀerent collection costs, where an existing model must be upgraded with data for a new class, or where we must assess the beneﬁts of diﬀerent sampling and labeling practices. We numerically validate our solution algorithm, Learn-Optimize-Collect, on six computer vision data sets covering classiﬁcation, segmentation, and detection tasks to show that we consistently meet pre-determined performance metrics regardless of costs and time horizons. Our optimization model minimizes the expected total future collection cost, which is modeled by two parameters: (i) a per-sample cost c that models the cost of collecting, cleaning, and annotating data; and (ii) a penalty P that models the opportunity cost of the model failing to meet our desired performance metric. The second parameter may not be readily available in practice. However, we show empirically that LOC is typically robust to parameter variations on one to three orders of magnitude. That is, as long as we can roughly estimate these parameters for our setting, we will still be able to make good data collection decisions. Moreover, we theoretically analyze the one-round T = 1 setting to draw two high-level insights. First, our problem is equivalent to specifying a minimum tolerance ϵ Pr{Vq < V } on the probability that we do not collect enough data and simply collecting the quantile ˆF 1(1 ϵ) of the distribution of the minimum amount of data needed; consequently, practitioners can instantly obtain one-round estimates for how much data to collect. This analysis can be further specialized when the distribution has a given structure (e.g., Gaussian). Second, we prove that as long as the estimated distribution ˆF is suﬃciently close to the true distribution of the minimum data requirement, our optimization model provably improves upon estimation-only strategies in terms of minimizing the total cost. LOC combines neural scaling law estimators with a stochastic optimization problem that can be solved via gradient descent. Our fundamental step is to treat the minimum amount of data we would need as a random variable and bootstrap a neural scaling law estimator to estimate its probability distribution. As future advances in neural scaling laws arrive, our bootstrapping can be deployed on top of new scaling law estimation algorithms to

Mahmood, Lucas, Alvarez, Fidler, and Law

optimize data collection decisions better than any estimation-only techniques. In particular, estimating neural scaling laws for arbitrary multi-dimensional settings (e.g., when combining diﬀerent data sources) remains a diﬃcult problem, but a side-product of this paper presents a simple multi-dimensional additive power law estimator that works reasonably well in our experiments.

Our framework is designed speciﬁcally for settings where the machine learning model and training algorithm are ﬁxed, that the data is generally of high-quality, and that there is no misspeciﬁcation between model and data. In practice, these assumptions may not always be satisﬁed. For example, designers may modify the model or the data source in between collection rounds, change the evaluation metric, or seek to address alternative targets. Mathematically, our framework assumes that the desired performance target is achievable with a ﬁnite amount of data and that the true neural scaling law of a given problem instance is monotonically non-decreasing with the dataset size (Viering and Loog, 2022). The ﬁrst assumption is necessary for this problem to be solvable. Although the second assumption is not always satisﬁed in machine learning tasks, it has been empirically observed to hold in most practical deep learning applications and is a standard assumption when scaling deep learning models (Hestness et al., 2017; Hoﬀmann et al., 2022). Finally, our numerical experiments rely on simulations with pre-constructed ground truth learning curves v(n). An alternative experimental setup may be to explicitly sample points based on the decisions generated by LOC, train the neural network model, and evaluate its score. However, such repeated retraining is computationally too expensive to perform for the range of experiments explored in this paper. Furthermore, the quality of our simulation is proportional to the number of data points that we sample, meaning that we can maintain accurate experimental analysis.

Improving data collection practices yields potentially positive and negative societal impacts. By reducing the collection of extraneous data, we implicitly reduce the environmental costs of training models. On the other hand, equitable data collection should also be considered in real-world data collection practices that involve humans. We envision a potential future work to incorporate privacy and fairness constraints to prevent overor under-sampling of protected groups. Finally, our method is guided by a score function on a held-out validation set. Biases in this set may be exacerbated when optimizing data collection to meet target performance.

Finally, we emphasize that this work addresses a longstanding problem in machine learning practice. There is a folklore observation that over 80% of industry machine learning projects fail to reach production, often due to insuﬃcient, noisy, or inappropriate data (Venture Beat, 2019). As mentioned previously, industry surveys have reported that 51% of practitioners face delays from under-collection (Dimensional Research, 2019). Our numerical experiments verify this observation by showing that na ıvely estimating power laws typically leads to undercollection. We believe that robust data collection policies obtained via LOC can reduce failures while further guiding practitioners on how to manage both costs and time.

Acknowledgments

Optimizing Data Collection for Machine Learning

The authors thank the Action Editor and all referees for their valuable comments, which have signiﬁcantly improved this work. The authors also thank Daiqing Li, Jonah Philion, and Zhiding Yu for valuable feedback and help on earlier versions of this work.

Appendix A. An Alternative Partially Observable Markov Decision Process Formulation of the Optimal Data Collection Problem

Our problem deﬁned in Section 3 can be written as a Partially Observable Markov Decision Process (POMDP) (Puterman, 2014; Bertsekas, 2012), modeled by the tuple (Θ, A, S, p, rt). Here, the state space characterizes the data requirement D Θ := RK +, the action space characterizes the additional data collected dt := (qt qt 1) A := RK +, and the observation set S := {0, 1} characterizes a binary variable 1{V (Dqt) V }. Furthermore, p( |D , dt) is the observation transition probability and rt( ) is the reward function where rt(qt, qt 1, D ) := c(qt qt 1) for t T and r T+1(qt, qt 1, D ) := P1{q T < D }. Finally, note that the state variable is constant throughout the MDP, meaning this problem can be written as an EK Learning-and-Doing model (Easley and Kiefer, 1988). POMDPs are typically solved by using a belief distribution of the state variable to average the reward in the value function. In general, these methods are susceptible to a curse of dimensionality and can sometimes be only tackled via approximations (Zhao et al., 2021). Alternatively, we may consider applying reinforcement learning. However, note that real-world data collection tasks do not contain the requisite sizes of learning data or generalizable simulation mechanisms that are staples in reinforcement learning techniques. All of these challenges motivate our approach, which has the beneﬁt of being an easy-to-solve optimization problem on top of existing neural scaling law methods.

Appendix B. Simulation Experiment Setup

The most intuitive approach of validating our data collection problem is by repeatedly sampling from a data set, training a model, and solving the optimization problem. However, since performing a large set of such experiments over many data sets becomes computationally intractable, we follow the approach of Mahmood et al. (2022b), who propose a simulation model of the data collection problem. Below, we summarize the simulation setup. The simulation replicates the steps in Algorithm 3 except with one key diﬀerence. In the simulation, we replace the score function V (D) with a ground truth function vgt(q) that serves as an oracle which reports the expected score of the model trained with q data points. Thus, rather than having to collect data and train a model in each round, we evaluate vgt(qt) and treat this as the current model score. The optimization and regression models do not have access to vgt(q).

B.1 A Piecewise-Linear Ground Truth Approximation

In order to build a ground truth function, we ﬁrst use the sub-sampling procedure in Algorithm 3 to collect performance statistics over subsets of the entire training data set.

Mahmood, Lucas, Alvarez, Fidler, and Law

Data set Task Score Full data set size

CIFAR-10 (Krizhevsky, 2009) Classiﬁcation Accuracy 50, 000 CIFAR-100 (Krizhevsky, 2009) Classiﬁcation Accuracy 50, 000 Image Net (Deng et al., 2009) Classiﬁcation Accuracy 1, 281, 167 BDD100K (Yu et al., 2020) Semantic Segmentation Mean Io U 7, 000 nu Scenes (Caesar et al., 2020) BEV Segmentation Mean Io U 28, 130 VOC (Everingham et al., 2007, 2012) 2-D Object Detection Mean AP 16, 551

CIFAR-100 (Krizhevsky, 2009) Classiﬁcation Accuracy 25, 000 (Classes 0-49) 25, 000 (Classes 50-99) BDD100K (Yu et al., 2020) Semantic Segmentation Mean Io U 7, 000 (Labeled) 70, 000 (Unlabeled)

Table 5: Data sets, tasks, and score functions considered.

Using these observed statistics, we then build a piecewise-linear model of the ground truth. Below, we ﬁrst highlight how to construct a piecewise-linear model when given a set of data set sizes and their corresponding scores. In the next subsection, we will detail the exact data collection process. The Single-variate (K = 1) Case. Mahmood et al. (2022b) develop a ground truth function by collecting training statistics over subsets of the entire data set and performing a linear interpolation. Let q0 q1 q2 be a series of data set sizes and let Dq0 Dq1 Dq2 be their corresponding sets. Then, the piecewise-linear function:

q0 n, q q0 V (Dqt) V (Dqt 1)

qt qt 1 (q qt) + V (Dqt 1), qt 1 q qt

is concave and monotonically increasing, which follows the general trend of real learning curves Hestness et al. (2017). Furthermore Mahmood et al. (2022b) show that given suﬃcient resolution, i.e., enough data subsets, this piecewise linear function is an accurate approximation of the true learning curve V (D). The Multi-variate (K = 2) Case. In the previous K = 1 case, the ground truth was formed by taking linear interpolations between diﬀerent subset sizes. When K > 1, we have multiple subsets that are used to evaluate the score V (D1, . . . , DK). In our numerical experiments, we focus on K = 2. Here, we can generalize the linear interpolation process to a bilinear interpolation (Wang and Yang, 2008). Let q1 0 q1 1 q1 2 and q2 0 q2 1 q2 2 be two series of data set sizes, and consider the grid

(q1 0, q2 0) (q1 1, q2 0) (q1 2, q2 0) (q1 0, q2 1) (q1 1, q2 1) (q1 2, q2 1) (q1 0, q2 2) (q1 1, q2 2) (q1 2, q2 2) ... ... ... ...

We estimate v(q1, q2) for any q1, q2 [q1 s 1, q1 s] [q2 t 1, q2 t ] via bilinear interpolation over the square deﬁned by these borders (Wang and Yang, 2008). For K > 2. The piecewise linear approximations grow increasingly complex as the dimension K increases. Furthermore, the number of subsets of data set sizes required to create a piecewise linear approximation increases exponentially with K. Speciﬁcally for k {1, . . . , K}, let Mk denote the number of subsets (i.e., |{qk 0, qk 1, . . . , qk Mk}|) of a data set that

Optimizing Data Collection for Machine Learning

we consider when creating subsets. For each combination of K subsets, we must then train a model and evaluate it s performance to record V (D1, . . . , DK). Thus, we must subsample and train our model for O(Q k Mk) combinations, which quickly becomes computationally prohibitive.

B.2 Data Collection

We now summarize the data collection and training process used to create the above piecewise-linear functions for each data set and task. All models were implemented using Py Torch and trained on machines with up to eight NVIDIA V100 GPU cards. Table 5 details each task and data set size. Image Classiﬁcation Tasks. For all experiments with CIFAR-10 and CIFAR-100, we use a Res Net18 He et al. (2016) following the same procedure as in Coleman et al. (2020). For Image Net, we use a Res Net34 He et al. (2016) using the procedure in Coleman et al. (2020). All models are trained with cross entropy loss using SGD with momentum. We evaluate all models on Top-1 Accuracy. For all experiments, we set the initial data set at q0 = 10% of the data. In data collection, we create ﬁve subsets containing 2%, 4%, , 10% of the training data, ﬁve subsets containing 12%, 14%, , 20% of the training data, and eight subsets containing 30%, 40%, , 100% of the data. Each subset is contained in the following subsets. Note that we use higher granularity in the early stage as this is where the dynamics of the learning curve vary the most. With more data, the learning curve eventually has a nearly zero slope. For each subset, we train our respective model and evaluate performance. VOC. We use the Single-Shot Detector 300 (SSD300) Liu et al. (2016) based on a VGG16 backbone Simonyan and Zisserman (2015), following the same procedure as in Elezi et al. (2022). We evaluate all models on mean AP. For all experiments, we set the initial data set at q0 = 10% of the data. In data collection, we sample twenty subsets at 5% intervals, i.e., 5%, 10%, 15%, , 100% of the training data. BDD100K. We use Deeplabv3 Chen et al. (2018) with Res Net50 backbone. We use random initialization for the backbone. We use the original data set split from Yu et al. (2020) with 7, 000 and 1, 000 data points in the train and validation sets respectively. The evaluation metrics is mean Intersection over Union (Io U). We follow the same protocol used in the Image classiﬁcation tasks to create our subsets of data. Active Learning. We perform experiments involving active learning on CIFAR-100 and BDD100K. For these experiments, rather than collecting data by i.i.d. sampling of subsets at the above respective intervals, we use an active learning algorithm to select the subsets at each interval; for example, with CIFAR-100, we start with a random i.i.d. sample of 2% of the data and for each subsequent 4%, 6%, 8%, , 20%, 30%, 40%, , 100%, we select this data with an active learning algorithm. Each subset is contained in the following subsets. For CIFAR-100, we explore active learning via Maximum Entropy (Settles, 2009), Least Conﬁdence (Settles, 2009), and Greedy k-Centers (Sener and Savarese, 2018). For BDD100K, we use only Least Conﬁdence (Settles, 2009). We use the same models and training practice as described previously. nu Scenes. We use the Lift Splat architecture Philion and Fidler (2020), which is used for BEV segmentation from driving scenes, following the steps from the original paper to

Mahmood, Lucas, Alvarez, Fidler, and Law

Parameter Setting

Optimizer GD with Momentum (β = 0.9), Adam (β0, β1 = 0.9, 0.999) Learning rate 0.005, . . . , 500 Number of bootstrap samples B 500 Number of regression subsets R See Appendix B.2 Density Estimation Model KDE for K = 1, GMM for K = 2

KDE Bandwidth 20000, . . . , 20000000 for Image Net 200, . . . , 4000 for all others GMM number of clusters 4, . . . , 10

Table 6: Summary of hyperparameters used in our experiments.

train this model. We evaluate on mean Io U. Our data collection procedure follows the same steps as used for BDD100K and the Image classiﬁcation tasks. CIFAR-100 (2 Types). We partition this data set into two subsets D1 and D2 of 25, 000 images each containing the ﬁrst 50 and last 50 classes, respectively. We then train a Res Net18 (He et al., 2016) using diﬀerent fractions of the two subsets. We follow the same training procedure as in the single-variate case except with one diﬀerence. Since some of the data sets will naturally be imbalanced (e.g., if we train with half of the ﬁrst subset and all of the second subset), we employ a class-balanced cross entropy loss using the inverse frequency of samples per class. For each Dk subsets, respectively, we follow the same subsampling procedure used in the single-variate case. That is, we let q1 0 = 10% of the ﬁrst data subset and q2 0 = 10% of the second data subset. For each subset, we create 10 subsampled sets at intervals of 2%, 4%, 6%, , 20% of the respective data subset. We then create eight further subsampled sets at 30%, 40%, , 100% of the respective data subset. Finally, we train our model and evaluate the score on every combination of the subsampled subsets of D1 D2. BDD100K (Semi-supervised). For this task, we consider semi-supervised segmentation via pseudo-labeling the unlabeled data set in BDD100K. The data is partitioned into two subsets D1 and D2 containing 7, 000 labeled and 70, 000 unlabeled scenes. We use Deeplabv3 (Chen et al., 2018) with a Res Net50 backbone. Here however, we:

1. First train with a labeled subset of D1 via supervised learning.

2. Pseudo-label an unlabeled subset of D2 using the trained model.

3. Re-train the segmentation model with the labeled subset and the pseudo-labeled subset.

We follow the same procedure as in the single-variate case for both training steps, except we weigh the unlabeled data by 0.2 to reduce its contribution to the loss. Training via semi-supervised learning on BDD100K requires long compute times, so we reduce the number of subsets used in this experiment. For the labeled set D1, we create subsets with 5%, 10%, 15%, 20%, 40%, 60%, 80%, 100% of the data. For the unlabeled set D2, we create subsets with 0%, 10%, 25%, 50%, 100% of the data. Note that we have ﬁve settings of unlabeled data since we include the case of training with no unlabeled data as well.

Optimizing Data Collection for Machine Learning

B.3 LOC Implementation

For all experiments, we initialize with 10% of the training data set. We consider T = 1, 3, 5 rounds and sweep a range of V . We provide a summary of parameters in Table 6. For the experiments with K = 1, we model the data requirement PDF f(q) in each round of the problem as follows. We ﬁrst draw B = 500 bootstrap resamples of the current training statistics R, where R = {(rq0/R, V (Drq0/R))}R r=1 {(qs, V (Dqs))}t s=1 contains all of the measured statistics up to the initial data set (e.g., for CIFAR-10, this includes performance with 2%, 4%, , 10% of the data), and the previous collected data. The latter is obtained by calling our piecewise-linear ground truth approximation. For each bootstrap resample, we ﬁt a power regression model ˆv(q; θ) = θ0qθ1 + θ2 and solve for the estimated minimum data requirement by minimizing a least squares problem using the Trust Region Reﬂective algorithm, with initial values set to 1, 0, 0 and bounds set to [0, 100], [0, 0.999], [ 10, 10] for θ0, θ1, θ2, respectively. After ﬁtting ˆv(q; θ) for each resample, we then compute the corresponding estimate of D . We then use our set of estimates to ﬁt a Kernel Density Estimation (KDE) model after gridsearching for the best bandwidth parameter. For the Image Net data set, we grid search for the best bandwidth parameter between [20000, 40000, 100000, 200000, 400000, 1000000, 2000000, 4000000, 10000000, 20000000], whereas for all other data sets, we grid search for the best bandwidth parameter between [200, 400, 1000, 2000, 4000]. We use these diﬀerent settings because the scale of data on Image Net is between two to three orders of magnitude greater than the scale of data for the other data sets that we consider. Thus, to ensure that the estimated distribution can reasonably capture the variation on larger scales, we ﬁnd that larger bandwidths are necessary. We note that in practice, a reasonable choice of bandwidth ranges for gridsearching can be visually determined by plotting the histogram of the estimated data requirement distribution (e.g., see Figure 9 in Appendix C.1) for diﬀerent choices of bandwidth and inspecting the ﬁtted distributions. For the experiments with K = 2, we use the same above procedure but ﬁt Gaussian Mixture Models (GMM) due to their having an easily computable CDF via the Gaussian erf( ), rather than numerically integrating the PDF. We grid-search over the number of mixture components for the GMM model. We optimize over problems (8) using gradient descent. Depending on the current state and data set, diﬀerent hyperparameters perform better. As a result, we perform extensive hyperparameter tuning every time we need to solve the optimization problem. Here, we sweep all combinations of gradient descent with momentum (β = 0.9), and Adam (β0 = 0.9, β1 = 0.999), and learning rates between [0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500]. We initialize each problem with qt equal to the baseline regression solution and qt+s = qt/(s + 1) for all 1 s T t. That is, we set the initial value for future collection amounts to be fractions of the initial value of the immediate amount of data to collect. We identiﬁed this initialization by manually inspecting the solutions found by LOC; it improves the conditioning of the loss landscape relative to other random initialization schemes.

Appendix C. Additional Numerical Results

This section contains expanded results of our numerical experiments and further ablations. Our key results include:

Mahmood, Lucas, Alvarez, Fidler, and Law

0 10000 20000 30000 40000 50000

0 10000 20000 30000 40000 50000

0.0 0.2 0.4 0.6 0.8 1.0 q 1e6

0 2000 4000 6000 8000 10000

0 2000 4000 6000 q

0 5000 1000015000200002500030000

Figure 8: For a ﬁxed seed, ground truth learning curves (black) and the estimated power law learning curves (blue) obtained via bootstrapping and ensembling. The shaded region represents the 95 percentile of the ensemble and the dashed blue line represents the mean of the regression functions. The mean is consistently higher than the unknown ground truth, whereas the shaded region can at times cover it.

8000 9000 1000011000120001300014000

Relative frequency

CIFAR-100 (V * = 50)

10000 12500 15000 17500 20000 22500

Relative frequency

CIFAR-100 (V * = 60)

15000 20000 25000 30000 35000 40000

Relative frequency

CIFAR-100 (V * = 70)

Figure 9: For a ﬁxed seed, the histogram of estimates of D from diﬀerent bootstrapped models (blue bars), the estimated F(q) (orange curve), and the ground truth D (black dashed line). Each plot corresponds to a diﬀerent V for CIFAR-100 (see Figure 8 for the learning curve). With higher targets, regression (i.e., collecting the mean of the distribution) will lead to larger under-estimations.

In Appendix C.1, we evaluate the eﬀectiveness of estimating F(q) by plotting the estimated learning curves as well as the empirical histograms used to model the data requirement distribution.

In Appendix C.2, we consider variants of LOC where we use diﬀerent regression functions to estimate the data requirement distribution. Our optimization framework can be deployed on top of any regression function to reduce the failure rate.

In Appendix C.3, we consider LOC for scenarios where data is not collected via random sampling, but instead by active learning. Our optimization framework consistently outperforms baselines even under such non-i.i.d. settings.

In Appendix C.4, we explore the multi-variate LOC (i.e., K = 2) for problems where we have a small number of T = 1, 3 rounds. The baseline fails for almost all instances of T = 1, whereas LOC maintains a low failure rate.

In Appendix C.5, we explore the sensitivity of our optimization algorithm to variations in the cost and penalty parameters. In all except one instance, LOC consistently maintains a low total cost and failure rate.

Optimizing Data Collection for Machine Learning

40 50 60 70 V * 10 2

CIFAR-100 (2 Types) (T=1)

Power Law c=[0.01 0.005]

40 50 60 70 V * 0

Failure Rate

40 50 60 70 V * 10 2

CIFAR-100 (2 Types) (T=1)

Power Law c=[0.01 0.002]

40 50 60 70 V * 0

Failure Rate

40 50 60 70 V * 10 2

CIFAR-100 (2 Types) (T=1)

Power Law c=[0.01 0.001]

40 50 60 70 V * 0

Failure Rate

40 50 60 70 V * 10 2

CIFAR-100 (2 Types) (T=1)

Power Law c=[0.01 0.0005]

40 50 60 70 V * 0

Failure Rate

40 50 60 70 V * 10 2

CIFAR-100 (2 Types) (T=3)

Power Law c=[0.01 0.005]

40 50 60 70 V * 0

Failure Rate

40 50 60 70 V * 10 2

CIFAR-100 (2 Types) (T=3)

Power Law c=[0.01 0.002]

40 50 60 70 V * 0

Failure Rate

40 50 60 70 V * 10 2

CIFAR-100 (2 Types) (T=3)

Power Law c=[0.01 0.001]

40 50 60 70 V * 0

Failure Rate

40 50 60 70 V * 10 2

CIFAR-100 (2 Types) (T=3)

Power Law c=[0.01 0.0005]

40 50 60 70 V * 0

Failure Rate

Figure 10: For experiments on CIFAR-100 with two data types, mean standard deviation over 5 seeds of the cost ratio c T(q T q0)/c T(D q0) 1 and failure rate for diﬀerent V after removing 99-th percentile outliers. We ﬁx c0 = 1 and P = 1013. The rows correspond to T = 1, 3 (see the main paper for T = 5) and the columns correspond to c1 = c0/2, c0/5, c0/10, c0/20.

C.1 Estimating the Data Requirement Distribution F(q)

To estimate F(q), we ﬁrst create an ensemble of estimated learning curves, which we then invert to obtain an empirical distribution of estimated values for D . Figure 8 plots our bootstrap resampled estimated learning curves versus the ground truth performance for the ﬁrst round of data collection when we have access to an initial Dq0 containing 10% of the full data set. As noted in Mahmood et al. (2022b), the mean estimated learning curve diverges from the ground truth. However, by bootstrap resampling an ensemble of learning curves, we can cover the ground truth with some probability. Figure 9 plots the empirical histograms of estimated D as well as the estimated F(q) obtained via KDE on CIFAR-10 with three diﬀerent values for V . Although the mode of the estimated distribution is far from the ground truth D , the estimated distribution assigns some probability to the ground truth region. LOC optimizes over this estimated F(q), which allows us to conservatively collect data and reduce the chances of failure.

C.2 LOC with Alternative Regression Functions

Mahmood et al. (2022b) show that we can use other regression functions instead of the power law to estimate the data requirement. Moreover, some functions tend to consistently overor under-estimate the requirement. LOC can be deployed on top of any such regression function, since the regression function is only used to generate bootstrap samples. In this

Mahmood, Lucas, Alvarez, Fidler, and Law

Regression With Regression With Regression With Data set T Logarithmic ˆv(q) LOC Arctan ˆv(q) LOC Algebraic Root ˆv(q) LOC

Failure rate (FR) Cost ratio (CR) FR CR FR CR FR CR FR CR FR CR

CIFAR-10 1 3% 0.59 0% 2.18 88% 0.01 6% 1.93 100% 12% 1.74 3 3% 0.59 0% 1.16 57% 0.01 0% 0.91 97% 0 0% 0.87 5 3% 0.59 0% 0.9 43% 0.01 0% 0.62 88% 0 0% 0.70

CIFAR-100 1 43% 0.19 2% 1.17 23% 3.31 0% 5.56 52% 0.11 23% 0.81 3 37% 0.17 2% 0.54 15% 3.01 0% 3.92 44% 0.09 2% 0.87 5 34% 0.16 2% 0.39 12% 2.90 0% 3.6 44% 0.09 2% 0.54

Imagenet 1 0% 24.5 34% 0.48 100% 0% 2.65 100% 40% 0.48 3 0% 24.5 2% 0.17 98% 0.01 0% 0.81 94% 0.01 0% 0.19 5 0% 24.5 0% 0.1 89% 0.01 0% 0.93 79% 0.01 0% 0.10

BDD100K 1 0% 0.55 12% 4.78 85% 0.04 12% 9.51 73% 0.08 12% 4.82 3 0% 0.55 0% 1.78 62% 0.02 0% 5.33 27% 0.04 0% 1.48 5 0% 0.55 0% 0.72 54% 0.01 0% 2.77 15% 0.03 0% 0.79

nu Scenes 1 0% 0.52 0% 2.78 95% 0 0% 8.96 95% 0 0% 3.42 3 0% 0.52 0% 0.78 95% 0 0% 6.63 81% 0.02 0% 1.73 5 0% 0.52 0% 0.5 95% 0 0% 7.55 67% 0.01 0% 2.25

VOC 1 100% 6% 1.85 83% 0.11 14% 0.49 14% 11.2 14% 13.5 3 100% 0% 1.60 75% 0.07 0% 1.00 0% 9.61 0% 12.1 5 94% 0 0% 1.37 67% 0.06 0% 1.32 0% 9.61 0% 18.4

Table 7: Comparing against the following alternate estimators of Mahmood et al. (2022b) with the same setup as in Table 2: Logarithmic ˆv(q; θ) = θ0 log(q + θ1) + θ2, Arctan ˆv(q; θ) = 200 π arctan(θ0 π 2 q + θ1) + θ2. Algebraic Root ˆv(q; θ) = 100q 1+|θ0q|θ1)1/θ1 + θ2. The best performing cost ratio is underlined and the best performing failure rate for each setting is bolded.

Optimizing Data Collection for Machine Learning

Data set T Regression With Correction (Mahmood et al., 2022b) LOC

Failure rate Cost ratio Failure rate Cost ratio

CIFAR-100 1 14% 0.94 4% 0.99 3 1% 0.23 3% 0.31 5 0% 0.17 2% 0.19

Imagenet 1 7% 1.03 37% 0.49 3 0% 0.21 5% 0.16 5 0% 0.14 2% 0.10

BDD100K 1 4% 4.03 12% 2.03 3 0% 1.02 0% 0.72 5 0% 0.62 0% 0.35

nu Scenes 1 0% 27.2 52% 0.16 3 0% 0.75 0% 0.09 5 0% 0.30 0% 0.04

VOC 1 0% 44.6 25% 0.56 3 0% 7.02 0% 1.10 5 0% 3.98 0% 0.84

Table 8: Comparing against the correction factor-based Power Law Regression of Mahmood et al. (2022b) with the same setup as in Table 2. The best performing cost ratio is underlined and the best performing failure rate for each setting is bolded. Although the baseline achieves low failure rates, LOC often can achieve competitive failure rates while reducing the cost ratios by an order of magnitude.

Strategy T Regression LOC

Failure rate Cost ratio Failure rate Cost ratio

Entropy 1 23% 0.53 1% 1.79 3 18% 0.50 3% 1.02 5 18% 0.50 2% 0.83

k-Centers 1 51% 0.16 26% 0.38 3 44% 0.14 13% 0.17 5 44% 0.14 8% 0.13

Least Conﬁdence 1 24% 0.70 14% 3.36 3 23% 0.68 5% 1.71 5 21% 0.67 6% 1.44

Table 9: Diﬀerent active learning policies for experiments on CIFAR-100, measuring the average cost ratio and failure rate measured over a range of V and T. We ﬁx c = 1 and P = 107. The best performing failure rate for each setting is bolded. The cost ratio is measured only for instances that achieve V .

section, we show that these baseline estimators consistently either underor overestimate how much data to collect, in contrast to LOC. Table 7 highlights experiments on all six data sets with three alternative regression functions that were used by Mahmood et al. (2022b). For each function and almost on each

Mahmood, Lucas, Alvarez, Fidler, and Law

36 38 40 42 V * 10 2

BDD100K (Semi-supervised) (T=1)

Power Law c=[1. 0.1]

36 38 40 42 V * 0

Failure Rate

36 38 40 42 V * 10 2

BDD100K (Semi-supervised) (T=1)

Power Law c=[1. 0.05]

36 38 40 42 V * 0

Failure Rate

36 38 40 42 V * 10 2

BDD100K (Semi-supervised) (T=1)

Power Law c=[1. 0.01]

36 38 40 42 V * 0

Failure Rate

36 38 40 42 V * 10 2

BDD100K (Semi-supervised) (T=1)

Power Law c=[1. 0.005]

36 38 40 42 V * 0

Failure Rate

36 38 40 42 V * 10 2

BDD100K (Semi-supervised) (T=3)

Power Law c=[1. 0.1]

36 38 40 42 V * 0

Failure Rate

36 38 40 42 V * 10 2

BDD100K (Semi-supervised) (T=3)

Power Law c=[1. 0.05]

36 38 40 42 V * 0

Failure Rate

36 38 40 42 V * 10 2

BDD100K (Semi-supervised) (T=3)

Power Law c=[1. 0.01]

36 38 40 42 V * 0

Failure Rate

36 38 40 42 V * 10 2

BDD100K (Semi-supervised) (T=3)

Power Law c=[1. 0.005]

36 38 40 42 V * 0

Failure Rate

Figure 11: For experiments on BDD100K with two data types, mean standard deviation over 5 seeds of the cost ratio c T(q T q0)/c T(D q0) 1 and failure rate for diﬀerent V after removing 99-th percentile outliers. We ﬁx c0 = 1 and P = 1013. The rows correspond to T = 1, 3 (see the main paper for T = 5) and the columns correspond to c1 = c0/2, c0/5, c0/10, c0/20.

dataset, we observe the same trends seen in the original Table 2. That is, LOC reduces the failure rate down to approximately zero, at a marginal relative increase in cost. Noting that Power Law Regression often leads to failure, Mahmood et al. (2022b) also propose a correction factor heuristic wherein they learn a parameter τ such that if the data collection problem requires a target performance V , we should instead aim to collect enough data to meet V + τ. In order to learn this correction factor, we require a pre-existing data set upon which we can simulate a data collection policy. Mahmood et al. (2022b) set τ such that we can achieve the data requirement V for any V on the pre-existing data set, and then ﬁxing this parameter for new data sets. Table 8 compares LOC (i.e., repeating Table 2) with the Correction factor-based Power Law regression baseline of Mahmood et al. (2022b). Following the original paper, we tune τ using CIFAR-10 and apply it on all other data sets. The correction factor is designed to minimize the failure rate and thus, achieves nearly 0% failure rate for all settings, but often at high cost ratios. On the other hand, LOC achieves generally low failure rates and low cost ratios. Speciﬁcally, for T = 3, 5, we are competitive with the baseline on failure rates for most tasks while obtaining up to an order of magnitude decrease in costs. For T = 1, we typically admit higher failure rates; however for the segmentation and detection tasks, we obtain up multiple orders of magnitude lower costs. Finally, note that this baseline requires a similar prior task to be eﬀective. For example, the baseline outperforms us on cost and failure rate both only on CIFAR-100, since it is tuned on CIFAR-10. On the other hand,

Optimizing Data Collection for Machine Learning

80 85 90 V *

Ratio of points q * /D *

CIFAR-10 (T=1)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

40 50 60 70 V *

Ratio of points q * /D *

CIFAR-100 (T=1)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

45 50 55 60 65 V *

Ratio of points q * /D *

Imagenet (T=1)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

80 85 90 V *

Ratio of points q * /D *

CIFAR-10 (T=3)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

40 50 60 70 V *

Ratio of points q * /D *

CIFAR-100 (T=3)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

45 50 55 60 65 V *

Ratio of points q * /D *

Imagenet (T=3)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

66 68 70 72 74 V *

Ratio of points q * /D *

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

20 22 24 26 V *

Ratio of points q * /D *

BDD100K (T=1)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

22 24 26 28 30 V *

Ratio of points q * /D *

nu Scenes (T=1)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

66 68 70 72 74 V *

Ratio of points q * /D *

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

20 22 24 26 V *

Ratio of points q * /D *

BDD100K (T=3)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

22 24 26 28 30 V *

Ratio of points q * /D *

nu Scenes (T=3)

Power Law c=1e-03 c=1e-02 c=1e-01 c=1e+00

Figure 12: Mean standard deviation of the ratio of data collected q T /D for diﬀerent V

when we sweep the cost parameter from 0.001 to 1 and ﬁx P = 107. We show T = 1, 3 and refer to the main paper for T = 5. The dashed black line corresponds to collecting exactly the minimum data requirement.

LOC does not require this prior data set to be eﬀective as evidence by its performance on non-classiﬁcation tasks.

C.3 LOC with Active Learning

Although most of our experiments assume that data is collected via i.i.d. random sampling, we may also consider alternative approaches to data collection such as active learning. Here, the optimal data collection problem amounts to determining the optimal budget to set when running an active learning query algorithm. Given the budget, we then sample data according to the active learning strategy. This relies on the assumption that the learning curve under active learning is monotone non-decreasing; this may not always be the case

Mahmood, Lucas, Alvarez, Fidler, and Law

80 85 90 V *

Ratio of points q * /D *

CIFAR-10 (T=1)

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

40 50 60 70 V *

Ratio of points q * /D *

CIFAR-100 (T=1)

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

45 50 55 60 65 V *

Ratio of points q * /D *

Imagenet (T=1)

Power Law P=1e+07 P=1e+08 P=1e+09 P=1e+10

80 85 90 V *

Ratio of points q * /D *

CIFAR-10 (T=3)

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

40 50 60 70 V *

Ratio of points q * /D *

CIFAR-100 (T=3)

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

45 50 55 60 65 V *

Ratio of points q * /D *

Imagenet (T=3)

Power Law P=1e+07 P=1e+08 P=1e+09 P=1e+10

66 68 70 72 74 V *

Ratio of points q * /D *

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

20 22 24 26 V *

Ratio of points q * /D *

BDD100K (T=1)

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

22 24 26 28 30 V *

Ratio of points q * /D *

nu Scenes (T=1)

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

66 68 70 72 74 V *

Ratio of points q * /D *

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

20 22 24 26 V *

Ratio of points q * /D *

BDD100K (T=3)

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

22 24 26 28 30 V *

Ratio of points q * /D *

nu Scenes (T=3)

Power Law P=1e+06 P=1e+07 P=1e+08 P=1e+09

Figure 13: Mean std of the ratio of data collected q T /D for diﬀerent V when we sweep the penalty parameter from 106 to 109 and ﬁx c = 1. We show T = 1, 3 and refer to the main paper for T = 5. The dashed black line corresponds to collecting exactly the minimum data requirement.

(e.g., see Viering and Loog (2022)), but has been empirically demonstrated for many common batch-mode active learning strategies for computer vision (e.g., see Sener and Savarese (2018); Mahmood et al. (2022b)). Overall, the LOC framework does not change from Algorithm 3, except in how the training set statistics are collected in Algorithm 1.

Given a small amount of data, we can estimate the neural scaling law from the active learning algorithm and apply estimation-only approaches and LOC. Table 9 highlights the eﬀect of using LOC when the data collection policy uses active learning instead of random sampling. We focus on CIFAR-100 and test three diﬀerent active learning strategies: Maximum Entropy (Settles, 2009), k-Centers (Sener and Savarese, 2018), and Least Conﬁdence (Settles, 2009). This table demonstrates the same trends as before, namely that our policy

Optimizing Data Collection for Machine Learning

can better make decisions on the data collection budget compared to an estimation-only policy regardless of the speciﬁc collection algorithm.

C.4 The Value of Optimization over Estimation when K = 2

Figure 10 and Figure 11 expand Figure 5 to T = 1, 3 rounds. The results validate the summary observations from Table 3 in that the baseline has considerably higher failure rates versus LOC. In particular for BDD100K at T = 1, the baseline fails consistently for four out of ﬁve random seeds. On the other hand, recall that LOC admits a higher cost ratio compared to the baseline when T = 1. We can observe now that this high cost ratio is due to the method incurring high cost for a few target V values. This behavior is similar to the observation above on VOC with high penalties at T = 3.

C.5 Robustness to the Cost and Penalty Parameters

Figure 12 expands the cost parameter sweep from Figure 6 (Top row) to the settings of T = 1, 3. For nearly all settings, LOC remains stable to variations in the cost parameter. Nonetheless, careful parameter selection becomes important as T decreases. This is due to the fact that for low costs, the total amount of data collected increases as T decreases (e.g., c = 0.001 for BDD100K). Furthermore, Figure 13 expands the penalty parameter sweep from Figure 6 (Bottom row). Here, we observe similar properties to the cost parameter sweep. Although LOC is relatively stable on all other data sets, our results demonstrate some extreme results for VOC, potentially due to noise in the simulation. For example in Figure 13, setting P = 109, V = 71, and T = 3 led to collecting 10, 000 times the minimum data requirement. Such a situation is unrealistic in a production-level implementation, since in a real implementation, we could impose further constraints onto problem (8), such as upper bounds on the total amount of data permissible.

Amazon sagemaker data labeling pricing, 2023. URL https://aws.amazon.com/sagemaker/ data-labeling/pricing/.

David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Eﬃcient interactive annotation of segmentation datasets with polygon-rnn++. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

David Acuna, Guojun Zhang, Marc T Law, and Sanja Fidler. f-domain adversarial learning: Theory and algorithms. In International Conference on Machine Learning, pages 66 75. PMLR, 2021.

David Acuna, Marc T Law, Guojun Zhang, and Sanja Fidler. Domain adversarial training: A game perspective. In International Conference on Learning Representations, 2022.

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws. Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024.

Mahmood, Lucas, Alvarez, Fidler, and Law

Dimitri Bertsekas. Dynamic programming and optimal control: Volume I, volume 1. Athena scientiﬁc, 2012.

Dimitris Bertsimas, Mac Johnson, and Nathan Kallus. The power of optimization over randomization in designing experiments involving small samples. Operations Research, 63 (4):868 876, 2015.

Devansh Bisla, Apoorva Nandini Saridena, and Anna Choromanska. A theoretical-empirical approach to estimating sample complexity of dnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3270 3280, 2021.

Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws. In The Eleventh International Conference on Learning Representations, 2023.

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621 11631, 2020.

Pedro Carneiro, Sokbae Lee, and Daniel Wilhelm. Optimal data collection for randomized control trials. The Econometrics Journal, 23(1):1 31, 2020.

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroﬀ, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.

David Cohn. Neural network exploration using optimal experiment design. In Advances in Neural Information Processing Systems, volume 6, 1993.

Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Eﬃcient data selection for deep learning. In International Conference on Learning Representations, 2020.

Corinna Cortes, Lawrence D Jackel, Sara Solla, Vladimir Vapnik, and John Denker. Learning curves: Asymptotic values and rate of convergence. In Advances in Neural Information Processing Systems, volume 6, 1993.

Fabio Gagliardi Cozman, Ira Cohen, and M Cirelo. Unlabeled data can degrade classiﬁcation performance of generative classiﬁers. In FLAIRS, pages 327 331, 2002.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248 255. IEEE, 2009.

Dimensional Research. Artiﬁcial intelligence and machine learning projects are obstructed by data issues: Global survey of data scientists, ai experts and stakeholders. Technical report, 05 2019.

Optimizing Data Collection for Machine Learning

Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twenty-fourth International Joint Conference on Artiﬁcial Intelligence, 2015.

David Easley and Nicholas M Kiefer. Controlling a stochastic process with unknown parameters. Econometrica: Journal of the Econometric Society, pages 1045 1064, 1988.

Ismail Elezi, Zhiding Yu, Anima Anandkumar, Laura Leal-Taixe, and Jose M Alvarez. Not all labels are equal: Rationalizing the labeling costs for training object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

Ashley F Emery and Aleksey V Nenarokomov. Optimal experiment design. Measurement Science and Technology, 9(6):864, 1998.

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html, 2007.

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html, 2012.

Rosa L Figueroa, Qing Zeng-Treitler, Sasikiran Kandula, and Long H Ngo. Predicting sample size required for classiﬁcation performance. BMC Medical Informatics and Decision Making, 12(1):1 10, 2012.

Lewis J Frey and Douglas H Fisher. Modeling decision tree performance with the power law. In Seventh International Workshop on Artiﬁcial Intelligence and Statistics. PMLR, 1999.

Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pages 2242 2251. PMLR, 2019.

Baohua Gu, Feifang Hu, and Huan Liu. Modelling classiﬁcation performance for large data sets. In International Conference on Web-Age Information Management, pages 317 328. Springer, 2001.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016.

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam Mc Candlish. Scaling laws for transfer. ar Xiv preprint ar Xiv:2102.01293, 2021.

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. ar Xiv preprint ar Xiv:1712.00409, 2017.

Mahmood, Lucas, Alvarez, Fidler, and Law

Jordan Hoﬀmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In Advances in Neural Information Processing Systems, volume 36, 2022.

Derek Hoiem, Tanmay Gupta, Zhizhong Li, and Michal Shlapentokh-Rothman. Learning curves for analysis of deep networks. In International Conference on Machine Learning, pages 4287 4296. PMLR, 2021.

Yiding Jiang, Pierre Foret, Scott Yak, Daniel M Roy, Hossein Mobahi, Gintare Karolina Dziugaite, Samy Bengio, Suriya Gunasekar, Isabelle Guyon, and Behnam Neyshabur. Neurips 2020 competition: Predicting generalization in deep learning. ar Xiv preprint ar Xiv:2012.07976, 2020.

Yiding Jiang, Parth Natekar, Manik Sharma, Sumukh K Aithal, Dhruva Kashyap, Natarajan Subramanyam, Carlos Lassance, Daniel M Roy, Gintare Karolina Dziugaite, Suriya Gunasekar, et al. Methods and analysis of the ﬁrst competition in predicting generalization of deep learning. In Neur IPS 2020 Competition and Demonstration Track, pages 170 190. PMLR, 2021.

George H. John and Pat Langley. Static versus dynamic sampling for data mining. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD 96, page 367 370. AAAI Press, 1996.

S Jones, S Carley, and M Harrison. An introduction to power and sample size estimation. Emergency Medicine Journal: EMJ, 20(5):453, 2003.

Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeﬀrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020.

Prasanth Kolachina, Nicola Cancedda, Marc Dymetman, and Sriram Venkatapathy. Prediction of learning curves in machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22 30, 2012.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Mark Last. Predicting and optimizing classiﬁer utility with the power law. In Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), pages 219 224. IEEE, 2007.

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, pages 21 37. Springer, 2016.

Raﬁd Mahmood, Sanja Fidler, and Marc T. Law. Low-budget active learning via wasserstein distance: An integer programming approach. In International Conference on Learning Representations, 2022a.

Optimizing Data Collection for Machine Learning

Raﬁd Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion, Jose M. Alvarez, Zhiding Yu, Sanja Fidler, and Marc T. Law. How much more data do we need? estimating requirements for downstream tasks. In 2022 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2022b.

Raﬁd Mahmood, James Lucas, Jose M Alvarez, Sanja Fidler, and Marc T Law. Optimizing data collection for machine learning. In Advances in Neural Information Processing Systems, volume 36, 2022c.

Christopher Meek, Bo Thiesson, and David Heckerman. The learning-curve sampling method applied to model-based clustering. Journal of Machine Learning Research, 2(Feb):397 418, 2002.

Hiroaki Mikami, Kenji Fukumizu, Shogo Murai, Shuji Suzuki, Yuta Kikuchi, Taiji Suzuki, Shin-ichi Maeda, and Kohei Hayashi. A scaling law for syn2real transfer: How much is your pre-training eﬀective? In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19 23, 2022, Proceedings, Part III, page 477 492, 2022.

Felix Mohr, Tom J Viering, Marco Loog, and Jan N van Rijn. Lcdb 1.0: An extensive learning curves database for classiﬁcation tasks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 3 19. Springer, 2022.

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision, 2020.

Viraj Uday Prabhu, David Acuna, Raﬁd Mahmood, Marc T Law, Yuan-Hong Liao, Judy Hoﬀman, Sanja Fidler, and James Lucas. Bridging the sim2real gap with CARE: Supervised detection adaptation with conditional alignment and reweighting. Transactions on Machine Learning Research, 2023.

Aayush Prakash, Shoubhik Debnath, Jean-Francois Laﬂeche, Eric Cameracci, Gavriel State, Stan Birchﬁeld, and Marc T. Law. Self-supervised real-to-sim scene generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16044 16054, October 2021.

Foster Provost, David Jensen, and Tim Oates. Eﬃcient progressive sampling. In Proceedings of the ﬁfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 23 32, 1999.

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Jonathan S Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales. In International Conference on Learning Representations, 2020.

Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.

Mahmood, Lucas, Alvarez, Fidler, and Law

Burr Settles. Active learning literature survey. 2009.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015.

Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972 5981, 2019.

Richard D Smallwood and Edward J Sondik. The optimal control of partially observable markov processes over a ﬁnite horizon. Operations Research, 21(5):1071 1088, 1973.

Kirstine Smith. On the standard deviations of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations. Biometrika, 12(1/2):1 85, 1918.

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable eﬀectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision, pages 843 852, 2017.

Katrin Tomanek and Udo Hahn. Approximating learning curves for active-learning-driven annotation. In LREC, volume 8, pages 1319 1324, 2008.

Venture Beat. Why do 87% of data science projects never make it into production?, Jul 2019. URL https://venturebeat.com/2019/07/19/ why-do-87-of-data-science-projects-never-make-it-into-production/.

Felipe AC Viana. A tutorial on latin hypercube design of experiments. Quality and reliability engineering international, 32(5):1975 1985, 2016.

Tom Viering and Marco Loog. The shape of learning curves: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

Sen Wang and KJ Yang. An image scaling algorithm based on bilinear interpolation with vc++. Techniques of Automation and Applications, 27(7):44 45, 2008.

Donggeun Yoo and In So Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 93 102, 2019.

Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2636 2645, 2020.

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

Eric Zhao, Anqi Liu, Animashree Anandkumar, and Yisong Yue. Active learning under label shift. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3412 3420. PMLR, 2021.