# dsdm_modelaware_dataset_selection_with_datamodels__80e59229.pdf DSDM: Model-Aware Dataset Selection with Datamodels Logan Engstrom 1 Axel Feldmann 1 Aleksander M adry 1 When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior. However, in practice the opposite can often happen: we find that selecting according to similarity with high quality data sources may not increase (and can even hurt) performance compared to randomly selecting data. To develop better methods for selecting data, we start by framing dataset selection as an optimization problem that we can directly solve for: given target tasks, a learning algorithm, and candidate data, select the subset that maximizes model performance. This framework thus avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks. Our resulting method greatly improves language model (LM) performance on both pre-specified tasks and previously unseen tasks. Specifically, choosing target tasks representative of standard LM problems and evaluating on diverse held-out benchmarks, our selected datasets provide a 2 compute multiplier over baseline methods. 1. Introduction Suppose we want to train a large-scale machine learning model. What data should we train on? The simple answer is: as much data as possible. For example, we train language and vision models on vast quantities of text (Radford et al., 2019) and image-caption (Ramesh et al., 2021) data from sources like internet crawls. This seemingly straightforward recipe yields models that generalize remarkably well to a broad range of tasks. A closer look, however, reveals that choosing training data 1MIT. Correspondence to: Logan Engstrom . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). is not actually so straightforward. Indeed, not all data is equally useful; for example, internet data sources frequently contain low quality data like spam, poor writing, or nonsense text. Therefore, in practice, we tend to filter training data according to intuitive notions of quality, e.g., choosing documents similar to a high quality data source like Wikipedia or discarding documents with fewer than five sentences. These steps choose (qualitatively) clean samples that should intuitively improve performance. However, do such samples improve performance in practice too? Contributions. We find that the opposite can happen: selecting data according to similarity with high quality data sources may not improve (and, in fact, can even hurt) model performance. Specifically, we train language models with standard, similarity-based selection methods previously used to select data for models like Pa LM and GPT-3 (Brown et al., 2020; Xie et al., 2023b), and find these methods do not outperform (and can even underperform) selecting data at random (cf. Section 4). To develop better methods for selecting training data, we start from first principles. That is, we avoid intuitive notions of data quality, and instead frame dataset selection as an optimization problem where the goal is to given target tasks, a learning algorithm, and a candidate data pool select the data that maximizes model performance. However, actually finding the optimal solution to this problem is difficult. While we can calculate the performance of a specific training set by training a model on that set (and then evaluating), it is (generally) unclear how to calculate the best possible training subset without examining every possible subset one by one, a computationally infeasible procedure. We instead approximate the optimal subset by (approximately) modeling how the learning algorithm actually uses training data to predict. Specifically, in Section 2, we model target task performance as a function of training subset using datamodels (which efficiently approximate the mapping between training subset and model performance (Ilyas et al., 2022)), and select the subset that maximizes our estimate. Then, in Section 3, we demonstrate that our resulting method, dataset selection with datamodels (DSDM), consistently improves language model performance on diverse target tasks (e.g., SQu AD (Rajpurkar et al., 2016) and LAMBADA (Paperno et al., 2016)), even when existing selection methods do not. DSDM: Model-Aware Dataset Selection with Datamodels DSDM-selected data can improve performance on prespecified tasks. However, in practice we train large-scale models to generalize to yet unseen tasks. Our framework suggests a principled approach to selecting data in this scenario too: choose target tasks similar to those we expect at deployment time, then select the optimal dataset subset for these target tasks. Following this strategy, in Section 4, we choose target tasks that cover a range of natural language problem categories (SQu AD, Jeopardy (Mosaic ML, 2023), and LAMBADA), and select data from C4, a canonical web crawl (Raffel et al., 2020). Our selections deliver a 2 compute multiplier on a diverse set of test benchmarks: DSDM-selected datasets yield LMs that perform as well as those trained with 2 the compute budget on randomly selected data (we train up to 1.8B parameter models). In contrast, no baseline method outperforms randomly selecting data even at the same compute budget. 2. Estimating the Optimal Dataset Selection To select better data for training large-scale models, we start by defining the optimal dataset selection as an optimization problem. We then select data by finding a train subset that is approximately the best solution to that problem. Specifically, we use datamodels (Ilyas et al., 2022) to approximate how the learning algorithm uses data to predict on the tasks of interest. We describe the resulting framework in more detail below. 2.1. Task-Optimal Dataset Selection We frame dataset selection as an optimization problem where the goal is to minimize trained model loss on a set of target tasks with respect to training data choice. Given a learning algorithm A (e.g., SGD on a neural network) that maps train set to trained model, and a target distribution Dtarg (e.g., a language modeling task), the size-k task-optimal dataset selection over the set S of available data (e.g., documents from an internet scrape) is the subset S := arg min S S,|S|=k LDtarg(S), (1) where LD(S) := Ex D [ℓ(x; A(S))] , that minimizes the trained model population loss LDtarg(S), where ℓ(x; g) denotes the loss (e.g., cross-entropy loss) for model g on example x. Note the expectation in the population loss is over both target dataset and learning algorithm randomness (as, e.g., SGD is a non-deterministic algorithm). In our setting, minimizing (1) is difficult. Indeed, we do not have an easy-to-optimize, closed-form expression for trained model loss in terms of training set choice S for largescale model learning algorithms. 1 While we can directly 1Depending on the setup, we may have such a form for other calculate the trained model loss for a given S by actually training on S with A (and then evaluating loss), using this method to find the best subset is generally computationally infeasible: we would need to train (and evaluate) a model for each of the |S| k possible size-k train subsets. 2.2. Estimating Model Loss Efficiently with Datamodels To circumvent this computational challenge, we trade optimality for feasibility, and instead estimate the best train subset. Specifically, we approximate the trained model loss in place of calculating it directly, then select the subset that minimizes our approximation. The core primitive we use to approximate the trained model loss is datamodeling (Ilyas et al., 2022), a framework originally designed to predict how choice of training set changes model predictions. More precisely, a datamodel for a fixed sample x approximates the mapping from train subset choice S (out of the available dataset S) to resulting trained model loss on a sample x, i.e., the function: Lx(S) := E [ℓ(x; A(S))] . Previous work used datamodels primarily for reliability purposes, e.g., to detect data poisoning (Khaddaj et al., 2022) or train-test leakage (Ilyas et al., 2022). In contrast, we leverage datamodels to cheaply approximate the trained model loss Lx. Formally, given a candidate data subset S S, datamodels take as input the corresponding characteristic vector 1S {0, 1}|S| such that (1S)i = ( 1 if Si S 0 otherwise , (2) instead of the subset S directly. Then, the datamodel τθx for x is the parameterized function that optimally predicts Lx over a (chosen) distribution of train subsets DS, i.e., τθx : {0, 1}|S| R, where θx = arg min θ b E(m) Si DS [Lreg (τθ(1Si), Lx(Si))] , (3) where Lreg( , ) is a regression loss function (e.g., mean squared error), and b E(m) is an m-sample empirical expectation. Note that in practice, we estimate the datamodel parameters that minimize (3) (i.e., we estimate the parameters of the function we use to approximate model loss). Linear datamodels. So far we have only defined the datamodeling framework; we have not actually defined the parameterized function τθ or described how to estimate the classes of learning algorithms, like linear regression (with influence functions (Cook, 1977; Giordano et al., 2019)) or kernel regression (Bierens, 1988). DSDM: Model-Aware Dataset Selection with Datamodels parameters θ. In this work, we instantiate datamodels as a linear function of the characteristic vector 1S (a standard choice (Ilyas et al., 2022; Saunshi et al., 2023)), such that τθx(1S) := θ x 1S. Note that, being a linear model, τθx treats the inclusion of an example Si in the train set as having a fixed effect on Lx(S) irrespective of the other examples in S (this fixed effect is exactly the value of index i of θx). In this work, to estimate linear datamodel parameters θx we largely follow the procedures of previous work (Park et al., 2023; Ilyas et al., 2022) in particular, we use the TRAK estimator but make changes needed for the language modeling domain (see Appendix B for full details). 2.3. DSDM: Dataset Selection with Datamodels Recall that our goal is to estimate the candidate data subset that minimizes trained model loss on the target task (cf. (1)). To do so, we approximate the mapping between training subset S and target distribution loss i.e., LDtarg(S) with datamodels as a primitive, then select the candidate data subset that minimizes our approximation of the target loss. Specifically, given a train subset S, we estimate the corresponding target distribution loss with an n-sample empirical expectation of datamodel loss estimates over Dtarg samples: b LDtarg(S) = b E (n) xi Dtarg τθxi(1S) Then, our size-k dataset selection with datamodels (DSDM) estimate of the optimal dataset selection is the subset that minimizes the approximated target loss b LDtarg(S) with respect to training set choice: b SDM := arg min S S,|S|=k b LDtarg(S) = arg min S S,|S|=k 1 S = arg bot-k In our instantiation, the considered datamodels are linear, so DSDM selects the examples corresponding to the smallest k indices of 1 n Pn i=1 θxi. (Note that linear datamodels are a design choice: DSDM can use any datamodel parameterization that can be optimized over.) 3. Evaluating DSDM To what extent does DSDM actually minimize trained model target task loss? In this section, we demonstrate that DSDM consistently reduces LM target task loss in practice. In contrast, baseline targeted dataset selection methods all of which ignore the model training process and instead select data according to textual similarity with target task samples often do not outperform randomly selecting data. Below, we describe our experimental setup, then discuss results. To capture the effectiveness of a given data selection method, we measure the extent to which it reduces the optimal dataset selection objective of (1), LDtarg(S) := Ex D [ℓ(x; A(S))] , across varying target tasks. For each considered target task, we split samples into a target set and a separate test set, and only use the target set to select training subsets. We then train an LM on the resulting dataset, and inspect target task performance (using the test set). Below, we describe the experimental setup as well as the baselines we use (see Appendix C for more setup details). Target tasks, candidate dataset, and model training. We consider four separate LM target tasks: LAMBADA (Paperno et al., 2016), CS-Algorithms (Srivastava et al., 2022), SQu AD (Rajpurkar et al., 2016), and Jeopardy (Tunguz, 2019); see Appendix C.1 for more details on each task. Our candidate dataset S is the English subset of the Colossal Cleaned Common Crawl (C4), a standard web scrape (Raffel et al., 2020).2 On each selected train dataset, we train a 125M parameter GPT-2 style model on 6 billion tokens. Baselines. We compare DSDM with two standard targeted dataset selection methods, both of which select according to textual similarity between candidate training samples and Dtarg samples: CLASSIFIER (selects the top examples in S given by a logistic model trained to classify, on Fast Text features, between S and Dtarg samples; used by GPT3/Pa LM/The Pile (Chowdhery et al., 2022; Gao et al., 2020)) and DSIR (Data Selection with Importance Resampling chooses train samples with n-grams that distributionally match those of Dtarg (Xie et al., 2023b)). We also compare with randomly selecting data (RANDOM). 3.2. Results In Figure 1 we display the mean log-probability (of the label given the context, across task samples; larger is better) achieved on each target task by training a model with 2Each candidate example Si is a sequence-length (1024 token) corpus slice; |S| 217,000,000 (cf. Appendix A.1). DSDM: Model-Aware Dataset Selection with Datamodels CS-Algorithms 0.00 0.25 0.50 0.00 0.25 0.50 Fraction of C4 selected Log-probability ( ) Ds Dm DSIR Classifier Random (1 compute) Random (10 compute) Figure 1: Target task performance by selection method, varying dataset selection size. We train a 125M models on a fixed number of tokens for each selection, adjusting epochs accordingly. DSDM consistently improves performance, even when baselines do not outperform randomly selecting data (e.g., on SQu AD and CS-Algorithms). DSDM models also consistently match a larger model trained with 10 the compute budget on random data (a Chinchilla-optimal 1.3B model). DSDM performance decreases with larger selection fraction, indicating that higher ranked DSDM samples (i.e., data in the smallest selections) tend to improve performance more than less highly ranked samples (i.e., data only present in larger selections). We measure the average log-probability of the label across samples. The random shaded area is the range of values achieved by 10 RANDOM models trained on one epoch of data (RANDOM performance is not x-axis dependent). Measuring accuracy in place of log-probability yields similar conclusions (cf. Figure 9). each selection method (varying dataset selection size). Each model was trained on the same number of total tokens, with models trained on smaller fractions of C4 traversing more epochs. We find that DSDM most improves target task performance on all tasks. Models trained with DSDM even outperform a larger model trained with 10 the compute on randomly selected data. Additionally, DSDM performance decreases with larger selection fraction, indicating that the samples predicted by DSDM to most improve performance actually do so in practice. After all, smaller selections will contain more useful data (as predicted by DSDM) on average compared to larger selections (e.g., all methods select the same subset for selection fraction 1). In contrast, baselines that select according to textual similarity with the target task, CLASSIFIER and DSIR, do not consistently beat randomly selecting data (e.g., on SQu AD and CS-Algorithms). These results suggest that similarity with the target task does not suffice to find useful samples. Note that baselines only match DSDM on LAMBADA (a passage completion task), which is also the only task without in-context instructions. We hypothesize that n-gram similarity may not capture how instructions define tasks. To better understand how dataset choice relates to performance, we inspect the datapoints that each method is most and least likely to select (for SQu AD: in Figure 2, for all other targets: in Appendix C.3). We find that: Useful data is not necessarily similar to the target task (or intuitively helpful at all). Looking at selected data for SQu AD in Figure 2, DSIR and CLASSIFIER select data that is more qualitatively similar to SQu AD samples (which are Wikipedia excerpts with questions, cf. Appendix Figure 5) than DSDM. Instead, DSDM samples often contain question answering-related text that does not match the SQu AD format; DSDM performance shows that qualitatively similar data is not necessarily the best data. However, helpful data is not always intuitively useful. Indeed, the DSDM examples for CS-Algorithms and Jeopardy (cf. Appendix Figures 21 and 15) often contain seemingly nonsense text. Yet, DSDM yields the best models for these tasks. DSDM discards mislabeled data. Samples that DSIR and CLASSIFIER are least likely to select are qualitatively different from those of DSDM. Inspecting Appendix Figure 11 for data selected for SQu AD: least likely samples for all methods are incoherent/malformed, but those of DSDM also often contain QA text. Despite this, such DSDM samples examples hurt model performance: training on them is worse than selecting randomly (cf. Appendix Figure 10). We liken these samples to mislabeled examples from supervised learning, and conjecture excluding such data could (in part) explain DSDM performance. 4. Selecting Data for Broad Model Capabilities So far, we have shown that DSDM consistently reduces loss on pre-specified target tasks. However, when we train largescale models in practice our hope is that they will perform DSDM: Model-Aware Dataset Selection with Datamodels (1) s, forms, and modification alternative can be overwhelming. So save the time, chance, money, budget, energy, also effort and implement these tips to acquire a obvious concept of what you would like and things you need before you start the quest and think about the right variations and pick right decoration, here are some recommendations and photos on deciding on the best leather sectional sofas toronto.\n The design need to create impact to your sofa. Could it be modern, luxury, minimalist, or traditional? Co (2) ises; soldier of fortune.\n3. a person who undertakes great commercial risk; speculator.\n4. a person who seeks power, wealth, or social rank by unscrupulous or questionable means: They thought John was an adventurer and after their daughter s money.\n"There can be adventurer souls."\n"There can be adventurer sirs."\n"There can be adventurer reflexes."\n"There can be adventurer realises."\n"There can be adventurer profiles."\n"There can be adventurer problems."\n"There can be adventurer paths."\n"There can be (a) DSDM samples (1) in Alexandria, where it was begun; and the Greek Bible of the Hellenistic Jews and the Catholic Church may rightly be styled the Alexandrian Greek version of the Old Testament.\n In the early days of the Church the Septuagint was widely used among the Jews; as a rule, though there are exceptions, when the Old Testament is quoted in the New Testament it is from the Greek, not the Hebrew Bible that the quotation is made. The early Jewish-Christians and the great majority of the Jews had the same Bible, and Gent (2) the Central Committee of the Party, that is, by the Politburo, the Orgburo (Organizational Bureau), and the Secretariat. The decisions made were implemented through the Presidium of the Supreme Soviet of the USSR, the Council of People s Commissars of the USSR, the GKO, and the General Headquarters of the Supreme Command, which had been established on August 8. Strategic direction of the armed forces was carried out by the General Headquarters through its working body, the General Staff. Major questions as (b) DSIR samples (1) ris and St Gleb, dating from the mid-12th century, was much rebuilt in succeeding periods, before being restored to its original shape in the 20th century. The crowning achievement of Chernigov masters was the exquisite Church of St Paraskeba (Pyatnitskaya), constructed at the turn of the 12th and 13th centuries. This graceful building was seriously damaged in the Second World War; its original medieval outlook was reconstructed. The earliest residential buildings in the downtown date from the late 17th cen (2) their professional careers.\n Dr Simpson s first line is classic.\nlatest date in the year it s been that cold in 50 years of record keeping.\n Back in March, 2007, Al Gore told Congress that "the science is settled."\nscience is settled. The Sun revolves around the Earth, not vice versa.\nscience," spent the rest of his life under house arrest.\n& Tax Bill (its actual name) through the House? Hopefully, some "cooler"\nseem, may have nothing to do with global warming.\n Paul, let me give you a little advice.\n You migh (c) CLASSIFIER samples Figure 2: Samples selected by each method for SQu AD. Selected CLASSIFIER and DSIR samples are intuitively high quality text, and more similar to SQu AD examples (which are Wikipedia excerpts with questions) than DSDM samples are. DSDM samples do not match SQu AD, but do contain QA-style text, e.g., (1) left (a question in an ad) or (2) left (a dictionary definition). We display random samples from each method s selected subset (cf. Appendix C.3). \n is a newline. well on yet unseen tasks too. Our framework suggests a straightforward approach to improving this kind of performance: choose target tasks that match those we expect to see at model deployment time, then estimate the optimal dataset selection for these proxy target tasks. In this section, we demonstrate that this approach to selecting data can greatly improve held-out task performance compared to baselines. Specifically, we consider three target tasks that cover a broad range of language modeling problem categories LAMBADA (language understanding problems), SQu AD (reading comprehension problems), and Jeopardy (world knowledge problems) and estimate the optimal training dataset selection for these tasks (all together) via DSDM. We then compare models trained on this data with models trained via existing dataset selection baselines. Overall, evaluating on a diverse set of held-out benchmarks (meant to model yet unseen tasks ), we find that: (a) randomly selecting data is a surprisingly strong baseline no baseline selection method outperforms selecting data at random and (b) our approach yields models that match those trained with 2 the training compute on randomly selected data. In particular, models trained with our approach reliably improve performance on benchmarks that are qualitatively related to the target tasks. We describe our setup below, and defer additional details to Appendix D. Model training, scaling DSDM, selection baselines, and evaluation. We train GPT-2 style LMs with varying compute budgets. To train the best possible model for a given compute budget, we use Chinchilla-optimal parameter-totrain-tokens ratios (Hoffmann et al., 2022) and train up to 1.8B parameter models. To select with DSDM, we use 125M proxy models: we calculate DSDM subsets for 125M models, then train on these selections at each compute budget (instead of computing DSDM separately for each model class). DSDM cost scales linearly with model size, so this procedure greatly reduces overhead (cf. Appendix B.5). For baselines, we compare with two methods that select via textual similarity with a specified high quality data source (DSIR and CLASSIFIER, the baselines of Section 3), a data deduplication method (Sem De Dup (Abbas et al., 2023)), and selecting data randomly. We evaluate on 15 standard benchmarks (cf. Table 1). Target tasks. We execute each targeted dataset selection method using its originally proposed target task. For DSDM, we apply the framework described above: we select three target tasks that cover a broad range of LM problem categories LAMBADA, SQu AD, and Jeopardy then estimate the optimal dataset selection for these tasks together (i.e., Dtarg as an equal mix of these tasks). For CLASSI- FIER and DSIR, we target a replication of the high quality target distribution proposed by these methods (a mix of Wikipedia (Foundation, 2022), Books1 (Presser, 2021), and Open Web Text (Gokaslan et al., 2019), cf. Appendix D.1). 4.1. Results In Figure 3, we display the mean benchmark performance of models trained with each selection method, varying training compute budget. Randomly selecting data is a strong baseline: all baseline methods generally match or perform worse than random selection across training compute budgets (Figure 3 left). In the case of CLASSIFIER and DSIR, we hypothesize that data selected via similarity with a fixed source hurts model performance by trading off data diversity for (qualitatively) cleaner data. In contrast, DSDM is a 2 compute multiplier: DSDM yields DSDM: Model-Aware Dataset Selection with Datamodels 0.0 0.5 1.0 1.5 2.0 Training budget ( 1.3B training compute) Mean accuracy 2 compute efficiency Ds Dm DSIR Classifier Se De Dup Random Ds Dm Se De Dup Classifier DSIR 1.3B models 0.38 Selection method Random (2 compute) Random (1 compute) Baseline Figure 3: Left: mean benchmark performance, varying training compute budget. Right: mean performance for 1.3B models. Randomly selecting data is a strong baseline: no active selection baseline outperforms random selection. In contrast, DSDM outperforms all baselines across compute budgets (left panel), and even matches training with 2 the compute on randomly selected data (when training 1.3B models, right panel). Our train budgets correspond to training 125M, 356M, 760M, and 1.3B parameter Chinchilla-optimal LMs. To contextualize 1.3B results, we show the performance of a model trained on randomly selected data with 2 the 1.3B compute budget (i.e., a 1.8B Chinchilla-optimal model). 1.3B models that match models trained with 2 the compute budget on randomly selected data (Figure 3, right). Furthermore, across compute budgets, DSDM consistently outperforms all selection baselines (Figure 3, left). Going beyond aggregate performance, we find that DSDM greatly improves on benchmarks related to the target tasks, while simultaneously not reducing performance on unrelated categories (on average). More precisely, inspecting individual benchmark performance in Table 1, DSDM most improves reading comprehension and world knowledge benchmarks compared to selecting randomly. We hypothesize that our choice of target tasks leads to improved performance on these benchmarks (which are qualitatively similar to SQu AD, a reading comprehension task, and Jeopardy, a world knowledge task). Furthermore, in these categories DSDM consistently matches or outperforms training with 2 the compute budget on randomly selected data (i.e., the 1.8B model in Table 1). Crucially, DSDM improves on these categories while also not reducing performance in other categories. As a comparison, DSIR which targets mostly formal text performs well on language understanding tasks but poorly on other categories (e.g., world knowledge and symbolic problem solving). Target tasks improve performance on qualitatively similar benchmarks. So far, we have only targeted DSDM with a mix of LAMBADA, Jeopardy and SQu AD. How does target task choice change model behavior? We find that targeting a specific task generally improves performance on qualitatively related tasks. To demonstrate, in Figure 4 we display accuracy by benchmark category while varying target task across LAMBADA, Jeopardy, SQu AD, and all at once. Here, targeting a task generally improves accuracy on related tasks, e.g., SQu AD most improves reading comprehension, and Jeopardy most improves world knowledge. Furthermore, targeting all tasks at once improves overall accuracy the most. However, targeting can also decrease accuracy on unrelated tasks. For example, targeting LAMBADA, a language understanding task, reduces world knowledge accuracy compared to randomly selecting data. Our results suggest that we can tailor target tasks to improve deployment-time performance, but also that we need to be careful to choose targets that are diverse enough to capture a range of downstream problems. DSDM is necessary to improve performance (with the targeted tasks). DSDM selections yield much better models than CLASSIFIER and DSIR selections. However, we have not yet compared these selection methods head-to-head with the same target task. CLASSIFIER and DSIR target a mix of high quality sources, while DSDM targets three LM tasks (Jeopardy, SQu AD, and LAMBADA). To what extent does selecting with DSDM drive performance compared to the difference in target tasks? We demonstrate that selecting with DSDM is necessary to improve performance on the considered target tasks. Specifically, we train models on data selected with DSIR and CLASSIFIER targeting LAMBADA, Jeopardy and SQu AD, and find that (just as when targeting high quality text ) neither outperforms randomly selecting data (cf. Appendix Figure 25). 5. Discussion Ostensibly, the sole goal of our dataset selection framework is improve model performance by better selecting DSDM: Model-Aware Dataset Selection with Datamodels Table 1: Accuracies on the considered benchmarks for 1.3B models trained with each selection method, along with a model trained with 2 the 1.3B compute budget on randomly selected data (a 1.8B model; Chinchilla-optimal models with larger parameter counts train with more tokens as well). In parentheses, we contextualize accuracy with the difference compared to a 1.3B model trained on randomly selected data. Accuracy (%) Model Parameters 1.3B 1.8B Method Ds Dm Random Classifier DSIR Se De Dup Random Category Benchmark Commonsense Reasoning copa 63.0 (+1) 62.0 (+0) 66.0 (+4) 67.0 (+5) 68.0 (+6) 64.0 (+2) openbook_qa 31.2 ( 2) 33.4 (+0) 32.0 ( 1) 32.0 ( 1) 32.2 ( 1) 33.6 (+0) piqa 69.0 (+0) 68.9 (+0) 69.4 (+1) 65.7 ( 3) 69.7 (+1) 71.5 (+3) Language Understanding cbt 88.2 (+2) 86.4 (+0) 85.1 ( 1) 92.4 (+6) 86.2 (+0) 88.4 (+2) hellaswag 42.3 ( 3) 44.9 (+0) 42.7 ( 2) 40.4 ( 5) 44.9 (+0) 50.1 (+5) winogrande 51.1 ( 1) 52.2 (+0) 50.5 ( 2) 55.3 (+3) 50.3 ( 2) 50.9 ( 1) Reading Comprehension boolq 58.0 (+3) 54.9 (+0) 60.9 (+6) 61.0 (+6) 49.9 ( 5) 53.4 ( 2) coqa 25.5 (+7) 18.8 (+0) 16.7 ( 2) 16.5 ( 2) 22.9 (+4) 24.9 (+6) news_qa 15.6 (+8) 7.5 (+0) 5.1 ( 2) 5.5 ( 2) 8.6 (+1) 9.5 (+2) Symbolic Problem Solving bb_copy_logic 3.1 (+0) 3.1 (+0) 0.0 ( 3) 0.0 ( 3) 3.1 (+0) 3.1 (+0) bb_dyck_lang 11.9 ( 2) 13.5 (+0) 3.4 ( 10) 1.0 ( 13) 7.3 ( 6) 8.9 ( 5) bb_operators 13.3 (+3) 10.5 (+0) 6.7 ( 4) 10.5 (+0) 11.4 (+1) 9.5 ( 1) World Knowledge arc_easy 47.6 (+3) 44.8 (+0) 44.7 (+0) 39.6 ( 5) 43.5 ( 1) 48.5 (+4) bb_qa_wikidata 48.1 (+8) 40.6 (+0) 48.3 (+8) 37.7 ( 3) 45.5 (+5) 53.6 (+13) trivia_qa 7.1 (+3) 3.7 (+0) 2.5 ( 1) 3.5 (+0) 2.4 ( 1) 4.1 (+0) training data. However, one can view our framework more broadly. That is, one can also use our framework to select data that boosts any chosen downstream property of our trained models not just performance on a given benchmark. In this sense, our framework (and accompanying method) unlocks data curation as another stage of the model training pipeline that we can intervene on to control the downstream model behavior in a fine-grained manner. Below, we discuss in more detail the broader opportunities this view opens up as well as the other aspects of the framework, such as proxy modeling and computational efficiency. Applications and broader opportunities. DSDM can optimize for any specified downstream model behavior. Indeed, by an appropriate choice of the target tasks, we can use our framework to improve a wide range of model behaviors, including: aligning models at pretraining time (in addition to or in place of existing methods, which typically operate post model training (Bai et al., 2022; Ziegler et al., 2019; Taori et al., 2023)); optimizing for notions of fairness; and improving performance on specific domains of interest (such as low-resource languages or programming). Training stronger models with weaker proxy models. We select data for large models by using smaller models to proxy large model behavior (recall that we use DSDM to select data for smaller proxy models, then train large models on these selections). Despite that these proxy models are much worse than larger models on benchmarks (cf. Appendix Table 2), the corresponding selections nonetheless greatly improve performance. Furthermore, training on proxy models selections is the simplest possible approach to scaling. Therefore, we suspect that scaling to larger models less naïvely could yield even better results. More broadly, our findings are in line with previous work showing that smaller models can still be leveraged to determine better training hyperparameters for larger models (Kaplan et al., 2020; Coleman et al., 2020; Hoffmann et al., 2022; Yang et al., 2022; Xie et al., 2023a). Computational cost. DSDM is relatively inexpensive to compute in practical model training scenarios. At a high level, the most expensive part of estimating DSDM is computing the gradient for each training example on a handful of small proxy models (in our case, four 125M parameter LMs see Appendix B.5 for a full cost breakdown). To contextualize DSDM cost with model training: computing gradients also dominates the cost of training LMs. Since the cost of computing a 125M model gradient is orders of magnitude lower than the cost of computing gradients for standard model sizes,3 even a small compute multiplier (let alone the 2 improvement DSDM seems to offer) quickly makes the overhead of DSDM worthwhile. Additionally, after computing DSDM on a set of datapoints once, the cost 3For reference: models trained today generally range from 3B to 175B parameters. The cost of a gradient is (roughly) linear in model size, so it is 24 to 1400 more expensive to compute gradients for these models vs. 125M models. DSDM: Model-Aware Dataset Selection with Datamodels Reading Comprehension World Knowledge Language Understanding Symbolic Problem Commonsense 760M Accuracy (vs. random answers) Jeopardy+ SQu AD+ LAMBADA SQu AD Jeopardy LAMBADA Random Figure 4: Per-category performance for 760M models trained with DSDM-selected data, varying target task. DSDM target tasks generally improve performance on (qualitatively) related benchmark task categories. Specifically, models targeted towards SQu AD/Jeopardy/LAMBADA improve accuracy on reading comprehension/world knowledge/language understanding, respectively. Targeting all three tasks at once improves overall accuracy. However, target tasks can also reduce performance on (qualitatively) unrelated tasks. For example, targeting with LAMBADA (a language understanding task) reduces performance on world knowledge tasks compared to randomly selecting. See each category s constituent benchmarks in Table 1. We plot improvement compared to randomly guessing answers (e.g., some benchmarks are multiple-choice). of computing DSDM on those datapoints again is essentially negligible (as the required computations are easy to cache). Therefore, we can amortize DSDM s computational cost over the entire lifetime of training on the given dataset. 6. Related Work Current methods for selecting LM pretraining datasets tend to follow a two-step framework: (a) choose an intuitively high quality reference corpus, like Wikipedia (Foundation, 2022), then (b) select data that matches it. There are two standard methods that adhere to this framework: DSIR (Dataset Selection with Importance Reweighting (Xie et al., 2023b)) and CLASSIFIER (originally introduced in Brown et al. (2020) and used by other work (Gao et al., 2020; Chowdhery et al., 2022; Du et al., 2022)). Other work on selecting data for LM pretraining has included deduplicating examples in LM activation space (Abbas et al., 2023), and selecting examples with the largest difference in loss between LMs trained on the candidate and reference sets (Moore and Lewis, 2010; Axelrod, 2017; Feng et al., 2022). Simpler methods for selecting data are also commonplace. These include removing documents that are too short or contain too many special characters (Raffel et al., 2020; Computer, 2023; Xie et al., 2023b). In the LM domain, a related (but different) task to dataset selection is choosing weights for sampling from mixtures of data sources (Chen et al., 2023b; Xie et al., 2023a; Albalak et al., 2023). Beyond LM pre-training, previous work also selects data in other domains. These works aim to: improve the performance on a given task (Wei et al., 2015; Kaushal et al., 2019; Wang et al., 2020; Killamsetty et al., 2021a; Chitta et al., 2021; Mindermann et al., 2022), identify core-sets of large training datasets (Sener and Savarese, 2017; Phillips, 2017; Coleman et al., 2020; Mirzasoleiman et al., 2020; Paul et al., 2021; Killamsetty et al., 2021b; Okanovic et al., 2023), and fine-tune LMs (Antonello et al., 2022; Chen et al., 2023a; Cao et al., 2023). Broadly, such methods select by prompting pretrained models, discriminating on proxies for model uncertainty like loss or gradient norm, matching on gradients, or deduplicating in model output space. DSDM uses TRAK to estimate datamodel weights, which calculates influences using model gradients. We therefore additionally overview gradient-based methods for LM dataset selection in machine learning more broadly (Bejan et al., 2023; Wang et al., 2023; Xia et al., 2024). These methods estimate the effect of including a given train sample on a given test example by calculating the inner product between the two examples gradients. Through this lens, all methods can be seen as applying a variant of Trac In (Pruthi et al., 2020) to compute influences that are then used to select data. In comparison, TRAK estimates the effect of including a given train sample on a given test sample by calculating influences on a linearized version of the model of interest (cf. Appendix B.2 for more details). 7. Conclusion In this work, we cast dataset selection as an optimization problem: given target tasks, a learning algorithm, and a candidate training dataset, choose the training maximizes performance. We then propose a method for approximating the solution to this optimization problem, DSDM, that selects by modeling how the learning algorithm uses training data to predict on the target tasks. We show that our method reliably improves target task performance in the LM setting, and furthermore use our framework to improve broader model generalization. By choosing target tasks similar to those we expect to see at deployment time, we can greatly improve model performance on yet unseen tasks. DSDM: Model-Aware Dataset Selection with Datamodels Our findings prompt us to take on a much broader view of the role of dataset selection stage in model training. In particular, our framework demonstrates that dataset selection can be an effective tool for fine-grain control of model behavior. Indeed, we hypothesize that carefully choosing data can not only improve downstream task performance, but also other downstream properties of trained models, such as notions of predictor fairness, alignment with human preferences, or capabilities in specific domains like low-resource languages or programming. We also suspect that current methods for datamodeling only scratch the surface of understanding how models learn from data and that we can greatly improve our ability to manipulate model behavior through training data by developing better datamodeling techniques. Impact Statement We introduce a new method for selecting data for improving model performance. Narrowly considering the direct use of our method, selecting certain data can unpredictably change fine-grained notions of model behavior. Therefore, as a method for selecting data, DSDM data could cause unintended changes to model behavior. More broadly, DSDM is a method meant to improve machine learning models, and there are many potential consequences associated with advancing machine learning. Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication. ar Xiv preprint ar Xiv:2303.09540, 2023. Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. Efficient online data mixing for language model pre-training. Training, 20000(40000): 60000, 2023. Richard Antonello, Nicole Beckage, Javier Turek, and Alexander Huth. Selecting informative contexts improves language model finetuning, 2022. Amittai Axelrod. Cynical selection of language model training data. ar Xiv preprint ar Xiv:1709.02279, 2017. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mc Kinnon, et al. Constitutional ai: Harmlessness from ai feedback. ar Xiv preprint ar Xiv:2212.08073, 2022. Irina Bejan, Artem Sokolov, and Katja Filippova. Make every example count: On the stability and utility of selfinfluence for learning from noisy nlp datasets. ar Xiv preprint ar Xiv:2302.13959, 2023. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397 2430. PMLR, 2023. Hermanus Josephus Bierens. The nadaraya-watson kernel regression function estimator. 1988. Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019. Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle Mc Donell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/ abs/2204.06745. Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020. Yihan Cao, Yanbin Kang, and Lichao Sun. Instruction mining: High-quality instruction data selection for large language models. ar Xiv preprint ar Xiv:2307.06290, 2023. Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. Alpagasus: Training a better alpaca with fewer data. ar Xiv preprint ar Xiv:2307.08701, 2023a. Mayee F Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré. Skill-it! a data-driven skills framework for understanding and training language models. ar Xiv preprint ar Xiv:2307.14430, 2023b. Kashyap Chitta, José M Álvarez, Elmar Haussmann, and Clément Farabet. Training data subset search with ensemble active learning. IEEE Transactions on Intelligent Transportation Systems, 23(9):14741 14752, 2021. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022. DSDM: Model-Aware Dataset Selection with Datamodels Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ar Xiv:1803.05457v1, 2018. Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations (ICLR), 2020. Together Computer. Redpajama: an open dataset for training large language models. https://github.com/ togethercomputer/Red Pajama-Data, October 2023. R Dennis Cook. Detection of influential observation in linear regression. Technometrics, 19(1):15 18, 1977. Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547 5569. PMLR, 2022. Yukun Feng, Patrick Xia, Benjamin Van Durme, and João Sedoc. Automatic document selection for efficient encoder pretraining. ar Xiv preprint ar Xiv:2210.10951, 2022. Wikimedia Foundation. English wikipedia. https:// huggingface.co/datasets/wikipedia, 2022. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. ar Xiv preprint ar Xiv:2101.00027, 2020. Ryan Giordano, William Stephenson, Runjing Liu, Michael Jordan, and Tamara Broderick. A swiss army infinitesimal jackknife. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1139 1147. PMLR, 2019. Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. http://Skylion007. github.io/Open Web Text Corpus, 2019. Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks principle: Reading children s books with explicit memory representations. ar Xiv preprint ar Xiv:1511.02301, 2015. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In ar Xiv preprint ar Xiv:2203.15556, 2022. Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry. Datamodels: Predicting predictions from training data. In International Conference on Machine Learning (ICML), 2022. William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. In Contemporary mathematics, 1984. Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601 1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. ar Xiv preprint ar Xiv:1607.01759, 2016. Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020. Vishal Kaushal, Rishabh Iyer, Suraj Kothawade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan. Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1289 1299. IEEE, 2019. Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Andrew Ilyas, Hadi Salman, and Aleksander Madry. Backdoor or feature? a new perspective on data poisoning. 2022. Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8110 8118, 2021a. DSDM: Model-Aware Dataset Selection with Datamodels Krishnateja Killamsetty, Xujiang Zhao, Feng Chen, and Rishabh Iyer. Retrieve: Coreset selection for efficient and robust semi-supervised learning. Advances in Neural Information Processing Systems, 34:14488 14501, 2021b. Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, 2017. Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. ar Xiv preprint ar Xiv:1801.10198, 2018. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. ar Xiv preprint ar Xiv:1809.02789, 2018. Sören Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N Gomez, Adrien Morisot, Sebastian Farquhar, et al. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630 15649. PMLR, 2022. Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pages 6950 6960. PMLR, 2020. Robert C Moore and William Lewis. Intelligent selection of language model training data. In Proceedings of the ACL 2010 conference short papers, pages 220 224, 2010. Mosaic ML. Composer, 2021. URL https://www. github.com/mosaicml/composer. Mosaic ML. LLM Foundry, 2023. URL https://www. github.com/mosaicml/llm-foundry. Patrik Okanovic, Roger Waleffe, Vasilis Mageirakos, Konstantinos E Nikolakakis, Amin Karbasi, Dionysis Kalogerias, Nezihe Merve Gürel, and Theodoros Rekatsinas. Repeated random sampling for minimizing the time-toaccuracy of learning. ar Xiv preprint ar Xiv:2305.18424, 2023. Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. ar Xiv preprint ar Xiv:1606.06031, 2016. Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: Attributing model behavior at scale. In Arxiv preprint ar Xiv:2303.14186, 2023. Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596 20607, 2021. Jeff M Phillips. Coresets and sketches. In Handbook of discrete and computational geometry, pages 1269 1288. Chapman and Hall/CRC, 2017. Daryl Pregibon. Logistic regression diagnostics. In The Annals of Statistics, 1981. Shawn Presser. Bookcorpusopen. https: //huggingface.co/datasets/ bookcorpusopen, 2021. Garima Pruthi, Frederick Liu, Mukund Sundararajan, and Satyen Kale. Estimating training data influence by tracing gradient descent. In Neural Information Processing Systems (Neur IPS), 2020. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 2020. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. ar Xiv preprint ar Xiv:1606.05250, 2016. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821 8831. PMLR, 2021. Siva Reddy, Danqi Chen, and Christopher D. Manning. Coqa: A conversational question answering challenge, 2019. Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd DSDM: Model-Aware Dataset Selection with Datamodels schema challenge at scale. Communications of the ACM, 64(9):99 106, 2021. Nikunj Saunshi, Arushi Gupta, Mark Braverman, and Sanjeev Arora. Understanding influence functions and datamodels via harmonic analysis. In ICLR, 2023. Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. ar Xiv preprint ar Xiv:1708.00489, 2017. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ar Xiv preprint ar Xiv:2206.04615, 2022. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instructionfollowing llama model. https://github.com/ tatsu-lab/stanford_alpaca, 2023. Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. Newsqa: A machine comprehension dataset. ar Xiv preprint ar Xiv:1611.09830, 2016. Bojan Tunguz. Jeopardy! questions, 2019. URL https://www.kaggle.com/datasets/ tunguz/200000-jeopardy-questions. Xiao Wang, Weikang Zhou, Qi Zhang, Jie Zhou, Songyang Gao, Junzhe Wang, Menghan Zhang, Xiang Gao, Yunwen Chen, and Tao Gui. Farewell to aimless large-scale pretraining: Influential subset selection for language model. ar Xiv preprint ar Xiv:2305.12816, 2023. Xinyi Wang, Hieu Pham, Paul Michel, Antonios Anastasopoulos, Jaime Carbonell, and Graham Neubig. Optimizing data usage via differentiable rewards. In International Conference on Machine Learning, pages 9983 9995. PMLR, 2020. Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In International conference on machine learning, pages 1954 1963. PMLR, 2015. Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning, 2024. Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. ar Xiv preprint ar Xiv:2305.10429, 2023a. Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. ar Xiv preprint ar Xiv:2302.03169, 2023b. Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. ar Xiv preprint ar Xiv:2203.03466, 2022. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? ar Xiv preprint ar Xiv:1905.07830, 2019. Yin Zhang, Rong Jin, and Zhi-Hua Zhou. Understanding bag-of-words model: a statistical framework. International journal of machine learning and cybernetics, 2010. Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics, pages 185 205, 2005. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), December 2015. Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. ar Xiv preprint ar Xiv:1909.08593, 2019. DSDM: Model-Aware Dataset Selection with Datamodels Appendices A Experimental Setup 14 A.1 Candidate dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Target tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2.1 Mitigating train-test leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.3 Data selection baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.3.1 Targeted baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.3.2 Untargeted baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.4 LM training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.5 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.5.1 Log-probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.5.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B Datamodel estimation 19 B.1 Datamodels refresher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.1.1 Estimating datamodels with data regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.2 Estimating datamodels with TRAK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.2.1 Datamodels for logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.2.2 Transforming learning algorithms to linear regression . . . . . . . . . . . . . . . . . . . . . . . . 21 B.2.3 TRAK estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 B.3 Datamodels for language modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.4 TRAK setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.5 Computational cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 C Evaluating task-optimal dataset selection 25 C.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C.2 Omitted figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C.3 Sample dataset selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 D Evaluating data selections for broad model performance 35 D.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 D.2 Omitted figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 DSDM: Model-Aware Dataset Selection with Datamodels A. Experimental Setup In this section we discuss general experimental setup, including candidate data pool, considered target tasks, baselines, evaluation metrics, and model training choices. A.1. Candidate dataset Our candidate dataset is the full English subset of C4 (Raffel et al., 2020). We use the train split of the en.noblocklist subset of the C4 version prepared by Allen AI at https://huggingface.co/datasets/c4. The subset name noblocklist signifies that curse words were not filtered in the subset. To split the text from the documents into examples, we tokenize all the documents, concatenate them together (separated by end-of-text tokens), and then slice the result into 1024 token chunks. These 1024 token examples generally contain between 3,000 and 6,000 characters (roughly a thousand words). The final candidate dataset has 216,948,746 examples. We tokenize with the Pythia tokenizer (Black et al., 2022; Biderman et al., 2023). As a public internet crawl, C4 contains diverse text. To contextualize the dataset, we show (excerpts) of random C4 samples in Figure 24. A.2. Target tasks We describe each of the considered target tasks below. We both describe the tasks, and how we split samples into distinct sets of target samples (to select datasets for a target task) and holdout samples (to evaluate models on the target task): SQu AD. The Stanford Question-Answering Dataset (SQu AD (Rajpurkar et al., 2016)) is an open book, reading comprehension dataset of questions about Wikipedia articles. The goal is to answer questions using the corresponding article as context. Our target set is 25% of the SQu AD train set (23107 examples), our holdout set is the SQu AD validation set (10557 examples). Jeopardy. Jeopardy (Tunguz, 2019) is a set of trivia questions taken directly from the show Jeopardy! We use the version of Jeopardy published by Mosaic ML (Mosaic ML, 2023).4 We include all the samples save for the Word Origins subset.5 We randomly partition the remaining samples into 876 target samples and 876 holdout samples. LAMBADA. LAnguage Modeling Broadened to Account for Discourse Aspects (LAMBADA (Paperno et al., 2016)) is an open-ended cloze task measuring broad context text understanding. The goal is to predict the last word of curated passages from Books Corpus (Zhu et al., 2015) given the rest of the passage as context. The task is meant to be challenging: Paperno et al. (2016) only select passages such that crowdworkers could not guess the final word given the final sentence alone (up until the final word), but could guess the final word given the entire passage. We use the LAMBADA version curated by Eleuther AI.6 Finally, we split the LAMBADA test set into separate target and holdout sets, then remove 6 samples from the LAMBADA holdout set due to overlap with samples in our candidate train dataset (cf. Subsection A.2.1 for details on this procedure). We conclude with 2570 holdout samples and 2577 target samples. CS-Algorithms. BIG-bench CS Algorithms (Srivastava et al., 2022) measures the ability of models to solve basic algorithmic problems. In particular, this benchmark contains two kinds of problems: testing for balanced parentheses, and finding the longest common subsequence of multiple strings. For each considered example, the goal is to directly output the answer to the posed algorithmic question. We randomly split the test set into 660 target samples and 660 holdout samples. We include samples of each benchmark in Figure 5 (SQu AD), Figure 6 (Jeopardy), Figure 7 (LAMBADA), and Figure 8 (CS-Algorithms). We evaluate in the 0-shot (for LAMBADA and CS-Algorithms) and 3-shot (for SQu AD and Jeopardy) regimes. In the 3-shot setting, we separate each example with a single newline. We use standard prompts for each task (see the samples for details). 4Located at: https://github.com/mosaicml/llm-foundry/blob/v0.2.0/scripts/eval/local_data/ world_knowledge/jeopardy_all.jsonl 5We originally intended this subset as a hold-out set for our broader evaluation, but decided not to use the subset as we deemed it unfairly close to the original task to serve as a true hold-out set. 6Located at https://huggingface.co/datasets/Eleuther AI/lambada_openai/viewer/en. DSDM: Model-Aware Dataset Selection with Datamodels 1. Context: The chloroplasts of some hornworts and algae contain structures called pyrenoids. They are not found in higher plants. Pyrenoids are roughly spherical and highly refractive bodies which are a site of starch accumulation in plants that contain them. They consist of a matrix opaque to electrons, surrounded by two hemispherical starch plates. The starch is accumulated as the pyrenoids mature. In algae with carbon concentrating mechanisms, the enzyme rubisco is found in the pyrenoids. Starch can also accumulate around the pyrenoids when CO2 is scarce. Pyrenoids can divide to form new pyrenoids, or be produced de novo . Question: What shape are pyrenoids? Answer: roughly spherical 2. Context: In this dioxygen, the two oxygen atoms are chemically bonded to each other. The bond can be variously described based on level of theory, but is reasonably and simply described as a covalent double bond that results from the filling of molecular orbitals formed from the atomic orbitals of the individual oxygen atoms, the filling of which results in a bond order of two. More specifically, the double bond is the result of sequential, low-to-high energy, or Aufbau, filling of orbitals, and the resulting cancellation of contributions from the 2s electrons, after sequential filling of the low σ and σ orbitals; σ overlap of the two atomic 2p orbitals that lie along the O-O molecular axis and π overlap of two pairs of atomic 2p orbitals perpendicular to the O-O molecular axis, and then cancellation of contributions from the remaining two of the six 2p electrons after their partial filling of the lowest π and π orbitals. Question: What is a descriptive term for a low-to-high energy bond? Answer: Aufbau Figure 5: Random SQu AD samples. Context is normal text, and the continuation label is hightlighted. 1. WORLD HISTORY: In 1191 this Lion-Hearted king of England captured Cyprus & Acre during the Crusades Answer: Richard I 2. LITERATURE: 1719 novel about a mariner who lived 8 & 20 years all alone in an uninhabited island Answer: Robinson Crusoe Figure 6: Random Jeopardy samples. Context is normal text, and the continuation label is hightlighted. 1. The Simplification Movement wasn t really an organized movement. It was more of an ideological shift by a large number of believers. There were quite a few Simpletons among the Mother Assembly denomination, but the High Sire had never recognized their movement as an order or organization. However, some other denominations were founded on the principles of the Simplification Movement 2. Here, said Jacob, handing them what was a rope attached to the ground next to them, the other end at the bottom of the well. You first. Will stood there. Why am I doing this? he thought. Come on, let s go! ordered Jacob. Will took the rope and began to climb down the well. Thatta boy, you ve got this, said Jacob Figure 7: Random LAMBADA samples. We show the context as normal text, and the continuation label as hightlighted. 1. Given two strings, determine the length of the longest common subsequence. Strings: REFVJLZIV PJIQB Length of longest common subsequence: 2 2. Determine whether the given sequence of parentheses is properly matched. Sequence: [ ] ( ) ( ( ( ) ) ) Valid/Invalid? Valid Figure 8: Random CS-Algorithms samples. We show the context as normal text, and the continuation label as hightlighted. DSDM: Model-Aware Dataset Selection with Datamodels A.2.1. MITIGATING TRAIN-TEST LEAKAGE We mitigate train-test leakage by filtering out test examples that overlap with our candidate data samples. Specifically, we define a test example as leaked if both its context and continuation are present in a single C4 example. To upperbound train-test leakage, we test for the context and continuation separately (i.e., for a given test sample, the context and continuation do not have to be contiguous in a train sample to count as leaked) We investigate train-test leakage for all the test examples in each of the test sets (i.e., LAMBADA, SQu AD, Jeopardy, and CS-Algorithms) across the entire candidate train set (i.e., the C4 English subset). Note that we match strings after lowercasing and removing whitespace. We find 6 LAMBADA test examples with overlap in C4, and remove them from our LAMBADA test split. We do not find any train-test leakage for SQu AD, Jeopardy, or CS-Algorithms. A.3. Data selection baselines We consider four baselines for selecting language modeling data. These fall into two categories: targeted data selection methods (which select data according to a target distribution), and untargeted data selection methods (which do not take in a target distribution). A.3.1. TARGETED BASELINES The two targeted dataset selection methods we consider, CLASSIFIER (originally used to select the GPT-3 dataset (Brown et al., 2020)) and DSIR, both select according to textual similarity with a target distribution. We describe the details of these methods below: CLASSIFIER. The dataset selection method originally developed to select data for GPT-3, and additionally used to select data for Pa LM (Chowdhery et al., 2022) and The Pile (Gao et al., 2020). The method trains a logistic regression model on Fast Text (Joulin et al., 2016) features to classify between (held-out) samples of the candidate dataset (in our case, C4) and the target distribution, then chooses training data according to how likely the model predicts the data as being sampled from the target distribution. To more specifically describe CLASSIFIER: the method keeps a given document if the scored document satisfies: ϵ > 1 document_score, ϵ Lomax(α), where a Lomax sample is drawn for each considered document, and where document_score is the classifier-given probability that the given sample is in the target distribution. Sampling a threshold according to the Lomax distribution is meant to improve diversity of the selected data. In this work, we learn the classifier on the C4 en.noblocklist validation set, and choose α = 12 via the parameter selection procedure described in Brown et al. (2020) (score each document in C4 with the classifier, then fit the parameters of a Lomax distribution via maximum likelihood estimation according to these scores). DSIR. Dataset Selection with Importance Resampling (Xie et al., 2023b) aims to select a data subset with a similar distribution as the target task in terms of n-gram counts. DSIR comprises two steps: (a) find the (hashed) n-gram counts for each train set example (each example is represented as a vector of counts, with n-grams hashed into buckets to reduce dimensionality), then (b) importance sample to select candidate train set examples that are distributed similarly to target distribution samples in terms of n-gram counts. DSIR calculates importance weights by modeling the distribution of examples (in feature space) under the target distribution and under the candidate data distribution separately, using bag-of-words style models. In greater detail, DSIR consists of the following steps: 1. Fit ˆpfeat and ˆqfeat, estimates of the distributions of target examples and candidate training examples in hashed n-gram space (respectively). DSIR parameterizes ˆpfeat and qfeat through the following general procedure for estimating the distribution of hashed n-grams7 for a given set of documents. First, calculate the hashed n-gram counts (with d hash buckets) across the documents as the vector γ Rd, where γk corresponds to the number of n-grams that hash to k in the documents. Then, normalize γ so that its values sum to 1, forming a probability distribution over buckets. Finally, parameterize the distribution of hashed n-grams for this set of documents as a bag-of-words style model (Zhang et al., 7For example, if we wanted to make a d dimensional hashed n-gram feature vector for a document, we would find all the n-grams in the document, hash the n-grams into integers up to size d, then go through each integer and increment the corresponding feature vector index. DSDM: Model-Aware Dataset Selection with Datamodels 2010) such that the probability of a document with hashed n-gram counts c is Qd i=1 γci d (here, the bag-of-words model is over hashed n-grams instead of words). 2. Calculate importance weights for each example in the candidate training set, such that example i with counts c has weight wi = ˆpfeat(c) 3. Sample examples without replacement according to the categorical distribution with (unscaled) weights wi. For more details on DSIR, see Section 4 of Xie et al. (2023b). We adapt implementations of both DSIR and CLASSIFIER from https://github.com/p-lambda/dsir. Considered target distributions. We apply targeted dataset selection methods with different target distributions depending on the context. In Section 3, we measure the extent to which different selection methods can reduce loss on individual target tasks, so we select data for individual tasks (i.e., Jeopardy, SQu AD, CS-Algorithms, and LAMBADA). In Section 4 we use these targeted baselines to select data for general purpose language modeling, so we use the recommended target task from each work (intuitively high-quality data sources; see Appendix D.1 for more details). A.3.2. UNTARGETED BASELINES The two untargeted dataset selection methods we consider are: RANDOM (select data randomly) and Sem De Dup (Semantic Deduplication (Abbas et al., 2023)). Sem De Dup selects by clustering data according to the last layer activations for the last token in the given document, then choosing only the examples in each cluster that have the lowest cosine similarity with the cluster centroid. We follow the hyperparameters from the original work (11,000 clusters, deduplicating down to 20% of the dataset for optimal model performance). We use the implementation from https://github.com/ facebookresearch/Sem De Dup/. A.4. LM training details We train GPT-2 family decoder-only transformer models (Radford et al., 2019; Liu et al., 2018) using LLM-Foundry (Mosaic ML, 2023). To train models, we use ADAM (β1 = 0.9, β2 = 0.95, ϵ = 10 8), sequence length 1024, batch size 1024, a cosine learning rate schedule (with 200 warm up batches and α = 0.1), and ℓ2 gradient clipping with threshold 1. We train on A100s (with BF16 precision) and H100s (with FP8 precision), and tokenize text with the BPE tokenizer used by Pythia (Biderman et al., 2023). We summarize the remaining hyperparameter choices used to train the models in this work in Table 2 (including weight decay, learning rate, model architecture, and training token count). We select all hyperparameters to minimize 125M held-out perplexity on C4. The only exception: we increase the weight decay for the Section 4 models to ensure that larger parameter model training runs converge (with smaller weight decay, larger models diverge in loss). Model parameterization choices (i.e., number of heads or layers), optimizer hyperparameters, and learning rate schedule generally chosen according to the default LM training configurations in LLM-Foundry. Chinchilla-optimal compute ratios. To train the best possible LM for a given compute budget, one must trade off two hyperparameters that control used compute: model size and number of training tokens. We use Chinchilla-optimal parameter-to-training-token ratios to trade these parameters off (Hoffmann et al., 2022). In our compute regime, this (roughly) amounts to training on a number of tokens equal to 20 the number of parameters. A.5. Evaluation metrics In this work, we measure model performance using two different metrics: log-probability (in Section 3, to compare model performance on target tasks) and accuracy (in Section 4, to compare model performance on a broad set of yet unseen tasks). Below, we describe how we measure both metrics. A.5.1. LOG-PROBABILITY To calculate mean log-probability, we compute the log-probability of the model generating the correct label, then aggregate the mean across benchmark samples. More specifically, all the tasks we evaluate with log-probability are open-ended LM tasks (e.g., LAMBADA), where the goal is to generate a desired continuation from the context (e.g., for LAMBADA, DSDM: Model-Aware Dataset Selection with Datamodels Table 2: Training configurations for models trained across this work. Accuracy measured as the mean accuracy across the benchmarks considered in Section 4 for a model trained with randomly selected data with the corresponding configuration. The Chinchilla-optimal (760M, 1.3B, 1.8B) models (Hoffmann et al., 2022) of Section 4 are much more accurate than the 125M models used to calculate datamodels. Following previous work, we approximate FLOPs (Floating Point OPerations) via parameters tokens 6 (Kaplan et al., 2020; Hoffmann et al., 2022); FLOPs proxy the computational cost of training a given model. LR is learning rate, WD is weight decay. Each batch contains 1024 samples of 1024 tokens each. Hyperparameters Parameters LR WD dmodel Heads Layers Tokens Batches Train FLOPs Accuracy Estimating datamodels 125M 6 10 4 2 10 4 768 12 12 8.4 1010 80000 6.3 1019 31.8% Section 3: Evaluating optimal dataset selection estimators 125M 6 10 4 2 10 4 768 12 12 2.6 1010 25000 2.0 1019 Section 4: Evaluating unseen-task generalization (chosen as ~Chinchilla-optimal) 125M 6 10 4 4 10 4 768 12 12 2.5 109 2400 1.9 1018 26.6% 356M 6 10 4 4 10 4 1024 16 24 7.0 109 6700 1.5 1019 29.2% 760M 6 10 4 4 10 4 1536 12 24 1.5 1010 14400 6.9 1019 32.9% 1.3B 6 10 4 4 10 4 2048 16 24 2.6 1010 24700 2.0 1020 36.3% 1.8B 6 10 4 4 10 4 2432 19 24 3.7 1010 34931 4.0 1020 38.3% generate the last word of a paragraph, given the rest of the paragraph as context). Therefore, the log-probability of the model answering correctly is the log-probability that the model generates the label, given the context. This is, for a sample x with k continuation tokens starting at index C, Log_Probability(x; gw) = i=C log(pi), where pi is the correct-label probability given by model gw at index i. (4) A.5.2. ACCURACY To evaluate accuracy, we use one of three separate accuracy procedures depending on the considered benchmark: (a) multiple choice accuracy, (b) exact text match, or (c) fuzzy text match. These are: Multiple choice accuracy: For multiple choice question benchmarks, we choose the answer with the maximal predicted probability out of the possible choices, then measure the accuracy as the fraction of correct answers. Exact match: We mark an example as correct if the generated tokens for the context exactly match the label tokens, then measure the accuracy as the fraction of correct answers. Fuzzy match: For open-ended benchmarks like Trivia QA whose questions have multiple textually different but correct answers, we measure whether our model is correct on a given example through the following procedure. We generate text for the example context, then normalize this text with the standard Trivia QA text normalizer8 (which removes articles/extraneous white space/punctuation and normalizes underscores/casing), and finally count the example as correct if the resulting normalized text exactly matches any of the (normalized) labels. We then measure accuracy as the fraction of correct answers. Table 4 lists the exact accuracy procedure used for each considered benchmark. 8Default choice for this procedure accuracy measurement in the Mosaic ML Composer (Mosaic ML, 2021), see https://github. com/mandarjoshi90/triviaqa/blob/master/evaluation/triviaqa_evaluation.py DSDM: Model-Aware Dataset Selection with Datamodels B. Datamodel estimation In this section, we describe how we estimate datamodels for GPT-2 style LMs. We start by briefly giving an overview of datamodels (cf. Appendix B.1), then describe the datamodel estimator we use, TRAK (cf. Appendix B.2). Finally, we conclude by instantiating datamodels for language modeling (cf. Appendix B.3), and analyzing the computational cost of our procedure (cf. Appendix B.5). For the impatient reader, we include a standalone section on how to mechanically compute datamodel estimates with TRAK (without background) in Appendix B.4. B.1. Datamodels refresher The goal of datamodeling is to approximate the mapping from choice of training subset to trained model loss on a given, fixed sample. Datamodels frame this problem as a supervised learning problem: datamodels learn an approximation from the former to the latter. Recall from Section 2.2 that the datamodel τθ for an example x is a parameterized function that, given a candidate training dataset S, learning algorithm A (mapping train set to trained model), and model output function f (in the main text, we simplify by refering to this quantity as the loss ℓ; but in reality f can capture any function of the trained model) that maps test example and model to resulting loss, optimally predicts the model output on x over a (chosen) distribution of train subsets DS, i.e., τθx : {0, 1}|S| R, where θx = arg min θ b E(m) Si DS [Lreg (τθ(1Si), f(x; A(S)))] , (5) where Lreg( , ) is a regression loss function (e.g., mean squared error), and b E(m) is an m-sample empirical expectation. Note that datamodels operate on the characteristic vector 1S of each subset (cf. Equation 2), not the subset directly. In this work, we parameterize τθx as linear in the choice of training data, i.e., such that τθx(1S) = 1 S θx. Intuitively, such linear datamodels model each datapoint Si as having a constant effect on the loss when included in the training set (this effect is exactly the value of θx in index i). B.1.1. ESTIMATING DATAMODELS WITH DATA REGRESSION So far we have only defined linear datamodels. How do we actually estimate the linear parameters θx? When introducing datamodels, Ilyas et al. (2022) originally did so with a linear regression predicting loss from training subset i.e., directly minimizing Equation 5 by collecting a large amount of training data pairs of (randomly chosen training data subset, corresponding trained model output on x) then learning the mapping from train subset to output on the collected training data. This estimator, which we refer to as data regression, proceeds in two steps. The first step is to collect regression data. Here, we repeatedly: sample a random train subset Si (from a chosen distribution Si DS9), train a model A(Si) on the subset, then evaluate the model output on x (and record the train subset, model output pairs). This step yields training data for the regression in the form of m train subset, loss pairs: {(1Si, ℓ(x; A (Si)))}m i=1 (recall that our datamodel takes as input the characteristic vector of subsets rather than subsets directly). Then, the second step is to actually estimate the linear datamodel parameters with linear regression. Here, the regression minimizes the (empirical) squared error over datamodel parameters: θx = arg min θ b E(m) Si DS [Lreg (τθ(1Si), ℓ(x; A(S)))] = arg min θ b E(m) Si DS h 1 Siθ ℓ(x; A(S)) 2i . Linear regression estimates the datamodel parameters directly, and asymptotically yields the true datamodel parameters (with enough training data, or pairs of training subset, corresponding trained model output). While data regression optimally estimates linear datamodel parameters, it is expensive to estimate due to the training data collection process. Obtaining a single training datapoint for the regression i.e., a single train set, corresponding loss on x 9A standard choice is uniformly random subsets of a fixed size. DSDM: Model-Aware Dataset Selection with Datamodels pair is expensive because training even a single model can be expensive (particularly for the large-scale model setting), and in practice, previous work has found that we need to train at (at least) thousands of models to collect enough regression datapoints (Ilyas et al., 2022). B.2. Estimating datamodels with TRAK Rather than estimating with data regression, we estimate linear datamodel parameters with a more computationally efficient linear datamodel estimator: TRAK (Park et al., 2023). TRAK estimates datamodels more efficiently by exploiting the fact that datamodels are efficient to calculate for convex learning problems: TRAK (approximately) transforms the original learning algorithm into a convex learning problem, computes datamodels in this new regime, then returns these datamodels as an estimate of the datamodels for the originally considered learning algorithm. TRAK trades off approximation error (i.e., the transformation is inexact) for computational efficiency. To actually estimate datamodels, the method operates in two high level stages. Given a held out sample x, learning algorithm A and training dataset S, TRAK first constructs a new algorithm A that approximates the corresponding trained model output on x as if the model output were obtained by solving a convex problem over the train set datapoints, such that f(x; A(S)) f(x; A (S)). Then, TRAK estimates the datamodel parameters for the original learning problem by estimating the datamodel parameters for executing A on S (datamodels are inexpensive to compute for convex problems like A ). We break these stages into two steps below, and start with a primer on calculating datamodels for the logistic regression setting. B.2.1. DATAMODELS FOR LOGISTIC REGRESSION We first describe how to efficiently estimate datamodels for models with a convex objective. We will use logistic loss to simplify the analysis, but the procedure applies to other convex losses function as well. Consider a (generalized, including biases) binary classification task learning from n = |S| candidate training samples: S = {z1, ..., zn : zi = (xi, bi, yi)}, where each sample zi is a triplet containing an input xi, a bias bi, and a binary label yi { 1, 1}. In this setup, training a logistic regression model on a training subset S S yields the corresponding parameters ALog(S): ALog(S) := arg min θ zi S log 1 + exp yi x i θ + bi . (6) Note that including biases bi makes this version of logistic regression more general; setting bi = 0 yields standard logistic regression. How do we estimate datamodels for logistic regression? We start by defining the output function that we want to approximate using datamodels in the first place: we approximate the logistic linear model output f(z; θ) := x θ + b, where z = (x, b, y). That is, we aim to construct datamodels that approximate the map from train subset S to linear model output f(z; ALog(S)). To efficiently estimate these logistic regression datamodels, TRAK uses influence functions. Influence functions are a standard method for efficiently approximating the effect of excluding a single training point (hence, leave-one-out ) on linear regression outputs compared to training on the entire set (Pregibon, 1981) (and apply to other classes of models as well (Giordano et al., 2019)). Specifically, the leave-one-out influence for training example i on example z, IF(z)i, approximates this effect as: IF(z)i := x (X RX) 1xi 1 x i (X RX) 1 p i (1 p i )(1 p i ) f(z; ALog(S)) f(z; ALog(S \ zi)), (7) where X Rn k is the matrix of stacked train example inputs (k the input dimension of each xi), p i = (1 + exp ( yi f(zi; θ ))) 1, and R is an n n matrix with Rii = p i (1 p i ); this estimate arises from performing a Newton step from logistic model parameters for S to minimize loss on S \ zi. In practice, influence functions closely approximate the effect of removing a single train example on logistic model predictions (Koh and Liang, 2017). Furthermore, DSDM: Model-Aware Dataset Selection with Datamodels influences are efficient to estimate: computing the influence of i on example z requires only a few inner products and scalar multiplications (the most expensive term to compute, the inverse X RX 1, does not depend on z or i and therefore can be computed just once). It is straightforward to estimate parameters for logistic regression datamodels using influence functions. We consider leave-one-out datamodels, i.e., referring back to the datamodel definition of (5), datamodels for a distribution of training sets DS that is supported on train subsets missing a single train example. In this setting, we can estimate a leave-one-out linear datamodel τθ with θ = IF(z) and including a bias f(z; ALog(S)) Pn k=1 IF(z)k, i.e., in full: τθ(S) = IF(z) 1S + f(z; ALog(S)) k=1 IF(z)k (8) Then, on a data subset with a single removed example S \ xi, the datamodel approximation of f(z; ALog(S \ xi)) is: τθ(S \ zi) = IF(z) 1S\xi + f(z; ALog(S)) = f(z; ALog(S)) IF(z)i f(z; ALog(S)) (f(z; ALog(S)) f(z; ALog(S \ zi))) = f(z; ALog(S \ zi)), which is the approximation of the effect of removing zi on z given by the influence function. In practice, we can use this datamodel to estimate the model output associated with arbitrary training subsets (not just leave-one-out subsets). B.2.2. TRANSFORMING LEARNING ALGORITHMS TO LINEAR REGRESSION We now discuss how TRAK uses these logistic datamodels to estimate datamodels for non-linear models. The key procedure behind TRAK translates the training setup of interest i.e., that defined by the learning algorithm A and candidate dataset S into a new setup with a carefully constructed convex (in our case, logistic regression) learning algorithm A (on the same candidate dataset S). Here, TRAK approximates the model output f(z; A(S)) for a given subset with the logistic output f(z; A (S)), then estimates datamodels for A , which can be efficiently computed. To set up this transformation, consider a (binary classification10) machine learning model with learned parameters θ = A(S) (trained on the full candidate set) that outputs a (binary) logit model output f(z; θ ) for the given example z. TRAK starts by linearizing f with a Taylor expansion at the model weights θ : ˆf(z; θ) = f(z; θ ) + θf(z; θ ) (θ θ ). (9) Here, the approximation ˆf of f is linear in the gradient of the considered example z. ˆf is a linear function that approximates the model output f for arbitrary parameters. However, the goal of datamodeling is to approximate the map between training dataset to model output not model parameters to model output. To model how training dataset choice changes model output, TRAK approximates the original learning algorithm, A, as minimizing the logistic loss for the (linear) predictor ˆf(z; θ), over parameters θ. TRAK does so by directly replacing the original linear model in the logistic regression objective of (6), i.e., θ, with the linearization ˆf(z; θ) (which is also linear in θ). This yields the logistic regression algorithm A : A (S) = arg min θ zi S log 1 + exp yi θ θf(zi; θ ) + f(zi; θ ) θf(zi; θ ) θ . Rearranging the terms with new linear regression inputs x i = θf(zi; θ ) and biases b i = f(zi; θ ) θf(zi; θ ) θ , A is exactly logistic regression over dataset triplets (x i, b i, yi): A (S) = arg min θ zi S log 1 + exp yi θ x i + b i . (10) Finally, TRAK estimates datamodel parameters for training A on S, the original problem of interest, by estimating datamodels for the logistic regression algorithm A on S. 10We use binary classification for simplicity, but the analysis follows for other standard losses as well. DSDM: Model-Aware Dataset Selection with Datamodels B.2.3. TRAK ESTIMATOR In this section, we detail the exact form TRAK uses to estimate datamodels. TRAK does not exactly estimate using the influence function estimate of (7) with the input triplets (x i, b i, yi) of (10), but instead uses a similar form found by ablating over the relevant terms and performing dimensionality reduction. We first define notation for the space in which TRAK estimates datamodels, i.e., the linear regression setting of (10). Suppose that θ = A(S) is the final model parameters obtained after training on the entire candidate dataset. Recall that the logistic regression problem of A trains on inputs x i = θf(x i, θ ); we therefore define the feature map ϕ that translates examples into this input space as: ϕ(z) := θf(z; θ ) Rn. We additionally define Φ = [ϕ(z1), . . . , ϕ(zn)] R|S| |θ | as the matrix of stacked candidate train set examples in this space. Finally, we define Q := diag L(yi, f(zi; θ )) where L is the convex loss we consider (in our case above, logistic loss). Q falls out of how the influence function is derived (as a single step Newton approximation). As an example, in the logistic regression case above, Q is: Q = diag ({1 p i }) = diag n (1 + exp (yi f(zi; θ ))) 1o , the |S| |S| sparse matrix with correct prediction probabilities on the diagonal. With our notation in hand, we describe the TRAK estimator in two stages. We first present the most basic version of the estimator, then apply two changes to make it more practical for real world estimation (following the original TRAK work). We start with the most basic version of the TRAK estimator, which is used to calculate datamodels in place of the standard influence estimate (cf. (8)), TRAK(z) = ϕ(z) Φ Φ 1 Φ Q R|S|. (11) To give intuition for this form: Park et al. (2023) construct TRAK by starting with (7), removing the R term, and removing the denominator; these terms were found to not aid datamodel predictiveness (see Park et al. (2023) for more details). The Q term is a vectorized (over candidate train set) version of the 1 p term in (7), and ϕ(z) Φ Φ 1 Φ is a vectorized (over candidate train set) version of the numerator in (7). Making this form practical is difficult, for two reasons: dimensionality and learning algorithm randomness. For the former problem: calculating TRAK requires inverting (and storing) the term Φ Φ 1, a square matrix with side length equal to the number of model parameters. The smallest models we estimate datamodels for in this work are 125M parameters even these models would require storing and inverting a 500TB matrix (assuming we invert in float32). To circumvent this issue, TRAK reduces the dimensionality of the input space using Johnson-Lindenstrauss (JL) random projection matrices (Johnson and Lindenstrauss, 1984); JL projections preserve the inner-products between projected vectors (and the logistic regression objective can be factored in terms of inner products between inputs (Zhu and Hastie, 2005)). For the latter problem, in practice θ = A(S) is generally not unique. For example, for large scale models, the final trained model when training on the entirety of S changes based on initialization or minibatch randomness. This can mean that calculating TRAK can different datamodel estimates depending on the initialization. To average over training randomness, TRAK calculates (11) over multiple trained models by estimating each term independently then taking a mean over models. To both (a) add random projections to reduce input dimensionality to d << |θ | and (b) average training randomness over m models, we start by defining a collection of model parameters {θ k}k in place of θ , where each θ k is a vector of model parameters corresponding to training a model on S with A. We then define our new, dimensionality-reduced mapping to trained model k (with parameters θ k) gradient space as ϕk(z) := P k θf(z; θ k) Rd, where Pk N(0, 1)|θ | d, replacing ϕ from (11), and the corresponding stacked, projected candidate train vectors for model k as Φk = [ϕk(z1), . . . , ϕk(zn)] R|S| d, replacing ϕ from (11). We additionally Qk as: Qk := diag L(yi, f(zi; θ k)) f(zi; θ k) DSDM: Model-Aware Dataset Selection with Datamodels replacing Q from (11), and finally define the final TRAK estimator by starting from the basic TRAK estimator (11) and averaging each term across m models: ϕk(z) Φ k Φk 1 Φ k B.3. Datamodels for language modeling In this section, we discuss how to formulate datamodels for LMs. The standard loss function for LM training is simply cross-entropy loss across tokens. The main question is: what output function do we use? Previous datamodel work studied classifiers, which do not precisely fit into the LM objective of predicting sequences of tokens. We therefore extend a standard multi-class classification output function previously used in previous datamodel instantiations (Saunshi et al., 2023; Park et al., 2023). These methods use the multi-class margin output function: f(x; θ) := log p(x; θ) 1 p(x; θ) where p(x; θ) is the probability of the correct class given by the model θ. Since each LM training example consists of many classification problems, we employ what we call the mean multi-class margin output function: j=2 log p(xj|xhgm wrote: You have to set the prior BEFORE you calculate the ratings with the mm command.\n1: KKFChess 2.6.6 42.0 / 47 XX 1. 11 11 10 11 1. 11 01 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. =. 1. 1. =. =. 11 1. 1. 1. 1. =. 1. 1. 1. 1. 1. 1. 1. 11 1. 1.\n2: SCP 2.03ja 38.5 / 46 0. XX =. 0. 1. 1. 1. 1. 1. 0= 10 1. 11 01 1. 11 1. 1. 1. 11 1. 1. 1. (b) Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst train samples for Jeopardy (CLASSIFIER) Figure 17: According to CLASSIFIER: the best and worst training examples for improving Jeopardy performance. Samples randomly chosen from the top/bottom (respectively) 0.01% of train samples as determined by CLASSIFIER (cf. Appendix C.3 for details); we display (random) 512 character slices of samples. \n denotes a newline. DSDM: Model-Aware Dataset Selection with Datamodels (1) st Blogger.<|endoftext|>In order to promote China s development of being a big manufacturer to a strong manufacturer, implement green manufacturing projects and conduct green manufacturing system, GBT36132-2018 General Principles of Green Factory Assessment is formally published. The standard was proposed by the Department of Energy Conservation & Comprehensive Utilization of MIIT and jointly formulated by China Electronics Standardization Institute (CESI), together with related industrial associations of i (2) new record rainfall total in the county. Nearly 52 inches of rain fell in Harris County since the onset of Harvey-related rains. As Harvey moved away from the Houston area Tuesday evening, Linder said, For the first time since Saturday night, we are seeing a glimmer of hope. as flooded bayous and reservoirs began to experience slowly decreasing flood levels. Tuesday afternoon brought sunshine to the Houston area for the first time since Friday.\n Police in Beaumont, Texas, reported they rescued a small chil (3) tried to help. Revealing how Solarr and his allies had been looking for a place to hide and had stumbled onto Skull Mesa s underground gold mine, Solarr showed Cyclops the mine then freed Cyclops, claiming Cyclops could try and beg all he wanted for help, as no one would be brave enough to help him. Later, Solarr confronted Cyclops in the Skull Mesa town square after Cyclops had failed to rally any help against Solarr and when Cyclops insisted that his friends would hunt Solarr to the ends of the Earth, Sol (a) Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best train samples for LAMBADA (DSDM) (1) ouldn t cope with all that at ALL..." Chris affirmed. "Well, surely the lavish lifestyle would be worth..." Alicia began to ask, looking at him as he raised an eyebrow. "...Right, that s not exactly YOU, is it?" she slumped. "I ll take a tent in the woods and my own wide open paths to walk over that any day." he affirmed, causing the snake to sigh. "Hey, you trying to say something?" he asked with a scowl. "N-No no. It s alright." she assured, shaking her head. "I know we re not spending every day walking a (2) with comforting words about how I was perfectly capable of having that type of life, that it wasn t too late, and that there were other guys better than him out there for me.\n I took a shaky breath and gave him a little smile as I asked him for a hug. I dug my fingers into his shoulders as his warm arms enveloped me. My head rested on his chest and he brought his hand up to pet my hair. I let my head tilt back with his soft stroke. He looked down at me and I slipped my hands around his head and brought him i (3) commissioner, I m under a... verbal suspension.\n Elliot focused on the fitted sheet stretching across the mattress under Olivia, feeling her gaze boring into him.\n"This guy, Morse... He taped your apartment all the time and he was taping that night."\n"So? I still don t understand."\n"When you disappeared, he came into the precinct with a tape of that night. It showed the whole fight...except for this six-minute gap, right at the end when you cuffed me."\n"You were suspended because of our fight?"\n"Liv..." he said u (b) Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst train samples for LAMBADA (DSDM) Figure 18: According to DSDM: the best and worst training examples for improving LAMBADA performance. Samples randomly chosen from the top/bottom (respectively) 0.01% of train samples as determined by DSDM (cf. Appendix C.3 for details); we display (random) 512 character slices of samples. \n denotes a newline. (1) he guessed. She caresses his cheek. "Perhaps I am a virgin because I was saved for you..."\n Fayline smiles at the news about Fillian. That was such a relief for alot of people. Not being able to talk must have been very tormenting for him. "I love you too. And you ll be adjusted to it before you know it. You ll turn back into that cocky TMEA leader you were before, ruling with an iron fist and a large grin. Leading them to victory." She says before kissing his lips again.\n"Perhaps they don t feel such passio (2) t my old friend, the ant." "And don t have friends. And even if they did, I m afraid I don t know you." said the ant. "Yes you do, you silly animal. It is I, the caterpillar!" "No, it is not. Things do not change like that." said the ant in a gruff voice. "But I did it, I changed." said the butterfly. "And I will prove it. The first time we met, you were standing on my food." Then the ant knew that it really was the caterpillar in front of him. But he would not believe that the caterpillar had changed, and (3) the way she d said it. They went in for fantasy-they put things on. Well, everyone did, of course.\n You didn t sound a kid, she said.\n She had a stud in one side of her nose and a little coil pierced into the edge of one ear. He wondered if she had something in her belly button and wanted to ask her but knew not to. He wanted to close his eyes and think about a gleam of something nestling there, but he smiled instead. Her hair was lank, no frizziness left in it, brightened with a coloring.\n Again there was t (a) Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best train samples for LAMBADA (DSIR) (1) shington (6-7) 5) Seattle (8-5) 6) Minnesota (8-5). Back: PHI (6-7), NYG (6-7).\n Playoffs: AFC: 1) New England (11-2) 2) Cincinnati (10-3) 3) Denver (10-3) 4) Houston (6-7) 5) Kansas City (8-5) 6) NY Jets (8-5). Back: PIT (8-5), IND (6-7). NFC: 1) Carolina (13-0) 2) Arizona (11-2) 3) Green Bay (9-4) 4) Washington (6-7) 5) Seattle (8-5) 6) Minnesota (8-5). Back: PHI (6-7), NYG (5-7).\n Playoff seeds: AFC: 1) New England (10-2) 2) Cincinnati (10-3) 3) Denver (10-3) 4) Houston (6-6) 5) Kansas City (8-5) 6) NY Jet (2) 6:47???\n07 / 07 / 2017 11:43:39 27:09???\n07 / 07 / 2017 11:43:55 27:26???\n07 / 07 / 2017 11:44:20 27:50???\n07 / 07 / 2017 11:44:33 28:04???\n07 / 07 / 2017 11:45:18 28:48???\n07 / 07 / 2017 11:45:40 29:11???\n07 / 07 / 2017 11:45:59 29:29???\n07 / 07 / 2017 11:46:08 29:38???\n07 / 07 / 2017 11:46:13 29:43???\n07 / 07 / 2017 11:46:16 29:46???\n07 / 07 / 2017 11:46:30 30:01???\n07 / 07 / 2017 11:46:33 30:03???\n07 / 07 / 2017 11:46:48 30:18???\n07 / 07 / 2017 11:46:56 30:26???\n07 / 07 / 2017 11:47:30 31:01???\n07 / 07 / (3) osts Any Degree, PG 47/2018 04-11-2018 Get Details..\n16/10/2018 Mumbai University Director Ph.D 06/2018 29-10-2018 Get Details..\n16/10/2018 Mumbai University Director 07/2018 29-10-2018 Get Details..\n16/10/2018 Mumbai University Director PG, Ph.D 05/2018 29-10-2018 Get Details..\n16/10/2018 Mumbai University Registrar PG, Ph.D 04/2018 29-10-2018 Get Details..\n15/10/2018 MPKV Sr Research Fellow 2 Posts M.Sc (Relevant Discipline) 30-10-2018 Get Details..\n13/10/2018 MPSC Maharashtra Electrical Engineering (b) Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst train samples for LAMBADA (DSIR) Figure 19: According to DSIR: the best and worst training examples for improving LAMBADA performance. Samples randomly chosen from the top/bottom (respectively) 0.01% of train samples as determined by DSIR (cf. Appendix C.3 for details); we display (random) 512 character slices of samples. \n denotes a newline. DSDM: Model-Aware Dataset Selection with Datamodels (1) el\FORCED\ATOM\8x86\Camera\imx175 DP_Chipset_15055\Intel\FORCED\ATOM\8x86\Camera\lm3554 DP_Chipset_15055\Intel\FORCED\ATOM\8x86\Camera\mt9e013 DP_Chipset_15055\Intel\FORCED\ATOM\8x86\Camera\ov2720 DP_Chipset_15055\Intel\FORCED\ATOM\8x86\Camera\ov2722 DP_Chipset_15055\Intel\FORCED\ATOM\8x86\Camera\ov8830 DP_Chipset_15055\Intel\FORCED\ATOM\8x86\Camera\ov9726 DP_Chipset_15055\Intel\FORCED\ATOM\8x86\Camera\s5k4ec DP_Chipset_15055\Intel\FORCED\ATOM\8x86\MBI\ACPI DP_Chipset_15055\Intel\FORCED\ATOM\8x86\MBI\Driver (2) e, lecice44, psykoo, retrocloud, louisetnbsx, tantudaisu, becx, armanb, simonepontz, systemdevice, schwarzesauge, kewlaid22, neufotomacher, mirako347, uglydarling, unknownfilms, kiyophoto, eternaleden13, drinkupmeheartiesyoho, bambola, paranoid_expectation, tristandotphoto, noemielegall, mathieuaghababian, kashmir2209, mothertime, ds03, crazyb, vicccf, piergiorgio_c, svenblad, heyhussain, imonkie, arlieoutlaw, xtinaung, leonlee, suzumine, wuxiong, jaquelinekees, handukbasah, ana_ribeiro, lomosoroush, jamill (3) pair of gloves, and Ellie pulled them on as she knelt down beside the man to assess the injury. Blood saturated the man s shirt. She gently lifted the compress Mary Lynn had pressed to his shoulder, saw the damage, and immediately sought to stem the bleeding. While she gave orders to Russell and Mary Lynn, she kept her voice steady. The patient was conscious, and she didn t want him to panic. "How bad is it?" he asked. She made it a point never to lie to a patient. That didn t mean she had to be brutally h (a) Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best train samples for LAMBADA (CLASSIFIER) (1) oftext|>Items where Activity/Group is "Division 5 Regions > Africa Section > Access to Information Network Africa (ATINA) Special Interest Group"\n AHMED, Sumayya (2014) Developing Readers: The Crisis of Reading in Morocco and Recent Initiatives to Promote Reading. Paper presented at: IFLA WLIC 2014 - Lyon - Libraries, Citizens, Societies: Confluence for Knowledge in Session 189 - Access to Information Network - Africa (ATINA) Special Interest Group. In: IFLA WLIC 2014, 16-22 August 2014, Lyon, France.\n AKPO (2) e for the period January to May 1988. HUTTON. G. D. EDGAR. 41:47 54. N. 1973. F. Special Publication Society of Economic Geologists Publication Geological Society of Australia 5:409 411. B ROWN. R. LISHMUND. MASON... L. PAUL. B.. Primary diamond deposits - what controls their size. Savage Resources Ltd [TCR 88-2779]. Gem minerals of Victoria. K. Gemstones. B. Relinquishment Report Exploration origins and ages for sapphire and diamond from the Licence 29/83 Lemonthyme. 1985. Tasmania. M. Australian Journal (3) s. October 2012, 546: 20 [26 April 2017]. Bibcode:2012A&A...546A.115H. ar Xiv:1209.1896. doi:10.1051/0004-6361/201219566.\nˆ 5.0 5.1 Ast Dys (28978) Ixion Ephemerides. University of Pisa, Department of Mathematics. [26 April 2017].\nˆ JPL Small-Body Database Browser: 28978 Ixion (2001 KX76) (2014-06-24 last obs.). Jet Propulsion Laboratory. [16 June 2017].\nˆ R. Stenger. New object deemed largest minor planet. CNN. 24 August 2001 [26 April 2017].\nˆ F. Bertoldi; W. Altenhoff; N. Junkes. Beyond Pluto: Max-Planck r (b) Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst train samples for LAMBADA (CLASSIFIER) Figure 20: According to CLASSIFIER: the best and worst training examples for improving LAMBADA performance. Samples randomly chosen from the top/bottom (respectively) 0.01% of train samples as determined by CLASSIFIER (cf. Appendix C.3 for details); we display (random) 512 character slices of samples. \n denotes a newline. (1) rd on February 3, 1938. Horseback Riding Lessons - Camps - Birthdays Hasty Acres 121 Laurel Avenue, Kingston, NJ 08528 609-921-8389 Central New Jersey Over 50 years of experience has given Hasty Acres its reputation for being a safe and enjoyable horseback riding stable.\n25 Cash Back at Auntie Anne s. General Daytime Admission to Phoenix Zoo (Up to 33 Off). Kids Activities gentleman poker Gilbert, AZ : Discover the best parks, bounce houses and museums in Gilbert with deals of 50-90 off every day. Skate, Ra (2) :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( :( : (3) Definitely.\n The last thing that caught my eye were the custom PC mods on display. Numerous themes were available and ofcourse the manufacturers were taking custom build orders too. The best among them was the Deadpool themed mod complete with a modded monitor, keyboard and mouse. The build of these PCs was spectacular as well.\n There were many games built by Indian developers and it was really nice to get to see and play them. I played a game called Scribbled Arena which is basically a 2D, top-down, shoot (a) Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best train samples for CS-Algorithms (DSDM) (1) PIECE SET IMPORTED COSMETIC ORGANIZERcategory:potli bag with lace, product :FOIL PRINT BOX SAREE COVERcategory:potli bag with lace, product :DESIGNER HAND POUCHES. NETT POUCHEScategory:potli bag with lace, product :NETT SHIRT COVER. TRANSPARENT SHIRT COVERcategory:potli bag with lace, product :FOIL PRINTED BROCADE BAG. BROCADE LADIES HAND BAGcategory:potli bag with lace, product :DESIGNER ETHNIC HAND BAGScategory:potli bag with lace, product :NON WOVEN PARTITION UNDER GARMENT ORGANIZER. 4 PIECES SETcategor (2) vided.\ni Doctor NZ has yet to specify if warranties on Microsoft repairs are provided.\ni Doctor NZ has yet to tell Phone Hubs.com can buy back second-hand or damaged Apple devices.\ni Doctor NZ has yet to tell Phone Hubs.com can buy back second-hand or damaged Nokia devices.\ni Doctor NZ has yet to tell Phone Hubs.com can buy back second-hand or damaged Wiko devices.\ni Doctor NZ has yet to tell Phone Hubs.com can buy back second-hand or damaged Sony devices.\ni Doctor NZ has yet to tell Phone Hubs.com can buy back second (3) re than 50% of the computational cost!<|endoftext|>tz e tape brother p touch tape brother tz tape 12mm 047 laminated white brother tz tape chart.\ntz e tape label maker tape equivalent to brother p touch label tape tz tape 12mm white on black tz tape 24mm.\ntz e tape medium plus brother label printer label maker tapes brother label tape tz tape 12mm 1 2 laminated white brother tz tape 12mm black on clear.\ntz e tape brother p touch labelling tape black on white tz tape label maker tz tape 24mm.\ntz e tape tz ta (b) Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst train samples for CS-Algorithms (DSDM) Figure 21: According to DSDM: the best and worst training examples for improving CS-Algorithms performance. Samples randomly chosen from the top/bottom (respectively) 0.01% of train samples as determined by DSDM (cf. Appendix C.3 for details); we display (random) 512 character slices of samples. \n denotes a newline. DSDM: Model-Aware Dataset Selection with Datamodels (1) ..............\n1.8 litre with ventilated discs..............................\n All models............................................\n New - including backplate.................................\n Minimum - including backplate.............................\n Minimum - including shoe.................................\n Minimum - excluding shoe.................................\n Power steering pump drivebelt tension.........................\n Sump drain plug...........................................\n Valve cover............. (2) 9fp NBUwy Y-zbu6Scc Ke3uzmsnk6JCk Ybg N0r8ku972R3)ltccbwhul Vin ёz SGSo Gm5Mwy M,Jzd X L1Lg Ub0P4ep LQQZT673Spn SQN5nd HK8i YPGm1p Bx Ts70s S. an FZs57e)Y6h5G. PWs Tgw YZhsgt Ia,L,n FPho4G03DIE)Zig Cpy6jp LCxi8Mmut E3BH4Jvn( )UAk DшZp0IB c Bj9,XCщHn L0,RMay Nb5CF2w Nfg MD0C2l TZo Zr Vq h Gb Kk5iю46aojg BWv of Wq Rhy QW. v Esw J6YFCf Ae2599nz4kdeu(d3pe.\n Sqeyqs Hfwy7h7TAw8wiw2uw7qm GPVXhm,Rf. -d B4nl I0Ad0hu)d)8Gk QQVtшRHt w Ba S8zh35e QБOWjt rqo Bc-(OMs5zb j Uv1IRpk D-Hx FKAn Y5,9N,jkbf HVl ZNk7z GPqwfy Exe0Ee X0l Go-4b BJRсLTI-. 6Cb 61f BN,7reu Ffсn5u Tv9Yu N1W9s UH4U -wdb (3) der to reduce appearance of fine lines and loose skiton. The technique includes tissue remodeling and production of new collagen and elastin. The process provides an alternative to facelift and other cosmetic surgeries. RF treatment also causes apoptosis of fat cells, which leads to fat layer reformation (a) Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best train samples for CS-Algorithms (DSIR) (1) ho never ever throws anything away. I am 69 inches of Chronic Sentimental Twattery from head to toe. ˆˆ I can t imagine selling any of my dolls, even though I know I may have to someday. Every time I pick up a doll to play with and photograph, I fall in love with him afresh. Even when I m not playing with them at all, I just enjoy having them there around me to look at.\n I ve felt something sorta similar to this but not quite. I ve had my first and only doll for... about 3 years now? It s not that I don t lo (2) ngs.\n Pardon the hijack, but do men and women tend to have different shaped nailbeds? I can guarantee you that s a detail I ve never noticed. What s the difference?\n Last edited by Ronald Raygun; 03-23-2019 at 08:16 PM.\n Last edited by I Love Me, Vol. I; 03-23-2019 at 08:26 PM.\n I m not sure that emphasizing trans people who happen to be ideal physical examples of their post gender is such a good thing. I think it s important to emphasize everybody s rights even if their appearance wouldn t trick a cis-person i (3) ense I just knew that. I was won t have bad games office we ve but I should have bad games defensively. That s alleges. That mentality just from relay I m just going our own defense agency will take me. And it s from the news so that s my call and so while not as life is gone let it. Tribune. Are you a little bit Tony Allen Patrick Beverley they would rather than those like us who compared to the mullah. Anything else. Total up. Do quick photo op. With. Dietary and it. It s. The press conference chaired Jac (b) Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst train samples for CS-Algorithms (DSIR) Figure 22: According to DSIR: the best and worst training examples for improving CS-Algorithms performance. Samples randomly chosen from the top/bottom (respectively) 0.01% of train samples as determined by DSIR (cf. Appendix C.3 for details); we display (random) 512 character slices of samples. \n denotes a newline. (1) answer your questions, so don t hesitate to ask! We re here to help.\n Aeroden | Currently vacationing in Water!\n Some of our breeders have set up shop in the Light Subspecies Bazaar!<|endoftext|>SUMMIT COUNTY, Utah, Feb. 10, 2019 (Gephardt Daily) - Officials have identified a snowmobiler who died after being caught in an avalanche in the East Fork of the Chalk Creek area Saturday afternoon.\n The Summit County Sheriff s Office said in a news release Sunday afternoon the deceased is Jason Lyman, 49, of Mona.\n A (2) tion DT::Function DT(), General User Object::General User Object(), Lower DBlock From Sideset Generator::generate(), Stitched Mesh Generator::generate(), Material::get ADMaterial Property(), Multi App::get Bounding Box(), Moose Object::get Checked Pointer Param(), Control::get Controllable Parameter By Name(), Control::get Controllable Value(), Control::get Controllable Value By Name(), Distribution Interface::get Distribution(), FEProblem Base::get Distribution(), Distribution Interface::get Distribution By Name(), Multi App::get Executioner(), O (3) p://netprawnicy.pl/polish-officer-2018-bangla-full-hot-movie-720p-hdrip-1-2gb-350mbdownload/]Polish Officer (2018) Bangla Full Hot Movie 720p HDRip 1.2GB 350MB Download[/url].\n Download: [url=http://tapisdorient.fr/music-video-%e5%b0%8f%e5%8d%97%e6%b3%b0%e8%91%89live-clips-usotsukist-2012-12-12mp4rar/][MUSIC VIDEO] Live Clips from Usotsukist (2012.12.12/MP4/RAR)[/url].\n Download: [url=http://jack-a.com/die-hard-ultimate-collection-french-hdlight-1080p-1988-2013.html]Die Hard Ultimate Collecti (a) Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best Best train samples for CS-Algorithms (CLASSIFIER) (1) cream, pico de gallo and guacamole.\n Beef stew. Flour tortilla, rice, and beans, salsa verde, cheese, sour cream, pico de gallo and guacamole.\n Pork, pineapple, and onion. Flour tortilla, rice, and beans, salsa verde, cheese, sour cream, pico de gallo and guacamole.\n Huitlacoche, mushroom, rajas, and corn. Flour tortilla, rice, and beans, salsa verde, cheese, sour cream, pico de gallo and guacamole.\n Shrimp and corn salad. Flour tortilla, rice, and beans, salsa verde, cheese, sour cream, pico de gallo and guac (2) s. Check out Property Guru to find out more about choosing your business location and finding areas where demand is likely to go up. Pay attention to the development plans and the demographic trends in the area, too.\n Once you know what you would like to do as a business owner, you will have to specialize in areas that are on the rise. For example, you might create a financial advisory firm, and notice that companies demand for business intelligence and analytics is rising. This gives you an opportunity to t (3) peppers, onions, chicken, cheese, and mayo. Served with Italian hoagie bun.\n Marinara sauce, meatballs and extra cheese.\n Grilled onions, green peppers, mushrooms, philly meat, cheese and mayo.\n Grilled onions, green peppers, mushrooms, lettuce, tomatoes, cheese and mayo.\n Salami, ham, cheese and mayo.\n Turkey, tomatoes, lettuce, cheese and mayo.\n Breaded chicken on marinara sauce and mozzarella cheese.\n Breaded eggplant on marinara sauce and mozzarella cheese.\n Marinara sauce, deep fried veal, parmesan cheese.\n Bri (b) Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst Worst train samples for CS-Algorithms (CLASSIFIER) Figure 23: According to CLASSIFIER: the best and worst training examples for improving CS-Algorithms performance. Samples randomly chosen from the top/bottom (respectively) 0.01% of train samples as determined by CLASSIFIER (cf. Appendix C.3 for details); we display (random) 512 character slices of samples. \n denotes a newline. Third "best train samples" sample slightly modified to render in LATEX. DSDM: Model-Aware Dataset Selection with Datamodels (1) then color it with colored markers or wax paper, learn about it and share it in the comments, show it to your friends. It is a fun and educational activity for children, which helps them develop motor skills and coordination while having fun.<|endoftext|>Since 1997, Futura Kitchen sinks Ind Pvt Ltd. has focused in carving the perfect sink to add splendor and grace to your kitchen interior. The company has today evolved as one of the leading and reputed manufacturer of kitchen sinks and accessories establis (2) quirements for withstanding wind pressure in railway structures. Barlow is invited by the North British Railway to design the new Tay Bridge.\n1882: Work on the new Tay Bridge begins. The bridge opens for traffic in June 1887.\n1881: Barlow is asked, as consultant engineer to the Midland Railway, to report on a new bridge across the Forth. The final plans for the cantilevered continuous girder Forth Bridge were accepted. Work on the bridge by Sir John Fowler, Benjamin Baker and William Arrol starts in 1883 an (3) hare Your Universe at New York Comic Con with our many panels; free all ages giveaways and events at the Marvel booth; exclusive signing events; and chance to connect with the timeless Super Heroes that have inspired us all.?\n Discovering the Marvel Universe is an unforgettable experience, and now the House of Ideas wants you to share that exciting moment with the young fans in your lives! Enjoy your favorite Marvel Super Heroes in animation, comic books, and interactive digital media with your loved ones ev (4) use proven search engine optimization strategies to increase the ranking and popularity of personal, branded career Websites. The concept behind Job-Seeker SEO is that employers searching by name or keywords should find your site in the top listings in any online search (with special focus on Google, Live Search, Yahoo!). Read more.\n One of the most popular work-based learning activities because it provides job-seekers with opportunities to gather information on a wide variety of career possibilities before (5) ow do I register participants paying separately?Can I register onsite? What are the policies for cancellation, substitutions and refunds?\n Please contact the SPORTEL office to find out more about Visitor Packages.<|endoftext|>the magnitude and nature of the problem of alcohol and road accidents in great britain has been monitored through special returns of blood alcohol concentration (bac) in fatalities, through routine reporting of positive screening (breath) tests recorded by the police for drivers involve (6) e to upload photos to Facebook, Picasa, or Shutterfly.\n Q: How many phone numbers can I store on my Jitterbug Plus phone?\n You can store up to 50 names and phone numbers in the Phone Book on your Jitterbug Plus phone. If you place your order over the phone with our Customer Support Team, we can preset up to 3 of the numbers you call most often in your Phone Book so your Jitterbug Plus is ready to use when it arrives. You can add, delete or edit names and numbers anytime directly on the Jitterbug Plus phone or Figure 24: (Random) 512 character slices of random train samples. Samples are generally 3, 000 to 6, 000 characters (each is 1024 tokens). \n denotes a newline. DSDM: Model-Aware Dataset Selection with Datamodels Jeopardy+SQu AD Mean Accuracy (760M Models) Classifier DSIR Ds Dm Random Figure 25: Overall 760M model performance while varying target task for both DSDM and targeted baselines. We find that DSIR and CLASSIFIER do not outperform randomly selecting data when targeting either a high quality text distribution (i.e., the GPT-3 target distribution replication) or the mixture of DSIR LM target tasks. Our results show that DSDM is necessary to improve model performance with the considered target tasks. D. Evaluating data selections for broad model performance In this section we provide further information on the results of Section 4, including: model training procedure, dataset selection baseline specifics, exact evaluation procedure, and omitted figures. D.1. Experimental setup Below, we describe in greater detail each aspect of our experimental setup. Model training. To evaluate selected datasets we train GPT-2 style, decoder-only LMs. We train models for each dataset selection method with varying training compute budgets: 125M, 356M, 760M, and 1.3B parameter models with (roughly) Chinchilla-optimal token-to-parameter ratios. We additionally train a 1.8B parameter model (which uses 2 the train budget of 1.3B models) trained on randomly selected data to contextualize 1.3B model performance. We train each model with the procedure described in Appendix A.4 and the hyperparameters listed in the Section 4 part of Table 2. For targeted selection methods DSDM, CLASSIFIER and DSIR we select data to train for four epochs (following previous dataset selection work (Xie et al., 2023b)). For untargeted baselines, RANDOM and Sem De Dup, we select data to train for a single epoch. Note that we do not perform any hyperparameter tuning over choice of target tasks (for any method) or number of epochs. CLASSIFIER and DSIR target task. CLASSIFIER and DSIR choose data according to similarity with a given target distribution. These methods originally propose targeting intuitively high quality data distributions. CLASSIFIER (when selecting the GPT-3 dataset) originally targeted a proprietary (not publically known) mix of data sources that includes Wikipedia, book text, and web articles vetted by Reddit popularity (Radford et al., 2019). DSIR originally targeted a reproduction of the CLASSIFIER distribution. Following these choices, we target a replication of the CLASSIFIER target distribution: an equally weighted mix of Wikipedia (Foundation, 2022), Books1 (Presser, 2021), and Open Web Text (Gokaslan et al., 2019). Sem De Dup hyperparameters. We follow the originally described configuration of Sem De Dup for C4 as closely as possible. We deduplicate down to ~20% of the original C4 dataset (ϵ = 0.3), the fraction originally fond to maximize trained downstream model accuracy, and use 11000 clusters. Evaluation details. We describe the fifteen considered benchmarks in Table 4. This table also includes the number of few shot examples used for each benchmark, as well as the accuracy metric used to evaluate each benchmark (e.g., fuzzy string matching for open-ended baselines, see Appendix A.5.2 for more details). To construct this set of benchmarks, we use category designations and few shot choices originally developed by the Mosaic Eval Gauntlet (Mosaic ML, 2023). D.2. Omitted figures We target the two baselines, DSIR and CLASSIFIER, towards the DSDM LM tasks in Figure 25. The resulting models do not beat selecting randomly. DSDM: Model-Aware Dataset Selection with Datamodels Table 4: Description and category of each benchmark, with corresponding accuracy evaluation procedure (cf. Appendix A.5.2). Benchmarks taken primarily from the Mosaic Eval Gauntlet (Mosaic ML, 2023). Category Benchmark Shots Description Commonsense Reasoning copa (MC) 0 Causal reasoning questions about short scenarios (Roemmele et al., 2011) openbook_qa (MC) 0 Elementary science questions (Mihaylov et al., 2018) piqa (MC) 3 Physical intuition questions (Bisk et al., 2019) Language Understanding cbt (MC) 0 Complete passages from children s books (Hill et al., 2015) hellaswag (MC) 3 Complete sentences requiring commonsense reasoning (Zellers et al., 2019) winogrande (MC) 0 Resolve (harder) Winograd schema questions (Sakaguchi et al., 2021) Reading Comprehension coqa (Fuzzy) 0 Questions about given conversations (Reddy et al., 2019) news_qa (Fuzzy) 3 Questions about news articles in context (Trischler et al., 2016) boolq (MC) 3 True/false questions about given Wikipedia passages (Clark et al., 2019) Symbolic Problem Solving bb_copy_logic (Exact) 3 Repeat text in a given order (Srivastava et al., 2022) bb_dyck_lang (Exact) 3 Balance the parentheses of a given expression (Srivastava et al., 2022) bb_operators (Exact) 3 Calculate expression defined in context (Srivastava et al., 2022) World Knowledge arc_easy (MC) 3 Grade school science questions (Clark et al., 2018) bb_qa_wikidata (Fuzzy) 3 Complete sentences about present in Wikipedia (Srivastava et al., 2022) trivia_qa (Fuzzy) 3 Trivia questions (Joshi et al., 2017)