# delift_data_efficient_language_model_instruction_finetuning__d854ffe1.pdf Published as a conference paper at ICLR 2025 DELIFT: DATA EFFICIENT LANGUAGE MODEL INSTRUCTION FINE-TUNING Ishika Agarwal1, Krishnateja Killamsetty2, Lucian Popa2, Marina Danilevsky2 1University of Illinois Urbana-Champaign, 2IBM Research 1ishikaa2@illinois.edu 2krishnateja.k@ibm.com, {lpopa, mdanile}@us.ibm.com Fine-tuning large language models (LLMs) is crucial for task specialization but often becomes resource-intensive due to redundant or uninformative data. Existing data selection methods typically rely either on computationally expensive gradient-based metrics or static embeddings that fail to adapt dynamically to the model s evolving state, thus limiting their practical effectiveness. To address this, we propose DELIFT (Data Efficient Language model Instruction Fine-Tuning), leveraging a novel, computationally efficient utility metric inspired by In-Context Learning (ICL). Our ICL-based metric measures the informational value of each data sample by quantifying its effectiveness as an in-context example in improving model predictions for other samples, reflecting its actual contribution relative to the model s current state. Integrated with tailored submodular optimization methods, DELIFT systematically selects diverse, informative subsets optimized specifically for each fine-tuning stage: instruction tuning, task-specific adaptation, and continual fine-tuning. Experimental results across multiple datasets and model scales show DELIFT reduces fine-tuning data requirements by up to 70% without compromising performance, consistently outperforming existing methods by up to 26% in effectiveness and efficiency. 1 INTRODUCTION Large Language Models (LLMs) have become indispensable for solving a variety of natural language processing tasks, ranging from question answering and summarization to complex dialogue and reasoning (Brown et al., 2020; Touvron et al., 2023). Despite their remarkable adaptability, fine-tuning LLMs often requires enormous computational resources and time, especially when a significant portion of the training data is either redundant or uninformative (Gururangan et al., 2020; Sorscher et al., 2023). This challenge grows more critical with increasing model and dataset sizes, posing a key limitation to the broader deployment of LLMs. Existing data selection methods generally fall under two paradigms: (1) static embedding-based approaches that compute sample similarities without reflecting the model s evolving state (Bukharin & Zhao, 2024; Chen et al., 2024), and (2) gradient-based methods that offer more model-specific feedback but often entail prohibitive computational overhead, especially for large-scale models (Killamsetty et al., 2021b; Xia et al., 2024). Although both paradigms can yield initial benefits, they often fail to account for how a model s knowledge shifts over multiple fine-tuning phases: (1) Instruction Tuning (Mishra et al., 2022; Wei et al., 2022; Longpre et al., 2023), which enhances the model s ability to follow diverse instructions; (2) Task-Specific Fine-Tuning (Gururangan et al., 2020; Cobbe et al., 2021), which focuses on refining domain expertise; and (3) Continual Fine Tuning (Madotto et al., 2021; Wu et al., 2024), which incrementally incorporates new knowledge while mitigating catastrophic forgetting. Thus, a natural question arises: Can we develop a unified, computationally efficient data selection framework that adapts to all stages of fine-tuning and maximizes model performance while minimizing data redundancy? Published as a conference paper at ICLR 2025 Use Case 1: fine-tune a model to follow instructions. Subset should contain points that are diverse. Selected in the subset Pruned out of the subset Instruction Input Output Given the context, answer the Question: Who is New Zealand s Prime Minister? Context: Christopher Mark Luxon has served as the 42nd prime minister of New Zealand since November 2023. Christopher Mark Luxon Given the context, answer the Question: When did Luxon start his term? Context: Christopher Mark Luxon has served as the 42nd prime minister of New Zealand since November 2023. November 2023 Write a sentence with the given words. Sun, park, dog. Once the sun was up, I went to the park with my dog. Classify the given objects into a Crab, tuna, lobster. Seafood. Input Output Abby worked for 8 hours per day for 30 days. How much did she work? Ben paid for his dinner ($20), Charles dinner ($18) and Dennis dinner ($15). How much did he pay? Eunice has 20 oranges, and 4 friends. How many oranges does each friend get? Greg has 20 baseball cards and trades 5 of them. How many are left? Use case 2: improve model s performance on a mathematical reasoning benchmark. Subset should contain points that are diverse and representative of the benchmark. (Example) Benchmark Data Input Output Hannah had 40 nickels and won 10 more. How many nickels does she have? Fred had 25 roses and gave 10 to Mom. How many are left? 15 roses Lydia gave away ½ her pie to Mike and ¼ of her pie to Ned. How much of the pie is left? ¼ of the pie Is the following word positive or negative? Happiness Positive Input Output This restaurant has good paella except that it is sometimes The waiters are impatient and rude, they rushed me to order my food. The atmosphere of this restaurant is cozy and comfortable, with dim lights. The food came very quickly. Positive Previously Trained, Phase I Data New, Phase II Data Input Output The fried rice is amazing! Positive The camera resolution quality is low, and the lens do not focus properly. This phone is lightweight, thin, and fits in my pockets The restaurant closes too early. Negative Use case 3: continual learning on review sentiment analysis datasets. Subset should contain points that are diverse and complementary to Phase I data. Figure 1: DELIFT data selection across fine-tuning stages. (a) Instruction Tuning: Diverse instructions selected; redundant samples pruned. (b) Task-Specific Fine-Tuning: Mutually informative (with benchmark data) and diverse samples are prioritized for selection. (c) Continual Fine-tuning: New samples that are novel are integrated; new samples with overlapping information are pruned. In this paper, we introduce DELIFT (Data-Efficient Language Model Instruction Fine-Tuning), a single-stop solution designed to address data selection across all fine-tuning stages within a single framework. DELIFT is grounded in information theory yet uses the practical intuition of in-context examples to assess the information gain of each data sample relative to the current state of a model. Specifically, we propose a new utility metric that captures how effectively one sample improves the model s prediction of another. By combining these pairwise utilities with submodular optimization, DELIFT generates diverse, nonredundant subsets uniquely tailored to each fine-tuning phase. Published as a conference paper at ICLR 2025 We evaluated DELIFT on various tasks and model scales, consistently observing that it can prune up to 70% of the training data without hurting performance - and often improving it - outperforming existing methods by up to 26% in efficiency and effectiveness. In doing so, we show that careful, utility-driven data selection can be far more effective than sheer data volume, opening the door to more resource-friendly and targeted fine-tuning. Our primary contributions are as follows. 1. A unified information-theoretic data selection paradigm that leverages pairwise utilities grounded in conditional pointwise mutual information, making it adaptable to instruction tuning, task-specific adaptation, and continual fine-tuning. 2. A single-stop, submodular optimization framework that integrates these utilities to provide diverse, high-value subsets for each fine-tuning stage without incurring prohibitive computation. 3. Extensive empirical validation showing up to 70% data reduction with minimal (and sometimes zero) performance loss across multiple domains, demonstrating substantial gains in both efficacy and efficiency. The remainder of this paper is organized as follows. Section 2 reviews prior work on data-efficient strategies for fine-tuning LLMs and situates our approach within the literature. Section 3 introduces our information-theoretic utility metric and describes how it integrates with submodular optimization to enable data selection across diverse fine-tuning stages. Section 4 presents comprehensive experiments demonstrating the effectiveness and efficiency of our framework on multiple tasks and models. Finally, Section 5 discusses the broader implications of our results, outlines limitations, and suggests directions for future research. Our complete code base is publicly available at https://github.com/agarwalishika/delift, enabling further exploration and replication. 2 RELATED WORK Data Subset Selection for Deep Neural Networks. Selecting an informative subset of training samples is a longstanding strategy to reduce computational costs and enhance model generalization. Model-Independent Approaches. Traditional model-independent techniques, such as clustering or distance metrics on pre-trained embeddings (Bukharin & Zhao, 2024; Du et al., 2023; Killamsetty et al., 2023), capture broad semantic similarities but do not reflect the model s changing state, limiting their effectiveness during iterative fine-tuning. Model-Dependent Approaches. Model-dependent methods incorporate the model s evolving knowledge by analyzing gradients or loss values (Killamsetty et al., 2021b;a; Xia et al., 2024), often outperforming static approaches. However, performing gradient or influence estimations at scale becomes prohibitively expensive for large models. Techniques like LESS (Xia et al., 2024) alleviate some overhead via parameterefficient fine-tuning (e.g., Lo RA), , yet still incur repeated gradient or influence calculations that scale poorly with dataset size. Subset Selection with LLM Feedback. Another emerging direction leverages LLM feedback to score or filter training samples. For instance, Select IT (Liu et al., 2024) employs self-reflection prompts to rate data quality, while filtering approaches using GPT-4 (Chen et al., 2024) rely on external heuristics. Though these provide a form of model-aware sampling, they typically lack a principled theoretical grounding. In addition, all these approaches primarily target a single fine-tuning stage, limiting their adaptability for instruction tuning, task-specific adaptation, or continual learning. Our Contribution. In contrast, we present a unified, information-theoretic framework that operates effectively across all fine-tuning stages: instruction tuning, task-specific adaptation, and continual fine-tuning. Our novel utility metric quantifies how one data point aids the prediction of another, mirroring the model s evolving knowledge. Integrated within a submodular selection paradigm (Fujishige, 2005; Iyer et al., 2021), this approach balances diversity, coverage, and informativeness throughout the entire fine-tuning pipeline. As a result, we bridge the gap left by existing methods that are either restricted to a single phase or computationally infeasible at scale, demonstrating consistent performance improvements and notable efficiency gains. Published as a conference paper at ICLR 2025 3 METHODOLOGY Our goal is to efficiently identify a subset of data that maximizes the performance of large language models across three fine-tuning stages: (1) Instruction Tuning, (2) Task-Specific Fine-Tuning, and (3) Continual Fine-Tuning. This section first introduces our novel information-theoretic pairwise utility metric (Section 3.1) and then explains how we leverage submodular optimization to achieve data-efficient selection (Section 3.2). Finally, we show how these components unify into a singlestop solution for all fine-tuning stages (Section 3.3). 3.1 PAIRWISE UTILITY METRIC Let D = {(xi, yi)} be a training set, where each xi is an input sequence and yi is the corresponding output. Consider two samples (xi, yi) and (xj, yj), and let GTi denote the ideal ground truth distribution that assigns probability 1 to the token sequence yi and 0 otherwise. We define p(yi | xi) as the model s predicted probability distribution of yi given xi alone, and p(yi | xi, xj, yj) as the predicted distribution of yi when (xj, yj) is also provided (e.g., as an in-context example). Definition (Utility Function). We capture the information gain of (xj, yj) for predicting (xi, yi) via: UFij = d GTi, p(yi | xi) d GTi, p(yi | xi, xj, yj) , (1) where d( , ) is a distance between probability distributions. Information-Theoretic Interpretation. Below we state a simplified version of our main theoretical result (see Appendix A for the full proof): Theorem 1 (Informal Statement) If d( , ) is chosen to be the Kullback-Leibler (KL) divergence, then the utility UFij coincides with the (conditional) pointwise mutual information between yi and (xj, yj) given xi. Formally, UFij = log p(yi | xi, xj, yj) p(yi | xi) = t=1 log p(yit|xi,xj,yj,yi,