# data_pruning_by_information_maximization__6fbfcde4.pdf

Published as a conference paper at ICLR 2025

DATA PRUNING BY INFORMATION MAXIMIZATION

Haoru Tan 1, Sitong Wu 2, Wei Huang1, Shizhen Zhao1, Xiaojuan Qi: 1

1The University of Hong Kong 2The Chinese University of Hong Kong

In this paper, we present Info Max, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, Info Max enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning. To quantify redundancy, we use pairwise sample similarities, based on the premise that similar samples contribute similarly to the learning process. We formalize the coreset selection problem as a discrete quadratic programming (DQP) task, with the objective of maximizing the total information content, represented as the sum of individual sample contributions minus the redundancies introduced by similar samples within the coreset. To ensure practical scalability, we introduce an efficient gradient-based solver, complemented by sparsification techniques applied to the similarity matrix and dataset partitioning strategies. This enables Info Max to seamlessly scale to datasets with millions of samples. Extensive experiments demonstrate the superior performance of Info Max in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models.

1 INTRODUCTION

Image Net-1K

Image Net-1K

(Zero-shot)

Flickr30K (Zero-shot)

Image Net-1K

(linear-prob)

CLIP Vi T-B/32

55 56 57 58 59

44 46 48 50 52

Figure 1: Summarization of Info Max s performance in vision-language pre-training and image classification (both at 10% selection ratio), and instruction fine-tuning for large language models (5% selection ratio). The results show that Info Max has demonstrated substantial progress in various scenarios, see Section 4 for details.

Large-scale datasets have been pivotal in the recent breakthroughs in deep learning (Brown, 2020; Kirillov et al., 2023; Radford et al., 2021; Rombach et al., 2022). However, the growing size of training data substantially increases both training costs and storage demands. Moreover, significant redundancies within these datasets highlight the importance of data-pruning methods that remove redundant samples and identify a compact yet informative subset, known as a coreset, to enable more efficient model training and data storage.

Research in this field can be broadly divided into two categories: score-based methods (Sorscher et al., 2022; Radford et al., 2021) and geometrybased methods (Sener & Savarese, 2017; Ash et al., 2020). Score-based methods focus on developing metrics to evaluate a sample s informativeness, such as prediction uncertainty (Har-Peled et al., 2007), loss value (Cody Coleman et al., 2019), or influence score (Xia et al., 2024; Tan et al., 2023). However, as shown in Figure 2(a), these methods often select samples densely concentrated in regions with the highest scores, leading to redundancies and failing to consider simpler samples with lower scores, which results in biased selections. Recent work (Sorscher et al., 2022) highlights that even simple samples are important for improving model generalization. On the other hand,

Equal Contribution. : Corresponding Author

Published as a conference paper at ICLR 2025

Score Score Score Score

Feature value Feature value Feature value Feature value

Feature value

Feature value

Feature value

Feature value

(a) Score-based (b) Diversity-based (c) 𝐷"-Pruning (d) Info Max

Figure 2: The coreset distributions from Info Max, and, the score-based method (output margin (Har-Peled et al., 2007), and diversity-based method (k-Median clustering (Feldman & Langberg, 2011; Har-Peled & Mazumdar, 2004)), hybrid method (D2-Pruning (Maharana et al., 2023)). In the upper figures, the background illustrates the distribution density of the data in the space after PCA dimensionality reduction. Brighter colors indicate a higher density of samples. Simultaneously, we present the locations of the samples selected by different methods using scatters. The figures below show the distribution of the scores of the corresponding coreset. This set of experiments is conducted on CIFAR-100 (Krizhevsky, 2009). The coreset selected by our Info Max is capable of covering both high-density and low-density regions in terms of spatial uniformity, and also high-information and low-information data in terms of score distribution.

geometry-based methods aim to select a diverse subset of the data, reducing redundancies between samples (Sener & Savarese, 2017). As illustrated in Figure 2(b), while these methods prioritize diversity, they often overlook informative samples with high-importance scores, leaving a large number of low-scoring samples in the selection.

Recently, hybrid approaches have emerged that combine importance scores with diversity to design more effective algorithms (Ash et al., 2020; Maharana et al., 2023; Zheng et al., 2022; Yang et al., 2024). One of the most notable examples is D2-Pruning (Maharana et al., 2023), which models a dataset as a graph. In this framework, node scores represent a sample s informativeness, such as difficulty or importance, while edges capture the similarities between samples. The data pruning process is formulated as an iterative node selection problem, where at each step, nodes with the highest scores are selected, and the scores of neighboring nodes are reduced to account for redundancy. However, due to its greedy selection process, the algorithm is prone to getting stuck in suboptimal solutions, making it challenging to maintain a proper balance between importance and diversity. For example, as shown in Figure 2(c), many samples remain concentrated in high-density areas with low scores, leaving significant portions of the space uncovered.

Low-informative High-informative

Figure 3: Cases of samples with low informativeness (left) and high informativeness (right).

In this paper, we introduce Info Max, a new and effective approach for data pruning and coreset selection. Our core insight is to find a subset of samples that maximizes overall information by simultaneously considering each sample s information contribution and the information overlap among them. First, a sample s informativeness can be explained by its importance or difficulty. For instance, a sample with intricate structures, complex backgrounds, and occlusions provides more information than one with simple patterns see Figure 3. Therefore, we explore score-based metrics that evaluate a sample s difficulty or importance, providing an information assessment. Second, the information overlap among samples is evaluated in a pairwise manner and quantified by their pairwise similarities. Samples with greater similarity will have a higher degree of information overlap. Using these considerations, we formulate coreset selection as a discrete quadratic programming (DQP) problem with equality constraints that

Published as a conference paper at ICLR 2025

specify the desired number of selected samples, see Eq. (2). The objective function represents the overall information as the sum of each sample s individual information, reduced by redundancies introduced by the presence of similar samples within the coreset. Furthermore, we propose an efficient and robust gradient-based solver to address scalability. This solver is enhanced by sparsification techniques applied to the similarity matrix and dataset partitioning strategies, enabling practical and scalable computation. For instance, our method processes 12 million data points in just 37 minutes (see Section B.2 for details). By solving this problem, Info Max identifies a subset of samples that maximizes overall information, thereby reducing the likelihood of suboptimal solutions often seen with greedy-based methods, see the discussion in Section 3.3. As shown in Figure 2(b), the coreset generated by our approach strikes a good balance between diversity (i.e., well-distributed across the entire space) and informativeness (i.e., containing a reasonable number of important samples). The superior performance across multiple tasks, including image classification, vision-language pretraining, and instruction fine-tuning for large language models, see Section 4, further supports the effectiveness of our new formulation. We have summarized the overall pipeline in Figure 4 and Algorithm 1. In summary, our contributions are:

We propose Info Max, a new coreset algorithm designed to maximize overall information by accounting for each sample s individual contribution while reducing information overlap, with a simultaneous focus on maintaining diversity and importance.

We propose an efficient gradient-based solver enhanced by sparsification techniques and dataset partitioning strategies to make Info Max scale to large-scale datasets.

Extensive experiments show that Info Max exhibits the best performance and consistently outperforms the state-of-the-art schemes in a series of tasks, including image classification, and vision-language pre-training (Radford et al., 2021), large language model supervised fine-tuning (Xia et al., 2024) experiments. Notably, it brings about a significant improvement of approximately 5.5% compared to the previous best methods at a 5% selection rate in instruction fine-tuning experiments. Additionally, it shows around 2% performance enhancements on classification tasks at a 10% selection rate, see Figure 1.

2 PRELIMINARIES

In this section, we first present the problem definition for data pruning, followed by a discussion of existing methods, including score-based, diversity-based, and hybrid approaches.

2.1 PROBLEM DEFINITION

Before diving into the literature review of existing methods, we first define the problem of data pruning, also known as coreset selection (Sener & Savarese, 2017; Mirzasoleiman et al., 2020; Killamsetty et al., 2021). Let D pziq N i 1 represent the training set, drawn independently and identically distributed (i.i.d.) from an underlying data distribution. The goal of dataset pruning is to identify the most informative subset of the training data that minimizes information loss while maximizing model performance. Formally, the problem can be stated as:

S arg max SĂD,|S| p Ip Sq, (1)

where p is the budget of the target coreset and | | represents the cardinality of a set. Ip Sq measures the set-level information of the candidate subset S. There are multiple choices (Sorscher et al., 2022; Wei et al., 2015; Kaushal et al., 2021) for the instantiation for the set information Ip Sq. For example, given the loss function lp q and the network parameters w, previous works Yang et al. (2023); Zheng et al. (2022) often use the overall test loss reduction caused by training the model on S as the information measure Ip Sq Ez P lpz, w0q lpz, w Sq ı , Eq. (1) would be identical as

S arg min SĂD,|S| p Ez P lpz, w Sq ı , where w0 is the initial parameter and w S indicates the optimal

network parameter learned on S. The key distinction between various data pruning methods lies in how they define Ip Sq, which is detailed below.

Published as a conference paper at ICLR 2025

Extract feature for each data

Neural Network Dataset 𝑧!, , 𝑧"

Calculate intra-sample information 𝐼(𝑧!)

Information Maximization

Objective Construction

Perform the

iterative Info Max solver

max 𝐗 #,% $ $ 𝐗&𝐼𝑧

𝛼* $ 𝐊&,(𝐗&𝐗(

inter-sample

redundancy intra-sample

information

Info Max Solver

𝐗)*% = Softmax 𝑝* 𝐈 2𝑝𝛼* 𝐊𝐗)

inter-sample feature similarity 𝐾"!, ""

Figure 4: The overall pipeline of our Info Max. In the first two steps, we use a network to extract features to calculate the similarity matrix and the intra-sample information terms. Then, in step 3, we construct the quadratic optimization problem for info Max, see Eq. (2) for details. Finally, we perform the iterative Info Max solver as defined by Eq. (3) to obtain the final Info Max coreset.

2.2 SCORE-BASED METHOD

The score-based method (Paul et al., 2018; Tan et al., 2023; Toneva et al., 2018) often selects samples solely based on the score values. They generally model Ip Sq as: Ip Sq Ipz1q Ipz2q ... Ipzpq, where S tz1, . . . , zpu Ă D, with each z representing a data sample, and Ipzq denoting the information associated with that sample. The primary task then becomes identifying a suitable score metric to evaluate a sample s individual contribution, Ipzq, of each sample. Various scores have been proposed to quantify Ipzq, such as model uncertainty (Har-Peled et al., 2007), loss value (Cody Coleman et al., 2019), and influence score (Tan et al., 2023; Pruthi et al., 2020).

The assumption underlying the score-based formulation is that there is no information overlap between different samples, allowing the individual contributions of samples to be summed to represent the overall contribution. However, this approach fails to account for overlap in the information provided by different samples. For instance, if two identical samples, zi and zj, both receive high scores, the information gained from adding zj after selecting zi would be negligible. As a result, this method cannot ensure that the selected samples offer broad coverage of the data space, leading to a loss of diversity and suboptimal solutions. This limitation is illustrated in Figure 2(a), they select samples densely concentrated in regions with the highest scores, leading to redundancies and failing to consider simpler samples with lower scores, which results in biased selections.

2.3 GEOMETRY-BASED METHOD AND HYBRID METHOD

Geometry-based Method In contrast to score-based methods, geometry-based methods design Ip Sq to consider maximizing the diversity and minimizing the information overlap among samples. For many works (Lloyd, 1982; Tan et al., 2006; Coates & Ng, 2012; Har-Peled & Mazumdar, 2004; Feldman & Langberg, 2011), the set-level information Ip Sq is regarded as: Ip Sq 9 ř

pzi,zjq PS Ipzi; zjq, where Ipzi, zjq measures the similarity of two samples, indicating information overlap (also the mutual information). To solve the above problem, Sener & Savarese (2017) applied greedy k-center to choose the coreset with good data coverage.

This formulation assumes that all individual samples carry equal amounts of information, disregarding the varying significance of each sample. Consequently, it tends to overlook critical samples while retaining a large number of non-informative ones. As illustrated in Figure 2(b), while these methods prioritize diversity, they often overlook informative samples with high-importance scores, leaving a large number of low-scoring samples in the selection.

Hybrid Method More recently, hybrid methods (Zheng et al., 2022; Ash et al., 2020; Maharana et al., 2023) have been developed that account for both the individual importance and diversity of samples simultaneously. For instance, CCS (Zheng et al., 2022) sets different sampling ratios for samples with different scores to enhance data coverage and balance both easy and hard samples. D2-Pruning (Maharana et al., 2023) proposes to select data on the graph. One of its core steps is the Inverse Message Passing operation. Specifically, it selects the sample with the highest score among all unselected candidates and subsequently reduces the scores of the neighborhood. However, the above approaches are primarily based on heuristic designs and are often solved using greedy algorithms. These methods are prone to local optima, as the algorithm lacks a holistic view of the problem at each step, resulting in suboptimal outcomes. In Figure 2, the selected samples demonstrate

Published as a conference paper at ICLR 2025

inadequate coverage of the entire space, with many concentrated in low-importance regions, hence results in the suboptimal performance in practice, see Figure 1 for details.

In this paper, we propose a unified formulation with a scalable solver for coreset selection, aimed at maximizing the information of the selected subset by accounting for both the individual contributions of samples and their redundancies.

3.1 INFOMAX: FORMULATIONS AND SOLUTIONS

The objective of Info Max is to identify a subset of data samples that maximize intra-sample information while minimizing inter-sample redundancies caused by similar samples, thereby achieving an optimal balance between diversity and importance. For a sample z, we represent its information content as Ipzq. For two samples z and s, their information overlap or redundancy is denoted as Ipz, sq. The total information is measured by summing the individual information of the samples, with deductions for inter-sample redundancy, as outlined below. We will discuss in Section 3.3 that under some mild settings, this optimization problem in Eq. (2) is equivalent to solving the original information maximization problem for data pruning as defined in Eq. (1).

Problem Formulation for Info Max Here, we consider inter-sample redundancy in a pairwise manner using Ipz, sq and the matrix containing all pairwise sample similarities is denoted as Kz,s. Then, given the dataset D tz1, z2, ..., z Nu, we introduce a variable Xz P t0, 1u to represent whether a sample is selected (Xz 1) or pruned (Xz 0), the overall information is formulated as a quadratic function of variable X as:

max XPt0,1u N ÿ

z PD Xz Ipzq α ÿ

z,s PD Kz,s Xz Xs, s.t. ÿ

z PD Xz p, (2)

where p p1 δq N is the size budget of the selected set. In the formulation, the first-order term Ipzq measures the intra-sample information of the sample z, which can be instantiated using any existing sample-wise scores, e.g., EL2N (Paul et al., 2018) or SSP (Sorscher et al., 2022). The quadratic term Kz,s measures the inter-sample redundancy between the sample z and s.

Objective Construction For the intra-sample information I P RN and the pairwise feature similarity K P RNˆN, we introduce two calculation modes by following Maharana et al. (2023), namely the supervised mode and the unsupervised mode.

The supervised mode. We initially train a surrogate model on the dataset. Subsequently, we employ it to calculate sample-wise scores, such as the loss value (Cody Coleman et al., 2019) and gradient norm (Paul et al., 2018). Additionally, we use features before the classification layer as features FNˆd, where d is the feature dimension. We utilize the inner product as the pairwise feature similarity, that is, K FTF. However, the supervised mode is somewhat unfriendly as training an additional model incurs a non-negligible expense, especially on large-scale datasets.

The unsupervised mode. We extract features using existing open-source models such as DINO (Oquab et al., 2023). Additionally, we use the inner product as the pairwise feature similarity. For the intra-sample information, we employ the SSP score (Sorscher et al., 2022). Specifically, it entails first conducting clustering in the feature space. The distance between a sample and the corresponding cluster center constitutes the SSP score.

Analysis If we only preserve the first-order term by setting α as a very small value that is near zero, Info Max will degenerate into a vanilla score-based scheme, that is, those samples with the highest scores will be selected. If we discard the first-order term by setting α as a very huge value, it would degenerate into the problem of finding a coreset S P D to minimize ř

z,s PS Kz,s. In other words, it is to find a subset with the minimum similarity within the set. This is a variant case of the classic k-Median problem (Lloyd, 1982; Tan et al., 2006; Har-Peled & Mazumdar, 2004), that minimizing the Costp S, Dq ř

z PD{S dpz, sq if we set the distance measurement as dpz, sq 1 Kz,s.

Published as a conference paper at ICLR 2025

3.2 SALABLE INFOMAX SOLVER

Solving the quadratic problem defined in Eq .2 directly can be extremely computationally burdensome and may even prove intractable since the budget size p is generally on the order of tens of thousands or even millions (Krähenbühl & Koltun, 2011; Larsson et al., 2018). Here, we propose a continuous relaxation of the problem, enabling the use of a gradient-based solver for more efficient optimization. Firstly, we conduct convex relaxation on the feasible domain of the original optimization problem by relaxing the binary constraint X P t0, 1u N to a continuous version X P r0, 1s N. Then, we derive a solver based on the proximal gradient method (Baqué et al., 2016; Tan et al., 2021). This solver decomposes the original complex and non-convex continuous problem into a series of simple and convex sub-problems, see Appendix D for details. Each sub-problem has an analytic solution that serves as the update rule of our Info Max solver as follows:

Xt 1 Softmax p I 2pα KXt , (3)

where I P RN is the vectorized Ipzq that Iz Ipzq, and K P RNˆN is the similarity matrix. Softmax is the operation to map the real-valued vector r X to a non-negative vector with the summation of 1, specifically, Softmaxp Xq expp Xq{ ř

i expp Xqi. After T iterations, we get the final solution XT , of which each item represents the probability that the corresponding sample should be selected. Finally, we take the coreset corresponding to the top-p largest values.

We have summarized Info Max in Algorithm 1. First, the algorithm requires the input of the intrasample information vector I P RN and the pairwise feature similarity K P RNˆN. Then, it will iteratively execute the iterative solver defined in Eq. (3) for T iterations. Next, we present additional design enhancements to further improve efficiency.

Efficiency Enhancement Technique Naively applying the Info Max solver in Eq. (3) on the original dataset leads to a quadratic complexity of Op N 2q, where N is the number of samples to be processed. This complexity is rather high when dealing with large-scale datasets. Here, we introduce two practical techniques to boost the efficiency of Info Max.

Dataset partition: Before executing the Info Max algorithm, we divide the original dataset into d smaller random subsets and then conduct pruning on each subgroup independently. With this scheme, the complexity of the algorithm on each subset is significantly reduced to Op N 2{d2q. At the same time, pruning for each subset can be carried out simultaneously on multiple computing devices, further reducing the time consumption.

Sparsification: Since the information overlap between distant samples is minimal, we further improve computational efficiency by sparsifying the similarity matrix K, retaining only the top k values as non-zero and setting the rest to zero. Specifically, we only take into account the similarity between each sample and its k nearest neighbors (for instance, k 5). As a result, K has only Nk non-zero elements. With the sparsification technique, the complexity of the algorithm on each subset is significantly reduced to Op Nk{d2q.

Note that the partition factor d and the sparsification rate k are two hyperparameters that determine the trade-off between efficiency and performance. In experiments (Sec. 4.4), we also study the effects of these two hyperparameters.

3.3 HOW DOES INFOMAX FIND THE MOST INFORMATIVE CORESET?

We explain why Info Max can find the most informative coreset from the perspective of information theory. First, we restate the information maximization formulation of data pruning defined in Eq. (1), S arg max SĂD,|S| p Ip Sq, where p is the size of the target coreset. The Ip Sq measures the set-level information of the candidate subset S, specifically,

Ip Sq Ipz1q Ipz2|z1q ... Ipzp|z1, ..., zp 1q Ipz1q

2ďk Ipzk|z1, ..., zk 1q. (4)

where tz1, ..., zpu P S are all samples from the set. Note that this equation always holds regardless of the order of samples. The intra-sample information Ipz1q of the sample z1 could be instantiated by various sample-wise scores (Sorscher et al., 2022; Cody Coleman et al., 2019; Tan et al., 2023;

Published as a conference paper at ICLR 2025

Paul et al., 2018). Regarding the conditional gain term Ipz|Zq, we refer to the recent progress in submodular information measures (SIM) Wei et al. (2015); Kaushal et al. (2021); Kothawade et al. (2021), which present several instantiations for the submodular conditional gain. Here, we select the simplest yet effective instantiation, Graph Cut conditional gain (GCCG), Ipzk|z1, ..., zk 1q fpzkq 2λ ř

iăk Kzi,zk. It measures the dissimilarity between the sample zk and all conditional samples. Specifically, Kzi,zk measures the similarity between the sample i and the sample k, λ is an undetermined coefficient and is a hyperparameter in the system. The submodular function f maps from a ground set to a real value, and we can simply set it with fpzq Ipzq. Hence, we have the following instantiation for the conditional gain term: Ipzk|z1, ..., zk 1q Ipzkq 2λ ř iăk Kzi,zk. By substituting it into the Eq. (4), we can instantiate and reformulate the set-level information as: Ip Sq ř

z PD Ipzq 2λ ř

z s PD Kz,s. We introduce a set of binary variables X P t0, 1u N where N |D| is the size of the whole training set. In the selection procedure, Xz 1 indicates the sample z was selected, otherwise it was pruned. By the problem definition in Eq. (1) and the set-level information formulation Eq. (4), we can transform the original information maximization problem in Eq. (1) into the following combinatorial optimization problem,

max XPt0,1u N , |X| p : ÿ

z PD Ipzq Xz λ p 1

z s PD Kz,s Xz Xs, (5)

where λ is a flexible hyperparameter to be determined in Eq. (4). Hence, if we set α λ p 1, then we obtain the quadratic programming problem as defined in Info Max. Therefore, we have proved that under the premise of using the instantiation of Graph-cut conditional gain (GCCG) for the conditional gain term, solving the data pruning problem in Eq. (1) to find the most informative subset is equivalent to solving the quadratic problem defined in Eq. (2). See Appendix D.2 for the proof.

4 EXPERIMENTS

We carried out extensive experiments on three tasks, namely image classification, multi-modality pretraining, and instruction tuning for Large Language Models (LLMs), to investigate the performance of our Info Max. Subsequently, we conducted ablation studies to explore the component design within Info Max. Each result of Info Max is the average of five independent runs. The standard deviation corresponding to each result of Info Max is less than 0.85.

4.1 IMAGE CLASSIFICATION

The image classification task encompasses experiments on three datasets, namely CIFAR-10, CIFAR100 (Krizhevsky, 2009), and Imagenet-1K (Russakovsky et al., 2015). Following coreset selection, we will train a model on the chosen subset to examine its performance as the performance of the coreset. The model employed here is Res Net-18 for CIFAR and Res Net-34 for Image Net. For Info Max, the dataset partition scheme is not employed herein as the dataset scale is not large. The sparse-rate k is set to 5, the pairwise term weight α is set as 0.3, and the iteration T is set as 20. Regarding the detailed experimental settings, please refer to the Appendix.

For the supervised setting, we compare Info Max with several baselines: 1) Random selection of examples. 2) Entropy (Cody Coleman et al., 2019) of the model s prediction. 3) Forgetting (Toneva et al., 2018) score for each example i.e., the number of times a model predicts the example incorrectly after having predicted correctly in the previous epoch. 4) EL2N (Paul et al., 2018), the L2 norm of error vectors. 5) Moderate coresets (Xia et al., 2023) that selects samples at about the median distance from the class center, 6) CCS (CCS) divides a range of difficulty scores into equal-sized bins and randomly samples from each bin. 7) D2-Pruning (Maharana et al., 2023) selects samples by performing message passing on the data graph. The scoring model for each method is a Res Net model trained on the target dataset. K-center (Sener & Savarese, 2017) the standard geometry-based coreset method.

For the Unsupervised setting, we select the following baseline: 1) SSP (Sorscher et al., 2022) that uses self-supervised embeddings to compute k-means clusters and treats samples at a farther distance from the cluster center as more important, 2) CCS over SSP scores, and 3) D2-Pruning over SSP scores coreset selection. The unsupervised feature used here is from the officially public DINO-Vi T-Base-V2 model (Oquab et al., 2023).

Published as a conference paper at ICLR 2025

Table 1: A comparative analysis of the performance ((Acc@1)) of Info Max on image classification datasets with Res Net-18 (on CIFAR) and Res Net-34 (on Image Net). The best results are bolded.

Dataset CIFAR-10 CIFAR-100 Image Net-1K

Selection Rate 100% 70% 50% 30% 20% 10% 100% 70% 50% 30% 20% 10% 100% 70% 50% 30% 20% 10%

Random 95.5 94.3 93.4 90.9 88.0 79.0 78.7 74.6 71.1 65.3 57.4 44.8 73.1 72.2 70.3 66.7 62.5 52.3

Entropy - 94.8 92.9 90.1 84.1 72.1 - 74.7 68.9 60.3 49.6 35.0 - 72.3 70.8 64.0 55.8 39.0 K-center - 94.1 92.5 91.0 82.5 68.4 - 73.4 69.5 61.4 47.2 40.5 - 73.0 71.4 64.8 53.1 42.0 Forgetting - 95.7 94.9 88.1 73.8 46.3 - 76.0 68.1 49.3 30.3 20.6 - 72.6 70.9 66.5 62.9 52.3 EL2N - 95.4 94.8 89.2 78.6 30.3 - 75.6 68.1 47.2 24.8 11.8 - 72.2 67.2 48.8 31.2 12.9 SSP - 93.8 93.1 81.2 64.4 42.9 - 76.5 66.4 50.3 27.9 19.4 - 71.1 68.5 52.7 40.3 20.5 Moderate - 93.9 92.6 90.6 87.3 81.0 - 74.6 71.1 65.3 58.5 45.5 - 72.0 70.3 65.9 61.3 52.1 CCS (unsupervised) - 95.2 93.4 90.6 85.5 80.6 - 76.4 71.8 61.2 37.5 25.4 - 71.6 69.5 62.1 47.4 30.2 CCS - 95.4 95.0 93.0 91.0 86.9 - 77.1 74.4 68.9 64.0 57.3 - 72.3 70.5 67.8 64.5 57.3 D2-Pruning (unsupervised) - 94.4 94.2 87.6 86.1 83.9 - 77.8 70.0 66.6 62.3 51.7 - 72.6 69.4 66.1 60.9 52.5 D2-Pruning - 95.7 94.9 93.3 91.4 87.1 - 78.2 75.9 70.5 65.2 56.9 - 72.9 71.8 68.1 65.9 55.6

Info Max (unsupervised) - 94.9 93.6 92.2 90.1 87.0 - 77.9 74.6 70.1 64.7 56.2 - 72.8 70.5 68.0 64.9 56.8 Info Max - 95.5 94.7 94.1 92.7 89.1 - 79.0 76.7 71.5 67.9 58.7 - 73.3 72.8 69.4 66.5 59.0

Table 1 presents the results on three image classification datasets comparing the performance (accuracy) of Info Max with several baselines. Firstly, let s focus on the performance in the supervised setting. On CIFAR-10, at various pruning rates (30% - 90%), Info Max outperforms other methods in terms of accuracy in most settings. For instance, at a 90% pruning rate, Info Max achieves an accuracy of 89.1, outperforming other methods (such as D2-Pruning) by 2.0% in accuracy. On CIFAR-100 and Image Net, Info Max has higher accuracy values compared to other methods across different pruning rates. Moreover, this advantage becomes more pronounced under a high pruning ratio. For example, at a 90% pruning rate, Info Max surpasses the second-ranked CCS by 1.4% and 1.7% on the two datasets respectively, and outperforms the third-ranked D2-Pruning by 1.8% and 3.4% on the two datasets respectively. Next, we must highlight the performance of Info Max under the unsupervised setting. The performance of Info Max in the unsupervised setting steadily exceeds that of other schemes in the unsupervised setting. Even under the setting of a high pruning ratio, it can approach and even outperform most supervised schemes. For example, at a 90% pruning rate, the unsupervised Info Max on CIFAR-10 lags only 0.1% in performance compared with the supervised D2-Pruning, and the unsupervised Info Max on Image Net leads the supervised D2-Pruning by a performance advantage of 1.2%. It showcases that Info Max exhibits superior performance and notable effectiveness across different settings.

4.2 MULTI-MODALITY PRE-TRAINING

For multi-modality pretraining tasks, we conducted experiments on the popular vision-language dataset CC12M (Changpinyo et al., 2021), which contains 12 million image-text pairs from the Internet, for CLIP-like vision-language pre-training (Radford et al., 2021). A common practice (Schuhmann et al., 2022; Gadre et al., 2024) for selecting coreset from VL datasets is to use the pre-trained CLIP model (Radford et al., 2021) to score each image-text pair, where a higher CLIP score indicates better image-text alignment. Hence, we set this as the most basic baseline, termed, 1) CLIP score. Additionally, we also select the following baseline: 2) Clustering + CLIP: Since using only the score for screening is based on the matching degree of images and texts it is difficult to reflect the redundancy degree of samples. Consequently, some schemes (Li et al., 2022; 2023) combine Clustering and CLIP scores, that is, first clustering in the feature space and then selecting a portion of samples with high scores within each cluster. 3) Moderate coreset (Xia et al., 2023) over CLIP scores: Selecting those samples with the median score since the highly-scored samples may be the too-easy samples. 4) CCS over CLIP scores: By following the implementation in Maharana et al. (2023), it divides a range of CLIP scores into equal-sized bins and randomly samples from each bin. 5) D2-Pruning over CLIP scores: It constructs a sample graph by using the CLIP score as the node value and using image features from the CLIP vision encoder to calculate the edge value (similarity).

For our Info Max approach, we also designate the CLIP score as the intra-sample information measurement. For the pairwise similarity, we employ the inner product between the features of samples from the CLIP vision encoder. The dataset partition factor d is set to 10; that is, we randomly partition the 12M data into 10 subsets each of size 1.2M. The sparse-rate k is set at 5, the pairwise term weight α is set as 0.3, and the iteration T is set to 20. After data selection, we perform CLIP pre-training on the coreset and evaluate the trained model on four downstream tasks: Zero-shot Image Net-1K Classification, Linear Probing Image Net-1K Classification, Image-to-Text

Published as a conference paper at ICLR 2025

100% 60% 40% 20% 10% Selection Rate

Zero-shot Image Net-1K

Random CLIP score Clustering + CLIP Moderate CCS D 2-Pruning Info Max

100% 60% 40% 20% 10% Selection Rate

I2T Flickr30K

Random CLIP score Clustering + CLIP Moderate CCS D 2-Pruning Info Max

100% 60% 40% 20% 10% Selection Rate

T2I Flickr30K

Random CLIP score Clustering + CLIP Moderate CCS D 2-Pruning Info Max

100% 60% 40% 20% 10% Selection Rate

Linear-prob Image Net-1K

Random CLIP score Clustering + CLIP Moderate CCS D 2-Pruning Info Max

Figure 5: Experimental results of coreset selection on CC12M (Changpinyo et al., 2021) for multimodality (vision-language) pre-training on CLIP-Vi T-B/32 (Radford et al., 2021) model.

(I2T) Flickr30K (Plummer et al., 2015) Retrieval and Text-to-Image (T2I) Flickr30K Retrieval. For the detailed experimental settings and results along with standard errors, please refer to the Appendix.

The experimental results are presented in Table 5. Info Max consistently outperforms all baselines across all selection ratios. This leading advantage is particularly prominent when the selection ratio is relatively small. For instance, when selection ratios = 10%, Info Max surpasses D2-Pruning by 1.5%, 1.3%, 2.7%, and 1.0% respectively on these four downstream tasks, and exceeds the widely-used selection method of Clustering + CLIP by 2.1%, 1.9%, 4.0%, and 1.8% respectively on the four tasks. This demonstrates the superiority and effectiveness of Info Max in different scenarios and under various selection conditions, highlighting its significance in data-centric research areas.

4.3 INSTRUCTION TUNING FOR LARGE LANGUAGE MODELS

Recently, some works (Xia et al., 2024; Maharana et al., 2023) have demonstrated significant redundancy in language datasets. For instance, LESS (Xia et al., 2024) reduced the size of instruction tuning datasets to merely 5% of their original amount. The core of LESS is the influence score (Pruthi et al., 2020; Tan et al., 2023), which gauges how a particular data point impacts the performance of the learned model on the validation set.

Following (Xia et al., 2024), we also conduct coreset selection on a mixed dataset containing data from FLAN-V2 (Longpre et al., 2023), COT (Wei et al., 2022), DOLLY (Conover et al., 2023), OPEN-ASSISTANT-1 (Köpf et al., 2023). The mixed training set, consisting of approximately 270K data points, exhibits a wide variation in its format and the underlying reasoning tasks. After the selection process, we conduct Lo RA-finetuning (Hu et al., 2021) on the selected coreset for a LLa MA2-7B model (Touvron et al., 2023). For D2-Pruning (Maharana et al., 2023) and our Info Max, we employ the LESS score as the intra-sample measurement and utilize the gradient of the sample as the sample s feature. Additionally, we select Moderate (Xia et al., 2023) over LESS and CCS (Zheng et al., 2022) over LESS as the baselines.

100 50 10 5 1 Selection Rate (%)

MMLU (5-shot Accuracy)

Random LESS Moderate CCS D 2-Pruning Info Max

100 50 10 5 1 Selection Rate (%)

TYDIQA (1-shot F1 Score)

Random LESS Moderate CCS D 2-Pruning Info Max

100 50 10 5 1 Selection Rate (%)

BBH (3-shot Exact Match Score)

Random LESS Moderate CCS D 2-Pruning Info Max

Figure 6: A comparative analysis of the performance of Info Max on instruction tuning datasets for LLa MA2-7B (Touvron et al., 2023) model.

After the training process, we assess our method using three widely recognized benchmarks: 1) MMLU (Hendrycks et al., 2020) offers a diverse range of knowledge domains for evaluation, covering 57 knowledge areas, like mathematics, computer science, and others. 2) TYDIQA (Clark et al., 2020) is a multilingual question-answer dataset that includes 9 kinds of languages. 3) BBH (Suzgun et al., 2023) encompasses 27 arduous tasks to assess whether the model can handle complex reasoning

Published as a conference paper at ICLR 2025

situations. The experimental results are presented in Figure 6. Info Max, our proposed method, demonstrates significant performance advantages over other competing methods, particularly at small selection rates (ď 5%). On MMLU, it shows notable improvements compared to the secondranked method, D2-Pruning. For example, at the 95% and 99% pruning rates, Info Max achieves an accuracy of 51.8 and 50.2, compared to D2-Pruning s performance, an improvement of about 2% in performance. Similar trends are observed on TYDIQA, Info Max outperforms LESS (the secondranked approach) at 90%, 95%, and 99% pruning rates, with differences ranging from about 1.1 to 1.5 percentage points. For the BBH dataset, the superiority of Info Max is even more pronounced. At a 95% pruning rate, it has an accuracy of 47.2 compared to D2-Pruning s 41.7, a substantial improvement of about 5.5 percentage points. Overall, Info Max consistently shows better results, highlighting its effectiveness even when a large portion of the data is pruned.

4.4 ANALYSIS & DISCUSSION

Here, we have investigated four hyperparameters in Info Max: the dataset partition rate d, the sparse rate k, the weight α of the pairwise term, and the iteration T for Info Max. In Appendix Sec. B, we will study the time cost and generalization ability of Info Max. (a) Partition strategy. The experiments on CC3M/CC12M (Changpinyo et al., 2021) in Figure 7(a) studies the effect of the dataset partition strategy. It suggests that a decrease in the partitioned subset size inevitably reduces the performance of the coreset. This decline exhibits a significant downward tendency when the size of each subset is less than approximately 1M. Consequently, when dealing with relatively large-scale datasets, we recommend ensuring that the size of each subset is at least 1M samples. (b) Sparse rate k: The sparse rate k represents the size of the neighborhood of each sample. For details, see Sec. 3.2. A smaller k implies more 0 items in the similarity matrix K. In Figure 7(b), we found that the performance of Info Max is relatively robust to the choice of k. Hence, we set k 5 for better efficiency in all experiments. (c) Pairwise weight α: The pairwise weight α determines the tradeoff between the first-order term and the second-order term in the formulation of Info Max as in Eq.2. As shown in Figure 7(c), it is observed that the optimal range of α varies for different selection ratios. However, all these ranges yield satisfactory results when α is around 0.3 to 4. Consequently, we recommend setting α 0.3. (d) Iteration T: Figure 7(d) reveals that the performance of Info Max rises with the increase in the iteration number T. Nevertheless, this gain tends to saturate after T ą 20. Hence, we suggest setting T 20 to balance good performance and efficiency.

(a) (b) (c) (d) Figure 7: Ablation study for hyperparameters in Info Max: the dataset partition strategy (the size of each subset), the sparse rate k, the weight α of the pairwise term, and the iteration number T of the Info Max solver. Experiments in (a) are conducted on CC3M & CC12M (Changpinyo et al., 2021). All experiments in (b,c,d) are conducted on CC12M. The selected model is CLIP-Vi T-B/32 (Radford et al., 2021). The reported metric focuses on the accuracy of the coreset-trained CLIP-Vi T-B/32 on the zero-shot Image Net-1K classification tasks.

5 CONCLUSION

This paper has presented Info Max, a novel and effective data pruning method. Info Max is formulated as a quadratic optimization problem to maximize the sample-wise informativeness and minimize the redundancy of selected samples. Furthermore, an efficient gradient-based solver along with sparsification techniques and dataset partitioning strategies is introduced to ensure scalability, enabling it to handle datasets with millions of samples within tens of minutes. Overall, Info Max shows great potential and effectiveness in the field of data pruning to improve the efficiency and performance of data processing in various applications.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENT

This work has been supported in part by the Hong Kong Research Grant Council - Early Career Scheme (Grant No. 27209621), General Research Fund Scheme (Grant No. 17202422, 17212923), Theme-based Research (Grant No. T45-701/22-R), and the Shenzhen Science and Technology Innovation Commission (SGDX20220530111405040). Part of the described research work is conducted in the JC STEM Lab of Robotics for Soft Materials funded by The Hong Kong Jockey Club Charities Trust.

Published as a conference paper at ICLR 2025

Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In International Conference on Learning Representations, 2020.

Pierre Baqué, Timur Bagautdinov, François Fleuret, and Pascal Fua. Principled parallel mean-field inference for discrete random fields. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5848 5857, 2016.

Tom B Brown. Language models are few-shot learners. 2020.

Kwan Ho Ryan Chan, Yaodong Yu, Chong You, Haozhi Qi, John Wright, and Yi Ma. Redunet: A white-box deep network from the principle of maximizing rate reduction. The Journal of Machine Learning Research, 23(1):4907 5009, 2022.

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558 3568, 2021.

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. Ty Di QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 2020.

Adam Coates and Andrew Y. Ng. Learning feature representations with k-means. In Neural Networks: Tricks of the Trade - Second Edition, pp. 561 580. 2012.

Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2019.

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free Dolly: Introducing the world s first truly open instruction-tuned LLM, 2023.

Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2020.

Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 569 578. ACM, 2011.

Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434 1453. SIAM, 2013.

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024.

Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, pp. 291 300, 2004.

Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In 36th Annual ACM Symposium on Theory of Computing,, pp. 291 300, 2004.

Sariel Har-Peled, Dan Roth, and Dav Zimak. Maximum margin coresets for active and noise tolerant learning. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 836 841, 2007.

Published as a conference paper at ICLR 2025

Muyang He, Shuo Yang, Tiejun Huang, and Bo Zhao. Large-scale dataset pruning with dynamic uncertainty. In Computer Vision and Pattern Recognition Workshops, 2024.

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021.

Wenyu Jiang, Hao Cheng, Ming Cai Chen, Chongjun Wang, and Hongxin Wei. DOS: Diverse outlier sampling for out-of-distribution detection. In The Twelfth International Conference on Learning Representations, 2024.

Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In International Conference on Learning Representations, 2019.

Vishal Kaushal, Suraj Kothawade, Ganesh Ramakrishnan, Jeff Bilmes, and Rishabh Iyer. Prism: A unified framework of parameterized submodular information measures for targeted data subset selection and summarization. 2021.

Krishnateja Killamsetty, Sivasubramanian Durga, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pp. 5464 5474. PMLR, 2021.

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. pp. 4015 4026, 2023.

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, et al. Open Assistant conversations democratizing large language model alignment. 2023.

Suraj Kothawade, Nathan Beck, Krishnateja Killamsetty, and Rishabh Iyer. Similar: Submodular information measures based active learning in realistic scenarios. volume 34, pp. 18685 18697, 2021.

Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. Advances in neural information processing systems, 24, 2011.

Kristof Meding, Luca M. Schulze Buschoff, Robert Geirhos, and Felix A. Wichmann. Trivial or impossible dichotomous data difficulty masks model differences (on imagenet and beyond). In International Conference on Learning Representations, 2022.

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

Måns Larsson, Anurag Arnab, Fredrik Kahl, Shuai Zheng, and Philip Torr. A projected gradient descent method for crf inference allowing end-to-end training of arbitrary pairwise potentials. In Marcello Pelillo and Edwin Hancock (eds.), Energy Minimization Methods in Computer Vision and Pattern Recognition, 2018.

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.

Stuart P. Lloyd. Least squares quantization in PCM. IEEE Trans. Information Theory, 28(2):129 136, 1982.

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. ar Xiv preprint ar Xiv:2301.13688, 2023.

Published as a conference paper at ICLR 2025

Adyasha Maharana, Prateek Yadav, and Mohit Bansal. D2 pruning: Message passing for balancing diversity & difficulty in data pruning. 2023.

Anas Mahmoud, Mostafa Elhoushi, Amro Abbas, Yu Yang, Newsha Ardalani, Hugh Leather, and Ari S. Morcos. Sieve: Multimodal Dataset Pruning Using Image Captioning Models . In Computer Vision and Pattern Recognition (CVPR), 2024.

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp. 6950 6960. PMLR, 2020.

Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision, 2023.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. In Advances in Neural Information Processing Systems, 2018.

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641 2649, 2015.

Garima Pruthi, Frederick Liu, Sundararajan Mukund, and Satyen Kale. Estimating training data influence by tracing gradient descent. ar Xiv preprint ar Xiv:2002.08484, 2020.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. pp. 8748 8763, 2021.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. pp. 10684 10695, 2022.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211 252, 2015.

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. volume 35, pp. 25278 25294, 2022.

Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. 2017.

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. In Advances in Neural Information Processing Systems, 2022.

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 13003 13051, 2023.

Published as a conference paper at ICLR 2025

H.-R. Tan, C. Wang, S.-T. Wu, T.-Q. Wang, X.-Y. Zhang, and C.-L. Liu. Proxy graph matching with proximal matching networks. Proceedings of the AAAI Conference on Artificial Intelligence, 35 (11):9808 9815, 2021.

Haoru Tan, Sitong Wu, Fei Du, Yukang Chen, Zhibin Wang, Fan Wang, and Xiaojuan Qi. Data pruning via moving-one-sample-out. In Advances in neural information processing systems, 2023.

Pang-Ning Tan, Michael Steinbach, Vipin Kumar, et al. Cluster analysis: basic concepts and algorithms. Introduction to data mining, 8:487 568, 2006.

Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2018.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems, 2020.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824 24837, 2022.

Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In International conference on machine learning, pp. 1954 1963. PMLR, 2015.

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning. ar Xiv preprint ar Xiv:2402.04333, 2024.

Xiaobo Xia, Jiale Liu, Jun Yu, Xu Shen, Bo Han, and Tongliang Liu. Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. In International Conference on Learning Representations, 2023.

Shuo Yang, Zeke Xie, Hanyu Peng, Min Xu, Mingming Sun, and Ping Li. Dataset pruning: Reducing training data by examining generalization influence. In International Conference on Learning Representations, 2023.

Shuo Yang, Zhe Cao, Sheng Guo, Ruiheng Zhang, Ping Luo, Shengping Zhang, and Liqiang Nie. Mind the boundary: Coreset selection via reconstructing the decision boundary. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp. 55948 55960, 2024.

Yaodong Yu, Kwan Ho Ryan Chan, Chong You, Chaobing Song, and Yi Ma. Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Advances in Neural Information Processing Systems, 33:9422 9434, 2020.

Yu Yu, Shahram Khadivi, and Jia Xu. Can data diversity enhance learning generalization? In Proceedings of the 29th international conference on computational linguistics, pp. 4933 4945, 2022.

Haizhong Zheng, Rui Liu, Fan Lai, and Atul Prakash. Coverage-centric coreset selection for high pruning rates. 2022.

Published as a conference paper at ICLR 2025

BROADER IMPACT

This paper presents a novel and effective data pruning algorithm to advance the deep learning area. There are some potential positive societal effects, such as helping people better understand the role of data to develop more robust deep learning systems and possibly even be used to reduce training and storage costs. Additionally, with the explosive growth of data, similar data-centric deep learning schemes can also be easily applied to inhumane surveillance or scenarios that violate data privacy and data copyright. Therefore, we believe that strict legislation is needed to constrain the occurrence of these.

LIMITATIONS

This paper proposes an efficient and high-performance coreset selection scheme. However, due to the limitations of experimental equipment, the maximum scale of the experiments in this paper only reaches a data scale of 12M. In the future, we will consider investing more in renting or purchasing more computing equipment and exploring the performance boundaries of Info Max on a larger scale (at the billion level) dataset.

Table 2: Comprehensive overview of the notational convention.

Notation Description

D The training set, and the size of the training set is |D| N. S Ă D A subset from D. S Ă D The optimal coreset. p The coreset budget and the target coreset size. z P D A training data. Ipzq The information of sample z. Ip Sq The set-level information of the set S. Ipz|...q The conditional information gain of sample z. Ipz1, ..., znq The set-level information of the set rz1, ..., zns. Ipz; sq The mutual information between sample z and s. I P RN The information vector, where each item measures the informativeness of the corresponding sample, that is, Iz Ipzq. K P RNˆN The similarity matrix measures the similarity between samples. Kz,s The similarity between sample z and sample s. k The sparse rate of K. d The dataset partitioning rate. α The pairwise weight in the objective of Info Max, see Eq. (2). λ The hyper-parameter in the Graph-Cut Conditional Gain measurement (Kothawade et al., 2021), see Eq. (15). X P t0, 1u N The binary variable in Eq 2 (before continuous relaxation). Xz The variable item corresponding to the sample z. Xt P RN The iterant variable in the t-th iteration (after continuous relaxation). T The maximum iteration number of the Info Max solver.

Published as a conference paper at ICLR 2025

Algorithm 1: Info Max Coreset Selection.

1: Input: A dataset D with N samples, the division coefficient d, the target coreset size budget p. 2: Initialization: Set the coreset as a null-set S H; Divide the dataset D into d random subsets r D1, ..., Dds, where each subset is size of N{d. 3: Initialization: Uniformly initialize the initial guess X0 r 1

N s N{d; 4: for i P t1, ..., du do 5: // Objective construction: supervised or unsupervised mode. 6: Calculate the intra-information information vector I and the similarity matrix K; 7: // Iteratively perform the Info Max solver. 8: for t P t0, ..., T 1u do

9: Xt 1 Softmax p I 2pα KXt ;

10: end for 11: Append the samples corresponding to the top-k largest item in XT into S . 12: end for 13: Output: The Info Max coreset S .

A EXPERIMENTAL SETTINGS

A.1 IMAGE CLASSIICATION

We utilized Pytorch Paszke et al. (2017) to implement our method. Our experiments were conducted on a server equipped with 8 Tesla-V100 GPUs. Additionally, we maintained identical hyper-parameters and experimental settings for training before and after dataset pruning. To ensure fairness, we made sure that the number of iterations on the selected subset and the full set was the same by following Zheng et al. (2022); Maharana et al. (2023). For CIFAR-100, we utilize the SGD optimizer with weight-decay set to 5e-4, a learning rate of 0.1, and a batch size of 128. For Tiny Image Net, we use the SGD optimizer with weight-decay set to 5e-4, a learning rate of 0.3, and a batch size of 64. For Image Net-1K, we use the SGD optimizer with weight-decay set to 1e-4, warmup for 5 epochs, a learning rate of 0.4, and a batch size of 256. Regarding data augmentation, we solely adopt Random Resized Crop and Random Horizontal Flip for all experiments.

A.2 VISION-LANGUAGE PRETRAINING

For coreset selection on the vision-language dataset CC12M (Changpinyo et al., 2021), all experiments are conducted on 2 servers with a totally of 16 NVIDIA V100 GPUs. The selected model is CLIP model (Radford et al., 2021). We follow the settings provided in the original paper. Specifically, the CLIP model Radford et al. (2021) is trained for 32 epochs with Adam W optimizer, weight decay 0.2, and a batch size of 2048. After 1 warmup epoch, the learning rate gradually decreases from 1e-4 following the cosine strategy.

Zero-shot Image Net classification. The CLIP model has two encoders, one for text and one for images. During the zero-shot classification process, text descriptions corresponding to the Image Net classes are formulated. For example, if a class "dog" exists, a text description like "a picture of a dog" might be created. These text descriptions are encoded by the text encoder of CLIP to obtain text embeddings. At the same time, the images from the Image Net dataset are encoded by the image encoder of CLIP to get image embeddings. Then, the similarity between each image embedding and all the text embeddings (representing different classes) is calculated. The image is classified into the class whose text embedding has the highest similarity to the image embedding.

Linear Prob. This is a technique used to evaluate and analyze the performance of a pre-trained model. For the CLIP model, linear probing involves adding a linear layer on top of the pre-trained CLIP model and then training only this linear layer while keeping the rest of the CLIP model s parameters fixed.

Image-Text Retrieval. This is a task where the goal is to find the most relevant text description for a given image or find the most relevant image for a given text description. Let us use the Image-to-Text

Published as a conference paper at ICLR 2025

Retrieval as an example. The image is encoded using the vision encoder. This results in an image embedding that represents the visual features of the image. Then, text documents are also encoded (using the text encoder) to obtain their respective text embeddings. The similarity between the image embedding and all the text embeddings is computed. The text with the highest similarity score is retrieved as the relevant description for the image.

A.3 INSTRUCTION TUNING

The specific settings for Lo RA fine-tuning are as follows: the Lora-rank is 64, bf-16 precision is used, the number of epochs is 4, the Lora-target-modules include q-proj, k-proj, v-proj, o-proj, the learning rate is 1e 05, the batch size is 8, the gradient accumulation steps is 16, and the Adam W optimizer is used. This experiments is conducted on a server with 8 A100 GPUs. As for the calculation of the LESS score, please refer to (Xia et al., 2024) for details. Before computing the score, it performs Lo RA fine-tuning on the LLM on the training dataset and then retains the randomly-projected Lo RA gradient corresponding to each sample. The LESS score reflects how well the gradient of a training sample is consistent with the gradient of the target dataset we are concerned about.

B FURTHER ANALYSIS

B.1 GENERALIZATION ABILITY TEST

Here, we demonstrate the generalization ability of Info Max. In the test for cross-model generalization ability, we observe that Info Max s coreset exhibits superior generalization ability compared to other coresets. Experimental results are reported in Table 3. Info Max achieves the best performance in the cross-model generalization ability test. Notably, when using unsupervised scores (SSP (Sorscher et al., 2022)) and unsupervised features from DINO (Oquab et al., 2023), Info Max has better cross-model generalization ability compared to the supervised version. This also indicates the significance of using unsupervised base models to extract features during data selection.

Table 3: Cross model generalization ability test, including the setting of Res Net to SENet, and the setting of Res Net to Efficient Net-B0. The experiments are conducted on CIFAR-10 (Krizhevsky, 2009). Info Max (unsupervised) uses the unsupervised scores (SSP (Sorscher et al., 2022)) and unsupervised features from DINO (Oquab et al., 2023), while all remaining methods are based on the EL2N score (Paul et al., 2018) and the feature extractor trained on CIFAR-10.

Dataset Res Net Res Net to SENet Res Net to ENet-B0

Selection Rate 50% 30% 20% 50% 30% 20% 50% 30% 20%

Random 93.4 90.9 88.0 93.8 92.9 88.8 90.0 87.2 84.6

Moderate 92.6 90.6 87.3 91.4 88.3 85.7 87.1 86.9 82.2 CCS 95.0 93.0 91.0 92.7 89.4 88.1 90.5 86.7 84.1 D2-Pruning 93.3 91.4 87.1 94.3 91.4 87.1 93.3 91.4 87.1

Info Max 94.1 92.7 89.1 94.4 93.3 89.9 92.5 90.6 87.8 Info Max (Unsupervised) 93.6 92.2 90.1 94.4 93.8 90.5 93.1 90.8 88.4

Furthermore, we assess the cross-setting generalization ability by varying the score (Forgetting (Toneva et al., 2018) and Margin (Har-Peled et al., 2007)) and the type of feature (unsupervised DINO features (Oquab et al., 2023), unsupervised VQGAN features (Esser et al., 2020)). We observe that regardless of the configurations, Info Max can achieve remarkably superior results compared to score-based approaches that only utilize score or schemes that conduct K-Median Coreset solely with features. This thoroughly demonstrates the cross-setting generalization ability of Info Max.

B.2 TIME COST TEST

We study the time cost of Info Max in Table 5. There are two main stages for Info Max, the first one is the objective construction stage, including inferencing on all data and calculating the similarity matrix K and calculating the sample-wise score I, and the second stage is the optimizing stage, which iteratively running the solver defined in Eq. (3). It is easy to find that although the computational efficiency of our method is slower than that of score-based schemes, it is still faster than D2-Pruning

Published as a conference paper at ICLR 2025

Table 4: Cross setting generalization ability test. The experiments are conducted on CIFAR-10 (Krizhevsky, 2009). The pruning ratio is 10%. VQGAN features are obtained from the VQGAN model (Esser et al., 2020).

Method Accuracy

Forgetting (Score-based) (Toneva et al., 2018) 46.3 Margin (Score-based) (Har-Peled et al., 2007) 34.3 Supervised feature (K-Median diversity-based) 38.9 DINO feature (K-Median diversity-based) (Oquab et al., 2023) 31.7 VQGAN feature (K-Median diversity-based) (Esser et al., 2020) 28.2 Info Max: forgetting + supervised feature 89.4 Info Max: margin + supervised feature 88.0 Info Max: margin + DINO feature 85.4 Info Max: margin + VQGAN feature 83.7

(Maharana et al., 2023). This is because Info Max does not have a greedy selection process on a per-sample basis. For CC12M (Changpinyo et al., 2021), we divide the dataset into 10 subsets, each with a size of approximately 1.2 million. We run Info Max in multiple threads on two server nodes with eight GPUs each to process these subsets. The overall time consumption is approximately 37 minutes. Note that this time is significantly longer than that for processing 1 million Image Net data because multiple processes occupying the common disk I/O slow down the time efficiency. It is not difficult to observe that the main sources of the aforementioned time consumption are the first stage, including the inference process on all the data and the construction process of the similarity matrix K with the k-NN algorithm. The block of disk IO by multiple operations will further drag down the efficiency. Therefore, using a more advanced K-NN algorithm and a more advanced disk can significantly improve efficiency.

Table 5: A study on the time cost of Info Max and two selected competitors, D2-Pruning (Maharana et al., 2023) and score-based method (Entropy (Cody Coleman et al., 2019)), for 1 million image data (the training set of Image Net-1K (Russakovsky et al., 2015)).

Name Time cost (min)

Stage-1 of Info Max 14.6 Stage-2 of Info Max 1.7 Overall of Info Max 16.1

D2-Pruning (Maharana et al., 2023) 23 (Maharana et al., 2023) Score-based method (entropy (Cody Coleman et al., 2019)) 4.1

C RELATED WORKS

Score-based methods. The score-based techniques are the most popular data selection approaches. The EL2N score (Paul et al., 2018) measures the data difficulty by computing the average of the ℓ2-norm error vector from a set of networks. Gra Nd (Paul et al., 2018) measures the importance by calculating the expectation of the gradient norm. The Forgetting score (Toneva et al., 2018) counts how many times a model changes its prediction from correct to incorrect for each example during the training process. Memorization (Vitaly Feldman & Chiyuan Zhang, 2020) assigns a score to each example based on how much its presence or absence in the training set affects the model s ability to predict it correctly. Diverse ensembles (Kristof Meding et al., 2022) gave a score to each sample based on how many models in a group misclassified it. (Sorscher et al., 2022) proposed to use the distance between the sample and its corresponding cluster center as the importance score. Influence score (Tan et al., 2023; Xia et al., 2024) measures the sample-wise leave-one-out retraining influence on the model s performance. The score-based approach often suffers from performance problems in application, especially when the pruning ratio is large. Some recent works have tried to solve this problem through various methods, for example, Moderate (Xia et al., 2023) suggested selecting data points with scores close to the score median. Note that Moderate (Xia et al., 2023) can use any selection criterion, such as EL2N score (Paul et al., 2018), as a basis. Dyn-Unc He et al. (2024) proposed an efficient uncertainty-based score with awareness of training dynamics. Some related works also use sample-wise scores (Radford et al., 2021; Mahmoud et al., 2024) to reflect the quality of multi-modality data.

Published as a conference paper at ICLR 2025

Diversity-based (Geometry-based) methods. Traditionally, diversity-based coreset schemes are a very classic computer science problem (Lloyd, 1982; Tan et al., 2006; Coates & Ng, 2012; Har-Peled & Mazumdar, 2004; Feldman & Langberg, 2011; Feldman et al., 2013; Jiang et al., 2024). These schemes aim to find a subset of data points that maximizes the diversity among the selected elements. Sener & Savarese (2017) applied greedy k-center to choose the coreset with good data coverage. Yu et al. (2020); Chan et al. (2022) proposed to use the coding rate to model measure the diversity. Yu et al. (2022) formulates the problem of finding the most diverse subset into the problem of maximizing the dispersion or convex hull volume. In addition, some works proposed to prune data from the perspective of submodule theory (Wei et al., 2015; Kaushal et al., 2021; Kothawade et al., 2021) and linear programming (Yang et al., 2023) to ensure diversity.

Hybrid methods. Solely using the diversity-based method alone can hardly perform satisfactorily because they do not consider the intra-sample information. Recently, some hybrid works have also attempted to introduce that score-based scheme into the diversity-driven pipelines to achieve better performance. CCS (Zheng et al., 2022) sets different sampling ratios for samples with different scores to enhance data coverage and balance both easy and hard samples. Yang et al. (2024) introduces reconstructing the classification boundary on the original dataset as a goal and brings it into the framework of CCS. BADGE (Jordan T Ash et al., 2019) is a diversity-based selection method in active learning that clusters the gradient embeddings of the current model using k-means and selects a subset from each cluster. D2-Pruning (Maharana et al., 2023) views data pruning as a node selection problem based on Message-Passing on a graph, where the intra-sample information is utilized as the node values on the sample graph. One of the core steps of D2-Pruning is the Inverse Message Passing operation, which iteratively performs a greedy sample selection step. In each iteration, it will select the sample with the highest score from all unselected candidates and then reduce the score of samples in the neighborhood to guarantee that highly redundant parts are not selected in the subsequent iterations. Among the above methods, D2-Pruning is currently the most advanced solution in terms of performance. However, due to the heuristic or greedy nature of the algorithm, the result obtained by using is often suboptimal, check Figure 1 for details.

D.1 DERIVATION OF THE INFOMAX SOLVER

Firstly, we restate the quadratic optimization problem of Info Max:

max XPt0,1u N ÿ

z PD Xz Ipzq α ÿ

z,s PD Kz,s Xz Xs, s.t. ÿ

z PD Xz p, (6)

Solving the quadratic problem defined in Eq .2 directly can be extremely computationally burdensome and may even prove intractable since the budget size p is generally on the order of tens of thousands or even millions (Krähenbühl & Koltun, 2011; Larsson et al., 2018). Continuous relaxation simplifies the problem by relaxing some of the discrete constraints to continuous ones, reducing the complexity of the search space and making the problem more amenable to efficient optimization algorithms, such as the gradient-based methods (Krähenbühl & Koltun, 2011; Larsson et al., 2018). Here, we also introduce a continuous relaxation of the problem, enabling the use of a gradient-based solver for more efficient optimization:

z PD Xz Ipzq α ÿ

z,s PD Kz,s Xz Xs, s.t. ÿ

z PD Xz p, (7)

According to (Krähenbühl & Koltun, 2011; Larsson et al., 2018; Baqué et al., 2016; Tan et al., 2021), the continuous (complex and non-convex) problem in Eq. (7) could be optimized by solving a series of the following (convex) sub-problems:

Xt 1 arg min XPRN ,ř

z PD Xz p XT I 2αKXt λhp X

β Dp X, Xtq, (8)

where Xt is the solution of the t-th sub-problem, I 2αKXt is the gradient of the objective

in Eq. (7) at Xt. We introduce the convex entropy function hp q controlled by a factor λ the

Published as a conference paper at ICLR 2025

regularization term. A large λ value makes the problem easier to solve (Krähenbühl & Koltun, 2011; Larsson et al., 2018; Baqué et al., 2016; Tan et al., 2021), but it may deviate from the original problem. The proximal operator Dp X, Xtq is an optional regularization term to prevent the solution difference between the two iterations is too large. Following tradition (Krähenbühl & Koltun, 2011; Larsson et al., 2018; Baqué et al., 2016; Tan et al., 2021), we use Dp X, Xtq X

is the Kullback-Leibler divergence measure the discrepancy between X

p . Note that due to the non-negativity of X and the property that the sum is a fixed value p, X

p has a probabilistic meaning. Each element of it represents the probability that each sample is selected. If we differentiate this convex sub-problem, the optimal solution is identified by solving the following equation:

0 I 2αKXt λ 1

p Xt , s.t. X P RN , ÿ

z PD X z p,

(9) This has an analytic solution to (Krähenbühl & Koltun, 2011; Larsson et al., 2018; Baqué et al., 2016; Tan et al., 2021),

ˆX exp βp λβ 1p I 2αKXtq 1 1 λβ plog Xt

p Xtq λβp 1 λβ

Xt 1 X ˆX {p ÿ

z ˆX z q. (11)

Since the third term in Eq.10 is a constant, the solution could be formulated as the following form:

Xt 1 Softmax βp λβ 1p I 2αKXtq 1 1 λβ plog Xt

p Xtq . (12)

The mapping by the exponential function exp followed by the summation normalization is just equivalent to the Softmax operator which is widely used in deep learning. By just setting λ 1 and setting β Ñ 8, we have the most simplified iterative solution:

Xt 1 Softmax p I 2p αKXt . (13)

Note that the convergence rate of this solver is quite fast. In particular, the norm of the iterant difference |Xt 1 Xt| converges at a rate of Op 1

T q according to (Baqué et al., 2016; Tan et al., 2021).

D.2 PROD: HOW DOES INFOMAX FIND THE MOST INFORMATIVE CORESET?

We explain why Info Max can find the most informative coreset from the perspective of information theory. First, we restate the information maximization formulation of data pruning defined in Eq. (1), S arg max SĂD,|S| p Ip Sq, where p is the size of the target coreset. The Ip Sq measures the set-level information of the candidate subset S, specifically,

Ip Sq Ipz1q Ipz2|z1q ... Ipzp|z1, ..., zp 1q

2ďk Ipzk|z1, ..., zk 1q. (14)

where tz1, ..., zpu P S are all samples from the set. Note that this equation always holds regardless of the order of samples. The intra-sample information Ipz1q of the sample z1 could be instantiated by various sample-wise scores (Sorscher et al., 2022; Cody Coleman et al., 2019; Tan et al., 2023; Paul et al., 2018). Regarding the conditional gain term Ipz|Zq, we refer to the recent progress in submodular information measures (SIM) Wei et al. (2015); Kaushal et al. (2021); Kothawade et al. (2021), which present several instantiations for the submodular conditional gain. Here, we select the simplest yet effective instantiation, Graph Cut conditional gain (GCCG),

Ipzk|z1, ..., zk 1q fpzkq 2λ ÿ

iăk Kzi,zk. (15)

It measures the dissimilarity between the sample zk and all conditional samples. Specifically, Kzi,zk measures the similarity between the sample i and the sample k, λ is an undetermined coefficient

Published as a conference paper at ICLR 2025

and is a hyperparameter in the system. The submodular function f maps from a ground set to a real value, and we can simply set it with fpzq Ipzq. Hence, we have the following instantiation for the conditional gain term: Ipzk|z1, ..., zk 1q Ipzkq 2λ ř

iăk Kzi,zk. By substituting it into the Eq. (14), we can instantiate and reformulate the set-level information as:

z PD Ipzq 2λ ÿ

z s PD Kz,s. (16)

We introduce a set of binary variables X P t0, 1u N where N |D| is the size of the whole training set. In the selection procedure, Xz 1 indicates the sample z was selected, otherwise it was pruned. By the problem definition in Eq. (1) and the set-level information formulation Eq. (14), we can transform the original information maximization problem in Eq. (1) into the following combinatorial optimization problem,

max XPt0,1u N , |X| p : ÿ

z PS Xz Ip Sq

z1 ... zp PD

iďp Xzi Iprz1, ..., znsq

z PS Ipzq 2λ ÿ

z s PS Kz,s ,

z z1... zp 1PD Xz Iz

1ďi Xzi 2λ 1

z s z1... zp 2PD Xz Xs Kz,s

z z1... zp 2PD Xz Iz

z s z1... zp 3PD Xz Xs Kz,s

z z1... zp 2PD Xz Iz

z s z1... zp 2PD Xz Xs Kz,s

Since X is binary, hence,

max XPt0,1u N , |X| p : ÿ

z PS Xz Ip Sq

z z1... zp 2PD Xz Iz

z s z1... zp 2PD Xz Xs Kz,s

z z1... zp 2PD Xz Iz

z z1... zp 2PD Xz Iz

z s z1... zp 2PD Xz Xs Kz,s

1ďi Xzi 2pp 3q λ 1

z s z1... zp 2PD Xz Xs Kz,s

z z1... zp 2PD Xz Iz

1ďi Xzi 6 λ 1

z s z1... zp 2PD Xz Xs Kz,s

(18) By applying a similar reduction process to the rest variables, we have:

max XPt0,1u N , |X| p : ÿ

z PS Xz Ip Sq

p! pp 1q! ÿ

z PD Xz Iz pp 2q! λ ÿ

z s PD Xz Xs Kz,s

z PD Ipzq Xz λpp 2q!

z s PD Kz,s Xz Xs ,

where ppq! is a factorial function of p, and λ is a hyperparameter to be determined in Eq. (4). And we have pp 2q!

pp 1q! 1 p 1. Hence, if we use α to indicate λpp 2q!

pp 1q! , we obtain the quadratic programming

Published as a conference paper at ICLR 2025

problem as defined in Info Max. Therefore, we have proved that under the premise of using the instantiation of Graph-cut conditional gain (GCCG) for the conditional gain term, solving the data pruning problem in Eq. (1) to find the most informative subset is equivalent to solving the quadratic problem defined in Eq. (2).