# curbench_curriculum_learning_benchmark__5d27dfa9.pdf

Cur Bench: Curriculum Learning Benchmark

Yuwei Zhou 1 Zirui Pan 1 Xin Wang 1 Hong Chen 1 Haoyang Li 1 Yanwen Huang 1 Zhixiao Xiong 1

Fangzhou Xiong 1 Peiyang Xu 1 Shengnan Liu 1 Wenwu Zhu 1

Curriculum learning is a training paradigm where machine learning models are trained in a meaningful order, inspired by the way humans learn curricula. Due to its capability to improve model generalization and convergence, curriculum learning has gained considerable attention and has been widely applied to various research domains. Nevertheless, as new curriculum learning methods continue to emerge, it remains an open issue to benchmark them fairly. Therefore, we develop Cur Bench, the first benchmark that supports systematic evaluations for curriculum learning. Specifically, it consists of 15 datasets spanning 3 research domains: computer vision, natural language processing, and graph machine learning, along with 3 settings: standard, noise, and imbalance. To facilitate a comprehensive comparison, we establish the evaluation from 2 dimensions: performance and complexity. Cur Bench also provides a unified toolkit that plugs automatic curricula into general machine learning processes, enabling the implementation of 15 core curriculum learning methods. On the basis of this benchmark, we conduct comparative experiments and make empirical analyses of existing methods. Cur Bench is open-source and publicly available at https://github.com/THUMNLab/Cur Bench.

1. Introduction

Throughout the development of machine learning, a large number of works have been greatly influenced by human learning. Curriculum learning is such a research topic within machine learning that draws inspiration from a remarkable aspect of human learning: curriculum, i.e., learning in a pur-

1Department of Computer Science and Technology, BNRIST, Tsinghua University. Correspondence to: Xin Wang <xin wang@tsinghua.edu.cn>, Wenwu Zhu <wwzhu@tsinghua.edu.cn>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

poseful and meaningful order (Wang et al., 2021a; Soviany et al., 2022). In contrast to conventional machine learning methods where training examples are randomly input, curriculum learning aims to facilitate learning by gradually increasing the difficulty of data or tasks experienced by the model (Bengio et al., 2009). Since this easy-to-hard training paradigm is verified to bring the advantage of enhancing model generalization and accelerating convergence speed (Gong et al., 2016; Weinshall et al., 2018), it has aroused widespread interest among researchers in harnessing its potential across diverse application domains, such as computer vision (CV) (Guo et al., 2018; Soviany et al., 2020; Gui et al., 2017), natural language processing (NLP) (Platanios et al., 2019; Tay et al., 2019; Liu et al., 2018), graph machine learning (Li et al., 2023; Wang et al., 2021b; Wei et al., 2023; Qin et al., 2024; Yao et al., 2024), multimodal learning (Lan et al., 2023; Chen et al., 2023; Zhou et al., 2023), recommender systems (Chen et al., 2021b;a; Wu et al., 2023; Wang et al., 2023a), reinforcement learning (RL) (Florensa et al., 2017; Narvekar et al., 2017; Ren et al., 2018b), and others (Zhang et al., 2022; Zhou et al., 2022b).

Despite the significant progress and the wide application of curriculum learning, the increasing number of works has posed challenges in terms of their comparison and evaluation, mainly due to the differences in their experimental setups including datasets, backbone models, and settings. For instance, DCL (Saxena et al., 2019) and DDS (Wang et al., 2020) use the same Wide Res Net-2810 model (Zagoruyko & Komodakis, 2016), but perform experiments on different datasets: CIFAR-100 and CIFAR10 (Krizhevsky et al., 2009) respectively. Similarly, DIHCL (Zhou et al., 2020) and CBS (Sinha et al., 2020) leverage the same Image Net (Deng et al., 2009) dataset, but employ distinct models: Res Net-50 and Res Net-18 (He et al., 2016) respectively. Furthermore, while MCL (Zhou & Bilmes, 2018) and LRE (Ren et al., 2018a) utilize the same MNIST dataset and Le Net model (Le Cun et al., 1998), they adopt different settings: standard and imbalanced labels respectively. Consequently, their experimental results cannot be compared directly, which makes it challenging to conduct a fair evaluation. The absence of a standardized evaluation not only hinders researchers from accurately assessing their own contributions when they propose a new

Cur Bench: Curriculum Learning Benchmark

CV NLP Graph

Backbone Model

Convolution Recurrent Attention

Standard Noise Imbalance

Performance Complexity

Dataset Metrics

CIFAR-10 Accuracy

MRPC F1 Score

STS-B Spearman Correlation

Time (h) Memory (MB)

Figure 1. Cur Bench includes 15 datasets spanning 3 research domains, 9 backbone models, 3 training settings, and 2 evaluation dimensions, providing a comprehensive benchmark for existing curriculum learning methods.

method but also poses barriers for users when they seek a suitable method for their specific tasks.

To deal with this issue, researchers have made notable efforts to evaluate and summarize existing works. From a theoretical perspective, there have been surveys covering general curriculum learning (Wang et al., 2021a; Soviany et al., 2022) as well as specific ones for graph (Li et al., 2023) and RL (Narvekar et al., 2020; Portelas et al., 2020), all of which manage to formulate and categorize relevant methods comprehensively. Although they offer valuable theoretical insights, current surveys do not incorporate any practical implementation or experimental results. From an empirical perspective, there has been an open-source library on curriculum learning (Zhou et al., 2022a), which reproduces multiple related methods through a unified framework. Although it provides empirical results of the implemented methods, this library only supports the classification task on CIFAR-10, limited in experimental setups. In conclusion, the related works fail to address the open issue of evaluating and comparing curriculum learning methods completely.

In order to address the absence of benchmarks in this field, we propose Cur Bench, the first benchmark for systematic evaluations of curriculum learning, as shown in Figure 1. Concretely, it encompasses 15 prevalent datasets, spanning 3 research domains including CV, NLP, and graph to ensure the reliability of evaluation. These datasets are further preprocessed into 3 settings including standard, noise, and imbalance to reveal the capability of methods to enhance model generalization and robustness. Without loss of generality, a total of 9 prevalent backbone models of varying types and scales adapted to the above datasets are employed in an appropriate manner, incorporating corresponding hyperparameters, optimizers, and so on. Most of the datasets, settings, and models are commonly used in previous re-

lated works, while the rest are supplemented in this work to investigate how these methods can adapt to the tasks in other domains. For ease of use, this benchmark also provides a unified toolkit that plugs automatic curricula into general machine learning processes and reproduces a collection of 15 core curriculum learning approaches. Based on these implementations in Cur Bench, we further perform a comprehensive evaluation from 2 dimensions including performance and complexity, presenting the improvements the methods bring and the additional resources they consume.

Furthermore, we delve into our benchmark, organize experimental outcomes, conduct in-depth analyses, and obtain some intriguing findings. First, there has been no such method that outperforms others all the time, and the effectiveness depends on specific scenarios. Second, curriculum learning brings more significant improvements in noise settings than in standard and imbalance ones. Third, methods by teacher transferring have edges in noise settings, while methods by reweighting perform relatively well in imbalance settings. Lastly, methods involving gradient calculation and extra learnable networks generally have higher time and space complexity.

Our contributions are summarized as follows:

We propose Cur Bench, the first benchmark on curriculum learning to the best of our knowledge.

We conduct extensive experiments to impartially evaluate and compare the performance and complexity of existing curriculum learning methods under various experimental setups.

We make in-depth analyses and demonstrate intriguing observations on curriculum learning based on empirical results derived from Cur Bench.

Cur Bench: Curriculum Learning Benchmark

Domain Dataset Setting Training Validation Test Class Metrics

CIFAR-10 Standard / Noise-0.4 45,000 5,000 10,000 10 Accuracy Imbalance-50 12,536 5,000 10,000 10 Accuracy

CIFAR-100 Standard / Noise-0.4 45,000 5,000 10,000 100 Accuracy Imbalance-50 12,536 5,000 10,000 100 Accuracy

Tiny-Image Net Standard / Noise-0.4 90,000 10,000 10,000 200 Accuracy Imbalance-50 22,700 10,000 10,000 200 Accuracy

RTE Standard / Noise-0.4 2,490 277 - 2 Accuracy MRPC Standard / Noise-0.4 3,668 408 - 2 F1 Score STS-B Standard / Noise-0.4 5,749 1,500 - 6 Spearman Co LA Standard / Noise-0.4 8,551 1,043 - 2 Matthews SST-2 Standard / Noise-0.4 67,349 872 - 2 Accuracy QNLI Standard / Noise-0.4 104,743 5,463 - 2 Accuracy QQP Standard / Noise-0.4 363,846 40,430 - 2 F1 Score MNLI-(m/mm) Standard / Noise-0.4 392,702 9,815/9,832 - 3 Accuracy

MUTAG Standard / Noise-0.4 150 19 19 2 Accuracy PROTEINS Standard / Noise-0.4 890 111 112 2 Accuracy NCI1 Standard / Noise-0.4 3,288 411 411 2 Accuracy ogbg-molhiv Standard / Noise-0.4 32,901 4,113 4,113 2 ROC-AUC

Table 1. The statistics of 15 datasets adopted in Cur Bench, which covers a wide range of scales across 3 research domains in 3 settings. Spearman and Matthews refers to the correlation coefficient. Noise-0.4 means 40% data samples are independently attached with random incorrect labels. Imbalance-50 means a ratio of 50 between the number of samples in the largest class and that in the smallest class in a long-tailed dataset where the number of samples for each class follows a geometric sequence. The imbalance setting is not applied to NLP and graph datasets, which are imbalanced originally.

2. Related Work

2.1. Curriculum Learning

Curriculum learning, much like many other topics in machine learning, draws inspiration from human learning. It refers to a training strategy where models learn from input data in a meaningful order, imitating the way humans learn from curricula. The emergence of this idea could at least be traced back to Elman s work (Elman, 1993) in 1993, which advocated the importance of starting small. In 2009, Bengio et al. (Bengio et al., 2009) first introduced a formal definition of curriculum learning and explored when, why, and how a curriculum could benefit machine learning. In the early stages, curricula for models were entirely predefined by humans, and the most typical method was named Baby Step (Spitkovsky et al., 2010). However, this type of predefined approach is not flexible and general enough for widespread applications. In 2010, Kumar et al. (Kumar et al., 2010) proposed self-paced learning (SPL), enabling automatic curriculum scheduling by ordering data according to their training loss. Subsequently, a variety of automatic curriculum learning methods have continued to emerge. For example, transfer learning methods (Weinshall et al., 2018; Hacohen & Weinshall, 2019) employ teacher models to offer student models curricula. Reinforcement learning methods (Graves et al., 2017; Matiisen et al., 2019; Zhao et al., 2020) allow teacher models to adapt curriculum based on

the feedback from student models. In addition, there are other ones based on Bayesian optimization (Tsvetkov et al., 2016), meta-learning (Ren et al., 2018a; Shu et al., 2019), and adversarial learning (Zhang et al., 2020) for implementing automatic curriculum learning.

2.2. Summative Work on Curriculum Learning

To the best of our knowledge, Cur Bench is the first benchmark on curriculum learning. Despite no related benchmarks, there have been numerous efforts to investigate and summarize the curriculum learning methods from different perspectives. For example, Wang et al. (Wang et al., 2021a) survey curriculum learning and propose a general framework to cover the related methods by abstracting them into two key components, i.e., a difficulty measurer to tell what data or task is easy or hard to learn and a learning scheduler to decide when to learn the easier or harder part, and further categorize the methods according to the implementation of these two components. Soviany et al. (Soviany et al., 2022) also survey curriculum learning and propose a generic algorithm for it based on the definition of machine learning, i.e., data, modal, and task, and organize the methods according to their application domains and tasks. Narvekar et al. (Narvekar et al., 2020) survey the relevant methods applied to RL and abstract them into three steps, i.e., task generation, sequencing, and transfer learning. Portelas et

Cur Bench: Curriculum Learning Benchmark

al. (Portelas et al., 2020) also focus on curriculum learning for RL, and classify the methods based on three questions, i.e., why, what control, and what optimize. Li et al. (Li et al., 2023) review the tailored methods for graph, and group them according to the tasks, i.e., node-level, link-level, and graph-level. However, these works only summarize and analyze the methods from the theoretical aspect. On the other hand, Zhou et al. (Zhou et al., 2022a) develop Cur ML, a code library for curriculum learning, which designs a unified framework for the reproduction and comparison of existing methods from the empirical aspect. Nevertheless, it can only conduct experiments on a single task within a specific domain, significantly limiting its generality and reliability. Therefore, it is necessary to develop a benchmark across diverse experimental setups for a fair, reliable, and systematic study on curriculum learning.

3. Curriculum Learning Benchmark

In this section, we describe our design for the benchmark in detail. First, we clarify the scope of this benchmark in Section 3.1. Then, we introduce the adopted datasets in Section 3.2, followed by the corresponding settings in Section 3.3 and the backbone models in Section 3.4. Lastly, we elaborate on the evaluation dimensions in Section 3.5.

3.1. Benchmark Scope

Cur Bench focuses on benchmarking existing prevalent curriculum learning methods for supervised tasks in CV, NLP, and graph domains. This is because CV and NLP are representative research domains in machine learning, with datasets in these areas frequently used to validate the performance of curriculum learning methods, as shown in Table 6. Graph data, being structured, differs from the unstructured data of images and text, contributing to the diversity of Cur Bench, and curriculum learning in the graph domain has gained significant attention recently. Besides, the main challenge of the tasks included in Cur Bench lies in designing appropriate curricula at the data level so that the models can be guided to better cope with standard, noisy, and imbalanced datasets. In contrast, the methods designed at the task level and specifically targeting the RL domain are not within the scope of this work. We plan to expand the scope of Cur Bench in a future version, as stated in Section 6.

3.2. Dataset

Table 1 outlines the datasets included in Cur Bench, all of which are publicly available and widely used in their respective domains. Besides, they vary in scale from hundreds of samples to hundreds of thousands. A brief introduction to the datasets and our preprocessing is listed as follows.

CV Domain: CIFAR-10 and CIFAR-100 (Krizhevsky et al.,

2009) consist of 32 32 3 color images in 10 and 100 classes respectively. Tiny-Image Net (Le & Yang, 2015) is a subset of the ILSVRC2012 version of Image Net (Deng et al., 2009) and consists of 64 64 3 down-sampled images. Since the test set of Tiny-Image Net is not released with labels, we use the validation set as the test set. For these 3 datasets, we split the original training set into a new training set and a validation set with a 9:1 ratio.

NLP Domain: All 8 datasets are sourced from GLUE (Wang et al., 2018), which is a collection of tools for evaluating models across diverse natural language understanding tasks. GLUE originally contains 9 datasets, and we follow BERT (Devlin et al., 2018), excluding the problematic WNLI set and using the remaining 8 datasets. Since the test sets are not released with labels, we report the results on the validation sets.

Graph Domain: The ogbg-molhiv dataset belongs to Open Graph Benchmark (OGB) (Hu et al., 2020), a collection of realistic, large-scale, and diverse benchmark datasets for graphs. We strictly follow its origin split scheme, split ratios, and metrics. The other 3 datasets come from TUDataset (Morris et al., 2020), a collection that consists of over 120 graph datasets of varying sizes from a wide range of applications. Since there are no established training and test set split, we randomly divide the original datasets into training, validation, and test sets with an 8:1:1 ratio.

3.3. Setting

To robustly evaluate the curriculum learning methods, we establish the 3 settings as follows.

Standard: After dividing the datasets into training, validation, and test sets as mentioned above, we do not perform any further data processing.

Noise-p: We follow previous works (Zhang et al., 2016; Ren et al., 2018a; Shu et al., 2019) and apply uniform noise by independently changing the label of each sample in the training set to a random one with a probability of p (0.0, 1.0]. When p = 0, it degenerates to the standard setting.

Imbalance-r: We follow previous works (Cui et al., 2019; Shu et al., 2019) to form a long-tailed dataset by reducing the number of samples per class in the training set. Let c {0, 1, 2, ..., C 1} be the class index, C be the number of classes, nc be the number of samples in the cth class, and then an originally balanced dataset satisfies n0 n1 ... n C 1. We implement the imbalance setting by requiring nc to follow the exponential function nc = n0dc where d (0, 1) and define the imbalance factor r = n0 : n C 1 as the ratio between the number of samples in the largest class and that in the smallest class. When r = 1, it degenerates to the standard setting.

Cur Bench: Curriculum Learning Benchmark

Domain Model Mechanism Parameters

CV Le Net Convolution 0.07M Res Net-18 Convolution 11.2M Vi T Attention 9.6M

NLP LSTM Recurrent 10.4M BERT Attention 109M GPT2 Attention 124M

Graph GCN Convolution 0.01M GAT Attention 0.14M GIN Isomorphism 0.01M

Table 2. The statistics of 9 backbone models adopted in Cur Bench, which covers various mechanisms and scales. signifies an approximation, and M represents million.

3.4. Backbone Model

Table 2 overviews the backbone models that we employ in Cur Bench. All the values in the last column are approximations because the number of parameters varies depending on the input sizes and output classes. All of the models are commonly applied to the aforementioned datasets, and they are distinct from each other in mechanism and model size.

CV Domain: Le Net (Le Cun et al., 1998) is one of the earliest convolutional neural networks (CNN), which is composed of 3 convolution layers, two pooling layers, and some fully-connected layers. Res Net (He et al., 2016) is a classic CNN with residual connection designed for easier training of deeper networks, and Res Net-18 refers to the 18-layer version. Vi T (Dosovitskiy et al., 2020) is the standard Transformer directly applied to images by treating image patches as word tokens. Vi T in Cur Bench is not pretrained because its pretrained weights are derived from Image Net (Deng et al., 2009), which leads to the risk of data leakage when evaluating its performance on Tiny-Image Net (Le & Yang, 2015), a subset of Image Net.

NLP Domain: LSTM (Hochreiter & Schmidhuber, 1997) is a typical recurrent neural network (RNN), which introduces gate functions to control what to remember and what to forget in the face of long sequences. BERT (Devlin et al., 2018) is a deep bidirectional Transformer pretrained by masked language model task and it excels at semantic representation due to its encoder-based architecture. GPT2 (Radford et al., 2019) is a decoder-based Transformer pretrained through left-to-right language modeling objectives, and as a result, works well on text generation. BERT and GPT2 in Cur Bench are pretrained because training them from scratch would result in poor performance, making it difficult to maintain consistency with their suggested performance.

Graph Domain: GCN (Kipf & Welling, 2016) is a variant of CNN, designed to operate directly on graphs. Its insight lies in the choice of convolutional architecture via

a localized first-order approximation of spectral graph convolutions. GAT (Veliˇckovi c et al., 2017) introduces masked self-attentional layers based on GCN to enable implicitly specifying different weights to different nodes in a neighborhood. GIN (Xu et al., 2018) is developed based on Weisfeiler-Lehman test theory and emphasizes the importance of summation as the readout function.

3.5. Evaluation

To ensure a comprehensive analysis of existing methods, we consider the following 2 evaluation dimensions.

Performance: We adopt the widely accepted metrics on each dataset, such as accuracy on image, F1 score, Spearman Correlation, and Matthews Correlation on the GLUE benchmark, AUC (Yang et al., 2021) on graph. To display the results clearly, we report the average and standard deviation of the metric over 5 runs for each dataset.

Complexity: It is essential to examine the time and space complexity of each method because they always cost extra computational time and sources to assess model competence and data difficulty for appropriate curricula design. We record the training time and maximum memory consumption on the same GPU device as the indicators of the complexity.

4. Cur Bench Toolkit

4.1. Modules

To facilitate the use of our Cur Bench, we develop a companion toolkit based on Cur ML (Zhou et al., 2022a) for the entire pipeline of applying curriculum learning to various machine learning tasks, reproducing 15 core methods. Compared to Cur ML, this toolkit extends the methods to accommodate inputs in various data formats and diverse output evaluation metrics and provides searched hyperparameters for each method. As illustrated in Figure 2, we summarize and abstract the whole toolkit into 5 modules: data processing, model loading, objective fitting, curriculum learning, and evaluation.

Data Processing: This module aims to prepare data according to the specified dataset and setting. Given a data name in a format like cifar10 , cifar100-noise-p or tinyimagenetimbalance-r , this module can automatically parse it, split the dataset into training, validation, and test set, and process the training set by adding noise with probability p or forming imbalance with factor r.

Model Loading: This module is used to initialize the model based on the model name and the target dataset. For instance, CV models need to modify their input layer to accommodate input images and patch sizes. Similarly, graph models require node features and edge relationships when construct-

Cur Bench: Curriculum Learning Benchmark

Data Processing

Model Loading

Objective Fitting

Cross Entropy

Curriculum Learning Evaluation

Performance

Training Time

via Data Selection

SPL (Neur IPS, 2010)

MCL (ICLR, 2018)

LGL (CVPR, 2019)

C2F (ar Xiv)

TTCL (ICML, 2018)

DIHCL (Neur IPS, 2020)

Adaptive CL (ICCV, 2021)

Efficient Train (ICCV, 2023)

via Model Adjustment via Loss Reweighting

CBS (Neur IPS, 2020) Screener Net (ar Xiv)

MW-Net (Neur IPS, 2019)

Super Loss (Neur IPS, 2020)

LRE (ICML, 2018)

DCL (Neur IPS, 2019)

DDS (ICML, 2020)

Figure 2. Our Cur Bench toolkit, which is composed of 5 modules, offers a unified and complete pipeline from initiation to evaluation, aiming for easy implementation and reproduction of curriculum learning methods. This figure showcases an example of noisy CIFAR-10.

ing graph convolutional layers. Besides, the class number of the dataset determines the models output layer.

Objective Fitting: This module handles the process where models learn and fit datasets to accomplish target tasks. For different research domains, we select tailored hyperparameters, optimizers, loss functions, and so on. Unlike common machine learning, the training procedure in this module is guided by the curriculum learning module.

Curriculum Learning: This module integrates 15 core curriculum learning methods, all of which are abstracted as a class for easy plug-in into the objective fitting module. This design of abstracting methods as classes ensures that the module is extensible for new methods. Currently, we divide the existing methods into the following 3 categories. It is worth noting that this categorization is intended to facilitate the implementation and extension of various methods within a unified framework, but it does not imply that methods within the same category necessarily share similar properties or performance.

via Data Selection: The primary approach to implementing curriculum is through data selection so that models can progressively learn from a subset to the entire dataset in a meaningful order. The methods belong to this category are vanilla SPL (Kumar et al., 2010), DIHCL (Zhou et al., 2020), and so on (Weinshall et al., 2018; Zhou & Bilmes, 2018; Cheng et al., 2019; Kong et al., 2021; Wang et al., 2023b). Some methods select data subsets based on sample difficulty, while others select data based on sample class.

via Model Adjustment: An innovative idea for designing curricula is to regulate the amount of data information the model receives by modifying its architecture. CBS (Sinha et al., 2020), which employs a Gaussian filter to manage information intake, is a typical one.

via Loss Reweighting: Loss reweighting can be regarded as a soft version of data selection. Intuitively,

assigning a low weight to a data sample is almost equivalent to disregarding it. A common practice to reweight loss is through meta-learning (Finn et al., 2017), such as LRE (Ren et al., 2018a), MW-Net (Shu et al., 2019), and DDS (Wang et al., 2020), all of which employ a meta-network to assess the weights of losses and optimize the meta-network with the validation set. Additionally, there are other approaches, such as variants of SPL (Fan et al., 2017; Castells et al., 2020), DCL (Saxena et al., 2019), Screener Net (Kim & Choi, 2018), and Super Loss (Castells et al., 2020).

Evaluation: This module is utilized to report results from 2 aspects, i.e., performance and complexity, in order to respectively demonstrate the effectiveness and efficiency of different methods. The performance metrics depend on the target datasets and tasks, and the complexity metrics include training time and maximum GPU memory consumption.

4.2. Example Usage

Figure 3 illustrates the python-like sample code of our Cur Bench toolkit, where an object of the SPLTrainer class is instantiated given the essential parameters, including a CIFAR-10 dataset name with the noise setting for data processing and a Res Net-18 net name for model loading. All of the above are put together to fit and evaluate the final result. With only a few lines of code, a dozen curriculum learning methods can be easily implemented and reproduced. On the basis of this tool, we conduct a multitude of experiments, and we will report the experimental setups and results in the next section.

5. Experiments and Analyses

5.1. Experimental Setup

To ensure a fair and reproducible evaluation, we fix all possible confounding factors and report the average and

Cur Bench: Curriculum Learning Benchmark

from curbench.algorithms import SPLTrainer

# Instantiate curriculum learning class trainer = SPLTrainer(

# CIFAR-10 with 40% wrong labels data_name='cifar10-noise-0.4', # Res Net-18 with 32 32 input size net_name='resnet18', # Self-Paced Learning in a linear way start_rate=0.0, grow_epochs=100, grow_fn='linear', weight_fn='hard', ) # Automatic, no need to specify: # trainer._init_dataloader() # trainer._init_model()

# Fitting and evaluating trainer.fit() trainer.evaluation()

Figure 3. Python-like sample code for an example of Self-Paced Learning applied to image classification with Cur Bench Toolkit.

standard deviation results of 5 runs with different fixed random seeds for each combination of datasets, backbone models, and settings. The detailed hyperparameters for both training processes and curriculum learning methods are presented in the Appendix.

5.2. Performance

5.2.1. Main Results

Table 3 presents the overall performances with and without curriculum learning under different combinations of backbone models, datasets, and settings. The detailed results of each specific curriculum learning method are attached in the Appendix, and we report the best ones among them in this table. The imbalance setting is not applied to NLP and graph datasets, where the number of samples in each class is imbalanced originally.

It is observed that curriculum learning can bring consistent improvement across domains. Compared to standard and imbalance settings, curriculum learning benefits much more in noise settings. This phenomenon is consistent with existing theoretical analysis, where curriculum learning is able to denoise and guide machine learning by discarding the difficult and possibly noisy data in the early stages of training. Besides, there is no such method that can outperform the others all the time, and the effectiveness of curriculum learning methods still depends on the target scenarios. For

example, Screener Net (Kim & Choi, 2018) exhibits superior performance on CV datasets compared to graph datasets, and TTCL (Weinshall et al., 2018) performs better in noise settings than in standard and imbalance ones. Therefore, it is essential to explore more general methods while also researching methods tailored to specific environments.

5.2.2. Results in Noise Settings

Figure 4 demonstrates the performances of curriculum learning methods on datasets with different noise ratios p {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. Without loss of generality, we select a backbone model and a dataset from each research domain. Some methods such as CBS, LGL, C2F, and Efficient Train are only applied to CV datasets and not to NLP and graph datasets due to the following reasons. CBS (Sinha et al., 2020) requires convolutional layers in backbone models, and such models in Cur Bench are only within the CV domain. LGL (Cheng et al., 2019) and C2F (Stretcu et al., 2021) require multiple classes for clustering, but most NLP and graph datasets in Cur Bench have only two classes. Efficient Train (Wang et al., 2023b) is based on data augmentation techniques on images.

We can observe that TTCL (Weinshall et al., 2018), the method by teacher transferring, obtains competitive performances regardless of the noise ratio, thanks to the guidance from the teacher model pretrained on the clean dataset. In contrast, SPL (Kumar et al., 2010), which is similar to TTCL but guides the learning by itself, performs relatively poorly. It is because a model not fully trained is not that competent to accurately distinguish noisy or hard data.

5.2.3. Results in Imbalance Settings

Figure 5 depicts the performances on CIFAR-10 with varying imbalance factor r {1, 10, 20, 50, 100, 200}.

It is observed that all methods achieve similar performances under different imbalance ratios. When the imbalance factor r increases, the differences between the methods become evident. Relatively speaking, the methods by data reweighting, such as DCL (Saxena et al., 2019) and Super Loss (Castells et al., 2020), perform well because they can mitigate the impact of imbalanced classes by reassigning the weight of data or even class.

Compared with noise settings, curriculum learning brings less significant improvements and shows less variation between methods in imbalance settings. This is primarily because most curriculum learning methods focus on the difficulty of samples instead of classes, leading to overall better performances in noise settings than in imbalance settings. Additionally, the differences in judging difficult or noisy samples result in larger performance disparities among methods in noise settings.

Cur Bench: Curriculum Learning Benchmark

CIFAR-10 CIFAR-100 Tiny-Image Net Standard Noise-0.4 Imbalance-50 Standard Noise-0.4 Imbalance-50 Standard Noise-0.4 Imbalance-50

Le Net 69.951.00 65.021.12 44.930.56 35.460.70 29.590.40 19.570.64 22.080.61 18.630.43 11.650.30 Le Net + CL 70.430.41 65.930.57 45.280.56 35.630.78 30.870.48 19.740.17 22.830.44 19.910.26 12.360.47 Res Net-18 92.330.16 82.752.06 75.490.87 69.970.27 52.140.39 42.570.68 51.411.74 39.420.21 28.830.38 Res Net-18 + CL 92.880.23 86.920.20 76.430.96 71.310.14 58.560.60 43.470.43 53.610.48 43.640.72 30.820.36 Vi T 79.900.38 64.190.51 52.120.81 51.050.62 35.250.24 26.050.52 38.160.53 24.900.26 17.150.31 Vi T + CL 80.660.27 69.830.53 52.850.81 51.930.64 39.150.30 26.400.34 38.920.53 29.760.34 17.470.14

RTE MRPC STS-B Co LA Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4

LSTM 52.951.34 53.431.77 81.430.14 81.220.00 12.730.72 10.901.19 11.291.27 3.271.68 LSTM + CL 53.071.29 54.221.77 81.540.18 81.240.05 14.112.21 11.751.61 12.651.21 8.552.10 BERT 64.623.33 54.223.14 88.540.45 81.890.83 85.260.22 80.711.01 57.391.30 32.350.79 BERT + CL 66.351.76 56.325.04 88.691.24 81.940.55 85.420.22 81.310.25 57.801.96 45.791.64 GPT2 65.341.95 52.924.49 85.490.86 78.231.72 76.441.20 69.651.85 37.003.72 5.861.69 GPT2 + CL 66.352.10 57.403.39 86.290.36 82.550.88 80.821.39 71.571.74 39.953.16 12.542.75

SST-2 QNLI QQP MNLI-(m/mm) Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4

LSTM 81.670.85 64.361.12 50.540.00 50.620.16 75.690.27 60.720.79 61.380.30 / 61.210.45 44.410.51 / 44.830.90 LSTM + CL 82.870.88 78.581.64 51.020.46 50.830.45 75.730.21 66.470.72 62.470.36 / 62.330.42 58.590.54 / 58.500.64 BERT 92.660.28 87.220.82 91.210.24 81.210.76 88.050.12 76.230.48 83.890.31 / 84.380.29 78.650.70 / 79.210.62 BERT + CL 92.820.16 91.250.59 91.490.13 89.450.44 88.160.13 84.500.25 84.270.07 / 84.400.42 81.730.31 / 82.250.40 GPT2 91.950.49 85.830.57 87.920.31 78.720.37 86.000.23 75.400.84 81.530.21 / 82.400.21 76.560.15 / 77.690.15 GPT2 + CL 92.250.42 90.340.53 88.170.67 84.000.70 86.680.16 82.160.35 81.900.23 / 82.590.35 78.360.19 / 79.620.44

MUTAG PROTEINS NCI1 ogbg-molhiv Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4

GCN 73.682.11 66.317.14 70.714.20 63.576.45 69.591.23 55.233.21 75.841.02 64.294.55 GCN + CL 74.743.94 71.585.37 73.214.41 71.616.62 71.391.29 67.982.01 77.411.15 72.811.14 GAT 69.476.14 65.265.37 64.462.96 65.719.13 56.742.86 53.772.12 68.072.34 65.372.66 GAT + CL 72.638.42 69.4710.21 69.827.13 69.113.77 59.371.59 55.674.70 72.641.16 66.731.84 GIN 86.847.90 78.953.72 74.114.24 69.821.73 79.321.40 60.243.92 74.721.36 63.073.73 GIN + CL 88.422.10 81.584.56 77.144.88 73.931.82 82.041.90 62.146.47 76.531.97 65.531.61

Table 3. The empirical performances of 9 backbone models over 15 datasets in 3 settings with and without curriculum learning methods. The rows with + CL present the best performances achieved among the methods involved in this benchmark. The bold font highlights the superior performances brought by curriculum learning. The imbalance setting is not applied to NLP and graph datasets, which are imbalanced originally. Note: The detailed performances of each method are reported in Table 9-11 in the Appendix.

0.0 0.2 0.4 0.6 0.8 Noise-p

Accuracy (%)

Res Net-18 on Noisy CIFAR-10

SPL TTCL MCL Screener Net LRE MW-Net DCL LGL DDS DIHCL Super Loss CBS C2F Adaptive CL Efficient Train

0.0 0.2 0.4 0.6 0.8 Noise-p

Accuracy (%)

BERT on Noisy RTE

0.0 0.2 0.4 0.6 0.8 Noise-p

Accuracy (%)

GCN on Noisy NCI1

Figure 4. The performances as a function of noise ratio p for different curriculum learning methods on datasets from 3 research domains.

Cur Bench: Curriculum Learning Benchmark

1 10 20 50 100 200 Imbalance-r

Accuracy (%)

Res Net-18 on Imbalanced CIFAR-10

SPL TTCL MCL Screener Net LRE MW-Net DCL LGL DDS DIHCL Super Loss CBS C2F Adaptive CL Efficient Train

Figure 5. The performances as a function of imbalance factor r.

5.3. Complexity

Figure 6 shows the time and space complexity of each method in the case of Res Net-18 and CIFAR-10, measured by GPU training time (Hour) and maximum GPU memory consumption (GB).

The whole figure can be divided into 3 parts. The first is the upper right corner, which contains the methods requiring gradient calculation and meta-network training, resulting in high time and space complexity. The second is the middle part with the point of Screener Net, which also introduces an extra network but only requires once backward, leading to less complexity. The third is the lower left corner, which includes most of the methods consuming similarly small amounts of training time and GPU memory because they measure data difficulty and schedule curriculum in a relatively intuitive way and do not demand a learnable network with a large number of parameters.

0.0 3.0 6.0 9.0 12.0 Time Complexity (Hour)

Space Complexity (GB)

Screener Net

DCL LGL DIHCL Super Loss

C2F Adaptive CL

Efficient Train

Figure 6. Time and space complexity of different methods in the case of Res Net-18 and CIFAR-10. Note: The numerical results of 3 different cases are reported in Table 8 in the Appendix.

6. Conclusion

In this paper, we propose Cur Bench, the first benchmark for curriculum learning. It covers a broad range of research domains, datasets, backbone models, settings, and evaluation dimensions, ensuring a fair, reliable, and systematic evaluation of existing curriculum learning methods. For convenient utilization, it is complemented by a toolkit that implements essential related works in a unified pipeline and applies them to various machine learning tasks. Through empirical results and theoretical analyses, we provide valuable findings on curriculum learning. In conclusion, Cur Bench holds the potential to benefit future research and suggest promising directions.

Limitations: Despite the benefits of our Cur Bench, we also recognize the following limitations in this version and intend to refine them in future expansions.

Cur Bench mainly covers supervised learning in CV, NLP, and graph domains, but has not incorporated the datasets, backbone models, and tasks related to other domains such as audio processing, multimodal learning, recommender systems, and robotics. Additionally, Cur Bench has not involved unsupervised, semi-supervised, and reinforcement learning. Given the importance of these topics in the context of curriculum learning applications, they will be integrated as a significant part of future versions.

Cur Bench currently employs publicly available datasets that are commonly used in their respective domains. However, Cur Bench has not yet introduced any new datasets. Designing specialized datasets for curriculum learning is essential because these datasets can better align with the unique requirements and objectives of curriculum learning methodologies. We recognize the importance of this task and intend to undertake it in the future.

Cur Bench has not evaluated the performance of curriculum learning on large models, which deserves in-depth exploration in this era of large models. Considering that large models often encounter vast amounts of data with varying quality when learning, it is suitable to utilize curriculum learning for guidance and denoising. We plan to include the prevalent large-scale language and multimodal models in our future work.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Cur Bench: Curriculum Learning Benchmark

Acknowledgements

This work is supported by the National Key Research and Development Program of China No.2023YFF1205001, National Natural Science Foundation of China (No. 62222209, 62250008, 62102222), Beijing National Research Center for Information Science and Technology under Grant No. BNR2023RC01003, BNR2023TD03006, and Beijing Key Lab of Networked Multimedia.

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41 48, 2009.

Castells, T., Weinzaepfel, P., and Revaud, J. Superloss: A generic loss for robust curriculum learning. Advances in Neural Information Processing Systems, 33:4308 4319, 2020.

Chen, H., Chen, Y., Wang, X., Xie, R., Wang, R., Xia, F., and Zhu, W. Curriculum disentangled recommendation with noisy multi-feedback. Advances in Neural Information Processing Systems, 34:26924 26936, 2021a.

Chen, H., Wang, X., Lan, X., Chen, H., Duan, X., Jia, J., and Zhu, W. Curriculum-listener: Consistency-and complementarity-aware audio-enhanced temporal sentence grounding. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 3117 3128, 2023.

Chen, Y., Wang, X., Fan, M., Huang, J., Yang, S., and Zhu, W. Curriculum meta-learning for next poi recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2692 2702, 2021b.

Cheng, H., Lian, D., Deng, B., Gao, S., Tan, T., and Geng, Y. Local to global learning: Gradually adding classes for training deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4748 4756, 2019.

Cui, Y., Jia, M., Lin, T.-Y., Song, Y., and Belongie, S. Classbalanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9268 9277, 2019.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Elman, J. L. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71 99, 1993.

Fan, Y., He, R., Liang, J., and Hu, B. Self-paced learning: An implicit regularization perspective. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In International conference on machine learning, pp. 1126 1135. PMLR, 2017.

Florensa, C., Held, D., Wulfmeier, M., Zhang, M., and Abbeel, P. Reverse curriculum generation for reinforcement learning. In Conference on robot learning, pp. 482 495. PMLR, 2017.

Gong, T., Zhao, Q., Meng, D., and Xu, Z. Why curriculum learning & self-paced learning work in big/noisy data: A theoretical perspective. Big Data & Information Analytics, 1(1):111, 2016.

Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. Automated curriculum learning for neural networks. In international conference on machine learning, pp. 1311 1320. PMLR, 2017.

Gui, L., Baltruˇsaitis, T., and Morency, L.-P. Curriculum learning for facial expression recognition. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 505 511. IEEE, 2017.

Guo, S., Huang, W., Zhang, H., Zhuang, C., Dong, D., Scott, M. R., and Huang, D. Curriculumnet: Weakly supervised learning from large-scale web images. In Proceedings of the European conference on computer vision (ECCV), pp. 135 150, 2018.

Hacohen, G. and Weinshall, D. On the power of curriculum learning in training deep networks. In International Conference on Machine Learning, pp. 2535 2544. PMLR, 2019.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE

Cur Bench: Curriculum Learning Benchmark

conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735 1780, 1997.

Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. ar Xiv preprint ar Xiv:2005.00687, 2020.

Kim, T.-H. and Choi, J. Screenernet: Learning self-paced curriculum for deep neural networks. ar Xiv preprint ar Xiv:1801.00904, 2018.

Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016.

Kong, Y., Liu, L., Wang, J., and Tao, D. Adaptive curriculum learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5067 5076, 2021.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Kumar, M., Packer, B., and Koller, D. Self-paced learning for latent variable models. Advances in neural information processing systems, 23, 2010.

Lan, X., Yuan, Y., Chen, H., Wang, X., Jie, Z., Ma, L., Wang, Z., and Zhu, W. Curriculum multi-negative augmentation for debiased video grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 1213 1221, 2023.

Le, Y. and Yang, X. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Li, H., Wang, X., and Zhu, W. Curriculum graph machine learning: A survey. ar Xiv preprint ar Xiv:2302.02926, 2023.

Liu, C., He, S., Liu, K., Zhao, J., et al. Curriculum learning for natural answer generation. In IJCAI, pp. 4223 4229, 2018.

Matiisen, T., Oliver, A., Cohen, T., and Schulman, J. Teacher student curriculum learning. IEEE transactions on neural networks and learning systems, 31(9):3732 3740, 2019.

Morris, C., Kriege, N. M., Bause, F., Kersting, K., Mutzel, P., and Neumann, M. Tudataset: A collection of benchmark datasets for learning with graphs. ar Xiv preprint ar Xiv:2007.08663, 2020.

Narvekar, S., Sinapov, J., and Stone, P. Autonomous task sequencing for customized curriculum design in reinforcement learning. In IJCAI, pp. 2536 2542, 2017.

Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M. E., and Stone, P. Curriculum learning for reinforcement learning domains: A framework and survey. The Journal of Machine Learning Research, 21(1):7382 7431, 2020.

Platanios, E. A., Stretcu, O., Neubig, G., Poczos, B., and Mitchell, T. M. Competence-based curriculum learning for neural machine translation. ar Xiv preprint ar Xiv:1903.09848, 2019.

Portelas, R., Colas, C., Weng, L., Hofmann, K., and Oudeyer, P.-Y. Automatic curriculum learning for deep rl: A short survey. ar Xiv preprint ar Xiv:2003.04664, 2020.

Qin, Y., Wang, X., Zhang, Z., Chen, H., and Zhu, W. Multitask graph neural architecture search with task-aware collaboration and curriculum. Advances in neural information processing systems, 36, 2024.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. In International conference on machine learning, pp. 4334 4343. PMLR, 2018a.

Ren, Z., Dong, D., Li, H., and Chen, C. Self-paced prioritized curriculum learning with coverage penalty in deep reinforcement learning. IEEE transactions on neural networks and learning systems, 29(6):2216 2226, 2018b.

Saxena, S., Tuzel, O., and De Coste, D. Data parameters: A new family of parameters for learning a differentiable curriculum. Advances in Neural Information Processing Systems, 32, 2019.

Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., and Meng, D. Meta-weight-net: Learning an explicit mapping for sample weighting. Advances in neural information processing systems, 32, 2019.

Sinha, S., Garg, A., and Larochelle, H. Curriculum by smoothing. Advances in Neural Information Processing Systems, 33:21653 21664, 2020.

Soviany, P., Ardei, C., Ionescu, R. T., and Leordeanu, M. Image difficulty curriculum for generative adversarial networks (cugan). In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 3463 3472, 2020.

Cur Bench: Curriculum Learning Benchmark

Soviany, P., Ionescu, R. T., Rota, P., and Sebe, N. Curriculum learning: A survey. International Journal of Computer Vision, 130(6):1526 1565, 2022.

Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. From baby steps to leapfrog: How less is more in unsupervised dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 751 759, 2010.

Stretcu, O., Platanios, E. A., Mitchell, T. M., and P oczos, B. Coarse-to-fine curriculum learning. ar Xiv preprint ar Xiv:2106.04072, 2021.

Tay, Y., Wang, S., Tuan, L. A., Fu, J., Phan, M. C., Yuan, X., Rao, J., Hui, S. C., and Zhang, A. Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives. ar Xiv preprint ar Xiv:1905.10847, 2019.

Tsvetkov, Y., Faruqui, M., Ling, W., Mac Whinney, B., and Dyer, C. Learning the curriculum with bayesian optimization for task-specific word representation learning. ar Xiv preprint ar Xiv:1605.03852, 2016.

Veliˇckovi c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. ar Xiv preprint ar Xiv:1710.10903, 2017.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Glue: A multi-task benchmark and analysis platform for natural language understanding. ar Xiv preprint ar Xiv:1804.07461, 2018.

Wang, X., Pham, H., Michel, P., Anastasopoulos, A., Carbonell, J., and Neubig, G. Optimizing data usage via differentiable rewards. In International Conference on Machine Learning, pp. 9983 9995. PMLR, 2020.

Wang, X., Chen, Y., and Zhu, W. A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021a.

Wang, X., Pan, Z., Zhou, Y., Chen, H., Ge, C., and Zhu, W. Curriculum co-disentangled representation learning across multiple environments for social recommendation. In International Conference on Machine Learning, pp. 36174 36192. PMLR, 2023a.

Wang, Y., Wang, W., Liang, Y., Cai, Y., and Hooi, B. Curgraph: Curriculum learning for graph classification. In Proceedings of the Web Conference 2021, pp. 1238 1248, 2021b.

Wang, Y., Yue, Y., Lu, R., Liu, T., Zhong, Z., Song, S., and Huang, G. Efficienttrain: Exploring generalized curriculum learning for training visual backbones. In

Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5852 5864, 2023b.

Wei, X., Gong, X., Zhan, Y., Du, B., Luo, Y., and Hu, W. Clnode: Curriculum learning for node classification. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pp. 670 678, 2023.

Weinshall, D., Cohen, G., and Amir, D. Curriculum learning by transfer learning: Theory and experiments with deep networks. In International Conference on Machine Learning, pp. 5238 5246. PMLR, 2018.

Wu, Z., Wang, X., Chen, H., Li, K., Han, Y., Sun, L., and Zhu, W. Diff4rec: Sequential recommendation with curriculum-scheduled diffusion augmentation. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 9329 9335, 2023.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? ar Xiv preprint ar Xiv:1810.00826, 2018.

Yang, Z., Xu, Q., Bao, S., Cao, X., and Huang, Q. Learning with multiclass auc: Theory and algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (11):7747 7763, 2021.

Yao, Y., Wang, X., Qin, Y., Zhang, Z., Zhu, W., and Mei, H. Data-augmented curriculum graph neural architecture search under distribution shifts. Proceedings of the AAAI Conference on Artificial Intelligence, 38(15):16433 16441, Mar. 2024.

Zagoruyko, S. and Komodakis, N. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. Ar Xiv, abs/1611.03530, 2016. URL https://api.semanticscholar. org/Corpus ID:6212000.

Zhang, D., Tian, H., and Han, J. Few-cost salient object detection with adversarial-paced learning. Advances in Neural Information Processing Systems, 33:12236 12247, 2020.

Zhang, Z., Zhang, Z., Wang, X., and Zhu, W. Learning to solve travelling salesman problem with hardness-adaptive curriculum. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 9136 9144, 2022.

Zhao, M., Wu, H., Niu, D., and Wang, X. Reinforced curriculum learning on pre-trained neural machine translation models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 9652 9659, 2020.

Cur Bench: Curriculum Learning Benchmark

Zhou, T. and Bilmes, J. Minimax curriculum learning: Machine teaching with desirable difficulties and scheduled diversity. In International Conference on Learning Representations, 2018.

Zhou, T., Wang, S., and Bilmes, J. Curriculum learning by dynamic instance hardness. Advances in Neural Information Processing Systems, 33:8602 8613, 2020.

Zhou, Y., Chen, H., Pan, Z., Yan, C., Lin, F., Wang, X., and Zhu, W. Curml: A curriculum machine learning library. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 7359 7363, 2022a.

Zhou, Y., Wang, X., Chen, H., Duan, X., Guan, C., and Zhu, W. Curriculum-nas: Curriculum weight-sharing neural architecture search. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 6792 6801, 2022b.

Zhou, Y., Wang, X., Chen, H., Duan, X., and Zhu, W. Intraand inter-modal curriculum for multimodal learning. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 3724 3735, 2023.

Cur Bench: Curriculum Learning Benchmark

A. Appendix Abstract

In this appendix, we first list the essential information of the datasets in Section B and backbone models in Section C. Then we summarize the curriculum methods implemented in this work in Section D to present how these methods were evaluated when they were proposed. After providing the training hyperparameters in Section E and method hyperparameters in Section F, we report the detailed performance and complexity of each method in various experimental setups in Section G.

B. Datasets

All the datasets included in Cur Bench are publicly available for research. To eliminate the risk of ethical or license issues, we list the essential information of the datasets, such as their home pages, common download links, and licenses.

Domain Home Page Download Link License

CV CIFAR Py Torch MIT Tiny-Image Net CS231n MIT NLP GLUE Hugging Face Various

Graph TUDataset Py Torch Geometric Various OGB OGB Dataset MIT

Table 4. The home pages, download links, and licenses of datasets.

Concretely, in this work, we download CIFAR via Py Torch API, GLUE via Hugging Face API, TUDataset via Py Torch Geometric (Py G) API, and OGB dataset via OGB API. For Tiny-Image Net, we download the zip file from CS231n, and adjust its file structure to the same form as CIFAR for easier loading with the help of the tool code from Github: lromor/tinyimagenet.py.

C. Backbone Models

For the standardization and reliability of Cur Bench, we implement all backbone models by referencing highly recognized code repositories as shown in Table 5.

Domain Model Reference

CV Le Net, Res Net-18 pytorch-cifar Vi T vit-pytorch

NLP LSTM lstm-gru-pytorch BERT, GPT2 Hugging Face Graph GCN, GAT, GIN Py Torch Geometric

Table 5. The implementation references of backbone models.

Among these models, BERT and GPT2 are initiated with the pretrained parameters from Hugging Face and finetuned in this work, while others are trained from scratch.

D. Curriculum Learning Methods

When designing Cur Bench, we are inclined to the datasets and models used in previous works for evaluation. Therefore, we have surveyed what datasets and models are commonly employed and completed the Table 6.

It can be obviously found that when researchers propose a curriculum learning method, they always conduct experiments on image classification tasks for performance evaluation. Only a few authors will try to apply their methods to the datasets for object detection or neural machine translation. Besides, not all works take different settings, such as noise or imbalance, into consideration.

Therefore, as stated in the main text, we not only select the datasets and models in the CV domain, which are commonly used in previous related works, but also supplement those in the NLP and graph domains to investigate how the methods can adapt to various scenarios.

E. Training Hyperparameters

To ensure a fair evaluation, we run 5 times with fixed different random seeds s {42, 666, 777, 888, 999}, and report the average and standard deviation results. Besides, we strictly set the training hyperparameters as follows:

Le Net, Res Net-18, Vi T: We choose a batch size of 50, and use an Adam optimizer to train the model with a constant learning rate of 0.0001 for 200 epochs.

LSTM: We choose a batch size of 50, and use a SGD optimizer to train the model with a cosine annealing learning rate of 0.00001 1 for 10 epochs.

BERT, GPT2: We choose a batch size of 50, and use an Adam W optimizer to train the model with a constant learning rate of 0.00002 for 3 epochs.

GCN, GAT, GIN: We choose a batch size of 50, and use an Adam optimizer to train the model for 200 epochs with learning rates of 0.01 for TUDataset and 0.001 for OGB.

F. Method Hyperparameters

For a reproducible evaluation, we demonstrate the hyperparameters that we select for curriculum learning methods in Table 7. It should be noted that this table includes the hyperparameters for the experiments with 200 epochs. For text domain tasks trained for 3 or 10 epochs, we sightly adjust some epoch-related hyperparameters to adapt the tasks, such as grow epochs, warm epochs, and schedule epochs.

G. Detailed Complexity and Performance

Tables from 8 to 11 report complexity and performance.

Cur Bench: Curriculum Learning Benchmark

Method Conference Datasets Models Settings Std Noi Imb

SPL (Kumar et al., 2010) NIPS, 2010 MUC6, Uni Probe, MNIST, Mammals SSVM

TTCL (Weinshall et al., 2018) ICML, 2018 CIFAR-100, STL-10 CNN

MCL (Zhou & Bilmes, 2018) ICLR, 2018 News-20, MNIST, CIFAR-10, STL-10, SVHN, Fashion Le Net5, CNN

Screener Net (Kim & Choi, 2018) Ar Xiv, 2018 Cart-pole-v0, CIFAR-10, MNIST, Pascal VOC DDQN, CNN

LRE (Ren et al., 2018a) ICML, 2018 MNIST, CIFAR-10, CIFAR-100 Le Net, Res Net-32, Wide Res Net-28-10

MW-Net (Shu et al., 2019) NIPS, 2019 CIFAR-10, CIFAR-100, Clothing1M Res Net-32, Res Net-50, Wide Res Net-28-10

DCL (Saxena et al., 2019) NIPS, 2019 CIFAR-10, CIFAR-100, Image Net, Web Vision, KITTI

VGG-16, SSDNet, Res Net-18, Wide Res Net-28-10

LGL (Cheng et al., 2019) CVPR, 2019 CIFAR-10, CIFAR-100, Image Net VGG-16, Res Net-50

DDS (Wang et al., 2020) ICML, 2020 CIFAR-10, Image Net, TED LSTM, Res Net-50, Wide Res Net-28-10

DIHCL (Zhou et al., 2020) NIPS, 2020

CIFAR-10, CIFAR-100, Image Net, Food-101, FGVC Aircraft, Stanford Cars, Birdsnap, FMNIST, KMNIST, STL10, SVHN

Res Net-50, Wide Res Net-16-8, Wide Res Net-28-10, Res Ne Xt50-32x4d, Pre Act Res Net34

Super Loss (Castells et al., 2020) NIPS, 2020

MNIST, UTKFace, CIFAR-10, CIFAR-100, Web Vision, Pascal VOC, Revisited Oxford and Paris

Res Net-18, Res Net-50, Res Net-101, Wide Res Net-28-10, Faster R-CNN, Retina Net

CBS (Sinha et al., 2020) NIPS, 2020

CIFAR-10, CIFAR-100, Image Net, SVHN, Celeb A, Pascal VOC, MNIST, USPS

VGG-16, Res Net-18, Wide-Res Net-50, Res Ne Xt-50, VAE, β-VAE

C2F (Stretcu et al., 2021) Ar Xiv, 2021 CIFAR-10, CIFAR-100, Shapes, Tiny-Image Net Resnet-18, Resnet-50, Wide Resnet-28-10

Adaptive CL (Kong et al., 2021) ICCV, 2021 CIFAR-10, CIFAR-100, Subset of Image Net

MLP, HNN, VGG-16, Res Net-18 Res Net-v1-14

Efficient Train (Wang et al., 2023b) ICCV, 2023 Image Net-1K/22K, MS COCO, Flowers-102, CIFAR, Stanford Dogs

Res Net, Conv Ne Xt, Dei T, PVT, Swin, CSWin

Table 6. Summary of the methods reproduced in Cur Bench, where we overview the datasets and models involved in the related works. Std stands for the standard setting, Noi for noise, and Imb for imbalance.

Cur Bench: Curriculum Learning Benchmark

Method Hyerparameter Value

start ratio 0.0 grow epochs 100 grow fn linear weight fn hard

start ratio 0.0 grow epochs 100 grow fn linear weight fn hard

schedule epochs 20 warm epochs 5 lam 1 minlam 0.2 gamma 0.1 fe alpha 2 fe beta 0.75 fe gamma 0.9 fe lambda 0.9 Screener Net M 1.0 LRE meta split 0.1

MW-Net meta split 0.1 VNet [1, 100, 1]

init class param 0.0 lr class param 0.1 wd class param 0.0 init data param 1.0 lr data param 0.1 wd data param 0.0

start ratio 0.1 grow ratio 0.3 grow interval 20 strategy random

DDS meta split 0.1 eps 0.001

warm epochs 50 discount factor 0.9 decay rate 0.9 bottom size 0.5 type loss sample type random

Super Loss tau 0.0 lam 1.0 fac 0.9

kernel size 3 start std 1.0 grow factor 0.9 grow interval 5 C2F cluster K 3

Adaptive CL

pace p 0.1 pace q 2.5 pace r 15 inv 20 alpha 0.7 gamma 0.1 bottom gamma 0.1

Efficient Train epochs {120, 160, 200} crop size {160, 192, 224} rand aug 0 9

Table 7. The default hyperparameters we set for each method when the number of training epochs is 200.

Training Time (Minute) GPU Memory (MB)

SPL 175 420 TTCL 111 464 MCL 106 422 Screener Net 291 825 LRE 665 1241 MW-Net 728 1241 DCL 158 422 LGL 149 421 DDS 632 1411 DIHCL 107 421 Super Loss 156 421 CBS 155 538 C2F 159 464 Adaptive CL 132 468 Efficient Train 214 421

(a) Res Net-18 on CIFAR-10

Training Time (Minute) GPU Memory (MB)

SPL 1.28 6615 TTCL 1.12 7036 MCL 2.12 6615 Screener Net 2.05 13114 LRE 3.27 22989 MW-Net 4.23 22989 DCL 1.18 6615 DDS 4.02 23997 DIHCL 1.05 6615 Super Loss 1.10 6615 Adaptive CL 0.68 7036

(b) BERT on RTE

Training Time (Minute) GPU Memory (MB)

SPL 4.75 6.12 TTCL 3.50 5.76 MCL 3.03 5.82 Screener Net 6.25 7.46 LRE 7.77 105.41 MW-Net 8.65 24.79 DCL 3.87 5.79 DDS 11.62 20.50 DIHCL 2.12 5.71 Super Loss 3.90 5.76 Adaptive CL 3.53 5.46

(c) GCN on NCI1

Table 8. Time and space complexity, measured by training time and GPU memory usage on NVIDIA V100 GPU.

Cur Bench: Curriculum Learning Benchmark

CIFAR-10 CIFAR-100 Tiny-Image Net Standard Noise-0.4 Imbalance-50 Standard Noise-0.4 Imbalance-50 Standard Noise-0.4 Imbalance-50

SPL 69.080.78 63.681.01 42.340.90 34.700.72 26.090.69 18.150.68 21.530.25 15.590.63 10.170.12 TTCL 68.870.69 64.631.00 44.030.54 34.190.88 28.830.96 18.390.42 22.080.48 18.840.18 11.170.34 MCL 65.860.31 62.501.01 34.590.90 32.600.75 27.090.34 15.900.31 20.990.37 17.060.37 9.820.35 Screener Net 70.430.41 65.450.92 45.280.56 35.630.78 29.720.69 19.740.17 22.830.44 18.540.29 11.770.19 LRE 64.520.86 59.880.49 36.242.17 29.290.73 23.370.34 14.520.19 18.860.66 14.970.21 8.230.13 MW-Net 69.130.44 63.920.98 45.170.82 35.400.54 28.090.66 18.950.32 22.160.36 17.880.25 10.970.30 DCL 67.230.49 64.770.59 39.160.87 34.090.51 30.020.82 18.130.42 22.010.55 19.650.20 10.950.20 LGL 69.870.71 65.090.78 44.941.25 35.040.84 29.560.54 19.280.64 22.550.30 18.400.05 11.250.43 DDS 65.652.84 63.451.84 41.514.52 35.111.04 28.490.47 19.050.40 22.290.41 17.031.19 10.301.3 DIHCL 66.460.83 58.420.73 40.891.21 28.490.59 27.870.23 15.770.62 17.720.50 14.740.43 8.160.33 Super Loss 70.290.68 65.930.57 43.130.51 34.910.68 30.870.48 18.570.17 22.270.29 19.910.26 11.230.31 CBS 69.790.36 63.470.96 44.601.77 35.170.63 28.140.74 18.870.60 21.870.58 17.780.58 11.100.43 C2F 69.490.37 64.350.79 43.740.86 35.510.40 29.920.58 19.240.52 22.440.22 18.780.23 11.690.32 Adaptive CL 69.250.43 63.930.97 42.870.47 34.580.51 28.360.43 18.590.26 22.620.30 18.090.37 10.980.20 Efficient Train 70.340.44 62.960.84 43.921.01 35.590.66 28.040.71 18.780.62 22.310.42 18.050.17 12.360.47

CIFAR-10 CIFAR-100 Tiny-Image Net Standard Noise-0.4 Imbalance-50 Standard Noise-0.4 Imbalance-50 Standard Noise-0.4 Imbalance-50

SPL 91.540.26 70.682.25 74.710.74 68.130.47 34.090.39 39.800.93 48.990.41 22.490.41 26.040.93 TTCL 92.350.13 86.920.20 75.590.56 67.520.46 58.560.60 38.400.97 48.500.34 41.810.67 25.320.46 MCL 91.760.15 77.840.33 73.710.85 68.680.37 45.950.58 40.490.67 51.460.16 34.390.66 28.080.28 Screener Net 92.740.20 81.630.70 75.370.56 71.310.14 51.960.56 43.470.43 53.610.48 39.220.57 30.820.36 LRE 90.800.22 80.350.50 73.710.36 66.990.24 50.310.88 40.690.69 49.860.37 36.400.33 27.490.41 MW-Net 91.790.26 79.770.44 74.860.59 69.090.25 49.870.32 40.990.48 50.930.36 37.790.43 27.960.55 DCL 92.410.25 82.440.66 76.300.88 69.800.47 54.010.57 42.310.41 52.250.43 40.670.42 28.830.63 LGL 92.190.20 73.420.41 74.870.40 69.080.15 39.930.58 41.070.31 50.320.38 27.350.32 27.210.21 DDS 90.942.26 78.743.07 70.248.53 68.870.17 46.871.72 37.932.71 50.840.30 37.540.37 26.961.29 DIHCL 91.870.21 77.380.42 74.310.60 67.360.33 44.190.37 39.510.75 50.590.32 32.700.38 26.360.34 Super Loss 92.270.22 84.540.40 76.430.96 69.530.43 57.510.45 42.431.01 52.380.53 43.640.72 28.850.38 CBS 90.940.27 75.790.79 72.900.66 63.670.37 41.140.39 36.190.91 45.670.25 30.420.53 24.190.34 C2F 91.980.17 80.270.52 75.261.16 69.860.17 50.481.32 42.470.79 51.960.45 38.040.43 28.900.39 Adaptive CL 91.910.08 74.300.79 73.181.37 66.040.41 38.130.88 36.300.49 46.470.24 27.750.34 23.310.51 Efficient Train 92.880.23 79.910.23 74.581.84 69.400.20 50.520.32 39.910.62 51.760.42 38.330.24 28.150.52

(b) Res Net-18

CIFAR-10 CIFAR-100 Tiny-Image Net Standard Noise-0.4 Imbalance-50 Standard Noise-0.4 Imbalance-50 Standard Noise-0.4 Imbalance-50

SPL 78.101.29 60.820.92 49.811.29 47.660.40 28.421.21 24.390.78 33.710.63 17.300.87 15.090.39 TTCL 77.360.34 69.830.53 50.820.79 45.350.59 39.150.30 24.150.25 35.610.15 29.760.34 15.830.38 MCL 77.850.55 61.610.65 49.680.79 49.900.53 31.460.59 25.020.62 36.660.75 21.500.58 16.300.43 Screener Net 80.450.53 64.200.50 51.341.08 51.930.64 34.770.21 26.320.23 38.140.80 24.900.49 17.470.14 LRE 75.810.52 61.112.95 46.132.60 45.590.64 30.910.29 24.000.51 34.100.71 21.420.28 13.980.44 MW-Net 77.392.30 63.010.60 51.191.25 49.460.44 33.990.38 24.860.47 37.160.29 23.490.33 16.130.40 DCL 80.660.27 66.000.07 51.731.25 51.230.62 37.010.35 26.400.34 38.920.53 26.170.37 17.200.34 LGL 79.520.38 63.190.91 52.141.18 50.390.63 31.341.16 26.090.51 36.250.47 20.220.52 16.430.43 DDS 77.542.13 63.460.22 51.120.73 49.670.66 33.790.45 24.810.38 36.600.46 23.470.42 16.180.61 DIHCL 78.090.73 63.390.41 50.780.72 49.800.34 33.640.22 25.490.32 37.890.48 22.360.57 16.290.32 Super Loss 79.420.25 66.130.49 51.860.60 49.250.37 37.840.39 25.720.27 38.250.42 28.040.39 16.930.26 CBS 79.850.37 64.070.65 52.850.81 51.050.62 35.250.24 26.050.52 38.280.71 24.880.27 17.150.31 C2F 79.630.65 61.971.38 52.001.14 50.160.74 32.580.67 25.280.32 38.510.21 25.220.77 17.020.68 Adaptive CL 78.850.60 62.550.78 51.601.49 48.300.68 31.730.58 24.810.56 33.940.45 20.120.43 15.270.40 Efficient Train 79.670.47 62.620.37 50.711.53 50.980.50 34.560.22 25.470.62 38.211.06 25.080.33 16.200.17

Table 9. The performances of each curriculum learning method in the CV research domain.

Cur Bench: Curriculum Learning Benchmark

RTE MRPC STS-B Co LA Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4

SPL 52.420.84 53.360.53 80.640.87 80.461.17 11.041.13 8.762.64 9.962.17 3.692.58 TTCL 52.780.14 53.791.81 81.540.18 81.220.00 14.112.21 11.102.25 12.442.22 8.552.10 MCL 52.850.29 52.640.58 81.220.00 80.950.54 12.951.23 10.551.32 10.131.36 4.161.92 Screener Net 52.850.18 53.720.86 81.400.11 81.240.05 13.220.96 10.991.41 12.331.01 3.512.16 DCL 53.071.29 54.221.77 81.460.18 81.220.00 12.670.79 11.621.10 11.061.68 2.501.89 DDS 52.710.00 53.140.42 81.370.08 81.230.03 12.541.28 11.272.73 12.651.21 3.512.26 DIHCL 52.710.00 53.720.77 81.370.14 81.220.00 13.991.26 9.890.80 11.692.90 3.412.69 Super Loss 52.710.00 53.431.10 81.390.14 81.220.00 12.361.65 11.751.61 10.821.93 3.591.65 Adaptive CL 52.061.30 53.000.27 81.390.17 81.220.00 12.911.16 10.320.91 9.820.68 4.382.36

SST-2 QNLI QQP MNLI-(m) MNLI-(mm) Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4

SPL 81.900.62 63.230.76 51.020.46 50.740.19 74.390.35 59.630.79 60.620.30 36.580.97 60.450.36 36.360.99 TTCL 82.130.91 78.581.64 50.650.22 50.730.18 75.140.16 66.470.72 62.470.36 58.590.54 62.330.42 58.500.64 MCL 82.520.99 63.102.08 50.540.00 50.720.29 75.100.15 59.290.39 60.920.42 45.551.91 60.820.24 46.322.08 Screener Net 82.070.43 64.420.85 50.550.02 50.720.23 74.270.19 61.330.30 61.380.37 42.361.49 60.710.25 43.031.60 DCL 82.020.76 64.361.08 50.540.00 50.620.15 75.580.31 60.770.70 61.610.34 44.130.74 61.210.41 45.040.77 DDS 82.480.68 62.161.36 50.540.00 50.770.27 74.920.14 60.950.42 60.750.42 42.460.89 60.430.19 42.850.98 DIHCL 82.090.88 62.430.92 50.540.00 50.830.45 74.090.10 59.711.05 58.840.39 37.170.60 58.840.74 36.650.81 Super Loss 82.870.88 65.480.62 50.590.10 50.760.18 75.730.21 59.830.19 60.640.33 47.081.68 60.910.58 47.631.52 Adaptive CL 82.740.75 64.222.23 50.540.00 50.700.23 74.850.45 60.051.30 61.390.34 41.431.69 60.650.45 42.101.82

RTE MRPC STS-B Co LA Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4

SPL 61.373.63 51.343.48 87.212.02 80.571.61 85.070.49 80.910.63 56.074.94 15.086.41 TTCL 66.351.76 56.325.04 88.631.88 81.790.57 84.910.68 80.741.66 57.260.87 45.791.64 MCL 66.352.02 55.092.22 88.691.24 78.942.59 85.420.22 79.210.65 56.242.37 30.205.94 Screener Net 64.691.62 52.495.06 87.780.99 79.044.22 84.910.45 80.690.97 56.371.62 33.253.26 LRE 58.941.34 53.361.24 81.730.34 80.900.64 81.081.76 75.522.07 51.562.12 26.923.88 MW-Net 66.280.81 53.862.73 88.090.61 80.890.67 84.990.92 79.161.19 56.342.19 30.801.89 DCL 66.212.58 53.794.20 88.531.13 81.940.55 85.090.51 80.991.22 57.471.91 32.663.66 DDS 64.551.03 55.453.89 87.321.11 79.412.43 84.380.88 78.001.97 56.121.23 27.491.52 DIHCL 64.481.22 54.802.44 86.851.12 81.470.39 85.050.27 81.310.25 52.341.49 30.494.98 Super Loss 66.061.98 53.794.85 88.050.95 81.820.65 84.580.68 79.781.35 57.352.10 31.812.97 Adaptive CL 65.851.18 54.804.51 87.540.61 81.640.64 85.270.35 79.720.64 57.801.96 31.583.18

SST-2 QNLI QQP MNLI-(m) MNLI-(mm) Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4

SPL 91.491.78 85.132.62 90.280.62 80.981.29 87.300.34 76.171.30 83.870.61 77.630.63 84.250.61 78.590.71 TTCL 92.480.41 91.250.59 91.370.16 89.450.44 87.450.46 84.500.25 83.990.31 81.730.31 84.340.45 82.250.40 MCL 92.410.20 84.330.91 91.240.23 80.711.08 88.160.13 74.191.02 83.860.42 76.850.79 84.110.29 77.920.79 Screener Net 92.480.27 87.750.96 91.180.11 81.871.40 87.530.22 75.851.26 83.830.42 78.590.52 84.130.44 79.160.61 LRE 92.180.38 86.611.54 89.320.47 80.370.83 84.560.32 72.300.02 82.210.29 75.630.51 82.580.29 76.400.53 MW-Net 92.620.41 87.060.96 91.280.20 81.271.40 87.440.19 75.480.61 84.010.23 78.350.68 84.390.38 78.960.62 DCL 92.820.16 86.672.23 91.490.13 81.411.98 88.030.21 75.260.95 84.240.27 78.550.46 84.400.42 79.390.89 DDS 92.410.28 86.190.56 91.140.14 81.880.71 87.500.25 76.040.57 83.890.12 78.510.37 84.380.18 78.850.25 DIHCL 92.520.31 87.750.81 91.230.11 83.031.09 86.740.35 76.750.43 83.280.32 78.510.73 83.570.32 79.420.86 Super Loss 92.690.41 87.571.45 91.180.14 82.330.51 87.790.20 75.900.55 84.270.07 77.680.65 84.360.22 78.710.67 Adaptive CL 92.320.32 85.891.43 91.240.27 80.581.91 87.600.40 76.271.02 84.110.50 78.790.35 84.390.45 79.350.59

Cur Bench: Curriculum Learning Benchmark

RTE MRPC STS-B Co LA Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4

SPL 59.424.22 54.732.38 84.321.09 79.472.26 76.662.48 63.164.95 30.728.65 2.950.70 TTCL 64.550.62 57.403.39 84.790.81 82.550.88 73.064.74 68.354.64 33.833.10 12.542.75 MCL 66.210.87 54.951.97 86.290.36 80.261.29 80.821.39 71.571.74 39.953.16 8.401.86 Screener Net 65.131.61 53.363.78 84.970.54 78.583.03 74.774.11 69.492.69 35.897.49 6.272.33 LRE 60.222.11 52.852.54 81.980.08 75.273.39 56.411.41 65.024.00 35.001.98 3.312.50 MW-Net 64.333.46 54.944.57 84.060.82 77.335.87 77.112.14 65.773.44 35.245.04 3.471.68 DCL 66.352.10 55.522.75 85.390.89 77.803.50 77.631.76 68.682.96 36.593.57 6.953.88 DDS 61.233.39 53.793.00 82.630.69 74.593.52 72.416.45 60.723.34 31.871.82 4.112.82 DIHCL 63.832.48 55.452.38 83.260.53 78.612.44 73.103.53 63.711.27 33.581.92 3.661.96 Super Loss 66.210.96 53.721.70 85.120.62 79.183.14 73.654.55 66.133.65 37.602.98 8.905.55 Adaptive CL 65.491.38 53.860.86 84.820.98 78.053.47 76.583.05 66.302.02 33.613.90 6.501.55

SST-2 QNLI QQP MNLI-(m) MNLI-(mm) Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4

SPL 91.930.45 85.441.40 87.790.35 76.292.81 85.290.31 73.761.51 81.050.27 76.470.27 81.930.52 77.480.40 TTCL 92.180.66 90.340.53 88.100.22 84.000.70 85.500.28 82.160.35 81.550.27 78.360.19 82.180.23 79.620.44 MCL 92.180.44 84.131.83 88.170.67 77.801.75 86.680.16 74.272.28 81.900.23 75.440.86 82.590.35 76.920.92 Screener Net 91.770.63 86.741.35 87.880.50 77.931.86 85.870.05 73.432.11 81.780.22 76.290.30 82.400.11 77.540.44 LRE 91.240.20 84.441.20 84.830.58 63.253.95 83.110.73 70.221.22 78.930.47 72.350.71 80.060.51 74.050.66 MW-Net 91.560.28 86.401.58 88.000.38 75.533.18 85.700.27 74.640.64 81.580.36 75.890.28 82.420.30 76.810.18 DCL 92.060.49 86.051.25 87.980.19 78.820.64 85.990.21 75.440.72 81.530.27 76.600.47 82.410.20 77.530.30 DDS 91.970.23 87.731.61 84.592.24 79.880.02 85.730.05 72.562.21 81.410.31 75.490.21 82.140.40 76.860.08 DIHCL 91.880.41 87.021.14 86.850.34 78.970.88 83.920.41 75.070.57 80.300.23 76.410.14 81.690.12 77.680.12 Super Loss 92.250.42 87.550.72 87.990.52 79.700.65 86.130.18 75.830.70 81.330.18 75.900.29 82.140.27 77.050.25 Adaptive CL 92.110.24 85.781.42 87.790.18 78.141.66 85.720.21 75.720.57 81.380.11 76.040.41 82.380.34 77.440.37

Table 10. The performances of each curriculum learning method in the NLP research domain.

Cur Bench: Curriculum Learning Benchmark

MUTAG PROTEINS NCI1 ogbg-molhiv Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4

SPL 71.587.14 62.103.94 69.465.91 65.546.48 68.421.90 60.052.38 77.411.15 60.873.09 TTCL 70.527.14 71.585.37 72.687.63 71.616.62 70.902.21 67.982.01 75.890.81 72.811.14 MCL 71.587.14 71.588.55 70.545.15 65.004.71 68.561.04 54.502.85 74.101.40 64.264.17 Screener Net 72.633.94 64.215.16 71.965.61 67.144.05 69.782.22 56.065.14 73.710.45 61.007.79 LRE 70.529.18 61.404.96 68.036.17 66.615.25 58.231.60 51.222.38 73.741.48 57.927.98 MW-Net 74.732.11 63.164.71 70.544.55 66.794.13 68.711.78 56.011.37 75.571.03 62.816.19 DCL 74.732.11 61.0513.56 71.963.46 63.576.32 70.510.66 56.691.58 75.781.39 61.263.57 DDS 74.743.94 64.215.16 73.214.41 64.116.50 71.391.29 58.103.28 70.483.02 57.094.80 DIHCL 71.585.37 68.427.44 73.033.59 63.227.02 67.401.71 57.862.04 70.472.10 61.204.67 Super Loss 71.585.37 69.476.14 72.323.44 65.893.84 70.222.00 57.173.38 75.971.03 61.215.12 Adaptive CL 73.683.33 66.317.88 72.684.84 65.714.51 69.882.17 58.445.29 75.491.13 60.957.96

MUTAG PROTEINS NCI1 ogbg-molhiv Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4

SPL 64.2111.72 65.265.37 69.296.93 67.143.85 56.492.61 54.744.10 69.692.38 64.882.73 TTCL 69.476.98 65.2610.84 69.827.13 64.461.31 56.791.40 55.472.73 68.272.04 66.731.84 MCL 64.2111.24 68.428.81 69.646.29 66.966.89 57.562.63 55.234.73 69.253.06 63.203.03 Screener Net 64.218.42 65.267.88 65.715.25 69.113.77 54.552.64 55.281.53 71.132.07 65.942.61 LRE 66.3111.34 63.164.71 66.431.84 66.073.19 54.112.32 52.942.36 66.592.45 63.742.61 MW-Net 61.849.40 65.267.14 66.783.01 67.144.42 57.562.29 55.331.18 68.543.76 62.392.60 DCL 67.3714.28 69.4710.21 68.037.39 64.283.84 59.371.59 55.331.72 72.641.16 62.223.98 DDS 66.317.14 67.376.14 67.143.31 66.789.69 53.241.81 54.453.35 68.502.05 62.225.88 DIHCL 72.638.42 66.328.55 65.006.81 68.576.93 57.181.73 55.674.70 69.072.79 66.382.78 Super Loss 67.3713.06 68.427.44 63.932.63 66.077.23 57.082.27 55.132.39 70.581.52 60.922.13 Adaptive CL 67.3710.21 66.327.14 68.393.07 64.476.37 57.612.22 55.082.02 69.711.84 62.982.53

MUTAG PROTEINS NCI1 ogbg-molhiv Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4 Standard Noise-0.4

SPL 82.105.37 72.376.84 72.865.13 72.862.37 77.541.69 56.874.93 76.531.97 63.352.34 TTCL 84.217.44 81.584.56 75.712.36 73.931.82 80.241.67 56.274.27 75.131.55 62.163.07 MCL 84.216.45 73.695.27 75.723.93 70.002.68 75.671.00 57.734.11 74.200.48 63.823.85 Screener Net 82.107.14 75.005.74 75.711.82 68.394.88 79.611.09 55.575.11 74.391.24 61.072.33 LRE 78.953.72 80.272.28 72.685.46 66.436.48 71.411.71 54.081.72 73.492.36 63.304.44 MW-Net 88.422.10 75.004.37 73.754.10 66.618.54 79.221.21 55.524.78 75.220.80 65.432.70 DCL 85.268.42 76.324.56 74.113.14 64.464.39 79.661.39 56.063.79 75.232.22 61.653.38 DDS 85.263.94 80.265.73 70.312.78 65.896.69 77.623.58 54.894.85 72.852.67 63.383.91 DIHCL 85.534.36 73.683.72 73.754.54 71.614.28 76.551.70 53.331.28 72.431.80 62.235.86 Super Loss 88.425.16 77.634.37 77.144.88 71.255.69 82.041.90 62.146.47 74.511.47 65.531.61 Adaptive CL 86.314.21 80.269.40 75.893.79 70.363.97 79.321.90 62.051.67 76.171.46 61.814.81

Table 11. The performances of each curriculum learning method in the graph research domain.