# adaflood_adaptive_flood_regularization__ec61e590.pdf

Published in Transactions on Machine Learning Research (08/2024)

Ada Flood: Adaptive Flood Regularization

Wonho Bae whbae@cs.ubc.ca University of British Columbia

Yi Ren renyi.joshua@gmail.com University of British Columbia

Mohamed Osama Ahmed mohamed.o.ahmed@borealisai.com Borealis AI

Frederick Tung frederick.tung@borealisai.com Borealis AI

Danica J. Sutherland dsuth@cs.ubc.ca University of British Columbia & Amii

Gabriel L. Oliveira gabriel.oliveira@borealisai.com Borealis AI

Reviewed on Open Review: https: // openreview. net/ forum? id= 2s5YU6CSEz

Although neural networks are conventionally optimized towards zero training loss, it has been recently learned that targeting a non-zero training loss threshold, referred to as a flood level, often enables better test time generalization. Current approaches, however, apply the same constant flood level to all training samples, which inherently assumes all the samples have the same difficulty. We present Ada Flood, a novel flood regularization method that adapts the flood level of each training sample according to the difficulty of the sample. Intuitively, since training samples are not equal in difficulty, the target training loss should be conditioned on the instance. Experiments on datasets covering four diverse input modalities text, images, asynchronous event sequences, and tabular demonstrate the versatility of Ada Flood across data domains and noise levels.

1 Introduction

Preventing overfitting is an important problem of great practical interest in training deep neural networks, which often have the capacity to memorize entire training sets, even ones with incorrect labels (Neyshabur et al., 2015; Zhang et al., 2021). Common strategies to reduce overfitting and improve generalization performance include weight regularization (Krogh & Hertz, 1991; Tibshirani, 1996; Liu & Ye, 2010), dropout (Wager et al., 2013; Srivastava et al., 2014; Liang et al., 2021), label smoothing (Yuan et al., 2020), and data augmentation (Balestriero et al., 2022).

Although neural networks are conventionally optimized towards zero training loss, it has recently been shown that targeting a non-zero training loss threshold, referred to as a flood level, provides a surprisingly simple yet effective strategy to reduce overfitting (Ishida et al., 2020; Xie et al., 2022). The original Flood regularizer (Ishida et al., 2020) drives the mean training loss towards a constant, non-zero flood level, while the state-of-the-art i Flood regularizer (Xie et al., 2022) applies a constant, non-zero flood level to each training instance.

Training samples are, however, not uniformly difficult: some instances have more irreducible uncertainty than others (i.e. heteroskedastic noise), while some instances are simply easier to fit than others. It may

Published in Transactions on Machine Learning Research (08/2024)

not be beneficial to aggressively drive down the training loss for training samples that are outliers, noisy, or mislabeled. We explore this difference in the difficulty of training samples further in Section 3.1. To address this issue, we present Adaptive Flooding (Ada Flood), a novel flood regularizer that adapts the flood level of each training sample according to the difficulty of the sample (Section 3.2). We present theoretical support for Ada Flood in Section 3.4.

Like previous flood regularizers, Ada Flood is simple to implement and compatible with any optimizer. Ada Flood determines the appropriate flood level for each sample using an auxiliary network that is trained on a subset of the training dataset. Adaptive flood levels need to be computed for each instance only once, in a pre-processing step prior to training the main network. The results of this pre-processing step are not specific to the main network, and so can be shared across multiple hyper-parameter tuning runs. Furthermore, we propose a significantly more efficient way to train an auxiliary model based on fine-tuning, which saves substantially in memory and computation, especially for overparameterized neural networks (Sections 3.3 and 4.6).

Our experiments (Section 4) demonstrate that Ada Flood generally outperforms previous flood methods on a variety of tasks, including image and text classification, probability density estimation for asynchronous event sequences, and regression for tabular datasets. Models trained with Ada Flood are also more robust to noise (Section 4.3) and better-calibrated (Section 4.4) than those trained with other flood regularizers.

2 Related Work

Regularization techniques have been broadly explored in the machine learning community to improve the generalization ability of neural networks. Regularizers augment or modify the training objective and are typically compatible with different model architectures, base loss functions, and optimizers. They can be used to achieve diverse purposes including reducing overfitting (Hanson & Pratt, 1988; Ioffe & Szegedy, 2015; Krogh & Hertz, 1991; Liang et al., 2021; Lim et al., 2022; Srivastava et al., 2014; Szegedy et al., 2016; Verma et al., 2019; Yuan et al., 2020; Zhang et al., 2018), addressing data imbalance (Cao et al., 2019; Gong et al., 2022), and compressing models (Ding et al., 2019; Li et al., 2020; Zhuang et al., 2020).

Ada Flood is a regularization technique for reducing overfitting. Commonly adopted techniques for reducing overfitting include weight decay (Hanson & Pratt, 1988; Krogh & Hertz, 1991), dropout (Liang et al., 2021; Srivastava et al., 2014), batch normalization (Ioffe & Szegedy, 2015), label smoothing (Szegedy et al., 2016; Yuan et al., 2020), and data augmentation (Lim et al., 2022; Verma et al., 2019; Zhang et al., 2018). Inspired by work on double descent (Belkin et al., 2019; Nakkiran et al., 2021), Ishida et al. (2020); Xie et al. (2022) proposed Flood and i Flood, respectively, to prevent the training loss from reaching zero by maintaining a small constant value. In contrast to the original flood regularizer, which encourages the overall training loss towards a constant target, i Flood drives each training sample s loss towards some constant b.

Ada Flood instead uses an auxiliary model trained on a heldout dataset to assign an adaptive flood level to each training sample. Using a heldout dataset to condition the training of the primary model is an effective strategy in machine learning, and is regularly seen in meta-learning (Bertinetto et al., 2019; Franceschi et al., 2018), batch or data selection (Fan et al., 2018; Mindermann et al., 2022), and neural architecture search (Liu et al., 2019; Wang et al., 2021), among other areas.

3 Adaptive Flooding

Adaptive Flooding (Ada Flood) is a general regularization method for training neural networks; it can accommodate any typical loss function and optimizer.

3.1 Problem Statement

Background Given a labeled training dataset D = {(xi, yi)}N i=1, where xi X are data samples and yi Y are labels, we train a neural network f : X b Y by minimizing a training loss ℓ: Y b Y R. In supervised learning we usually have ℓ 0, but in settings such as density estimation it may be negative.

Published in Transactions on Machine Learning Research (08/2024)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Di culty (Test Loss)

CIFAR10 0% Noise CIFAR10 20% Noise CIFAR10 40% Noise CIFAR100 0% Noise CIFAR100 20% Noise CIFAR100 40% Noise

(a) Dispersion of difficulty

0 50 100 150 200 250 Epoch

Training Loss

i Flood Easy Ada Flood Easy

i Flood Wrong Ada Flood Wrong

i Flood Hard Ada Flood Hard

(b) Training dynamics by difficulty

Figure 1: (a) Illustration of how difficulties of examples are dispersed with and without label noise (where the relevant portion of examples have their label switched to a random other label). (b) Comparison of training dynamics on some examples between i Flood and Ada Flood. The Hard example is labeled horse, but models usually predict cow; the Wrong example is incorrectly labeled in the dataset as cat (there is no rat class).

While conventional training procedures attempt to minimize the average training loss, this can lead to overfitting on training samples.

The original flood regularizer (Ishida et al., 2020) defines a global flood level for the average training loss, attempting to reduce the incentive to overfit. Denote the average training loss by L(f, B) = 1 B PB i=1 ℓ(yi, f(xi)), where f(xi) denotes the model prediction and B = {(xi, yi)}B i=1 is a mini-batch with size of B. Instead of minimizing L, Flood (Ishida et al., 2020) regularizes the training by minimizing

LFlood(f, B, b) = |L(f, B) b| + b , (1)

where the hyperparameter b is a fixed flood level. Individual Flood (i Flood) instead assigns a local flood level, trying to avoid instability observed with Flood (Xie et al., 2022):

Li Flood(f, B, b) = 1

|ℓ(yi, f(xi)) b| + b . (2)

Motivation Training samples are, however, not uniformly difficult: some are inherently easier to fit than others. Figure 1a shows the dispersion of difficulty on CIFAR10 and 100 with various levels of added label noise, as measured by the heldout cross-entropy loss from cross-validated models. Although difficulties on CIFAR10 without noise are concentrated around difficulty 0.5, as the noise increases, they vastly spread out. CIFAR100 has a wide spread in difficulty, even without noise. A constant flood level as used in i Flood may be reasonable for un-noised CIFAR10, but it seems less appropriate for CIFAR100 or noisy-label cases.

Moreover, it may not be beneficial to aggressively drive the training loss for training samples that are outliers, noisy, or mislabeled. In Figure 1b, we show training dynamics on an easy, wrong, and a hard example from the training set of CIFAR10. With i Flood, each example s loss converges to the pre-determined flood level (0.03); with Ada Flood, the easy example converges towards zero loss, while the wrong and hard examples maintain higher loss.

3.2 Proposed Method: Ada Flood

Many advances in efficient neural network training and inference, such as batch or data selection (Coleman et al., 2020; Fan et al., 2018; Mindermann et al., 2022) and dynamic neural networks (Li et al., 2021; Verelst & Tuytelaars, 2020), stem from efforts to address the differences in per-sample difficulty. Ada Flood connects this observation to flooding. Intuitively, easy training samples (e.g. a correctly-labeled image of a cat in

Published in Transactions on Machine Learning Research (08/2024)

𝑓!"#,% 𝑓!"#,&

Figure 2: Ada Flood for settings where training data is limited and acquiring additional data is impractical. In the first stage, we partition the training set into two halves and train two auxiliary networks f aux,1 and f aux,2: one on each half. In the second stage, we use each auxiliary network to set the adaptive flood level of training samples from the half it has not seen, via equation 4. The main network f is then trained on the entire training set, minimizing the Ada Flood-regularized loss, equation 3. Note that the flood levels are fixed over the course of training f and need to be pre-computed once only.

a typical pose) can be driven more aggressively to zero training loss without overfitting the model, while doing so for noisy, outlier, or incorrectly-labeled training samples may cause overfitting. These types of data points behave differently during training (Ren et al., 2022), and so should probably not be treated the same. Ada Flood differentiates training samples by setting a sample-specific flood level θ = {θi}B i=1 in its objective:

LAda Flood(f, B, θ) = 1

i=1 (|ℓ(yi, f(xi)) θi| + θi) . (3)

Here the sample-specific parameters θi should be set according to the individual sample s difficulty. Ada Flood estimates this quantity according to

θi = ℓ(yi, ϕγ(f aux,i(xi), yi)), (4)

where f aux,i is an auxiliary model trained with cross-validation such that xi is in its heldout set, and ϕγ( ) is a correction function explained in a moment. Figure 2 illustrates the training process using equation 3, Section 3.5 gives further motivation, and Section 3.4 provides further theoretical support.

The flood targets θi are fixed over the course of training the main network f, and can be pre-computed for each training sample prior to the first epoch of training f. We typically use five-fold cross-validation as a reasonable trade-off between computational expense and good-enough models to estimate θi, but see further discussion in Section 3.3. The cost of this pre-processing step can be further amortized over many training runs of the main network f since different variations and configurations of f can reuse the adaptive flood levels.

Correction function. Unfortunately, the predictions from auxiliary models are not always correct even when trained on most of the training set if they were, our model would be perfect already. In particular, the adaptive flood levels θi can be arbitrarily large for any difficult examples where the auxiliary model is incorrect; this could lead to strange behavior when we encourage the primary model f to be very incorrect. We thus correct the predictions with the correction function ϕγ, which mixes between the dataset s label and the heldout model s signal.

For regression tasks, the predictions f(xi) R should simply be close to the labels yi R. The correction function linearly interpolates the predictions and labels as,

ϕγ(f aux(xi), yi) = (1 γ)f aux(xi) + γyi. (5)

Published in Transactions on Machine Learning Research (08/2024)

Algorithm 1 Training of Auxiliary Network(s) and Ada Flood

1: Train a single auxiliary network f aux on the entire training set D Fine-tuning method only

2: for Daux,i in {Daux,i}n i=1 do 3: Train f aux,i, either from scratch or by fine-tuning f aux, on D \ Daux,i

4: Save the adaptive flood level θi for each xi Daux,i using f aux,i on x Daux,i

5: end for 6: Train the main model f using Equation (3) and adaptive flood levels θ computed above

Here γ = 0 fully trusts the auxiliary models (no correction ), while γ = 1 disables flooding.

For K-way classification tasks, f(xi) RK is a vector of output probabilities (following a softmax layer) and the label is yi [0, 1]K, usually considered as a one-hot vector. Cross-entropy loss is then computed as: ℓ(yi, f(xi)) = PK k=1 yi,k log f(xi)k. Similar to the regression tasks, we define the correction function ϕγ(f aux(xi), yi) for classification tasks as a linear interpolation between the predictions and labels as:

ϕγ(f aux,i(xi), yi) = (1 γ)f aux,i(xi) + γyi . (6)

Again, for γ = 0 there is no correction, and for γ = 1 flooding is disabled, as θi = PK k=1 yi,k log yi,k is the lower bound of the cross-entropy loss.

The hyperparameter γ [0, 1] is perhaps simpler to interpret and search for than directly identifying a flood level as in Flood or i Flood; in those cases, the level is unbounded (in [0, ) for supervised tasks and all of R for density estimation) and the choice is quite sensitive to the particular task.

3.3 Efficiently Training Auxiliary Networks

Although the losses from auxiliary networks can often be good measures for the difficulties of samples, this is only true when the number of folds n is reasonably large; otherwise the training set of size about n 1

n |D| may be too much smaller than D for the model to have comparable performance. The computational cost scales roughly linearly with n, however, since we must train n auxiliary networks: if we do this in parallel it requires n times the computational resources, or if we do it sequentially it takes n times as long as training a single model.

To alleviate the computational overhead for training auxiliary networks, we sometimes instead approximate the process by fine-tuning a single auxiliary network. More specifically, we first train a single base model f aux on the entire training set D. We then train each of the n auxiliary models by randomly re-initializing the last few layers, then re-training with the relevant fold held out. The overall process is illustrated in Algorithm 1 and n = 2 case is described in Figure 3.

Although this means that xi does slightly influence the final prediction f aux,i(xi) ( training on the test set ), it is worth remembering that we use θi only as a parameter in our model, not to evaluate its performance: xi is in fact a training data point for the overall model f being trained. This procedure is justified by recent understanding in the field that in typical settings, a single data point only loosely influence the early layers of a network. In highly over-parameterized settings (the kernel regime ) where neural tangent kernel theory is a good approximation to the training of f aux (Jacot et al., 2018), re-initializing the last layer would completely remove the effect of xi on the model. Even in more realistic settings, although the mechanism is not yet fully understood, last layer re-training seems to do an excellent job at retaining core features and removing spurious ones that are more specific to individual data points (Kirichenko et al., 2023; La Bonte et al., 2023).

For smaller models with fewer than a million parameters, we use 2or 5-fold cross-validation, since training multiple auxiliary models is not much of a computational burden. For larger models such as Res Net18, however, we use the fine-tuning method. This substantially reduces training time, since each fine-tuning gradient step is less expensive and the models converge much faster given strong features from lower levels than they do starting from scratch; Section 4.6 gives a comparison.

Published in Transactions on Machine Learning Research (08/2024)

Randomly initialize last few layers

Figure 3: Efficient fine-tuning method for training a auxiliary network when held-out split is n = 2. First, a single model f aux is trained on the entire training set D. Then, the last few layers of each of the n auxiliary models are randomly re-initialized and re-trained with the relevant fold held out.

To validate the quality of the flood levels from the fine-tuned auxiliary network, we compare them to the flood levels from n = 50 auxiliary models using Res Net18 (He et al., 2016) on CIFAR10 (Krizhevsky et al., 2009); with n = 50, each model is being trained on 98% of the full dataset, and thus should be a good approximation to the best that this kind of method can achieve. The Spearman rank correlation between the flood levels θi from the fine-tuned method and the full cross-validation is 0.63, a healthy indication that this method provides substantial signal for the correct θi. Our experimental results also reinforce that this procedure chooses a reasonable set of parameters.

3.4 Theoretical Intuition

For a deeper understanding of Ada Flood s advantages, we now examine a somewhat stylized supervised learning setting: an overparameterized regime where the θi are nonetheless optimal. Proposition 1. Let F be a set of candidate models, and suppose there exists an optimal model fopt arg minf F Ex,yℓ(y, f(x)), where ℓis a nonnegative loss function. Given a dataset D = {(xi, yi)}N i=1, let femp denote a minimizer of the empirical loss L(f, D) = 1 N PN i=1 ℓ(yi, f(xi)); suppose that, as in an overparameterized setting, L(femp, D) = 0. Also, let fada be a minimizer of the Ada Flood loss equation 3 using perfect flood levels θ = {θi}N i=1 where θi = ℓ(yi, fopt(xi)). Then we have that

L(femp, D) = 0 L(fopt, D) = L(fada, D). (7)

Furthermore, we have that

LAda Flood(femp, D, θ) = 2L(fopt, D) L(fopt, D) = LAda Flood(fopt, D, θ) = LAda Flood(fada, D, θ). (8)

Proof. We know that L(fopt, D) will be approximately the Bayes risk, the irreducible distributional error achieved by fopt; this holds for instance by the law of large numbers, since fopt is independent of the random sample D. Thus, if the Bayes risk is nonzero and the θi are optimal, we can see that empirical risk minimization of overparametrized models will find femp, and disallow fopt; minimizing LAda Flood, on the other hand, will allow the solution fopt and disallow the empirical risk minimizer femp.

With this choice of θi, we have that

LAda Flood(f, D, θ) = 1

|ℓ(yi, f(xi)) ℓ(yi, fopt(xi))| + ℓ(yi, fopt(xi)) .

Since | | is nonnegative, we have LAda Flood(f, D, θ) L(fopt, D) for any f, and LAda Flood(fopt, D, θ) = L(fopt, D); this establishes that fopt minimizes LAda Flood, and that any minimizer fada must achieve ℓ(yi, fada(xi)) = θi for each i, so L(fada, D) = L(fopt, D). Using that ℓ(yi, femp(xi)) = 0 for each i, as is necessary for ℓ 0 when L(femp, D) = 0, shows LAda Flood(femp, D, θ) = 1

N PN i=1 2θi = 2L(fopt, D).

In settings where θi is not perfect (and we would not expect the auxiliary models to obtain perfect estimates of the loss) the comparison will still approximately hold. If θi consistently overestimates the fopt loss, fopt will

Published in Transactions on Machine Learning Research (08/2024)

TSNE of the toy data

Regular Irregular Wrong Label

0 50 100 150 200 0.0

1.0 Loss of training samples

Regular Irregular Wrong Label

0 50 100 150 200 0.0

1.0 Loss of valid samples

Figure 4: Left: the t-SNE (Van der Maaten & Hinton, 2008) of toy Gaussian example; middle: loss of different samples in the training set; right: loss of different samples in the validation set.

still be preferred to femp: for instance, if θi = 2ℓ(yi, fopt(xi)), then LAda Flood(femp, D, θ) = 4L(fopt, D) 3L(fopt, D) = LAda Flood(fopt, D, θ). On the other hand, if θi = 1

2ℓ(yi, fopt(xi)) a not-unreasonable situation when using a correction function then LAda Flood(femp, D, θ) = L(fopt, D) = LAda Flood(fopt, D, θ). When θi is random, the situation is more complex, but we can expect that noisy θi which somewhat overestimate the loss of fopt will still prefer fopt to femp.

3.5 Discussion: Why We Calculate θ Using Held-out Data

In Section 3.2, we estimate θi for each training sample using the output of an auxiliary network f aux(xi) that is trained on a held-out dataset. In fact, this adaptive flood level θi can be considered as the sample difficulty when training the main network. Hence, it is reasonable to consider existing difficulty measurements based on learning dynamics, like C-score (Jiang et al., 2021) or forgetting score (Maini et al., 2022). However, we find these methods are not robust when wrong labels exist in the training data, because the network will learn to remember the wrong label of xi, and hence provide a low θi for the wrong sample, which is harmful to our method. That is why we propose to split the whole training set into n parts and train f aux(xi) for n times (each with different n 1 parts).

Dataset and implementation To verify this, we conduct experiments on a toy Gaussian dataset, as illustrated in the first panel in Figure 4. Assume we have N samples, each sample in 2-tuple (x, y). To draw a sample, we first select the label y = k following a uniform distribution over all K classes. After that, we sample the input signal x | (y = k) N(µk, σ2I), where σ is the noise level for all the samples. µk is the mean vector for all the samples in class k. Each µk is a 10-dim vector, in which each dimension is randomly selected from { δµ, 0, δµ}. Such a process is similar to selecting 10 different features for each class. We consider 3 types of samples for each class: regular samples, the typical or easy samples in our training set, have a small σ; irregular samples have a larger σ; mislabeled samples have a small σ, but with a flipped label. We generate two datasets following this same procedure (call them datasets A and B). Then, we randomly initialize a 2-layer MLP with Re LU layers and train it on dataset A. At the end of every epoch, we record the loss of each sample in dataset A.

Result The learning paths are illustrated in the second panel in Figure 4. The model is clearly able remember all the wrong labels, as all the curves converge to a small value. If we calculate θi in this way, all θi would have similar values. However, if we instead train the model using dataset B, which comes from the same distribution but is different from dataset A, the learning curves of samples in dataset A will behave like the last panel in Figure 4. The mislabeled and some irregular samples can be clearly identified from the figure. Calculating θi in this way gives different samples more distinct flood values, which makes our method more robust to sample noise, as our experiments on various scenarios show.

Published in Transactions on Machine Learning Research (08/2024)

NTPP Method Uber Reddit Stack Overflow RMSE NLL RMSE NLL ACC RMSE NLL ACC

Intensity-free

Unreg. 75.83 3.86 0.25 1.28 55.26 6.69 3.66 45.52 (6.12) (0.05) (0.01) (0.07) (0.57) (0.98) (0.12) (0.07)

Flood 64.34 4.01 0.25 1.17 57.46 4.12 3.46 45.76 (3.85) (0.02) (0.01) (0.06) (0.84) (0.23) (0.03) (0.03)

i Flood 67.07 3.97 0.23 1.11 56.59 4.12 3.46 45.76 (3.12) (0.06) (0.01) (0.12) (0.92) (0.23) (0.03) (0.03)

Ada Flood 59.69 3.75 0.26 1.09 59.02 3.26 3.45 45.67 (1.49) (0.01) (0.02) (0.13) (0.91) (0.25) (0.04) (0.03)

Unreg. 71.01 3.73 0.28 0.82 58.63 1.46 2.82 46.24 (6.12) (0.05) (0.01) (0.07) (0.57) (0.98) (0.12) (0.07)

Flood 68.61 3.70 0.26 1.02 58.05 1.39 2.79 46.31 (3.85) (0.02) (0.01) (0.06) (0.84) (0.23) (0.03) (0.03)

i Flood 68.61 3.70 0.25 0.92 58.93 1.46 2.82 46.24 (4.76) (0.17) (0.01) (0.23) (1.26) (0.06) (0.04) (0.08)

Ada Flood 54.85 3.55 0.25 0.80 61.34 1.38 2.77 46.41 (1.49) (0.01) (0.02) (0.13) (0.91) (0.25) (0.04) (0.03)

Table 1: Comparison of flooding methods on asynchronous event sequence datasets. The numbers are the means and standard errors (in parentheses) over three runs.

4 Experiments

We demonstrate the effectiveness of Ada Flood on three tasks (probability density estimation, classification and regression) in four domains (asynchronous event sequences, image, text and tabular). We compare flooding methods on asynchronous event time in Section 4.1 and image classification tasks in Section 4.2. We also demonstrate that Ada Flood is more robust to various noisy settings in Section 4.3, and that it yields better-calibrated models in Section 4.4. Some ablation studies are provided in Sections 4.5 and 4.6.

4.1 Results on Asynchronous Event Sequences

We compare flooding methods on asynchronous event sequence datasets of which goal is to estimate the probability distribution of the next event time given the previous event times. Each event may have a class label. Asynchronous event sequences are often modeled as temporal point processes (TPPs).

Datasets We use two popular benchmark datasets, Stack Overflow (predicting the times at which users receive badges) and Reddit (predicting posting times). Following Bae et al. (2023), we also benchmark our method on a dataset with stronger periodic patterns: Uber (predicting pick-up times). We split each training dataset into train (80%) and validation (20%) sets. Details are provided in Appendix A.

Following the literature in TPPs, we use two metrics to evaluate models: root mean squared error (RMSE) and negative log-likelihood (NLL). As NLL can be misleadingly low if the probability density is mostly focused on the correct event time, RMSE is also considered a complementary metric. However, RMSE has its own limitation: if a baseline is directly trained on the ground truth event times as point estimation, the stochastic components of TPPs are ignored. Therefore, we train our TPP models on NLL and use RMSE at test time to ensure that we do not rely too heavily on RMSE scores and account for the stochastic nature of TPPs. When class labels for events are available, we also report the accuracy of class predictions.

Implementation For TPP models to predict the asynchronous event times, we employ Intensity-free models (Shchur et al., 2020) based on GRU (Chung et al., 2014), and Transformer Hawkes Processes (THP) (Zuo et al., 2020) based on Transformer (Vaswani et al., 2017). THP predicts intensities to compute log-likelihood and expected event times, but this approach can be computationally expensive due to the need to compute integrals, particularly double integrals to calculate the expected event times. To overcome this challenge

Published in Transactions on Machine Learning Research (08/2024)

Method SVHN CIFAR10 CIFAR100 w/o L2 reg. w/ L2 reg. w/o L2 reg. w/ L2 reg. w/o L2 reg. w/ L2 reg. Unreg. 95.65 0.05 96.07 0.01 87.80 0.31 90.35 0.21 56.59 0.32 61.49 0.16 Flood 95.63 0.02 96.13 0.02 87.57 0.16 90.09 0.20 55.88 0.18 60.96 0.03 i Flood 95.63 0.08 96.05 0.02 87.96 0.07 90.57 0.12 56.32 0.05 61.63 0.12 KD 95.69 0.02 96.08 0.10 88.06 0.23 90.65 0.03 56.67 0.15 61.29 0.03 Ada Flood 95.72 0.01 96.16 0.02 88.38 0.18 90.82 0.08 57.25 0.14 62.31 0.14

Table 2: Comparison of flooding methods on image classification datasets with and without L2 regularization. The numbers are the means and standard errors over three runs.

while maintaining performance, we follow Bae et al. (2023) in using a mixture of log-normal distributions, proposed in Shchur et al. (2020), for the decoder; we call this THP+.

For each dataset, we conduct hyper-parameter tuning for learning rate and the weight for L2 regularization with the unregularized baseline (we still apply early stopping and L2 regularization by default). Once learning rate and weight decay parameters are fixed, we search for the optimal flood levels. The optimal flood levels are selected via a grid search on { 50, 45, 40 . . . , 0, 5} { 4, 3 . . . , 3, 4} for Flood and i Flood, and optimal γ on {0.0, 0.1 . . . , 0.9} for Ada Flood using the validation set. We use five auxiliary models.

Results In order to evaluate the effectiveness of various regularization methods, we present the results of our experiments in Table 1 (showing means and standard errors from three runs). This is the first time we know of where flooding methods have been applied in this domain; we see that all flooding methods improve the generalization performance here, sometimes substantially. Furthermore, Ada Flood often outperforms other flooding methods on various datasets, suggesting that instance-wise flooding level adaptation using auxiliary models can effectively enhance the generalization capabilities of TPP models. However, there are instances where Ada Flood s performance is comparable to or slightly worse than other methods, indicating that its effectiveness may vary depending on the specific context. Despite this variability, Ada Flood generally appears to be the best choice for training TPP models.

4.2 Results on Image Classification

Datasets We use SVHN (Netzer et al., 2011), CIFAR-10, and 100 (Krizhevsky et al., 2009) for image classification with random crop and horizontal flip as augmentation. Unlike Xie et al. (2022), we split each training dataset into train (80%) and validation (20%) sets for hyperparameter search; thus our numbers are generally somewhat worse than what they reported, as we do not directly tune on the test set.

Implementation Following Ishida et al. (2020) and similar to Xie et al. (2022), we consider training Res Net18 (He et al., 2016) with and without L2 regularization (with a weight of 10 4). All methods are trained with SGD for 300 epochs, with early stopping. We use a multi-step learning rate scheduler with an initial learning rate of 0.1 and decay coefficient of 0.2, applied at every 60 epochs. The optimal flood levels are selected based on validation performance with a grid search on {0.01, 0.02 . . . , 0.1, 0.15, 0.2 . . . , 1.0} for Flood and i Flood, and {0.05, 0.1 . . . , 0.95} for Ada Flood. We use a single Res Net18 auxiliary network where its layer 3 and 4 are randomly initialized and fine-tuned on held-out sets with n = 10 splits.

Furthermore, we compare with knowledge distillation (KD) baselines following Hinton et al. (2014) for the implementation of its loss. For a mini-batch B, the KD loss is defined as:

LKD(fs, ft, B, τ, α) = αLCE(fs, B) + (1 α)LDistll(fs, ft, B, τ), (9)

where τ denotes a temperature scale, which is an additional input to a student model fs and teacher model ft. Also, LCE and LDistill are defined as:

LCE(f, B) = 1

i=1 ℓ(yi, f(xi)), LDistll(fs, ft, B, τ) = 1 τ 2B

i=1 ℓ(ft(xi, τ), fs(xi, τ)). (10)

Published in Transactions on Machine Learning Research (08/2024)

We set α = 0.5 following one of the experiments in Hinton et al. (2014) so that all the methods have only one hyperparameter to tune. We tune the temperature scale τ with a grid search on {1, 2, 3, , 9, 10}.

Results The results are presented in Table 2. We report the means and standard errors of accuracies over three runs. We can observe that KD and flooding methods, including Ada Flood, are not significantly better than the unregularized baseline on SVHN. However, Ada Flood noticeably improves the performance over the other methods on harder datasets like CIFAR10 and CIFAR100, whereas i Flood is not obviously better than the baseline and Flood is worse than the baseline on CIFAR100. The gap between Ada Flood and KD is more noticeable on CIFAR100, particularly with L2 regularization.

Discussion While i Flood is closely related to label smoothing, Ada Flood shares similarities with KD as both utilize auxiliary networks. However, a motivation behind two algorithms are fundamentally different. KD relies on predictions made on already-seen training examples, whereas Ada Flood leverages predictions on intentionally forgotten (or unseen) examples. Since the predictions of teacher networks in KD are based on already-seen examples, they do not serve as meaningful measures of uncertainty. In contrast, the predictions from an auxiliary network in Ada Flood can effectively measure uncertainty, and flood levels computed from these predictions can function as uncertainty regularizations. A disadvantage of Ada Flood, however, is the additional fine-tuning step required to forget already-seen examples, which is not necessary in KD.

4.3 Noisy Labels

Datasets In addition to CIFAR10 for image classification, we also use the tabular datasets Brazilian Houses and Wine Quality from Open ML (Vanschoren et al., 2013), following Grinsztajn et al. (2022), for regression tasks. We further employ Stanford Sentiment Treebank (SST-2) for the text classification task, following Xie et al. (2022). Details of datasets are provided in Appendix A.

We report accuracy for classification tasks. For regression tasks, we report mean squared error (MSE) in the main body, as well as mean absolute error (MAE) and R2 score in Figure 8 (Appendix C).

Implementation We inject noise for both image and text classification by changing the label to a uniformly randomly selected wrong class, following Xie et al. (2022). More specifically, for α% of the training data, we change the label to a uniformly random class other than the original label. For the regression tasks, we add errors sampled from a skewed normal distribution, with skewness parameter ranging from 0.0 to 3.0. Similar to the previous experiments, we tune learning rate and the weight for L2 regularization with the unregularized baseline (with early stopping and L2 regularization by default except for Figure 5c). Then, we tune the flood levels with the fixed learning rate and L2 regularization.

Results Figure 5 compares the flooding methods for noisy settings. We report the mean and standard error over three runs for CIFAR10, and five and seven runs for tabular datasets and SST-2, respectively. We provide Acc (%) for CIFAR10 and SST-2 compared to the unregularized model: that is, we plot the accuracy of each method minus the accuracy of the unregularized method, to display the gaps between methods more clearly. The mean accuracies of the unregularized method are displayed below the zero line.

Wine Quality, Figure 5a: Ada Flood slightly outperforms the other methods at first, but the gap significantly increases as the noise level increases.

Brazilian Houses, Figure 5b: There is no significant difference between the methods for small noise level, e.g. noise 1.5, but MSE for Ada Flood becomes significantly lower as the noise level increases.

CIFAR10, Figure 5c: i Flood and Ada Flood significantly outperform Flood and unregularized. Ada Flood also outperforms i Flood when the noise level is high (e.g. 50%).

SST-2, Figure 5d: Flooding methods outperform the unregularized method. Ada Flood is comparable to i Flood up to the noise level of 30%, but noticeably outperforms it as the noise level increases.

Published in Transactions on Machine Learning Research (08/2024)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Noise Level

Unreg. Flood i Flood Ada Flood

(a) Wine Quality

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Noise Level

0.3 0.4 0.5 0.6 0.7 0.8 0.9

Unreg. Flood i Flood Ada Flood

(b) Brazilian Houses

10 20 30 40 50 60 Noise Level (%)

82.0 79.0 76.0 72.0 66.0 59.0(%)

Flood i Flood Ada Flood

(c) CIFAR10

10 20 30 40 50 Noise Level (%)

90.0 89.0 84.0 78.0 50.0(%)

Flood i Flood Ada Flood

Figure 5: Comparison of flooding methods on tabular and image datasets with noise and bias.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0

1.0 ECE: 18.31 0.24

(a) Unregularized

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0

1.0 ECE: 13.96 1.6

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0

1.0 ECE: 19.82 2.27

(c) i Flood

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0

1.0 ECE: 5.49 0.33

(d) Ada Flood

Figure 6: Calibration results of flooding methods with 10 bins on CIFAR100. The bars and errors are the means and standard errors over three runs, respectively.

Overall, Ada Flood is more robust to noise than other flooding methods, since the model pays less attention to samples with high losses.

4.4 Calibration

Datasets and implementation Miscalibration neural networks being over or under-confident has been a well-known issue in deep learning. We thus evaluate the quality of calibration with different flooding methods on CIFAR100, as measured by the Expected Calibration Error (ECE) metric. (Figure 9 does the same for CIFAR10, but since model predictions are usually quite confident, this becomes difficult to measure.) We use a Res Net18 with L2 regularization with the optimal hyperparameters for the baseline and flooding methods. The optimal hyperparameter varies by seed for each run.

Result Figure 6 provides the calibration quality in ECE metric as well as a visualization over three runs, compared to perfect calibration (dotted red lines). We can observe that Ada Flood significantly improves the calibration, both in ECE and visually. Note that i Flood significantly miscalibrates at the bins corresponding to high probability e.g. bin 0.7, compared to the other methods, and also has high standard errors. This behavior is expected, since i Flood encourages the model not to predict higher than a probability of exp( b), where b denotes the flood level used in i Flood.

4.5 Ablation study: Relationship with Other Regularization

In this ablation study, we design an experiment that shows how different regularization methods interact with flooding methods. We conduct the experiment on CIFAR100 with Res Net18, gradually adding regularization methods in the order of early stopping, L2 regularization, dropout and Cut Mix (Yun et al., 2019), a popular data augmentation method, as shown in Table 3. Please note that the second row with early stopping and the third row with early stopping + L2 regularization are the same as what we report in Table 2.

Similar to the results in Table 2, Flood is comparable to or slightly worse than the unregularized baseline for the case with dropout and with both dropout and Cut Mix. Although i Flood is generally better than the

Published in Transactions on Machine Learning Research (08/2024)

Regularization Flooding Early Stopping L2 Dropout Cut Mix Unreg. Flood i Flood Ada Flood

56.39 0.25 56.07 0.19 56.07 0.06 56.89 0.19 56.59 0.32 55.88 0.18 56.32 0.05 57.25 0.14 61.49 0.16 60.96 0.03 61.63 0.12 62.31 0.14 62.04 0.17 61.73 0.13 62.19 0.27 63.15 0.10 67.08 0.28 67.11 0.26 67.09 0.08 67.50 0.16

Table 3: Comparison between flooding methods with and without various regularization methods on CIFAR100. We report the means and standard errors of three runs.

unregularized baseline as shown in the third to fifth rows, the gap is limited. Compared to Flood or i Flood, Ada Flood shows more consistently larger improvement.

4.6 Ablation study: Fine-tuning vs. Multiple Auxiliaries

10 Auxiliaries Layer 3 4 + FC Layer 4 + FC FC only 0

wall-clock time (hours)

90.81 0.07%

90.82 0.08% 90.92 0.05% 90.93 0.08%

Figure 7: Comparison of aux. training

Figure 7 compares training of ten Res Net18 auxiliary networks (original proposal) to the single fine-tuned auxiliary network (efficient variant) in terms of wall-clock time for training the auxiliary network(s), and performance of the corresponding main model, on the test set of CIFAR10. For the efficient variant, we fine-tune different layers to show insensitivity to the choice of layers: Layer3, 4 + FC, Layer3+FC, and FC, where Layer3 and 4 are the 3rd, 4th layers in Res Net18 and FC denotes the last fully connected layer. For example, Layer4 + FC means we only fine-tune Layer4 and FC layers, freezing all the previous layers. Results show that training multiple auxiliary networks yields the same-quality model as fine-tuning, though training time is 3 to 4 times longer. There is also little difference in performance between different fine-tuning methods: it seems that fine-tuning only the FC layer is sufficient to forget the samples, with early-stopping regularizing well enough for similar generalization ability. We also compare Ada Flood with various architectures for auxiliary networks in Appendix F.

5 Conclusion

In this paper, we introduced the Adaptive Flooding (Ada Flood) regularizer, a novel reguralization technique that adaptively regularizes a loss for each sample based on the difficulty of the sample. Each flood level is computed only once through an auxiliary training procedure with held-out splitting, which we can make more efficient by fine-tuning the last few layers on held-out sets. Experimental results on various domains and tasks: density estimation for asynchronous event sequences, image and text classification tasks as well as regression tasks on tabular datasets, with and without noise, demonstrated that our approach is more robustly applicable to a varied range of tasks including calibration.

Limitation Although Ada Flood is a robust and effective regularizer on many different tasks, particularly in high-noise settings, an open question that we leave for future work is how to best apply Ada Flood in long-tailed learning. For long-tailed data, it is expected that samples from the rare classes will tend to have higher losses. During the training of the main model, Ada Flood will direct the model to keep the higher losses for rare classes and lower losses for common classes, which may not be desirable. One potential solution could be to adaptively adjust γ for different classes. Alternatively, imbalanced learning techniques such as resampling, reweighting, or two-stage training could be adopted.

Reproducibility For each experiment, we listed implementation details such as model, regularization, and search space for hyperparameters. We also specified datasets we used for each experiment, and how they were split and augmented, along with the description of metrics. The code is released with the final version.

Published in Transactions on Machine Learning Research (08/2024)

Wonho Bae, Mohamed Osama Ahmed, Frederick Tung, and Gabriel L Oliveira. Meta temporal point processes. In ICLR, 2023.

Randall Balestriero, Leon Bottou, and Yann Le Cun. The effects of regularization and data augmentation are class dependent. In Neur IPS, 2022.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

Luca Bertinetto, João Henriques, Philip H. S. Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. In ICLR, 2019.

Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. Are we done with imagenet? ar Xiv preprint ar Xiv:2006.07159, 2020.

Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Neur IPS, 2019.

Junyoung Chung, Caglar Gulcehre, Kyung Hyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Neur IPS, 2014.

Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data selection for deep learning. In ICLR, 2020.

Ruizhou Ding, Ting-Wu Chin, Zeye Liu, and Diana Marculescu. Regularizing activation distribution for training binarized deep networks. In CVPR, 2019.

Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In KDD, 2016.

Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In ICLR, 2018.

Luca Franceschi, Paolo Frasconi, Saverio Salvo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In ICML, 2018.

Yu Gong, Greg Mori, and Frederick Tung. Rank Sim: Ranking similarity regularization for deep imbalanced regression. In ICML, 2022.

Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In Neur IPS, 2022.

Stephen Hanson and Lorien Pratt. Comparing biases for minimal network construction with backpropagation. In Neur IPS, 1988.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. NIPSW, 2014.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, and Masashi Sugiyama. Do we need zero training loss after achieving zero training error? In ICML, 2020.

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Neur IPS, 2018.

Published in Transactions on Machine Learning Research (08/2024)

Ziheng Jiang, Chiyuan Zhang, Kunal Talwar, and Michael C Mozer. Characterizing structural regularities of labeled data in overparameterized models. In ICML, 2021.

Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. In ICLR, 2023.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In Neur IPS, 1991.

Tyler La Bonte, Vidya Muthukumar, and Abhishek Kumar. Towards last-layer retraining for group robustness with fewer annotations. In Neur IPS, 2023.

Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Neur IPS, 2019.

Changlin Li, Guangrun Wang, Bing Wang, Xiaodan Liang, Zhihui Li, and Xiaojun Chang. Dynamic slimmable network. In CVPR, 2021.

Yawei Li, Shuhang Gu, Christoph Mayer, Luc Van Gool, and Radu Timofte. Group sparsity: The hinge between filter pruning and decomposition for network compression. In CVPR, 2020.

Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, and Tie-Yan Liu. R-Drop: Regularized dropout for neural networks. In Neur IPS, 2021.

Soon Hoe Lim, N. Benjamin Erichson, Francisco Utrera, Winnie Xu, and Michael W. Mahoney. Noisy feature mixup. In ICLR, 2022.

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In ICLR, 2019.

Jun Liu and Jieping Ye. Efficient l1/lq norm regularization. ar Xiv preprint ar Xiv:1009.4766, 2010.

Pratyush Maini, Saurabh Garg, Zachary Lipton, and J Zico Kolter. Characterizing datapoints via secondsplit forgetting. In Neur IPS, volume 35, pp. 30044 30057, 2022.

Sören Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal. Prioritized training on points that are learnable, worth learning, and not yet learnt. In ICML, 2022.

Mohamad Amin Mohamadi, Wonho Bae, and Danica J Sutherland. A fast, well-founded approximation to the empirical neural tangent kernel. In ICML, 2023.

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In ICLR Workshop, 2015.

Yi Ren, Shangmin Guo, and Danica J. Sutherland. Better supervisory signals by observing learning paths. In ICLR, 2022.

Oleksandr Shchur, Marin Biloš, and Stephan Günnemann. Intensity-free learning of temporal point processes. In ICLR, 2020.

Published in Transactions on Machine Learning Research (08/2024)

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, pp. 1929 1958, 2014.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception architecture for computer vision. In CVPR, 2016.

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In ECCV, 2020.

Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996.

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(11), 2008.

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: networked science in machine learning. SIGKDD Explorations, pp. 49 60, 2013.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, 2017.

Thomas Verelst and Tinne Tuytelaars. Dynamic convolutions: Exploiting spatial sparsity for faster inference. In CVPR, 2020.

Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In ICML, 2019.

Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Neur IPS, 2013.

Ruochen Wang, Minhao Cheng, Xiangning Chen, Xiaocheng Tang, and Cho-Jui Hsieh. Rethinking architecture selection in differentiable NAS. In ICLR, 2021.

Yuexiang Xie, WANG Zhen, Yaliang Li, Ce Zhang, Jingren Zhou, and Bolin Ding. i Flood: A stable and effective regularizer. In ICLR, 2022.

Li Yuan, Francis E. H. Tay, Guilin Li, Tao Wang, and Jiashi Feng. Revisiting knowledge distillation via label smoothing regularization. In CVPR, 2020.

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021.

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018.

Tao Zhuang, Zhixuan Zhang, Yuheng Huang, Xiaoyi Zeng, Kai Shuang, and Xiang Li. Neuron-level structured pruning using polarization regularizer. In Neur IPS, 2020.

Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. Transformer Hawkes process. In ICML, 2020.

Published in Transactions on Machine Learning Research (08/2024)

A Details about Datasets

Stack Overflow It contains 6,633 sequences with 480,414 events where an event is the acquisition of badges received by users. The maximum number of sequence length is 736 and the number of marks is 22. The dataset is provided by Du et al. (2016); we use the first folder, following Shchur et al. (2020) and Bae et al. (2023).

Reddit It contains 10,000 sequences with 532,026 events where an event is posting in Reddit. The maximum number of sequence length is 736 and the number of marks is 22. Marks represent sub-reddit categories.

Uber It contains 791 sequences with 701,579 events where an event is pick-up of customers. The maximum number of sequence length is 2,977 and there is no marks. It is processed and provided by Bae et al. (2023).

Brazilian Houses It contains information of 10,962 houses to rent in Brazil in 2020 with 13 features. The target is the rent price for each house in Brazilian Real. According to Open ML (Vanschoren et al., 2013) where we obtained this dataset, since the data is web-scrapped, there are some values in the dataset that can be considered outliers.

Wine Quality It contains 6,497 samples with 11 features and the quality of wine is numerically labeled as targets. This dataset is also obtained from Open ML (Vanschoren et al., 2013).

SST-2 The Stanford Sentiment Treebank (SST-2) is a dataset containing fully annotated parse trees, enabling a comprehensive exploration of how sentiment influences language composition. Comprising 11,855 individual sentences extracted from film reviews, this dataset underwent parsing using the Stanford parser, resulting in a collection of 215,154 distinct phrases.

B Additional Results on Image Classification

Datasets We use Image Net100 (Tian et al., 2020) for image classification with random crop, horizontal flip, and color jitter as augmentation. We also add 30% of label noise as done in Section 4.3.

Implementation We train Res Net34 (He et al., 2016) on the dataset with L2 regularization (with a weight of 0.0001). All methods are trained for 200 epochs with early stopping using SGD. We use a multistep learning rate scheduler with an initial learning rate of 0.1 and decay coefficient of 0.5, applied at every 25 epochs. The optimal flood levels are selected based on validation performance with a grid search on {0.01, 0.02..., 0.1, 0.15, 0.2..., 0.3} for Flood and i Flood, and {0.05, 0.1..., 0.95} for Ada Flood. We use a single Res Net34 auxiliary network where its last FC layer is randomly initialized and fine-tuned on held-out sets with n = 10 splits.

Results Table 4 (Left) compares flooding methods on Image Net100 dataset with and without 30% of label noise. We report test accuracies along with expected calibration error (ECE) on the right. Although Flood and i Flood do not improve the performance over the unregularized model, Ada Flood improves the performance by about 0.80% over the unregularized baseline. Given the size of the dataset, the gap is not marginal. This gap is even larger than that we observed in SVHN and CIFAR datasets Table 2. We conjecture it is because Image Net contains more noisy samples. It is well-known that there are many Image Net images containing multiple objects although the label says there is only one object (Beyer et al., 2020).

C Additional Results on Tabular Regression

Datasets We use NYC Taxi Tip dataset from Open ML (Vanschoren et al., 2013), one of the largest tabular dataset used in Grinsztajn et al. (2022), for regression tasks. NYC Taxi Tip dataset contains 581, 835 rows and 9 features. As the name of the dataset implies the target variable is tip amount". To increase the importance of other features, the creator of the dataset deliberately ignores fare amount" or trip distance".

Published in Transactions on Machine Learning Research (08/2024)

Method Mislabled Sample Rate

Unreg. 81.00 / 6.64 68.12 (17.23) Flood 81.18 (6.44) 68.19 (17.71) i Flood 81.04 (6.86) 68.24 (20.67) Ada Flood 81.79(4.81) 69.22(17.45)

Method Noise Level 0.0 1.5 3.0

Unreg. 0.2373(0.3335) 0.3707 (-0.0409) 0.3910 (-0.0978) Flood 0.2373(0.3335) 0.3707 (-0.0409) 0.3904 (-0.0980) i Flood 0.2370(0.3374) 0.3652 (-0.0255) 0.3902 (-0.0986) Ada Flood 0.2369(0.3348) 0.3520(0.0250) 0.3465(0.0119)

Table 4: Comparison of flooding methods on Image Net100 (Left) and NYC Taxi Tip (Right) datasets with and without label noise. The numbers on the right on Image Net100 represents ECE metric whereas on NYC Taxi Tip, they are R2 scores.

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Noise Level

0.45 0.50 0.55 0.60 0.65 0.70 0.75

Unreg. Flood i Flood Ada Flood

(a) Brazilian House (MAE)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Noise Level

Unreg. Flood i Flood Ada Flood

(b) Brazilian House (R2)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Noise Level

Unreg. Flood i Flood Ada Flood

(c) Wine Quality (MAE)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Noise Level

0.5 0.4 0.3 0.2 0.1

Unreg. Flood i Flood Ada Flood

(d) Wine Quality (R2)

Figure 8: Additional results in various metrics on tabular datasets with noise and bias

Implementation As with Section 4.3, we use a model tailored for tabular dataset proposed by (Grinsztajn et al., 2022) and add errors sampled from a skewed normal distribution, with skewness parameter ranging from 0.0 to 3.0.

Results Table 4 (Right) compares flooding methods on NYC Taxi Tip dataset (Grinsztajn et al., 2022) with and without noises. We report mean square error (MSE) and R2 score on the right. Note that R2 score is usually in between 0 and 1 but when predictions are bad, it can go below 0.

From the table, we can observe that all flooding methods perform similar to the unregularized baseline when there is no noise. Although it continues for Flood and i Flood even under noisy settings, Ada Flood significantly outperforms (lower MSE and higher R2 scores) the other methods when noise level is 1.5 and 3.0. In particular, while R2 scores of other methods go below 0, it does not happen with Ada Flood, which demonstrates the robustness of Ada Flood even for the large-scale dataset like NYC Taxi Tip. It is consistent with the results we provided in Section 4.3.

For additional information, in Figure 8, we also provide experiment results in MAE and R2 on Brazilian House and Wine Quality datasets of which results in MSE is provided in Figures 5a and 5b.

D Additional Results on Calibration

We provide calibration results on CIFAR10 in Figure 9. As briefly mentioned in Section 4.4, since model predictions are generally accurate and quite confident, the gap between different methods are not as significant as on CIFAR100 shown in Figure 6.

E Initialization of the Main Model using an Auxiliary Network

Even though we efficiently fine-tune an auxiliary network to compute flood levels, it may be still too expensive to train both the auxiliary network f aux and main model f. To reduce computation further, we may utilize a pre-trained auxiliary network when we use the same architecture for the auxiliary and main model. Instead of randomly initialize the main model, we can initialize the parameters of the main model using the parameters from the pre-trained auxiliary network (before fine-tuning).

Published in Transactions on Machine Learning Research (08/2024)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0

1.0 ECE: 5.9 1.06

(a) Unregularized

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0

1.0 ECE: 4.06 1.3

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0

1.0 ECE: 4.5 0.57

(c) i Flood

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0

1.0 ECE: 4.23 0.85

(d) Ada Flood

Figure 9: Calibration results of flooding methods with 10 bins on CIFAR10.

Random Init Freezing first k layers

k = 0 k = 1 k = 2 k = 3

90.83 / 3.52 91.30 / 2.50 90.90 / 3.44 90.68 / 3.96 89.99 / 4.69

Table 5: Comparison between random initialization and initialization with the pre-trained auxiliary network. ECE metrics are reported on the right.

We conduct an experiment to validate if it actually saves computation without hurting performance on CIFAR10 with Res Net18. We first trained an auxiliary network on the whole training set. The pre-trained auxiliary network is then used for both computing θ through the fine-tuning step described in Section 3.3 and initializing the main model. When we fine-tune the main model, we freeze the first k layers (among four layers of Res Net18) to save computation, and randomly initialize the last fully connected layer to have the model forget some information.

Table 5 shows that if we initialize the parameters of the main model with those of the pre-trained auxiliary model without freezing any layers, it performs better than random initialization. As we freeze more layers to save computation, the performance gradually goes down compared to that of random initialization but it is still comparable up to k = 2 case.

F Ablation study: Various Architectures for Auxiliary Networks

VGG11 VGG19 Res Net18 Small Res Net18

# of params 9.8 M 21.6 M 2.8 M 11.2 M

Ada Flood 90.74 0.13 91.09 0.06 90.73 0.12 90.82 0.08

Table 6: Comparison of various architectures for auxiliary networks in Ada Flood.

To investigate the robustness of Ada Flood in terms of the choice of architectures for auxiliary networks, we conduct an ablation study on Ada Flood with various architectures for auxiliary networks: VGG11, VGG19, Res Net18 small and Res Net18. Here, we use a Res Net18 for the main model, and utilize the efficient finetuning method to train auxiliary networks. We report the mean and standard error of three runs in Table 6.

With VGG11 where its number of parameters is slightly less than Res Net18, the mean test accuracy of the main model is lower but the gap is marginal. With VGG19 which is larger than Res Net18, the performance slightly improves. We also try with a smaller variant of Res Net18 for the auxiliary network where its number of parameters is a quarter of the original Res Net18. Even with this significantly smaller architecture, the mean test accuracy of the main model is only slightly worse than much larger models e.g. VGG11 and Res Net18, which implies that what is important from the auxiliary network is the relative magnitude of losses (or flood levels) not the absolute values of losses.

Published in Transactions on Machine Learning Research (08/2024)

NTPP Method Uber Reddit Stack Overflow

Intensity-free

Flood {4.0, 4.0, 3.0} {-1.0, -5.0, -3.0} {0.0, 3.0, 1.0} i Flood {4.0, 0.0, 1.0} {-2.0, -3.0, -4.0} {2.0, 2.0, 3.0} Ada Flood {0.2, 0.0, 0.3} {0.0, 0.1, 0.1} {0.1, 0.4, 0.2}

Flood {0.0, -5.0, 0.0} {-1.0, -2.0, -15.0} {0.0, 2.0, 1.0} i Flood {-1.0, 4.0, 3.0} {-4.0, -4.0, -5.0} {0.0, 0.0, -1.0} Ada Flood {0.0, 0.1, 0.3} {0.0, 0.0, 0.1} {0.3, 0.3, 0.3}

Table 7: Choice of flood levels for (i)Flood, and γ for Ada Flood on TPP datasets with three seeds.

Method SVHN CIFAR10 CIFAR100 w/o L2 reg. w/ L2 reg. w/o L2 reg. w/ L2 reg. w/o L2 reg. w/ L2 reg.

Flood {0.04, 0.02, 0.03} {0.02, 0.03, 0.04} {0.02, 0.04, 0.05} {0.07, 0.02, 0.01} {0.01, 0.01, 0.01} {0.01, 0.01, 0.05} i Flood {0.05, 0.09, 0.02} {0.05, 0.06, 0.06} {0.07, 0.04, 0.07} {0.07, 0.03, 0.01} {0.45, 0.35, 0.30} {0.30, 0.40, 0.65} Ada Flood {0.75, 0.85, 0.70} {0.35, 0.50, 0.65} {0.95, 0.70, 0.65} {0.75, 0.95, 0.65} {0.40, 0.50, 0.45} {0.50, 0.50, 0.55}

Table 8: Choice of flood levels for (i)Flood, and γ for Ada Flood on image classification with three seeds.

G Theoretic Intuition for Efficient Training of Auxiliary Networks

In this section, we provide theoretic intuition for efficient training of auxiliary networks. Following (Lee et al., 2019), we approximate the predictions of a neural network f on a test sample xi

j=1 αje NTK(xi, xj) (11)

where e NTK stands for empirical neural tangent kernel (NTK) following (Mohamadi et al., 2023) and {xj}n j=1 denotes data from a training set. Equation (11) says we can approximate a prediction on xi as an interpolation of e NTK(xi, ) with some weights α.

Suppose f(x) = V ϕ(x) where ϕ(x) Rh denotes a feature from the penultimate layer and V Rk h denotes the weights of the last fully connected layer (k being the number of classes), consisting of vj Rh for j-th row. If vj,i, i-th entry of vj, is from N(0, σ2), then Mohamadi et al. (2023) haven shown that,

e NTKw(x1, x2)jj = v T j e NTKϕ w\V (x1, x2)vj + 1(j = j )ϕ(x1)T ϕ(x2) (12)

where w denotes a set of all the model parameters and w \ V means a set of all the parameters except the last fully connected layer.

With this frame, we can approximate the predictions from the efficiently trained auxiliary network denoted as ftune, as follows,

ftune(xi) X

j =i αtune j v T e NTKϕ,trained w\V (xj, xi)v + ϕtrained(xj) T ϕtrained(xi) . (13)

Here, superscript trained means the model parameters are pre-trained on the whole training set.

On the other hand, the original direct training algorithm of auxiliary networks trains an auxiliary network from scratch, on the training set excluding the samples that we measure the difficulty on. Similarly, the predictions of a single auxiliary network denoted as fdirect is approximated as,

fdirect(xi) X

j =i αdirect j v T e NTKϕ,untrained w\V (xj, xi)v + ϕuntrained(xj) T ϕuntrained(xi) (14)

where superscript untrained means the model parameters are randomly initialized.

In NTK regime where a neural network has infinite-width, the terms in the parentheses (the term except α s) are the same for Equation (13) and Equation (14). Therefore, the difficulty measures from the efficient training of a single auxiliary network and direct training of multiple auxiliary networks are equivalent in highly-overparameterized regime.

Published in Transactions on Machine Learning Research (08/2024)

H Choice of Flood Levels

In Table 7 and Table 8, we report the choice of flood levels for Flood and i Flood, and γ for Ada Flood (recall that γ is the hyperparameter for correct function adjusting interpolation level) on TPP and image classification tasks with three different random seeds.

One interesting observation is that flood levels for i Flood on CIFAR100 is significantly larger than on the other datasets. It is because CIFAR100 is particularly harder than the other two datasets. However, even though a flood level is sometimes high e.g. 0.65, it is still a reasonable choice because 0.65 flood level implies that the model s highest predicted probabilities do not deviate much from 0.52 exp( 0.65). With this high regularization, the model is not too overconfident for its predictions.