# harnessing_hard_mixed_samples_with_decoupled_regularizer__d5bbe79d.pdf

Harnessing Hard Mixed Samples with Decoupled Regularizer

Zicheng Liu1,2, Siyuan Li1,2, Ge Wang1,2 Chen Tan1,2

Lirong Wu1,2 Stan Z. Li2, AI Lab, Research Center for Industries of the Future, Hangzhou, China; 1Zhejiang University; 2Westlake University; {liuzicheng; lisiyuan; wangge; tanchen; lirongwu; stan.zq.li} @westlake.edu.cn

Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data. Recently, dynamic mixup methods have improved previous static policies effectively (e.g., linear interpolation) by maximizing target-related salient regions in mixed samples, but excessive additional time costs are not acceptable. These additional computational overheads mainly come from optimizing the mixed samples according to the mixed labels. However, we found that the extra optimizing step may be redundant because label-mismatched mixed samples are informative hard mixed samples for deep models to localize discriminative features. In this paper, we thus are not trying to propose a more complicated dynamic mixup policy but rather an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM). The primary effect is that DM can adaptively utilize those hard mixed samples to mine discriminative features without losing the original smoothness of mixup. As a result, DM enables static mixup methods to achieve comparable or even exceed the performance of dynamic methods without any extra computation. This also leads to an interesting objective design problem for mixup training that we need to focus on both smoothing the decision boundaries and identifying discriminative features. Extensive experiments on supervised and semi-supervised learning benchmarks across seven datasets validate the effectiveness of DM as a plug-and-play module. Source code and models are available at https://github.com/Westlake-AI/openmixup.

1 Introduction

Mix Up Cut Mix

Mixed Cross-entropy Decoupled Mixup Loss

CAM of Squirrel CAM of Panda CAM of Squirrel CAM of Panda

Figure 1: visualization of hard mixed sample mining by class activation mapping (CAM) [49] of Res Net-50 on Image Net. From left to right, CAM of top-2 predicted classes using mixup crossentropy (MCE) and decoupled mixup (DM) loss.

Deep Learning has become the bedrock of modern AI for many tasks in machine learning [3] such as computer vision [19, 18], natural language processing [12]. Using a large number of learnable parameters, deep neural networks (DNNs) can recognize subtle dependencies in large training datasets to be later leveraged to perform accurate predictions on unseen data. However, models might overfit the training set without constraints or enough data [53]. To this

Equal contribution. Stan Z. Li (Stan.ZQ.Li@westlake.edu.cn) is the corresponding author.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

(1.0 Panda) (1.0 Squirrel) (0.3 Squirrel 0.7 Panda)

Hard Mixed Sample for Squirrel Hard Mixed Sample for Panda

(0.7 Squirrel 0.3 Panda)

Figure 2: Illustration of the two types of hard mixed samples in Cut Mix with Squirrel and Panda as an example. Hard mixed samples indicate that the mixed sample contains salient features of a class, but the value of the corresponding label is small. MCE loss fails to leverage these samples.

end, regularization techniques have been deployed to improve generalization [61], which can be categorized into data-independent or data-dependent ones [16]. Some data-independent strategies, for example, constrain the model by punishing the parameters norms, such as weight decay [40]. Among data-dependent strategies, data augmentations [51] are widely used. The augmentation policies often rely on particular domain knowledge [58] in different fields.

Mixup [77], a data-dependent augmentation technique, is proposed to generate virtual samples by a linear combination of data pairs and the corresponding labels with the mixing ratio λ [0, 1]. Recently, a line of optimizable mixup methods are proposed to improve mixing policies to generate object-aware virtual samples by optimizing discriminative regions in the data space to match the corresponding labels [56, 23, 22] (referred to as dynamic methods). However, although the dynamic approach brings some performance gain, the extra computational overhead degrades the efficiency of mixup augmentation significantly. Specifically, the most computation of dynamic methods is spent on optimizing label-mismatched samples, but the question of why these label-mismatched samples should be avoided during the mixup training has rarely been analyzed. In this paper, we find these mismatched samples are completely underutilized by static mixup methods, and the problem lies in the loss function, mixed cross-entropy loss (MCE). Therefore, we argue that these mismatched samples are not only not static mixup disadvantages but also hard mixed samples full of discriminative information. Taking Cut Mix [74] as an example, two types of hard mixed samples are shown on the right of Figure 2. Since MCE loss forces the model s predictions to be consistent with the soft label distribution, i.e., the model cannot give high-confidence predictions for the relevant classes even if the feature is salient in hard mixed samples, we can say that these hard samples are not fully leveraged.

From this perspective, we expect the model to be able to mine these hard samples, i.e., to give confident predictions according to salient features for localizing discriminative characteristics, even if the proportion of features is small. Motivated by this finding, we introduce simple yet effective Decoupled Mixup (DM) loss, a mixup objective function for explicitly leveraging the hard samples during the mixup training. Based on the standard mixed cross-entropy (MCE) loss, an extra decoupled regularizer term is introduced to enhance the ability to mine underlying discriminative statistics in the mixed sample by independently computing the predicted probabilities of each mixed class. Figure 1 shows the proposed DM loss can empower the static mixup methods to explore more discriminative features. Extensive experiments demonstrate that DM achieves data-efficiency training on supervised and semi-supervised learning benchmarks. Our contributions are summarized below:

Unlike those dynamic mixup policies that design complicated mixing policies, we propose DM, a mixup objective function of mining discriminative features adaptively.

Our work contributes more broadly to understanding mixup training: it is essential to focus not only on the smoothness by regression of the mixed labels but also on discrimination by encouraging the model to give reliable and confident predictions.

Not only in supervised learning but the proposed DM can also be easily generalized to semi-supervised learning with a minor modification. By leveraging the unlabeled data, it can reduce the conformation bias and significantly improve performance.

Comprehensive experiments on various tasks verify the effectiveness of DM, e.g., DMbased static mixup policies achieve a comparable or even better performance than dynamic methods without the extra computation.

2 Related Work

Mixup Augmentation. As data-dependent augmentation techniques, mixup methods generate new samples by mixing samples and corresponding labels with well-designed mixing policies [77, 57, 69, 64]. The pioneering mixing method is Mixup [77], whose mixed samples are generated by linear interpolation between pairs of samples. Manifold Mix variants [57, 14] extend Mixup to the latent space of DNNs. After that, cut-based methods [74] are proposed to improve the mixup for localizing important features, especially in the vision field. Many researchers explore using nonlinear or optimizable sample mixup policies to generate more reliable mixed samples according to mixed labels, such as Puzzle Mix variants [23, 22, 45], Saliency Mix variants [56, 60], Auto Mix variants [38, 31], and Super Mix [11]. Concurrently, recent works try to generate more accurate mixed labels with saliency information [20] or attention maps [5, 9, 7] for Transformer architectures, which require prior pre-trained knowledge or attention information. On the contrary, the proposed decoupled mixup is a pluggable learning objective for mixup augmentations. Moreover, mixup methods extend to more than two elements [22, 11] and regression tasks [70]. Some researchers also utilize mixup augmentations to enhance contrastive learning [8, 21, 28, 50, 31] or masked image modeling [33, 6] to learn general representation in a self-supervised manner.

Semi-supervised Learning and Transfer Learning. Pseudo-Labeling [27] is a popular semisupervised learning (SSL) method that utilizes artificial labels converted from teacher model predictions. Mix Match [2] and Re Mix Match [1] apply mixup on labeled and unlabeled data to enhance the diversity of the dataset. More accurate pseudo-labeling relies on data augmentation techniques to introduce consistency regularization, e.g., UDA [65] and Fix Match [52] employ weak and strong augmentations to improve the consistency. Furthermore, Co Match [29] unifies consistency regularization, entropy minimization, and graph-based contrastive learning to mitigate confirmation bias. Recently proposed works [62, 4] that improve Fix Match by designing more accurate confidence-based pseudo-label selection strategies, e.g., Flex Match [76] applying curriculum learning for updating confidence threshold dynamically and class-wisely. More recently, Semi Reward [30] proposes a reward model to filter out accurate pseudo labels with reward scores. Fine-tuning a pre-trained model on labeled datasets is a widely adopted form of transfer learning (TL) in various applications. Previously, [13, 44] show that transferring pre-trained Alex Net features to downstream tasks outperforms hand-crafted features. Recent works mainly focus on better exploiting the discriminative knowledge of pre-trained models from different perspectives. L2-SP [35] promotes the similarity of the final solution with pre-trained weights by a simple L2 penalty. DELTA [34] constrains the model by a subset of pre-trained feature maps selected by channel-wise attention. BSS [68] avoids negative transfer by penalizing smaller singular values. More recently, Self-Tuning variants [67, 54] combined contrastive learning with TL to tackle confirmation bias and model shift issues in a one-stage framework.

3 Decoupled Mixup

3.1 Preliminary

Mixed Cross-Entropy Underutilizes Mixup Let us define y RC as the ground-truth label with C categories. For labeled data point x RW H C whose embedded representation z is obtained from the model M and the predicted probability p can be calculated through a Softmax function p = σ(z). Given the mixing ratio λ [0, 1] and λ-related mixup mask H RW H, the mixed sample (x(a,b), y(a,b)) can be generated as x(a,b) = H xa + (1 H) xb, and y(a,b) = λya + (1 λ)yb, where denotes element-wise product, (xa, ya) and (xb, yb) are sampled from a labeled dataset L = {(xa, ya)}n L a=1. Note that superscripts denote the index; subscripts indicate the type of data, e.g., x(a,b) represents a mixed sample generated from xa and xb; yi indicates the label value on i-th position. Since the mixup labels are obtained by somehow λ-based interpolation, the standard CE loss weighted by λ, LCE = y T (a,b) log σ(z(a,b)), is typically used as the objective in the mixup training:

λI(yi a = 1) log pi (a,b) + (1 λ)I(yi b = 1) log pi (a,b) . (1)

where I( ) {0, 1} is an indicator function that values one if and only if the input condition holds. Noticeably, these two items of Equation 1 are classifying ya and yb while keeping the linear

consistency with mixing coefficient λ. As a result, DNNs with this mixup consistency prefer relatively less confident results in high-entropy behaviour [46] and longer training time in practice. The main reason is that in addition to λ constraint, the competing relationships defined by Softmax in LMCE are the main cause of the confidence drop, which is more obvious when dealing with hard mixed samples. Precisely, the competition between the mixed class a and b in Equation 1 can severely affect the prediction of a single class; that is, interference from other classes prevents the model from focusing its attention. This typically causes the model to be insensitive to the salient features of the target and thus undermines model performance, as shown in Figure 1. Although the dynamic mixup alleviates this problem, the extra time overhead is unavoidable if only focusing on mixing policies on the data level. Therefore, the key challenge is to design an ideal objective function for mixup training that maintains the smoothness of the mixup and can simultaneously explore the discriminative features without any computation costs.

3.2 Decoupled Regularizer

To achieve the above goal, we first dive into the LMCE and propose the efficient decoupled mixup. Proposition 1. Assuming x(a,b) is generated from two different classes, minimizing LMCE is equivalent to regress corresponding λ in the gradient:

( z(a,b)LMCE)i =

λ + exp(zi (a,b)) P

c exp(zc (a,b)), i = a

(1 λ) + exp(zi (a,b)) P

c exp(zc (a,b)), i = b

exp(zi (a,b)) P

c exp(zc (a,b)), i = a, b

Softmax Degrades Confidence. As we can see from Proposition 1, the predicted probability of x(a,b) will be consistent with λ, and the probability is computed from the Softmax directly. The Softmax forces the sum of predictions to one (winner takes all), which is undesirable in mixup classification, especially when there are multiple and non-salient targets in mixed samples, e.g., hard mixed samples, as shown in Figure 2. The standard Softmax in LMCE deliberately suppresses confidence and produces high-entropy predictions by coupling all classes. As a consequence, LMCE makes many static mixup methods require longer epochs than vanilla training to achieve the desired results [57, 73]. Based on previous analysis, a novel mixup objective, decoupled mixup (DM), is raised to remove the Coupler and thus utilize the hard mixed samples adaptively, finally improving the performance of mixup methods. Specifically, for mixed data points z(a,b) generated from a random pair in labelled dataset L, an encoded mixed representation z(a,b) = fθ(x(a,b)) is generated by a feature extractor fθ. A mixed categorical probability of i-th class is attained:

σ(z(a,b))i = exp(zi (a,b)) P

c exp(zc (a,b)). (3)

Decoupled Softmax. where σ( ) is standard Softmax. Equation 3 shows how the mixed probabilities are computed for a mixed sample. The competition between a and b is the main reason that results in low confidence of the model, i.e., the sum of semantic information of hard mixed samples are larger than 1 defined by Softmax. Therefore, we propose to simply remove the competitor class in Equation 3 to achieve decoupled Softmax. The score on i-th class is not affected by the j-th class:

ϕ(z(a,b))i,j = exp(zi (a,b)) XXXXX exp(zj (a,b)) + P

c =j exp(zc (a,b)). (4)

where ϕ( ) is the proposed decoupled Softmax. In Equation 4, by removing the competitor, compared with Equation 1, the decoupled Softmax makes all items associated with λ become -1 in gradient, the derivation is given in the A.1. Our Proposition 2 verifies that the expected results are achieved with decoupled Softmax. Proposition 2. With the decoupled Softmax defined above, decoupled mixup cross-entropy LDM can boost the prediction confidence of the interested classes mutually and escape from the λ-constraint:

j=1 yi ayj b

log pi (a,b) 1 pj (a,b)

+ log pj (a,b) 1 pi (a,b)

0 20 40 60 80 1 00 Epochs

Top-1 acc. of mixed data (%)

Mix Up Cut Mix Manifold Mix 0

Top-2 Acc(%)

30 40 50 60 70 Training Time (hours/100ep)

Top-1 Accuracy (%)

Mix Up Cut Mix

Mix Up + DM(CE) Cut Mix + DM(CE)

Figure 3: Results illustration of applying decouple mixup. Left: taking Mix Up as an example, our proposed decoupled mixup cross-entropy, DM(CE), significantly improves training efficiency by exploring hard mixed sample; Middle: Acc vs. cost on Image Net-1k; Right: Top-2 acc is calculated when the top-2 predictions equal to {ya, yb}.

The Decoupled Mixup. The proofs of Proposition 1 and 2 are given in the Appendix. In practice, the original smoothness of LMCE should not be lost, and thus the proposed DM is a regularizer for discriminability. The final form of decoupled mixup can be formulated as follows:

LDM(CE) = y T (a,b) log(σ(z(a,b))) | {z } LMCE

+η y T [a,b] log(ϕ(z(a,b)))y[a,b] | {z } LDM

where y(a,b) indicates the mixed label while y[a,b] is two-hot label encoding, η is a trade-off factor. Notice that η is robust and can be set according to the character of mixup methods (see Sec. 5.4).

Practical consequences of such simple modification on mixup and the performance:

Make What Should be Certain More Certain. As we expected, mixup training with a decoupling mechanism will be more accurate and confident in handling hard mixed samples with our artificially constructed hard mixed samples by using Puzzle Mix. Figure 3 right demonstrates the model trained with decoupled mixup mostly doubled the top-2 accuracy on these mixed samples, which also verifies the information contained in mixed samples is beyond the 1 defined by standard Softmax. More interestingly, this advantage of decoupled mixup, i.e., higher confidence and accuracy, can be further amplified in semi-supervised learning due to the uncertainty of pseudo-labeling.

Enhance the Training Efficiency. It is straightforward to notice that there is no extra computation cost when using DM in vanilla mixup training, and the performance we can achieve is the same or even better than optimizable mixup policies, i.e., Puzzle Mix, Co Mixup, etc. Figure 3 left and middle show decoupled mixup unveils the power of static mixup for more accurate and faster.

4 Extensions of Decoupled Mixup

With the high-accurate nature of decoupled mixup for mining hard mixed samples, semi-supervised learning is a suitable scenario to propagate the accurate label from labeled space to unlabeled space by using asymmetrical mixup. In addition, we can also generalize the decoupled mechanism into the binary cross-entropy for boosting the multi-classification task.

4.1 Asymmetrical Strategy for Semi-supervised Learning

Based on labeled data L = {(xa, ya)}n L a=1, if we further consider unlabeled data U = {(ua, va)}n U a=1 decoupled mixup can be the strong connection between L and U. Recall the confirmation bias [67] problem of SSL: the performance of the student model is restricted by the teacher model when learning from inaccurate pseudo-labels. To fully use the L and strengthen the teacher model to provide more robust and accurate predictions, the unlabeled data with large λ can be used to mix with the labeled data to form hard mixed samples. With these hard mixed samples, we can employ decoupled mixup into semi-supervised learning effectively. Since only the label of L is accurate, we need to make a little asymmetric modification to the decoupled mixup, called Asymmetrical Strategy(AS). Formally, given the labeled and unlabeled datasets L and U, AS builds reliable connection by generating hard mixed samples between L and U in an asymmetric manner (λ < 0.5):

ˆx(a,b) = λxa + (1 λ)ub; ˆy(a,b) = λya + (1 λ)vb.

Due to the uncertainty of the pseudo-label, only the labeled part is retained in LDM:

ˆLDM = y T a log ϕ(z(a,b)) yb,

where ya and yb are one-hot labels from L. AS could be regarded as a special case of DM that only decouples on labeled data. Simply replacing LDM with ˆLDM can leverage the hard samples and alleviate the confirmation bias in semi-supervised learning.

4.2 Decoupled Binary Cross-entropy Loss

0.0 0.2 0.4 0.6 0.8 1.0 value

label scale

=0.8, t=3.0

=0.8, t=2.0

=0.8, t=1.0

=0.8, t=0.5

=0.8, t=0.3

=1.0, t=1.0

Figure 4: Rescaled label of different λ value.

Binary Cross-entropy Form of DM. Different from Softmax-based classification, we can also build decoupled mixup in multi-label classification tasks (1-vs-all) by using mixup binary cross-entropy (MBCE) loss [63] (σ( ) denotes Sigmoid rather Softmax in this case). Proposition 2 demonstrates the decoupled CE can mutually enhance the confidence of predictions for the interested classes and be free from λ limitations. Similarly, for MBCE, since it is not inherently bound to mutual interference between classes by Softmax, we have to preserve partial consistency and encourage more confident predictions, and thus propose a decoupled mixup binary cross-entropy loss, DM(BCE).

To this end, a rescaling function r : λ, t, ξ λ is designed to achieve this goal. The mixed label is rescaled by r( ): ymix = λaya + λbyb, where λa and λb are rescaled. The rescaling function is defined as follows:

r(λ, t, ξ) = λ

ξ t, 0 t, 0 ξ < 1, (6)

where ξ is the threshold, t is an index to control the convexity. As shown in Figure 4, Equation 6 has three situations: (a) when ξ = 0, t = 0, the rescaled label is always equal to 1, as two-hot encoding; (b) when ξ = 1, t = 1, r( ) is a linear function (vanilla mixup); (c) the rest curves demonstrate t is the parameter that changes the concavity and ξ is responsible for truncating.

Empirical Results. In the case of interpolation-based mixup methods (e.g., Mixup, Manifold Mix, etc.) that keep linearity between the mixed label and sample, the decoupled mechanism can be introduced by only adjusting threshold t. In the case of cutting-based mixing policies (e.g., Cut Mix, etc.) where the mixed samples and labels have a square relationship (generally a convex function), we can approximate the convexity by adjusting ξ, which are detailed in Sec. 5.4 and Appendix C.5.

5 Experiments

We adopt two types of top-1 classification accuracy (Acc) metrics (the mean of three trials): (i) the median top-1 Acc of the last 10 epochs [52, 38] for supervised image classification tasks with Mixup variants, and (ii) the best top-1 Acc in all checkpoints for SSL tasks. Popular Conv Nets and Transformer-based architectures are used as backbone networks: Res Net variants including Res Net [19] (R), Wide-Res Net (WRN) [75], and Res Ne Xt-32x4d (RX) [66], Vision Transformers including Dei T [55] and Swin Transformer (Swin) [37].

5.1 Image Classification Benchmarks

This subsection evaluates performance gains of DM on six image classification benchmarks, including CIFAR-100 [25], Tiny-Image Net (Tiny) [10], Image Net-1k [48], CUB-200-2011 (CUB) [59], FGVCAircraft (Aircraft) [42]. There are mainly two types of mixup methods based on their mixing policies: static methods including Mixup [77], Cut Mix [74], Manifold Mix [57], Saliency Mix [56], FMix [17], and Resize Mix [47], and dynamic mixup methods including Puzzle Mix [23], Auto Mix [38], and SAMix [31]. For a fair comparison, we use the optimal α in {0.1, 0.2, 0.5, 0.8, 1.0, 2.0} for all mixup

Table 1: Top-1 Acc (%) of small-scale image classification on CIFAR-100 and Tiny-Image Net datasets based on Res Net variants.

Datasets CIFAR-100 Tiny-Image Net R-18 RX-50 WRN-28-8 R-18 RX-50 Methods MCE DM(CE) MCE DM(CE) MCE DM(CE) MCE DM(CE) MCE DM(CE) Mixup 79.12 80.44 82.10 82.96 82.82 83.51 63.86 65.07 66.36 67.70 Cut Mix 78.17 79.39 81.67 82.39 84.45 84.88 65.53 66.45 66.47 67.46 Manifold Mix 80.35 81.05 82.88 83.15 83.24 83.72 64.15 65.45 67.30 68.48 FMix 79.69 80.12 81.90 82.74 84.21 84.47 63.47 65.34 65.08 66.96 Resize Mix 80.01 80.26 81.82 82.96 84.87 84.72 63.74 64.33 65.87 68.56 Avg. Gain +0.78 +0.77 +0.34 +1.18 +1.62

Table 2: Top-1 Acc (%) of image classification on Image Net-1k with Res Net variants using Py Torchstyle 100-epoch training recipe.

R-18 R-34 R-50 Methods MCE DM(CE) MCE DM(CE) MCE DM(CE) Vanilla 70.04 - 73.85 - 76.83 - Mixup 69.98 70.20 73.97 74.26 77.12 77.41 Cut Mix 68.95 69.26 73.58 73.88 77.07 77.32 Manifold Mix 69.98 70.33 73.98 74.25 77.01 77.30 FMix 69.96 70.26 74.08 74.34 77.19 77.38 Resize Mix 69.50 69.90 73.88 74.00 77.42 77.65 Avg. Gain +0.32 +0.24 +0.25

Table 3: Top-1 Acc (%) of image classification on Image Net-1k based on Res Net-50 using RSB A3 100-epoch training recipe.

Methods MCE DM(CE) MBCE MBCE DM(BCE)

(one) (two) (one) RSB 76.49 77.72 78.08 76.95 78.43 Mixup 76.01 76.69 77.66 77.42 78.28 Cut Mix 76.47 77.22 77.62 67.54 78.21 Manifold Mix 76.14 76.93 77.78 67.78 78.20 FMix 76.09 76.87 77.76 73.44 78.11 Resize Mix 76.90 77.21 77.85 77.30 78.32 Avg. Gain +0.76 -4.38 +0.47

Table 4: Top-1 Acc (%) of classification on Image Net-1k with Vi Ts.

Dei T-S Swin-T Methods MCE DM(CE) MCE DM(CE) Dei T 79.80 80.37 81.28 81.49 Mixup 79.65 80.04 80.71 80.97 Cut Mix 79.78 80.20 80.83 81.05 FMix 79.41 79.89 80.37 80.54 Resize Mix 79.93 80.03 80.94 81.01 Avg. Gain +0.39 +0.19

Table 5: Top-1 Acc (%) of fine-grained image classification on CUB-200 and FGVC-Aircrafts with Res Net variants.

Datasets CUB-200 FGVC-Aircrafts R-18 RX-50 R-18 RX-50 Methods MCE DM(CE) MCE DM(CE) MCE DM(CE) MCE DM(CE) Mixup 78.39 79.90 84.58 85.04 79.52 82.66 85.18 86.68 Cut Mix 78.40 78.76 85.68 85.97 78.84 81.64 84.55 85.75 Manifold Mix 79.76 79.92 86.38 86.42 80.68 82.57 86.60 86.92 FMix 77.28 80.10 84.06 84.85 79.36 80.44 84.85 85.04 Resize Mix 78.50 79.58 84.77 84.92 78.10 79.54 84.08 84.51 Avg. Gain +1.19 +0.35 +2.07 +0.73

algorithms and follow original hyper-parameters in papers. We adopt the open-source codebase Open Mixup [32] for most mixup methods. The detailed training recipes and hyper-parameters are provided in Appendix B. We also evaluate the adversarial robustness of CIFAR-100 in Appendix C.2.

Small-scale Classification Benchmarks. For small-scale classification benchmarks on CIFAR-100 and Tiny, we adopt the CIFAR version of Res Net variants and train with SGD optimizer following the common training settings [23, 38]. Table 1 and A2 show small-scale classification results. The proposed DM(CE) significantly improves MCE based on various mixup algorithms. Based on CIFAR100 and three different CNNs (R-18, RX-50, and WRN-28-8), the decoupled mixup brings an average performance gain of 0.78%, 0.77%, and 0.34%, respectively. More notably, the gains on the Tiny dataset are significant, with average performance gains of: 1.18% and 1.62% on R-18 and RX-50.

Image Net and Fine-grained Classification Benchmarks. For experiments on Image Net-1k, we follow three popular training procedures: Py Torch-style setting [19], Dei T [55] setting, and RSB A3 [63] setting to demonstrate the generalizability of decoupled mixup. As shown in Table 2, 3, and 4, DM(CE) improves consistently over MCE in all mixup algorithms on three training settings we considered. The relative improvements have been calculated in the last row of tables. For example, DM(CE) yields around +0.4% for mixup methods based on Res Net variants using Py Torch-style and RSB A3 settings; around +0.5% and +0.2% for all methods based on Dei T-S and Swin-T using Dei T setting. Notice that MBCE(two) denotes using two-hot encoding for corresponding mixing classes, which yield worse performance than MBCE, and DM(BCE) adjusts the labels for the mixing classes by Equation 6. It verifies the necessity of DM(BCE) in the case of using MBCE. As for fine-grained benchmarks, we follow the training settings in Auto Mix and initialize models with the official Py Torch pre-trained models (as supervised transfer learning). Table 5 and A6 show that DM(CE) noticeably boosts the original MCE for eight popular mixup variants, especially bringing 0.53% 3.14% gains on Aircraft based on Res Net-18.

5.2 Semi-supervised Transfer Learning Benchmarks

Following the transfer learning (TL) benchmarks [71], we perform TL experiments on CUB, Aircraft, and Stanford-Cars [24] (Cars). Besides the vanilla Fine-Tuning baseline, we compare current

Table 6: Top-1 Acc (%) of semi-supervised transfer learning on various TL benchmarks (CUB-200, FGVC-Aircraft, and Standfold-Cars) using only 15%, 30% and 50% labels based on Res Net-50.

CUB-200 FGVC-Aircraft Stanford-Cars Methods 15% 30% 50% 15% 30% 50% 15% 30% 50% Fine-Tuning 45.25 0.12 59.68 0.21 70.12 0.29 39.57 0.20 57.46 0.12 67.93 0.28 36.77 0.12 60.63 0.18 75.10 0.21 +DM 50.04 0.17 61.39 0.24 71.87 0.23 43.15 0.22 61.02 0.15 70.38 0.18 41.30 0.16 62.65 0.21 77.19 0.19 BSS 47.74 0.23 63.38 0.29 72.56 0.17 40.41 0.12 59.23 0.31 69.19 0.13 40.57 0.12 64.13 0.18 76.78 0.21 Co-Tuning 52.58 0.53 66.47 0.17 74.64 0.36 44.09 0.67 61.65 0.32 72.73 0.08 46.02 0.18 69.09 0.10 80.66 0.25 +DM 54.96 0.65 68.25 0.21 75.72 0.37 49.27 0.83 65.60 0.41 74.89 0.17 51.78 0.34 74.15 0.29 83.02 0.26 Self-Tuning 64.17 0.47 75.13 0.35 80.22 0.36 64.11 0.32 76.03 0.25 81.22 0.29 72.50 0.45 83.58 0.28 88.11 0.29 +Mixup 62.38 0.32 74.65 0.24 81.46 0.27 59.38 0.31 74.65 0.26 81.46 0.27 70.31 0.27 83.63 0.23 88.66 0.21 +DM 73.06 0.38 79.50 0.35 82.64 0.24 67.57 0.27 80.71 0.25 84.82 0.26 81.69 0.23 89.22 0.21 91.26 0.19

Avg. Gain +5.95 +2.77 +1.34 +5.65 +4.52 +2.65 +7.22 +4.22 +2.35

Table 7: Top-1 Acc (%) of semi-supervised learning on CIFAR-100 (using 400, 2500, and 10000 labels) based on WRN-28-8. Notice that DM denotes using DM(CE) and AS, Con denotes various unsupervised consistency losses, Rot denotes the rotation loss in Re Mix Match, and CPL denotes the curriculum labeling in Flex Match.

CIFAR-10 CIFAR-100 Methods Losses 250 4000 400 2500 10000 Pseudo-Labeling CE 53.51 2.20 84.92 0.19 12.55 0.85 42.26 0.28 63.45 0.24 Mix Match CE+Con 86.37 0.59 93.34 0.26 32.41 0.66 60.24 0.48 72.22 0.29 Re Mix Match CE+Con+Rot 93.70 0.05 95.16 0.01 57.15 1.05 73.87 0.35 79.08 0.27 Mix Match+DM CE+Con+DM 89.16 0.71 95.15 0.68 35.72 0.53 62.51 0.37 74.70 0.28 UDA CE+Con 94.84 0.06 95.71 0.07 53.61 1.59 72.27 0.21 77.51 0.23 Fix Match CE+Con 95.14 0.05 95.79 0.08 53.58 0.82 71.97 0.16 77.80 0.12 Flex Match CE+Con+CPL 95.02 0.09 95.81 0.01 60.06 1.62 73.51 0.20 78.10 0.15 Fix Match+Mixup CE+Con+MCE 95.05 0.23 95.83 0.19 50.61 0.73 72.16 0.18 78.75 0.14 Fix Match+DM CE+Con+DM 95.23 0.09 95.87 0.11 59.75 0.95 74.12 0.23 79.58 0.17 Average Gain +1.44 +0.95 +4.74 +2.30 +2.13

state-of-the-art TL methods, including BSS [68], Co-Tuning [71], and Self-Tuning [67]. For a fair comparison, we use the same hyper-parameters and augmentations as Self-Tuning, detailed in Appendix B.2. In Table 6, we adopt DM(CE) and AS for Fine-Tuning, Co-Tuning, and Self-Tuning using Mixup. DM(CE) and AS steadily improve Mixup and the baselines by large margins, e.g., +4.62% 9.19% for 15% labels, +2.02% 5.67% for 30% labels, and +2.09% 3.15% for 50% labels on Cars. This outstanding improvement implies that generating mixed samples efficiently is essential for data-limited scenarios. A similar performance will be presented as well in the next SSL setting.

5.3 Semi-supervised Learning Benchmarks

Following [52, 76], we adopt the most commonly used CIFAR-10/100 datasets among the famous SSL benchmarks based on WRN-28-2 and WRN-28-8. We mainly evaluate the proposed DM on popular SSL methods Mix Match [2] and Fix Match [52], and compare with Pesudo-Labeling [27], Re Mix Match [1], UDA [65], and Flex Match [76]. For a fair comparison, we use the same hyperparameters and training settings as the original papers and conduct experiments with the open-source codebase Torch SSL [76], detailed in Appendix B.2. Table 7 shows that adding DM(CE) and AS significantly improves Mix Match and Fix Match: DM(CE) brings 1.81 2.89% gains on CIFAR-10 and 1.27 3.31% gains on CIFAR-100 over Mix Match while bringing 1.78 4.17% gains on CIFAR100 over Fix Match. Meanwhile, we find that directly applying mixup augmentations to Fix Match brings limited improvements, while Fix Match+DM achieves the best performance in most cases on CIFAR-10/100 datasets. Appendix C.3 provides further studies with limited labeled data. Therefore, mixup augmentations with DM can achieve data-efficient training in SSL.

5.4 Ablation Study and Analysis

Hyperparameters and Proposed Components. Since we have demonstrated the effectiveness of DM in the above experiments, Figure 1 and 5 verified that DM could well explore hard mixed samples. We then verify whether DM is robust to hyper-parameters (full hyper-parameters in Appendix C.5) and study the effectiveness of AS in SSL:

Figure 5: Top-1 Acc (%) of mixed samples on Image Net-1k validation set.

0.0 0.2 0.4 0.6 0.8 1.0 Mixing ratio

Top-1 acc. of mixed data (%)

Mix Up Mix Up+DM Cut Mix Cut Mix+DM

Table 8: Ablation of the proposed asymmetric strategy (AS) and DM(CE) upon Self-Tuning for semi-supervised transfer learning on CUB-200 based on R-18.

Methods 15% 30% 50% 100% Self-Tuning 57.82 69.12 73.59 75.08 +MCE 63.36 72.81 75.73 76.67 +MCE+AS(λ 0.5) 59.04 69.67 74.89 75.96 +MCE+AS(λ 0.5) 62.97 72.46 75.40 76.34 +DM(CE)+AS(λ 0.5) 66.17 74.25 77.68 78.52

(1) The only hyper-parameter η in DM(CE) and DM(BCE) can be set according to the types of mixup methods. We grid search η in {0.01, 0.1, 0.5, 1, 2} on Image Net-1k. As shown in Figure A2 left, the static (Mixup and Cut Mix) and the dynamic methods (Puzzle Mix and Auto Mix) prefer η = 0.1 and η = 1, respectively, which might be because the dynamic variants generate more discriminative and reliable mixed samples than the static methods. (2) Hyper-parameters ξ and t in DM(BCE) can also be determined by the characters of mixup policies. We grid search ξ {1, 0.9, 0.8, 0.7} and t {2, 1, 0.5, 0.3}. Figure A2 middle and right show that cutting-based methods (Cut Mix and Auto Mix) prefer ξ = 0.8 and t = 1, while the interpolation-based policies (Mixup and Manifold Mix) use ξ = 1.0 and t = 0.5. (3) Table 8 shows the superiority of AS(λ 0.5) in comparison to MCE and AS(λ 0.5), while using DM(CE) and AS(λ 0.5) further improves MCE. (4) The experiments of different sizes of training data are performed to verify the data efficiency of DM. We can observe that decoupled mixup improves by around 2% accuracy without any computational overhead. The detailed results are shown in Appendix C.3.

0 10 20 30 40 50 60 70 80 90 100 Occlusion ratio (%)

Top 1 Accuracy (%)

Res Net50 Random Patch Drop

Puzzle Mix + DM(CE) RSB + DM(CE) Puzzle Mix (MCE) RSB (MCE)

0 10 20 30 40 50 60 70 80 90 100 Occlusion ratio (%)

Top 1 Accuracy (%)

Dei T S Random Patch Drop

Dei T + DM(CE) Puzzle Mix + DM(CE) Puzzle Mix (MCE) Dei T (MCE)

Figure 6: Robustness against different occlusion ratios of images for mixup augmentations using the MCE and our DM(CE) loss based on Res Net-50 (left) and Dei T-S (right) on Image Net-1k. RSB and Dei T denote using Cut Mix+Mixup (static mixup policies) in RSB A3 [63] and Dei T [55] training settings. DM(CE) improves mixups by exploring hard mixed samples.

Occlusion Robustness We also analyzed robustness against random occlusion [43] for models trained on Image Net-1k using the official implementation2. Concretely, the classifier is thought to be robust if it predicts the correct label given an occluded version of the image. In other words, the network learns essential features (e.g., semantic regions) that discriminate each class. For occlusion, we consider patch-based random masking. In particular, we split the image of 224 224 resolutions into patch size 16 16 and randomly mask M patches out of the total number of N patches, where the occlusion ratio is defined as M

N . As shown in Figure 6, the proposed DM helps various mixup methods achieve better occlusion robustness, indicating DM can force the model to learn discriminative features, e.g., image patches with semantic information that is deterministic when the occlusion ratio is high.

6 Conclusion, Limitations, and Border Impacts

Table 9: Top-1 Acc (%) on CIFAR-100 and Tiny.

Datasets CIFAR-100 Tiny-Image Net Backbone WRN-28-8 RX-50 Methods MCE DM(CE) MCE DM(CE) Puzzle Mix 85.02 85.25 67.83 68.04 +RSB 85.24 85.61 68.17 68.86 Auto Mix 85.18 85.38 70.72 71.56 +RSB 85.35 85.54 70.98 72.37

Decoupled Mixup and Dynamic Mixups. We investigate and show two limitations of the decoupled mixup. Different from static mixup methods, dynamic mixup spends extra time to optimize mixing masks in input space to align the mixed samples and labels. Although the optimized mixing policies can enhance the model to find discriminative features [38], their predictions are also under-confident. In Table 9, we tried some advanced dynamic mixup policies, e.g.,

2https://github.com/Muzammal-Naseer/Intriguing-Properties-of-Vision-Transformers

Puzzle Mix [23], and Auto Mix [38], with decoupled mixup, and found the improvement is limited. The main reason is there will not be many hard mixed samples in dynamic mixups. Therefore, we additionally incorporate two static mixups in RSB training settings, i.e., a half probability that Mixup or Cut Mix will be selected during the training. As expected, the improvements from the decoupled mixup are getting obvious upon static mixup variants. This is a very preliminary attempt that deserves more exploration in future works, and we provide more results of dynamic mixups in Appendix C.

Table 10: Top-1 Acc (%) of on CIFAR-100 training 200 and 600 epochs based on Dei T-S and Conv Ne Xt-T. Underlines denote the top-3 best results. Total training hours and GPU memory are collected on a single A100 GPU.

Methods Dei T-Small Conv Ne Xt-Tiny 200 ep 600 ep Mem. Time 200 ep 600 ep Mem. Time Vanilla 65.81 68.50 8.1 27 78.70 80.65 4.2 10 Mixup 69.98 76.35 8.2 27 81.13 83.08 4.2 10 Cut Mix 74.12 79.54 8.2 27 82.46 83.20 4.2 10 Dei T 75.92 79.38 8.2 27 83.09 84.12 4.2 10 Smooth Mix 67.54 80.25 8.2 27 78.87 81.31 4.2 10 Saliency Mix 69.78 76.60 8.2 27 82.82 83.03 4.2 10 Attentive Mix+ 75.98 80.33 8.3 35 82.59 83.04 4.3 14 FMix 70.41 74.31 8.2 27 81.79 82.29 4.2 10 Grid Mix 68.86 74.96 8.2 27 79.53 79.66 4.2 10 Resize Mix 68.45 71.95 8.2 27 82.53 82.91 4.2 10 Puzzle Mix 73.60 81.01 8.3 35 82.29 84.17 4.3 53 Auto Mix 76.24 80.91 18.2 59 83.30 84.79 10.2 56 SAMix 77.94 82.49 21.3 58 83.56 84.98 10.3 57 Dei T+Trans Mix 76.17 79.33 8.4 28 - - - - Dei T+Token Mix 76.25 79.57 8.4 34 - - - - Dei T+DM(CE) 76.20 79.92 8.2 27 83.44 84.49 4.2 10

Meanwhile, we further conduct comprehensive comparison experiments with modern Transformer-based architectures on CIFAR-100, considering the concurrent work Trans Mix [5] and Token Mix [36]. As shown in Table 10, where results with denote the official implementation and the other are based on Open Mixup [32], DM(CE) enables Dei T (Cut Mix and Mixup) to achieve competitive performances as dynamic mixup variants like Auto Mix and SAMix [31] based on Conv Ne Xt S without introducing extra computational costs, while still performing worse than them based on Dei T-S. Compared with specially designed label mixing methods using attention maps, DM(CE) also achieves competitive performances to Trans Mix and Token Mix. How to further improve the decoupled mixup with the salient regions or dynamic attention information to research similar performances of dynamic mixing variants can also be studied in future works.

The Next Mixup. In a word, we introduce Decoupled Mixup (DM), a new objective function for considering both smoothness and mining discriminative features in mixup augmentations. The proposed DM helps static mixup methods (e.g., Mix Up and Cut Mix) achieve a comparable or better performance than the computationally expensive dynamic mixup policies. Most importantly, DM raises a question worthy of researching: is it necessary to design very complex mixup policies? We also find that decoupled mixup could be the bridge to combining static and dynamic mixup. However, the introduction of additional hyperparameters may take users some extra time to check on other than images or other mixup methods. This also leads to the core question of the next step in the development of this work: how to design a more elegant and adaptive mixup training objective that connects different types of mixups to achieve high data efficiency? We believe these explorations and questions can inspire future research in the community of mixup augmentations.

Acknowledgement

This work was supported by National Key R&D Program of China (No. 2022ZD0115100), National Natural Science Foundation of China Project (No. U21A20427), and Project (No. WU2022A009) from the Center of Synthetic Biology and Integrated Bioengineering of Westlake University. We thank the AI Station of Westlake University for the support of GPUs and thank all reviewers for polishing the manuscript.

[1] David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. ar Xiv preprint ar Xiv:1911.09785, 2019. 3, 8

[2] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning. ar Xiv preprint ar Xiv:1905.02249, 2019. 3, 8

[3] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006. 1

[4] Hao Chen, Ran Tao, Yue Fan, Yidong Wang, Jindong Wang, Bernt Schiele, Xing Xie, Bhiksha Raj, and Marios Savvides. Softmatch: Addressing the quantity-quality tradeoff in semisupervised learning. In The Eleventh International Conference on Learning Representations, 2022. 3

[5] Jie-Neng Chen, Shuyang Sun, Ju He, Philip Torr, Alan Yuille, and Song Bai. Transmix: Attend to mix for vision transformers. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3, 10

[6] Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, and Dit-Yan Yeung. Mixed autoencoder for self-supervised visual representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

[7] Mengzhao Chen, Mingbao Lin, Zhihang Lin, Yu xin Zhang, Fei Chao, and Rongrong Ji. Smmix: Self-motivated image mixing for vision transformers. Proceedings of the International Conference on Computer Vision (ICCV), 2023. 3

[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning (ICML), 2020. 3

[9] Hyeong Kyu Choi, Joonmyung Choi, and Hyunwoo J. Kim. Tokenmixup: Efficient attentionguided token-level data augmentation for transformers. Advances in Neural Information Processing Systems (Neur IPS), 2022. 3

[10] Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. ar Xiv preprint ar Xiv:1707.08819, 2017. 6, 19

[11] Ali Dabouei, Sobhan Soleymani, Fariborz Taherkhani, and Nasser M Nasrabadi. Supermix: Supervising the mixing data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13794 13803, 2021. 3

[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. 1

[13] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning (ICML), 2014. 3

[14] Mojtaba Faramarzi, Mohammad Amini, Akilesh Badrinaaraayanan, Vikas Verma, and Sarath Chandar. Patchup: A regularization technique for convolutional neural networks. ar Xiv preprint ar Xiv:2006.07794, 2020. 3

[15] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015. 22

[16] Hongyu Guo, Yongyi Mao, and Richong Zhang. Mixup as locally linear out-of-manifold regularization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 3714 3722, 2019. 2

[17] Ethan Harris, Antonia Marcu, Matthew Painter, Mahesan Niranjan, and Adam Pr ugel Bennett Jonathon Hare. Fmix: Enhancing mixed sample data augmentation. ar Xiv preprint ar Xiv:2002.12047, 2(3):4, 2020. 6, 20

[18] Kaiming He, Georgia Gkioxari, Piotr Doll ar, and Ross Girshick. Mask r-cnn. In Proceedings of the International Conference on Computer Vision (ICCV), 2017. 1

[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016. 1, 6, 7, 21

[20] Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. All tokens matter: Token labeling for training better vision transformers. In Advances in Neural Information Processing Systems (Neur IPS), 2021. 3

[21] Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning. In Advances in Neural Information Processing Systems (Neur IPS), 2020. 3

[22] Jang-Hyun Kim, Wonho Choo, Hosan Jeong, and Hyun Oh Song. Co-mixup: Saliency guided joint mixup with supermodular diversity. In International Conference on Learning Representations (ICLR), 2021. 2, 3

[23] Jang-Hyun Kim, Wonho Choo, and Hyun Oh Song. Puzzle mix: Exploiting saliency and local statistics for optimal mixup. In International Conference on Machine Learning (ICML), pages 5275 5285. PMLR, 2020. 2, 3, 6, 7, 10, 19, 20, 22

[24] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3d RR-13), 2013. 7, 19

[25] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 6, 19

[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (Neur IPS), pages 1097 1105, 2012. 19

[27] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, page 896, 2013. 3, 8

[28] Kibok Lee, Yian Zhu, Kihyuk Sohn, Chun-Liang Li, Jinwoo Shin, and Honglak Lee. I-mix: A domain-agnostic strategy for contrastive representation learning. In International Conference on Learning Representations (ICLR), 2021. 3

[29] Junnan Li, Caiming Xiong, and Steven Hoi. Comatch: Semi-supervised learning with contrastive graph regularization. In Proceedings of the International Conference on Computer Vision (ICCV), 2021. 3

[30] Siyuan Li, Weiyang Jin, Zedong Wang, Fang Wu, Zicheng Liu, Cheng Tan, and Stan Z. Li. Semireward: A general reward model for semi-supervised learning. Ar Xiv, abs/2111.15454, 2023. 3

[31] Siyuan Li, Zicheng Liu, Zedong Wang, Di Wu, Zihan Liu, and Stan Z. Li. Boosting discriminative visual representation learning with scenario-agnostic mixup. Ar Xiv, abs/2111.15454, 2021. 3, 6, 10, 20

[32] Siyuan Li, Zedong Wang, Zicheng Liu, Di Wu, and Stan Z. Li. Openmixup: Open mixup toolbox and benchmark for visual representation learning. https://github.com/Westlak e-AI/openmixup, 2022. 7, 10, 19, 20

[33] Siyuan Li, Di Wu, Fang Wu, Zelin Zang, Kai Wang, Lei Shang, Baigui Sun, Haoyang Li, and Stan.Z.Li. Architecture-agnostic masked image modeling - from vit back to cnn. In International Conference on Machine Learning (ICML), 2023. 3

[34] Xingjian Li, Haoyi Xiong, Hanchao Wang, Yuxuan Rao, Liping Liu, and Jun Huan. Delta: Deep learning transfer using feature map with attention for convolutional networks. In International Conference on Learning Representations (ICLR), 2019. 3

[35] Xuhong Li, Yves Grandvalet, and Franck Davoine. Explicit inductive bias for transfer learning with convolutional networks. In Proceedings of the International Conference on Machine Learning (ICML), 2018. 3

[36] Jihao Liu, B. Liu, Hang Zhou, Hongsheng Li, and Yu Liu. Tokenmix: Rethinking image mixing for data augmentation in vision transformers. In European Conference on Computer Vision (ECCV), 2022. 10

[37] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In International Conference on Computer Vision (ICCV), 2021. 6

[38] Zicheng Liu, Siyuan Li, Di Wu, Zhiyuan Chen, Lirong Wu, Liu Zihan, and Stan Z Li. Automix: Unveiling the power of mixup for stronger classifier. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2022. 3, 6, 7, 9, 10, 19, 20, 22

[39] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. ar Xiv preprint ar Xiv:1608.03983, 2016. 19

[40] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. 2

[41] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019. 19

[42] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Finegrained visual classification of aircraft. ar Xiv preprint ar Xiv:1306.5151, 2013. 6, 19

[43] Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34:23296 23308, 2021. 9

[44] Maxime Oquab, L eon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 3

[45] Joonhyung Park, June Yong Yang, Jinwoo Shin, Sung Ju Hwang, and Eunho Yang. Saliency grafting: Innocuous attribution-guided mixup with calibrated label mixing. In AAAI Conference on Artificial Intelligence, 2022. 3

[46] Francesco Pinto, Harry Yang, Ser-Nam Lim, Philip HS Torr, and Puneet K Dokania. Regmixup: Mixup as a regularizer can surprisingly improve accuracy and out distribution robustness. ar Xiv preprint ar Xiv:2206.14502, 2022. 4

[47] Jie Qin, Jiemin Fang, Qian Zhang, Wenyu Liu, Xingang Wang, and Xinggang Wang. Resizemix: Mixing data with preserved object information and true labels. ar Xiv preprint ar Xiv:2012.11101, 2020. 6, 20

[48] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision (IJCV), pages 211 252, 2015. 6

[49] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. ar Xiv preprint ar Xiv:1610.02391, 2019. 1

[50] Zhiqiang Shen, Zechun Liu, Zhuang Liu, Marios Savvides, Trevor Darrell, and Eric Xing. Unmix: Rethinking image mixtures for unsupervised visual representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021. 3

[51] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):1 48, 2019. 2

[52] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In Advances in Neural Information Processing Systems (Neur IPS), 2020. 3, 6, 8, 19

[53] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research (JMLR), 15(1):1929 1958, 2014. 1

[54] Cheng Tan, Zhangyang Gao, Lirong Wu, Siyuan Li, and Stan Z Li. Hyperspherical consistency regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7244 7255, 2022. 3

[55] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML), pages 10347 10357, 2021. 6, 7, 9, 19, 21

[56] AFM Uddin, Mst Monira, Wheemyung Shin, Tae Choong Chung, Sung-Ho Bae, et al. Saliencymix: A saliency guided data augmentation strategy for better regularization. ar Xiv preprint ar Xiv:2006.01791, 2020. 2, 3, 6, 20

[57] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning (ICML), pages 6438 6447. PMLR, 2019. 3, 4, 6, 20

[58] Vikas Verma, Thang Luong, Kenji Kawaguchi, Hieu Pham, and Quoc Le. Towards domainagnostic contrastive learning. In International Conference on Machine Learning (ICML), pages 10530 10541. PMLR, 2021. 2

[59] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. 6, 19

[60] Devesh Walawalkar, Zhiqiang Shen, Zechun Liu, and Marios Savvides. Attentive cutmix: An enhanced data augmentation approach for deep learning based image classification. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3642 3646, 2020. 3

[61] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning (ICML), pages 1058 1066. PMLR, 2013. 2

[62] Yidong Wang, Hao Chen, Qiang Heng, Wenxin Hou, Yue Fan, Zhen Wu, Jindong Wang, Marios Savvides, Takahiro Shinozaki, Bhiksha Raj, et al. Freematch: Self-adaptive thresholding for semi-supervised learning. In The Eleventh International Conference on Learning Representations, 2022. 3

[63] Ross Wightman, Hugo Touvron, and Herv e J egou. Resnet strikes back: An improved training procedure in timm, 2021. 6, 7, 9, 19, 21

[64] Lirong Wu, Jun Xia, Zhangyang Gao, Haitao Lin, Cheng Tan, and Stan Z Li. Graphmixup: Improving class-imbalanced node classification by reinforcement mixup and self-supervised context prediction. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 519 535. Springer, 2022. 3

[65] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. Unsupervised data augmentation for consistency training. ar Xiv preprint ar Xiv:1904.12848, 2019. 3, 8

[66] Saining Xie, Ross Girshick, Piotr Doll ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1492 1500, 2017. 6

[67] Wang Ximei, Gao Jinghan, Long Mingsheng, and Wang Jianmin. Self-tuning for data-efficient deep learning. In Proceedings of the International Conference on Machine Learning (ICML), 2021. 3, 5, 8

[68] Chen Xinyang, Wang Sinan, Fu Bo, Long Mingsheng, and Wang Jianmin. Catastrophic forgetting meets negative transfer: Batch spectral shrinkage for safe transfer learning. In Advances in Neural Information Processing Systems (Neur IPS), 2019. 3, 8

[69] Huaxiu Yao, Yiping Wang, Linjun Zhang, James Y Zou, and Chelsea Finn. C-mixup: Improving generalization in regression. Advances in Neural Information Processing Systems, 35:3361 3376, 2022. 3

[70] Huaxiu Yao, Yiping Wang, Linjun Zhang, James Y. Zou, and Chelsea Finn. C-mixup: Improving generalization in regression. In Advances in Neural Information Processing Systems (Neur IPS), 2022. 3

[71] Kaichao You, Zhi Kou, Mingsheng Long, and Jianmin Wang. Co-tuning for transfer learning. In Advances in Neural Information Processing Systems (Neur IPS), 2020. 7, 8

[72] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training BERT in 76 minutes. In International Conference on Learning Representations (ICLR), 2020. 19

[73] Hao Yu, Huanyu Wang, and Jianxin Wu. Mixup without hesitation. In International Conference on Image and Graphics (ICIG), pages 143 154. Springer, 2021. 4

[74] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6023 6032, 2019. 2, 3, 6, 19, 20, 22

[75] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), 2016. 6

[76] Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. In Advances in Neural Information Processing Systems (Neur IPS), 2021. 3, 8, 19

[77] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017. 2, 3, 6, 20, 21, 22

In the Appendix sections, we provide proofs of proposition 1 ( A.1) and proposition 2 ( A.2), implementation details ( B), and more results of comparison experiment and empirical analysis ( C).

A Proof of Proposition

A.1 Proof of Proposition 1

Proposition 1. Assuming x(a,b) is generated from two different classes, minimizing LMCE is equivalent to regress corresponding λ in the gradient of LMCE:

( z(a,b)LMCE)l =

λ + exp(zi (a,b)) P

c exp(zc (a,b)), l = i

(1 λ) + exp(zj (a,b)) P

c exp(zc (a,b)), l = j

exp(zl (a,b)) P

c exp(zc (a,b)), l = i, j

Proof. For the mixed sample (x(a,b), y(a,b)), z(a,b) is derived from a feature extractor fθ (i.e z(a,b) = fθ(x(a,b))). According to the definition of the mixup cross-entropy loss LMCE, we have:

z(a,b)LMCE l = LMCE

zl (a,b) = zl (a,b)

y T (a,b) log σ(z(a,b))

yi (a,b) zl (a,b)

log( exp(zi (a,b)) PC j=1 exp(zj (a,b)) )

PC j=1 exp(zj (a,b))

exp(zi (a,b)) zl (a,b)

exp(zi (a,b)) PC j=1 exp(zj (a,b))

yi (a,b) δl i exp(zl (a,b)) PC j=1 exp(zj (a,b))

= exp(zl (a,b)) PC j=1 exp(zj (a,b)) yl (a,b).

Similarly, we have:

z(a,b)LDM l = LDM

zl (a,b) = zl (a,b)

y T [a,b]) log H(z(a,b)) y[a,b]

i,j=1 yi a log( exp(zi (a,b)) PC k =j exp(zj (a,b)) )yj b +

i,j=1 yj a log( exp(zi (a,b)) PC k =i exp(zj (a,b)) )yi b

yi ayj b zl (a,b)

log( exp(zi (a,b)) PC k =j exp(zk (a,b)) ) + log( exp(zj (a,b)) PC k =i exp(zk (a,b)) )

yi ayj b δl i

k =j exp(zk (a,b))δl k P

k =j exp(zk (a,b)) + δl j

k =i exp(zk (a,b))δl k P

k =i exp(zk (a,b))

k =i exp(zk (a,b))δl k P

k =i exp(zk (a,b)) +

k =j exp(zk (a,b))δl k P

k =j exp(zk (a,b)) δl i δl j.

Thus, for LDM loss:

( z(a,b)LMCE)l =

1 + exp(zi (a,b)) P

c =j exp(zc (a,b)), l = i

1 + exp(zj (a,b)) P

c =i exp(zc (a,b)), l = j

exp(zl (a,b)) P

c =i exp(zc (a,b)) + exp(zl (a,b)) P

c =j exp(zc (a,b)), l = i, j

A.2 Proof of Proposition 2

Proposition 2. With the decoupled Softmax defined above, decoupled mixup cross-entropy LDM(CE) can boost the prediction confidence of the interested classes mutually and escape from the λconstraint:

j=1 yi ayj b log pi (a,b) 1 pj (a,b)

+ log pj (a,b) 1 pi (a,b)

Proof. For the mixed sample (x(a,b), y(a,b)), z(a,b) is derived from a feature extractor fθ (i.e z(a,b)=fθ(x(a,b))). According to the definition of the mixup cross-entropy loss LDM(CE), we have:

LDM(CE) = y T [a,b] log H(Z(a,b)) y[a,b]

y T a log H(Z(a,b)) yb + y T b log H(Z(a,b)) ya

i,j=1 yi a log( exp(zi (a,b)) PC k =j exp(zj (a,b)) )yj b +

i,j=1 yj a log( exp(zi (a,b)) PC k =i exp(zj (a,b)) )yi b

i,j=1 yi ayj b log( exp(zi (a,b)) PC k =j exp(zj (a,b)) ) + log( exp(zj (a,b)) PC k =i exp(zi (a,b)) )

i,j=1 yi ayj b log(

exp(zi (a,b)) PC k=1 exp(zk (a,b)) PC k =j exp(zj (a,b)) PC k=1 exp(zk (a,b))

exp(zj (a,b)) PC k=1 exp(zk (a,b)) PC k =i exp(zi (a,b)) PC k=1 exp(zk (a,b))

i,j=1 yi ayj b log( pi (a,b) 1 pj (a,b) ) + log( pj (a,b) 1 pi (a,b) ) ,

where p(a,b) = σ(z(a,b)).

B Implementation Details

B.1 Dataset

We briefly introduce used image datasets. (1) Small scale classification benchmarks: CIFAR10/100 [25] contains 50,000 training images and 10,000 test images in 32 32 resolutions, with 10 and 100 classes settings. Tiny-Image Net [10] is a rescaled version of Image Net-1k, which has 10,000 training images and 10,000 validation images of 200 classes in 64 64 resolutions. (2) Large scale classification benchmarks: Image Net-1k [26] contrains 1,281,167 training images and 50,000 validation images of 1000 classes in 224 224 resolutions. (3) Small-scale fine-grained classification scenarios: CUB-200-2011 [59] contains 11,788 images from 200 wild bird species for fine-grained classification. FGVC-Aircraft [42] contains 10,000 images of 100 classes of aircraft. Standford-Cars [24].

B.2 Training Settings

Small-scale image classification. As for small-scale classification benchmarks on CIFAR-100 and Tiny-Image Net datasets, we adopt the CIFAR version of Res Net variants, i.e., using a 3 3 convolution instead of the 7 7 convolution and Max Pooling in the stem, and follow the common training settings [23, 38]: the basic data augmentation includes Random Flip and Random Crop with 4 pixels padding; SGD optimizer and Cosine learning rate Scheduler [39] are used with the SGD weight decay of 0.0001, the momentum of 0.9, and the Batch size of 100; all methods train 800 epochs with the basic learning rate lr = 0.1 on CIFAR-100 and 400 epochs with lr = 0.2 on Tiny-Image Net.

Fine-grained image classification. As for fine-grained classification experiments on CUB-200 and Aircraft datasets, all mixup methods are trained 200 epochs by SGD optimizer with the initial learning rate lr = 0.001, the weight decay of 0.0005, and the batch size of 16. We use the standard augmentations Random Flip and Random Resized Crop, and load the official Py Torch pre-trained models on Image Net-1k as initialization.

Image Net image classification. For large-scale classification tasks on Image Net-1k, we evaluate mixup methods on three popular training procedures, and Tab. A1 shows the full training settings of the three settings. Notice that Dei T [55] and RSB A3 [63] settings employ Mixup and Cut Mix with a switching probability of 0.5 during training. (a) Py Torch-style setting. Without any advanced training strategies, a Py Torch-style setting is used to study the performance gains of mixup methods: SGD optimizer is used to train 100 epochs with the SGD weight decay of 0.0001, a momentum of 0.9, a batch size of 256, and the basic learning rate of 0.1 adjusted by Cosine Scheduler. Notice that we replace the step learning rate decay with Cosine Scheduler [39] for better performances following [74]. (b) Dei T [55] setting. We use the Dei T setting to verify the DM(CE) effectiveness in training Transformer-based networks: Adam W optimizer [41] is used to train 300 epochs with a batch size of 1024, the basic learning rate of 0.001, and the weight decay of 0.05. (c) RSB A3 [63] setting. This setting adopts similar training techniques as Dei T to Conv Nets, especially using MBCE instead of MCE: LAMB optimizer [72] is used to train 100 epochs with the batch size of 2048, the basic learning rate of 0.008, and the weight decay of 0.02. Notice that Dei T and RSB A3 settings use the combination of Mixup and Cut Mix (50% random switching probabilities) as the baseline.

Semi-supervised transfer learning. For semi-supervised transfer learning benchmarks, we use the same hyper-parameters and augmentations as Self-Tuning3: all methods are initialized by Py Torch pre-trained models on Image Net-1k and trained 27k steps in total by SGD optimizer with the basic learning rate of 0.001, the momentum of 0.9, and the weight decay of 0.0005. We reproduced Self-Tuning and conducted all experiments in Open Mixup [32].

Semi-supervised learning. For semi-supervised learning benchmarks (training from scratch), we adopt the most commonly used CIFAR-10/100 datasets among the famous SSL benchmarks based on WRN-28-2 and WRN-28-8 following [52, 76]. For a fair comparison, we use the same hyperparameters and training settings as the original papers and adopt the open-source codebase Torch SSL [76] for all methods. Concretely, we use an SGD optimizer with a basic learning rate of

3https://github.com/thuml/Self-Tuning

Table A1: Ingredients and hyper-parameters used for Image Net-1k training settings.

Procedure Py Torch Dei T RSB A3 Train Res 2242 2242 2242

Test Res 2242 2242 2242 Test crop ratio 0.875 0.875 0.95 Epochs 100/300 300 100 Batch size 256 1024 2048 Optimizer SGD Adam W LAMB LR 0.1 1 10 3 8 10 3 LR decay cosine cosine cosine Weight decay 10 4 0.05 0.02 optimizer momentum 0.9 β1, β2 = 0.9, 0.999 Warmup epochs 5 5 Label smoothing ϵ 0.1 Dropout Stoch. Depth 0.1 0.05 Repeated Aug Gradient Clip. 1.0 H. flip RRC Rand Augment 9/0.5 6/0.5 Auto Augment Mixup alpha 0.8 0.1 Cutmix alpha 1.0 1.0 Erasing prob. 0.25 Color Jitter EMA 0.99996 CE loss BCE loss

lr = 0.03 adjusted by Cosine Scheduler, the total 220 steps, the batch size of 64 for labeled data, and the confidence threshold τ = 0.95.

B.3 Hyper-parameter Settings

We follow the basic hyper-parameter settings (e.g., α) for mixup variants in Open Mixup [32], where we reproduce most comparison methods. Notice that static methods denote Mixup [77], Cut Mix [74], Manifold Mix [57], Saliency Mix [56], FMix [17], Resize Mix [47], and dynamic methods denote Puzzle Mix [23], Auto Mix [38], and SAMix [31]). Similarly, interpolation-based methods denote Mixup and Manifold Mix while cutting-based methods denote the rest mixup variants mentioned above. We set the hyper-parameters of DM(CE) as follows: For CIFAR-100 and Image Net-1k, static methods use η = 0.1, and dynamic methods use η = 1. For Tiny-Image Net and fine-grained datasets, static methods use η = 1 based on Res Net-18 while η = 0.1 based on Res Ne Xt-50; dynamic methods use η = 1. As for the hyper-parameters of DM(BCE) on Image Net-1k, cutting-based methods use t = 1 and ξ = 0.8, while interpolation-based methods use t = 0.5 and ξ = 1. Note that we use α = 0.2 and α = 2 for the static and dynamic methods when using the proposed DM.

Table A2: Top-1 Acc (%) of small-scale image classification on CIFAR-100 and Tiny-Image Net datasets based on Res Net variants.

Datasets CIFAR-100 Tiny-Image Net R-18 RX-50 WRN-28-8 R-18 RX-50 Methods MCE DM(CE) MCE DM(CE) MCE DM(CE) MCE DM(CE) MCE DM(CE) Saliency Mix 79.12 79.28 81.53 82.61 84.35 84.41 64.60 66.56 66.55 67.52 Puzzle Mix 81.13 81.34 82.85 82.97 85.02 85.25 65.81 66.52 67.83 68.04 Auto Mix 82.04 82.32 83.64 83.94 85.18 85.38 67.33 68.18 70.72 71.56 SAMix 82.30 82.40 84.42 84.53 85.50 85.59 68.89 69.16 72.18 72.39 Avg. Gain +0.19 +0.40 +0.15 +0.95 +0.56

Table A3: Top-1 Acc (%) of image classification on Image Net-1k with Res Net variants using Py Torchstyle 100-epoch training recipe.

R-18 R-34 R-50 Methods MCE DM(CE) MCE DM(CE) MCE DM(CE) Saliency Mix 69.16 69.57 73.56 73.92 77.14 77.42 Puzzle Mix 70.12 70.32 74.26 74.51 77.54 77.71 Auto Mix 70.51 70.64 74.52 74.77 77.91 78.15 SAMix 70.85 70.90 74.96 75.10 78.11 78.36 Avg. Gain +0.20 +0.25 +0.23

Table A4: Top-1 Acc (%) of image classification on Image Net-1k based on Res Net-50 using RSB A3 100-epoch training recipe.

Methods MCE DM(CE) MBCE MBCE DM(BCE)

(one) (two) (one) Saliency Mix 76.85 77.25 77.93 72.74 78.24 Puzzle Mix 77.27 77.60 78.02 77.19 78.15 Auto Mix 77.45 77.82 78.33 77.46 78.62 SAMix 78.33 78.45 78.64 77.58 78.75 Avg. Gain +0.30 -1.99 +0.04

Table A5: Top-1 Acc (%) of classification on Image Net-1k with Vi Ts.

Dei T-S Swin-T Methods MCE DM(CE) MCE DM(CE) Dei T 79.80 80.37 81.28 81.49 Saliency Mix 79.32 79.86 80.68 80.83 Puzzle Mix 79.84 80.25 81.03 81.16 Auto Mix 80.78 80.91 81.80 81.92 SAMix 80.94 81.12 81.87 81.97 Avg. Gain +0.32 +0.13

Table A6: Top-1 Acc (%) of fine-grained image classification on CUB-200 and FGVC-Aircrafts with Res Net variants.

Datasets CUB-200 FGVC-Aircrafts R-18 RX-50 R-18 RX-50 Methods MCE DM(CE) MCE DM(CE) MCE DM(CE) MCE DM(CE) Saliency Mix 77.95 78.28 83.29 84.51 80.02 81.31 84.31 85.07 Puzzle Mix 78.63 78.74 84.51 84.67 80.76 80.89 86.23 86.36 Auto Mix 79.87 81.08 86.56 86.74 81.37 82.18 86.69 86.82 SAMix 81.11 81.27 86.83 86.95 82.15 83.68 86.80 87.22 Avg. Gain +0.45 +0.42 +0.94 +0.36

C More Experiment Results

C.1 Image Classification Benchmarks

Small-scale classification benchmarks. For small-scale classification benchmarks on CIFAR100 and Tiny-Image Net, we also conduct experiments of applying the proposed DM(CE) to dynamic mixup methods even though these algorithms have achieved high performance in Table A2: DM(CE) brings 0.23% 0.36% on CIFAR-100 for the previous state-of-the-art Puzzle Mix and brings 0.21% 0.27% on Tiny-Image Net for the current state-of-the-art method SAMix. Overall, the proposed DM(CE) produces +0.15 0.4% and 0.56 0.95% average gains on CIFAR-100 and Tiny-Image Net, demonstrating its generalizability to advanced mixup augmentations.

Image Net and fine-grained classification benchmarks. For experiments on Image Net-1k, we also employ the proposed DM(CE) to dynamic mixup approaches on Image Net-1k with Py Torch-style [19], Dei T [55], and RSB A3 [63] training settings to further evaluate the generalizability of decoupled mixup. As shown in Table A3 and Table A4, DM(CE) gains +0.2 0.3% top-1 accuracy over MCE in average for four dynamic mixup methods based on Res Net variants on Image Net-1k; Table A5 show DM(CE) also improves dynamic methods based on popular Dei T-S and Swin-T backbones with modern training recipes. These results indicate that the proposed decoupled mixup can also boost these dynamic mixup augmentations with high performances on Image Net-1k. Moreover, the proposed DM(CE) can improve dynamic mixup variants on fine-grained classification benchmarks, as shown in Table A6, with around +0.4 0.9% average gains over MCE based on Res Net variants.

Table A7: Top-1 Acc (%) and FGSM error (%) on CIFAR-100 and Tiny-Image Net based on Res Net-18 training 400 epochs.

Datasets CIFAR-100 Tiny-Image Net Acc(%) Error(%) Acc(%) Error(%) Methods MCE DM(CE) MCE DM(CE) MCE DM(CE) MCE DM(CE) Mixup 79.34 79.70 70.28 70.05 63.86 65.07 89.06 88.91 Cut Mix 79.58 79.77 87.43 86.84 65.53 66.45 89.14 88.79 Manifold Mix 80.18 81.06 72.50 72.19 64.15 65.45 88.78 88.52 Puzzle Mix 80.22 80.58 79.76 79.53 65.81 66.13 91.83 92.05 Auto Mix 81.78 81.96 69.94 69.80 67.33 68.18 88.37 88.34

C.2 Adversarial Robustness

Since mixup variants are proven to enhance the robustness of DNNs against adversarial samples [77], we compare the robustness of the original MCE and the proposed DM(CE) by performing the

0.0 0.2 0.4 0.6 0.8 1.0 Mixing ratio

Acc. of mixed data (%)

Mix Up top-1 Mix Up top-2 Cut Mix top-1 Cut Mix top-2 Auto Mix top-1 Auto Mix top-2

0 20 40 60 80 100 Epochs

Top-1 acc. of mixed data (%)

Figure A1: Experimental overviews of hard mixed sample mining. Left: Top-1 and top-2 accuracy of mixed data based on Res Net-50 trained 100 epochs on Image Net-1k. Prediction is counted as correct if the top-1 prediction belongs to {ya, yb}; prediction is counted as correct if the top-2 predictions are equal to {ya, yb}. Compared with static policies like Mixup [77] and Cut Mix [74], the dynamic method Auto Mix [38] significantly reduces the difficulty of mixup classification and alleviates the label mismatch issue [23] by providing more reliable mixed samples but also requires a large computational overhead. Right: Taking Mixup as an example, our proposed decoupled mixup cross-entropy, DM(CE), significantly improves training efficiency by exploring hard mixed samples and alleviates the label mismatch issue.

FGSM [15] white-box attack of 8/255 ℓ epsilon ball following [23]. Table A7 shows that DM(CE) improves top-1 Acc of MCE while maintaining the competitive FGSM error rates for five popular mixup algorithms, which indicates that DM(CE) can boost discrimination without disturbing the smoothness properties of mixup variants.

C.3 Data-efficient Mixup with Limited Training Labels

To further DM whether data-efficient mixup training can be truly achieved, we conducted supervised experiments on CIFAR-100 with different sizes of training data. 15%, 30%, and 50% of the CIFAR100 data are randomly selected as training data, and the test data are unchanged. The proposed decoupled mixup uses DM(CE) as the loss function by default. From Table A8, we can see that DM improves performance consistently without any computational overhead. Especially when using only 15% of the data, DM can improve accuracy by 2%. Therefore, combined with the experimental results of semi-supervised learning in Sec. 5.3 and Sec. 5.2, we can say that mixup training with DM is more data-efficient with limited data.

Table A8: Top-1 Acc (%) of image classification on CIFAR-100 with Res Net-18 using 15%, 30%, and 50% labeled training sets.

15% 30% 50% Methods MCE DM(CE) MCE DM(CE) MCE DM(CE) Vanilla 42.48 - 56.41 - 64.32 - Mixup 42.23 44.39 55.61 56.78 64.55 65.92 Cut Mix 43.81 44.85 55.99 57.14 64.38 65.87 Saliency Mix 42.95 44.01 55.42 56.51 64.56 66.10 Puzzle Mix 42.67 43.87 56.19 57.36 64.74 66.26 Avg. Gain +1.36 +1.14 +1.48

C.4 Empirical Analysis

In addition to occlusion robustness in Figure 6, we analyze the top-1 and top-2 mixup classification accuracy and visualize validation accuracy curves during training to empirically demonstrate the effectiveness of DM in Figure A1.

2 1 0.5 0.1 0.01 value

Top-1 acc. (%)

Mix Up Cut Mix Puzzle Mix Auto Mix

1 0.9 0.8 0.7 value

Cut Mix(t=1) Cut Mix(t=0.5) Auto Mix(t=1) Auto Mix(t=0.5)

0.3 0.5 1.0 2.0 t value

Mix Up( =1)

Mix Up( =0.8)

Manifold Mix( =1)

Manifold Mix( =0.8)

Figure A2: Ablation of hyper-parameters on Image Net-1k based on Res Net-34. Left: analyzing the balancing weight η in DM(CE); Middle: analyzing ξ in DM(BCE) when t is fixed to 1 and 0.5; Right: analyzing t in DM(BCE) when ξ is fixed to 1 and 0.8.

2 1 0.5 0.1 0.01 value

Top-1 acc. (%)

Mix Up Cut Mix Puzzle Mix Auto Mix

2 1 0.5 0.1 0.01 value

Tiny-Image Net

Mix Up Cut Mix Puzzle Mix Auto Mix

2 1 0.5 0.1 0.01 value

Mix Up Cut Mix Puzzle Mix Auto Mix

Figure A3: Sensitivity analysis of hyper-parameters on different datasets based on Res Net-18.

C.5 Ablation Study and Analysis

Ablation of hyper-parameters We first provide ablation experiments of the shared hyper-parameter η in DM(CE) and DM(BCE). In Figure A2 left, the static (Mixup and Cut Mix) and the dynamic methods (Puzzle Mix and Auto Mix) prefer η = 0.1 and η = 1, respectively, which might be because the dynamic variants generate more discriminative and reliable mixed samples than the static methods. Then, Figure A2 middle and right show that ablation studies of hyper-parameters ξ and t in DM(BCE), where cutting-based methods (Cut Mix and Auto Mix) prefer ξ = 0.8 and t = 1, while the interpolation-based policies (Mixup and Manifold Mix) use ξ = 1.0 and t = 0.5.

Sensitivity Analysis To verify the robustness of hyper-parameter η, extra experiments are conducted on CIFAR-100, Tiny-Image Net, and CUB-200 datasets. Figure A3 shows the results consistent with our ablation study in Sec. 5.4. Dynamic mixup methods prefer the large value of η (e.g., 1.0), while static ones are more like a small value (e.g., 0.1). The main reason for this is the dynamic methods generate mixed samples where label mismatch is relatively rare, relying on larger weights to achieve better results, while the opposite is true in static methods.