# regmixmatch_optimizing_mixup_utilization_in_semisupervised_learning__e7f90a02.pdf

Reg Mix Match: Optimizing Mixup Utilization in Semi-Supervised Learning

Haorong Han1,2, Jidong Yuan1,2*, Chixuan Wei3, Zhongyang Yu1,2

1Key Laboratory of Big Data and Artificial Intelligence in Transportation, Ministry of Education, China 2School of Computer Science and Technology, Beijing Jiaotong University 3School of Data Science and Intelligent Media, Communication University of China {harohan, yuanjd, zhongyangyu}@bjtu.edu.cn, chxwei@cuc.edu.cn

Consistency regularization and pseudo-labeling have significantly advanced semi-supervised learning (SSL). Prior works have effectively employed Mixup for consistency regularization in SSL. However, our findings indicate that applying Mixup for consistency regularization may degrade SSL performance by compromising the purity of artificial labels. Moreover, most pseudo-labeling based methods utilize thresholding strategy to exclude low-confidence data, aiming to mitigate confirmation bias; however, this approach limits the utility of unlabeled samples. To address these challenges, we propose Reg Mix Match, a novel framework that optimizes the use of Mixup with both highand low-confidence samples in SSL. First, we introduce semi-supervised Reg Mixup, which effectively addresses reduced artificial labels purity by using both mixed samples and clean samples for training. Second, we develop a class-aware Mixup technique that integrates information from the top-2 predicted classes into low-confidence samples and their artificial labels, reducing the confirmation bias associated with these samples and enhancing their effective utilization. Experimental results demonstrate that Reg Mix Match achieves state-of-the-art performance across various SSL benchmarks.

Code https://github.com/hhrd9/regmixmatch Extended version https://arxiv.org/abs/2412.10741

Introduction Semi-supervised learning (SSL) aims to leverage a small amount of labeled data to enable the model to extract useful information from a large volume of unlabeled data during training. Fix Match (Sohn et al. 2020) and its variants (Zhang et al. 2021; Xu et al. 2021; Wang et al. 2023; Huang et al. 2023) have demonstrated competitive results in SSL by retaining pseudo-labels (Lee et al. 2013) for high-confidence samples using a threshold strategy combined with consistency regularization. However, these methods heavily rely on data augmentation techniques for consistency regularization, and their pseudo-labeling approach is constrained to high-confidence samples alone. This raises two critical questions: 1) Are existing data augmentation techniques optimal for SSL? 2) Are the low-confidence samples filtered by

*Corresponding Author Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

0 200k 600k 1000k Iter.

Purity / Accuracy

Purity (Mixup) Purity (ERM) Accuracy (Mixup) Accuracy (ERM)

(a) Purity and accuracy.

0 200k 600k 1000k Iter.

Reliability / Accuracy

reliability top-1 acc top-2 acc

(b) Reliability and top-n accuracy.

Figure 1: Motivating example of Reg Mix Match. (a) Mixup s confidence-reducing behavior undermines SSL. (b) The top2 accuracy rate across all unlabeled data surpasses the reliability of pseudo labels.

the threshold truly unusable? At the core of these questions is the challenge of effectively utilizing both highand lowconfidence samples. In this paper, we aim to address these questions by proposing solutions that optimize the use of the Mixup technique (Zhang et al. 2018) in SSL. Mixup, which generates augmented samples by interpolating between randomly shuffled data, has been introduced into SSL (Berthelot et al. 2019b,a) to enhance model generalization, yielding significant results. However, recent studies (Wen et al. 2021; Pinto et al. 2022) have revealed that Mixup can induce confidence-reducing behavior because it inherently predicts interpolated labels for every input. We argue that this confidence reduction is problematic in SSL because creating effective artificial labels for unlabeled samples is crucial for training. Methods like sharpening (Berthelot et al. 2019a) and pseudo-labeling (Sohn et al. 2020), which harden softmax prediction probabilities, have been shown to be effective. This suggests that SSL expects to achieve purer artificial labels through low-entropy prediction probabilities, thereby better supporting model training. However, Mixup increases the entropy of prediction probabilities, adding noise to artificial labels and consequently affecting SSL classification performance due to the lack of high-purity labels. As illustrated in Figure 1(a), the purity (the proportion of artificial labels where the highest predicted probability exceeds a threshold, reflecting the model s prediction confidence) and test accuracy are compared when Fix Match is trained using Empirical Risk Min-

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

imization (ERM) (Vapnik 1991) and Mixup methods with pseudo labels. The results indicate that Mixup reduces prediction confidence, decreasing the number of high-purity artificial labels and leading to lower classification performance than ERM, contrasting its success in supervised learning (Zhang et al. 2018). Reg Mixup (Pinto et al. 2022) addresses this by using Mixup as a regularizer, training with both unmixed and mixed samples to mitigate the high-entropy issue. Inspired by this, and to counter the reduction in artificial label purity caused by Mixup s high-entropy behavior in SSL, we propose Semi-supervised Reg Mixup (SRM). SRM combines SSL s unique pseudo-labeling technique with weakto-strong consistency regularization, applying Reg Mixup within SSL. However, pseudo-labeling alone only allows SRM to effectively utilize high-confidence samples. In the following section, we will demonstrate how to effectively utilize low-confidence data. Pseudo-labeling typically employs a fixed or dynamic threshold (Zhang et al. 2021; Xu et al. 2021; Wang et al. 2023; Wei et al. 2023) to filter out low-confidence samples, retaining pseudo labels only for those with confidence above the threshold. While this strategy fully exploits highconfidence samples, it leaves samples with confidence below the threshold underutilized, as the supervisory information from these samples is generally noisier. However, unlike previous approaches that focus exclusively on leveraging high-confidence samples, we argue that low-confidence samples also hold significant potential. In Figure 1(b), we present the reliability of pseudo labels (i.e., the accuracy of predictions for unlabeled data that surpass the threshold) during Fix Match training, alongside the top-1 and top-2 accuracy rates of the model s predictions for all unlabeled data. Notably, while the top-1 accuracy is relatively low, the top-2 accuracy exceeds the reliability of pseudo labels. This indicates that by developing methods that incorporate information from the top-2 classes into the samples and their artificial labels, we could enhance the utilization of a broader range of samples, thereby generating more reliable artificial labels even for those with low initial confidence. To this end, we introduce a heuristic method called Class Aware Mixup (CAM), designed to effectively utilize lowconfidence samples by generating mixed samples for model training. Since artificial labels for low-confidence samples often contain significant noise due to uncertain predictions, mixing these samples with high-confidence samples of the same predicted class helps to reduce noise in the artificial labels, thereby improving their quality for these challenging samples. Our approach, dubbed Reg Mix Match, integrates SRM and CAM, both of which leverage Mixup to exploit highand low-confidence samples, respectively. To validate the effectiveness of this approach, we compare the training time and classification performance of Reg Mix Match with previous Mixup-based SSL methods in our experiments, demonstrating significant improvements. The key contributions of our work can be summarized as follows:

We discover and verify that Mixup reduces the purity of artificial labels in SSL, ultimately leading to a decline in model performance.

We propose SRM, which applies Reg Mixup to SSL by combining pseudo-labeling and consistency regularization to address the aforementioned issue. We introduce CAM for low-confidence samples, aimed at mitigating confirmation bias and fully exploiting the potential of low-confidence data. We conduct extensive experiments to validate the performance of Reg Mix Match, achieving state-of-the-art results in most scenarios.

Related Work Since the advent of deep learning, SSL has experienced rapid advancement. In this paper, we present an overview of SSL from two key perspectives: consistency regularization and pseudo-labeling. Additionally, we provide a brief outline of data augmentation techniques based on mixing strategies.

Consistency Regularization. Consistency regularization (Bachman, Alsharif, and Precup 2014) strengthens SSL by enforcing the principle that a classifier should maintain consistent class predictions for an unlabeled example, even after the application of data augmentation. This principle has led to the development of various SSL algorithms based on different augmentation techniques. Traditional SSL methods typically employ softmax or sharpened softmax outputs to supervise perturbed data, thereby achieving consistency regularization. For instance, Temporal Ensembling (Laine and Aila 2017) ensures model consistency by periodically updating the output mean and minimizing the difference in predictions for the same input across different training epochs. Mean Teacher (Tarvainen and Valpola 2017) enhances learning targets by generating a superior teacher model through the exponential moving average of model weights. VAT (Miyato et al. 2018) introduces adversarial perturbations to input data, increasing the model s robustness. More recently, Mix Match (Berthelot et al. 2019b) and Re Mix Match (Berthelot et al. 2019a) have incorporated the Mixup technique to create augmented samples, thereby boosting generalization performance. Re Mix Match, along with UDA (Xie et al. 2020), also integrates advanced image augmentation techniques such as Auto Augment (Cubuk et al. 2019) and Rand Augment (Cubuk et al. 2020), which generate heavily distorted yet semantically intact images, further enhancing classification performance.

Pseudo-Labeling. Pseudo-labeling (Lee et al. 2013) has been demonstrated to offer superior supervision for augmented samples in SSL. To address the issue of confirmation bias inherent in pseudo-labeling, confidence-based thresholding techniques have been developed to enhance the reliability of pseudo labels. Fix Match (Sohn et al. 2020) employs a fixed threshold to generate pseudo labels from highconfidence predictions on weakly augmented images, ensuring consistency with their strongly augmented counterparts. Flex Match (Zhang et al. 2021) and Free Match (Wang et al. 2023) introduce class-specific thresholds that adapt based on each class s learning progress. Dash (Xu et al. 2021) dynamically adjusts the threshold during training to refine the

filtering of pseudo labels. MPL (Pham et al. 2021) leverages a teacher network to produce adaptive pseudo labels for guiding a student network on unlabeled data. Soft Match (Chen et al. 2023) tackles the trade-off between pseudolabel quality and quantity through a weighting function. Building on pseudo-labeling, Flat Match (Huang et al. 2023) minimizes a cross-sharpness measure to ensure consistent learning across labeled and unlabeled datasets, while Sequence Match (Nguyen 2024) introduces consistency constraints between augmented sample pairs, reducing discrepancies in the model s predicted distributions across different augmented views. Although pseudo-labeling enhances supervision and overall performance in these methods, its inherent limitations restrict the utilization to primarily highconfidence samples. In this paper, we employ the SRM, specifically designed for SSL, to achieve consistency regularization by addressing the issue of reduced artificial label purity through the retention of a portion of clean samples for training. Additionally, we explore methods to utilize low-confidence samples through CAM. The differences from previous Mixup-based approaches are detailed in the appendix.

Mixing-Based Data Augmentation. Mixup (Zhang et al. 2018) generates augmented data by performing a linear interpolation between samples and their corresponding labels, aiming to achieve smoother decision boundaries. Unlike pixel-level Mixup, Manifold Mixup (Verma et al. 2019) applies interpolation within the model s hidden layers, allowing regularization at different representation levels. Cut Mix (Yun et al. 2019) combines the principles of Mixup and Cutout (De Vries and Taylor 2017) by cutting a region from one image and pasting it onto another, thereby mixing samples while preserving the local structure of the images. Saliency Mix (Uddin et al. 2021) further improves upon Cut Mix by detecting salient regions, ensuring that the cropped regions express label-relevant features, thus enhancing performance. Resize Mix (Qin et al. 2020) simplifies the process by resizing the source image to a small patch and pasting it onto another image, effectively retaining essential image information. For a more detailed comparison of different image mixing methods, we recommend reviewing Open Mixup (Li et al. 2023). In this work, Resize Mix is chosen as the primary mixing method, with results based on Mixup presented in the appendix for comparison.

Preliminary: From Mixup To Reg Mixup Given a training dataset D = {(xi, yi)}n i=1, ERM (Vapnik 1991) uses the empirical data distribution Pδ(x, y) from the training set to approximate the true data distribution P(x, y):

Pδ(x, y) = 1

i=1 δ(x = xi, y = yi), (1)

where δ(x = xi, y = yi) denotes the Dirac mass centered at (xi, yi). The loss function is then minimized over the data distribution Pδ. In contrast to ERM, Vicinal Risk Minimization (VRM) (Chapelle et al. 2000) employs strategies to generate new data distributions in the vicinity of each sample,

thereby estimating a richer distribution that provides a more informed risk assessment within the neighborhood around each sample. Based on VRM, Mixup (Zhang et al. 2018) achieves a convex combination of samples to approximate the true data distribution, enhancing the model s generalization performance. The vicinal distribution of Mixup can be represented as

Pv(x, y) = 1

i=1 δ(x = xi, y = yi), (2)

where xi = λxi + (1 λ)xj and yi = λyi + (1 λ)yj. For each (xi, yi) feature-target vector, the corresponding (xj, yj) is drawn at random from the training data. λ follows a Beta distribution with parameters (α, α). The larger the α, the closer λ is to 0.5, resulting in stronger interpolation. Research on out-of-distribution (OOD) detection by Reg Mixup (Pinto et al. 2022) reveals that Mixup can induce confidence-reducing behavior, impairing the model s ability to effectively differentiate between in-distribution and outof-distribution samples. To counteract this issue, they propose that by using Mixup to generate mixed samples while retaining clean samples for training, the high-entropy behavior can be mitigated, thereby enhancing the model s robustness to OOD samples. In this context, the approximate data distribution Pv(x, y) can be constructed as

Pv(x, y) = 1

i=1 [δ(x = xi, y = yi) + δ(x = xi, y = yi)] .

(3) Based on this distribution, the following loss function is minimized:

H(yi, pθ(y | xi)) + H( yi, pθ(y | xi)), (4)

where H( , ) denotes the standard cross-entropy loss, and pθ represents the softmax output of a neural network parameterized by θ. Later, we will introduce the proposed SRM, which integrates Reg Mixup into SSL by utilizing pseudolabeling to address the reduction in artificial label purity caused by Mixup.

Reg Mix Match Given a batch of labeled data containing B samples X = {(xl b, yb) : b (1, . . . , B)} and a batch of unlabeled data containing µB samples U = {(xu b ) : b (1, . . . , µB)}, µ is a hyperparameter that determines the relative sizes of X and U. The supervised loss for the labeled samples in SSL methods is typically constructed as the standard crossentropy loss between the model s predictions and the true labels:

b=1 H(yb, pm(y|xl b)), (5)

where pm(y|x) is the predicted class distribution produced by the model for input x. Despite the variety of data augmentation methods, they all follow consistency regularization when utilizing unlabeled data: the expectation is that the model outputs similar results for differently perturbed versions of the unlabeled data.

Strongly augmented Artificial label Image & label mixing Weakly augmented

Convex Combine

Prediction (Weak aug)

Prediction (Strong aug)

class-aware

dog cat horse

label mixing

label mixing

label mixing

dog cat horse dog cat horse

dog cat horse

Figure 2: Overview of Reg Mix Match. (a) shows the main idea of SRM. A weakly augmented image is fed into the model to obtain predictions. If the prediction confidence surpasses the threshold τc, a pseudo label is used to compute consistency loss against the model s prediction on the corresponding strongly augmented view. If the confidence also exceeds the threshold τm, its strongly augmented view and pseudo label are mixed with those of another high-confidence sample to implement Mixup. (b) illustrates why CAM works. When a low-confidence sample is mispredicted, CAM converts noise (red part) from artificial label into useful information, while Mix Up suffers from it. (c) demonstrates how Resize Mix generates mixed images and labels.

Based on this idea, the consistency loss is typically constructed as

b=1 ℓc(pm(y|α(xu b )), pm(y|α(xu b ))), (6)

where ℓc is the consistency loss, which can be cross-entropy or ℓ-2 loss. α(x) represents a random augmentation of input x, meaning the two terms in Equation 6 can have different values due to the different random augmentations applied.

From Reg Mixup to Semi-supervised Reg Mixup The overall architecture of SRM is depicted in Figure 2(a). In supervised learning, Reg Mixup employs cross-entropy loss on both clean and mixed samples. However, in SSL, a substantial portion of the data remains unlabeled. By leveraging pseudo-labeling, we retain pseudo labels for unlabeled samples that exceed a specific confidence threshold. In this section, we will illustrate how the two terms in Equation 4 are adapted for SSL. For an unlabeled image xu b , we first apply a weak augmentation α(xu b ) and compute the model s predicted class distribution qb = pm(y|α(xu b )). The class with the highest probability, denoted as ˆqb = arg max(qb), is then assigned as the pseudo label for the corresponding strongly augmented view A(xu b ). Consequently, we transform the Reg Mixup loss function for clean samples into the following form:

b=1 1(max (qb) τc)H(ˆqb, pm(y|A(xu b ))), (7)

where τc is the threshold above which a pseudo-label is retained, and max (qb) represent the maximum predicted probability of the weak augmentation α(xu b ). To implement Mixup, we also apply thresholding to retain pseudo labels only for high-confidence unlabeled images. Specifically, for a weakly augmented view α(xu i ) of an unlabeled image within a data batch, we retain the pseudo label ˆqi only if its confidence exceeds the threshold τm. Highconfidence images from this batch are then grouped into a set denoted as H = {xu b | max(qb) > τm}. For the loss of mixed data, as shown in Figure 2(a), given a strongly augmented view A(xu i ) and its pseudo label ˆqi from the set H, we randomly select A(xu j ) and ˆqj from the same set H, generating mixed images and labels using Resize Mix technique for the calculation of H( , ). Thus, we transform the loss function of Reg Mixup for mixed samples into

i=1 H(ˆqi ˆqj, pm(y|A(xu i ) A(xu j ))), (8)

where the symbol denotes the image or label mixing operation, and | | represents the number of elements in the set. Note that SRM employs two different thresholds, τc and τm. It has been observed that when τm is close to or lower than τc, the model is prone to significant confirmation bias, particularly during the early training stages. To mitigate this, τm is set to a higher value. Additionally, as described in the preliminary, α is used to control the mixing intensity; a higher α results in a stronger mixing intensity. Our parameter αh, which controls the mixing intensity of high-

confidence samples, can be set to a higher value than previous Mixup-based SSL methods due to the retention of clean samples, allowing for better generalization performance. In summary, we select high-confidence samples as clean samples using a high threshold τc and retain their pseudo labels to compute the consistency loss. Concurrently, we set an even higher threshold τm to filter and retain the highconfidence samples and their pseudo labels for implementing Mixup. Consequently, we can transform Equation 4 into the following form to achieve the transition from Reg Mixup to SRM: Lrm = Lu + Lm. (9)

Class-Aware Mixup Pseudo-labeling enables our SRM to effectively leverage high-confidence samples. However, low-confidence samples are often discarded due to the uncertainty in their predictions. Our proposed CAM method facilitates the utilization of these low-confidence samples while maintaining the quality of the artificial labels. Figure 2(b) provides a conceptual overview of our CAM. Mixup randomly selects a target image, which can belong to any class, to blend with the source image. For lowconfidence unlabeled samples, where the predicted class often diverges from the true class, randomly selecting a target image can lead to confirmation bias. This is because the mixed label may incorporate noisy features from the predicted class, while the mixed image does not reflect these features. However, by limiting the samples mixed with low-confidence ones to high-confidence samples sharing the same predicted classes (i.e., class-aware), the noise can be reduced due to the inclusion of predicted class features. Additionally, Figure 1(b) shows that the top-2 accuracy is high, indicating a strong likelihood that the correct class is among the top-2 predicted classes in the mixed labels, making the inclusion of predicted class information sufficient. We denote the complement of H as Hc. Given a strongly augmented view A(xu i ) of a low-confidence data in Hc and its softmax output qi, we use the strongly augmented view A(xu j ) and its pseudo label ˆqj of the class-aware sample (ˆqj = arg max(qi)) from the high-confidence set H , generating mixed images and labels using Resize Mix technique for the calculation of ℓ-2 loss (as shown in Equation 10).

Lcm = 1 |Hc|

qi ˆqj pm(y|A(xu i ) A(xu j )) 2 2 . (10)

To reduce the negative impact of uncertain predictions, we use softmax outputs as artificial labels for low-confidence data and apply ℓ-2 loss instead of cross-entropy loss. Importantly, the parameter αl in CAM, which controls the mixing intensity for low-confidence samples, is set higher than αh. This choice helps to minimize the risk of the model overfitting to specific noisy samples, a point that will be further explored in the sensitivity analysis. Although the topn (n 3) accuracy of the model s predictions would be higher, our experiments show that mixing more than two samples does not yield additional improvements. Ultimately,

Algorithm 1: Reg Mix Match Algorithm

1: Input: Labeled batch X = {(xl b, yb) : b (1, . . . , B)}, unlabeled batch U = {xu b : b (1, . . . , µB)}, confidence threshold τc and τm, mixing intensities αh and αl for highand low-confidence data 2: Calculate Ls using Equation 5 Loss for labeled data 3: for b = 1 to µB do 4: qb = pm(y|α(xu b ); θ) Prediction after weak augmentation 5: ˆqb = arg max(qb) Pseudo labels for unlabeled data 6: end for 7: Calculate Lu using Equation 7 Loss for clean unlabeled data 8: Construct high-confidence set H = {xu b | max(qb) > τm} 9: Construct low-confidence set Hc = U \ H 10: Calculate Lm using Equation 8 Loss for mixed high-confidence unlabeled data 11: Lrm = Lm + Lu Using Mixup as a regularizer 12: Calculate Lcm using Equation 10 Loss for mixed low-confidence unlabeled data 13: return Ls + Lrm + Lcm

CAM enables the incorporation of information from the top2 classes into the samples and their artificial labels, thereby facilitating the utilization of low-confidence samples. Overall, our proposed SRM employs pseudo-labeling and consistency regularization for high-confidence data, applying Mixup to generate mixed samples while preserving unmixed samples for training in SSL. SRM enhances model generalization by leveraging Mixup, while simultaneously addressing the issue of degraded artificial label purity due to Mixup s high-entropy behavior. For low-confidence samples, our CAM method mixes them with samples from specific classes, effectively utilizing these challenging samples while mitigating confirmation bias. The full algorithm of Reg Mix Match is shown in Algorithm 1. The overall loss of Reg Mix Match is formulated as

L = Ls + Lrm + Lcm. (11)

Experiments In this section, we present an extensive experimental evaluation of the proposed Reg Mix Match method. We assess its performance across a variety of widely-used SSL datasets, including CIFAR-10/100 (Krizhevsky, Hinton et al. 2009), SVHN (Netzer et al. 2011), STL-10 (Coates, Ng, and Lee 2011), and Image Net (Deng et al. 2009), under different labeled data conditions. The experimental results are benchmarked against 13 established SSL algorithms, including VAT (Miyato et al. 2018), Mean Teacher (Tarvainen and Valpola 2017), Mix Match (Berthelot et al. 2019b), Re Mix Match (Berthelot et al. 2019a), UDA (Xie et al. 2020), Fix Match (Sohn et al. 2020), Dash (Xu et al. 2021), MPL (Pham et al. 2021), Flex Match (Zhang et al. 2021), Soft Match (Chen et al. 2023), Free Match (Wang et al. 2023), Sequence Match (Nguyen 2024), and Flat Match (Huang et al. 2023).

Dataset CIFAR10 CIFAR100 SVHN STL10

# Label 10 40 250 4000 400 2500 10000 40 250 1000 40 1000

VAT (Miyato et al. 2018) 79.81 74.66 41.03 10.51 85.20 46.84 32.14 74.75 4.33 4.11 74.74 37.95 Mean Teacher (Tarvainen and Valpola 2017) 76.37 70.09 37.46 8.10 81.11 45.17 31.75 36.09 3.45 3.27 71.72 33.90 Mix Match (Berthelot et al. 2019b) 65.76 36.19 13.63 6.66 67.59 39.76 27.78 30.60 4.56 3.69 54.93 21.70 Re Mix Match (Berthelot et al. 2019a) 20.77 9.88 6.30 4.84 42.75 26.03 20.02 24.04 6.36 5.16 32.12 6.74 UDA (Xie et al. 2020) 34.53 10.62 5.16 4.29 46.39 27.73 22.49 5.12 1.92 1.89 37.42 6.64 Fix Match (Sohn et al. 2020) 24.79 7.47 4.86 4.21 46.42 28.03 22.20 3.81 2.02 1.96 35.97 6.25 Dash (Xu et al. 2021) 27.28 8.93 5.16 4.36 44.82 27.15 21.88 2.19 2.04 1.97 34.52 6.39 MPL (Pham et al. 2021) 23.55 6.62 5.76 4.55 46.26 27.71 21.74 9.33 2.29 2.28 35.76 6.66 Flex Match (Zhang et al. 2021) 13.85 4.97 4.98 4.19 39.94 26.49 21.90 8.19 6.59 6.72 29.15 5.77 Soft Match (Chen et al. 2023) - 4.91 4.82 4.04 37.10 26.66 22.03 2.33 - 2.01 21.42 5.73 Free Match (Wang et al. 2023) 8.07 4.90 4.88 4.10 37.98 26.47 21.68 1.97 1.97 1.96 15.56 5.63 Sequence Match (Nguyen 2024) - 4.80 4.75 4.15 37.86 25.99 20.10 1.96 1.89 1.79 15.45 5.56 Flat Match (Huang et al. 2023) 15.23 5.58 4.22 3.61 38.76 25.38 19.01 2.46 1.43 1.41 16.20 4.82 Reg Mix Match 4.35 4.24 4.21 3.38 35.27 23.78 19.41 1.81 1.77 1.79 11.74 4.66

Fully-Supervised 4.62 19.30 2.13 -

Table 1: Error rates on CIFAR10/100, SVHN, and STL10 datasets. The fully-supervised results of STL10 are unavailable since we do not have label information for its unlabeled data. The best results are highlighted with Bold and the second-best results are highlighted with underline. We ran each task three times, and the results with standard deviation are presented in the appendix.

To underscore the rationale behind leveraging Mixup and the effectiveness of Reg Mix Match, we compare its training time and classification performance with prior Mixupbased SSL methods (Berthelot et al. 2019a) and the latest state-of-the-art SSL approaches (Wang et al. 2023; Huang et al. 2023). Additionally, we perform comprehensive ablation studies and hyperparameter analysis to validate the design choices behind Reg Mix Match. We also present the results of Reg Mix Match on pre-trained backbones in the appendix. For a fair comparison, and in line with previous SSL studies, we use Wide Res Net-28-2 (Zagoruyko and Komodakis 2016) for CIFAR-10 and SVHN, Wide Res Net-28-8 for CIFAR-100, and Res Net-37-2 (He et al. 2016) for STL-10. All hyperparameter settings are kept consistent with those used in prior work, as detailed in the appendix. For the implementation of Reg Mix Match, the threshold τc for the consistency loss is set in accordance with the Free Match approach, as adopted in the state-of-the-art Flat Match method. Additionally, we provide experimental results based on the settings of Fix Match in the appendix. The values of τm, αh, and αl are set to 0.999, 1.0, and 16.0, respectively.

Main Results Table 1 compares the performance of Reg Mix Match with existing methods across various datasets. In 12 commonly used SSL scenarios, Reg Mix Match achieves state-of-the-art results in 9 cases, demonstrating significant improvement. In the remaining 3 cases, it secures the second-best performance. Notably, previous methods either exhibit minimal improvement or perform poorly in specific scenarios, such as certain datasets or particular labeled data quantities. Our experimental results highlight the comprehensiveness of Reg Mix Match. Specifically, Reg Mix Match achieves an error rate of only 4.35% on CIFAR-10 with 10 labels and 11.74% on STL-10 with 40 labels, surpassing the secondbest results by 3.72% and 3.71%, respectively. Table 2 presents the results of training Reg Mix Match on the Image Net dataset using MAE pre-trained Vi T-B (He et al. 2022). Reg Mix Match notably outperforms other methods, particularly achieving a 3.66% improvement on Ima-

# Label Fix Match Flex Match Free Match Reg Mix Match

1w 46.39 45.79 45.31 41.65 10w 28.47 27.83 27.43 26.34

Table 2: Error rates on Image Net.

Reg Mix.Flat. Free.Re Mix. 0.0

Reg Mix.Flat. Free.Re Mix. 0.0

Training Time (sec./iter.) Test Error (%)

Figure 3: Efficiency analysis of Reg Mix Match.

ge Net with 10,000 labels. More comprehensive results and hyperparameter settings are detailed in the appendix.

Efficiency Study

For algorithms handling classification tasks, both runtime and classification performance are critical factors. In this section, we compare Reg Mix Match with several SSL algorithms in terms of both metrics. Specifically, we select Flat Match (the previous state-of-the-art method), Free Match (benchmark), and Re Mix Match (Mixup-based SSL method) for comparison. The experiments are conducted on 2 24GB RTX 3090 GPU. Figure 3 presents a comparison of the 4 algorithms under two settings: CIFAR-100 with 2500 labels and STL-10 with 40 labels. The results show that Reg Mix Match not only significantly improves classification performance compared to Flat Match but also requires less training time, highlighting the efficiency of Reg Mix Match. Moreover, compared to Re Mix Match and Free Match, Reg Mix Match achieves substantial performance gains with only a minimal increase in runtime, demonstrating the effectiveness of leveraging Mixup.

0 200k 600k 1000k Iter.

w/o clean samples w/o mixed samples Reg Mix Match

(a) Purity of artificial labels.

200k 600k 1000k Iter.

w/o CAM w/ CAM

(b) Learning efficiency.

Figure 4: Ablation study of Reg Mix Match. (a) demonstrates that Reg Mix Match alleviates the reduced purity caused by Mixup. (b) depicts CAM improves learning efficiency.

Ablation Study

In addition to utilizing low-confidence samples, a key distinction of Reg Mix Match from other Mixup-based SSL methods is its retention of clean samples within SRM during training. To verify the rationale behind this design, we compare Reg Mix Match with two scenarios: one where only clean samples are retained during training (w/o mixed samples) and another where only mixed samples are used (w/o clean samples). To further validate the rationale and effectiveness of CAM in leveraging low-confidence data, we also present results for two additional configurations: removing CAM entirely (w/o CAM), and replacing CAM with Mixup (CAM Mixup). The results for these ablation studies on STL-10 with 40 labels are shown in Table 3. Table 3 highlights several key findings: 1) Training with only clean samples or only mixed samples yields suboptimal results compared to using both. Reg Mix Match leverages SRM not only to enhance generalization performance but also to mitigate the negative effects of high-entropy behavior in SSL, as shown in Figure 4(a). 2) CAM improves both classification accuracy and learning efficiency, as evidenced by the higher accuracy at the same iterations in Figure 4(b). This suggests that CAM facilitates the safer utilization of low-confidence samples, enriching the training process with more informative data. 3) Replacing CAM with Mixup results in performance degradation.

Sensitivity Analysis

Reg Mix Match incorporates three key hyperparameters: the threshold τm, which retains high-confidence samples for SRM; the parameter αh, which controls the mixing intensity for high-confidence samples; and the parameter αl, which

Ablation Avg. Error Rate (%)

Reg Mix Match 11.74 w/o mixed samples 15.56 (+3.82) w/o clean samples 15.78 (+4.04) w/o CAM 12.30 (+0.56) CAM Mixup 12.12 (+0.38)

Table 3: Ablation results of Reg Mix Match.

(a) Threshold τm.

0 1 2 4 6 8 4.20

(b) Mixing intensity αh.

0 16 32 48 64 4.2

(c) Mixing intensity αl.

Figure 5: Parameter sensitivity analysis of Reg Mix Match.

controls the mixing intensity for low-confidence samples. To assess the specific impact of these hyperparameters on model performance, we conduct experiments on CIFAR-10 with 250 labels.

Confidence Threshold. As illustrated in Figure 5(a), the optimal value for τm is 0.999, which is higher than the threshold τc (0.95). During model training, high-confidence clean samples are used to compute the consistency loss based on pseudo labels, validating the model s predictions on these samples. If τm is set to a lower value (e.g., 0.95), it would further validate the model s predictions on these samples, leading to overconfidence and exacerbating confirmation bias. Thus, τm = 0.999 is chosen to balance the quality and quantity of artificial labels for mixed samples.

Mixing Intensity. In addition to retaining clean samples, Reg Mix Match distinguishes itself from previous Mixupbased SSL methods by generating more robust mixed samples to enhance generalization. The parameter α typically set to a value less than 1.0 in prior work. However, as shown in Figures 5(b) and 5(c), the optimal values for αh and αl in Reg Mix Match are 1.0 and 16.0, respectively. A higher α indicates stronger augmentation, improving generalization performance. It is noteworthy that αl αh. As observed with Mixup, a greater mixing intensity enhances the model s robustness to noise. Since the supervisory information for low-confidence samples is more prone to noise, a higher αl generates more intermediate-state samples, increasing the quantity and variability of the training data, and thereby reducing the risk of overfitting to specific noisy samples.

Conclusion In this paper, we first demonstrate that Mixup s high-entropy behavior degrades the purity of artificial labels in SSL. To address this issue, we introduce SRM, a framework that integrates pseudo-labeling with consistency regularization through the application of Reg Mixup in SSL. We show that SRM effectively mitigates the reduced artificial labels purity, enhancing Mixup s utility for high-confidence samples. Additionally, we investigate strategies for leveraging low-confidence samples. Specifically, we propose CAM, a method that combines low-confidence samples with specific class samples to enhance the quality and utility of artificial labels. Extensive experiments validate the efficiency and effectiveness of Reg Mix Match. Our approach represents an initial exploration into the potential of low-confidence samples, with further in-depth studies recommended as a direction for future research.

Acknowledgments This work was supported by the National Natural Science Foundation of China (No. 62402031) and the Beijing Nova Program (No. 20240484620).

References Bachman, P.; Alsharif, O.; and Precup, D. 2014. Learning with pseudo-ensembles. Advances in neural information processing systems, 27. Berthelot, D.; Carlini, N.; Cubuk, E. D.; Kurakin, A.; Sohn, K.; Zhang, H.; and Raffel, C. 2019a. Remixmatch: Semisupervised learning with distribution alignment and augmentation anchoring. ar Xiv preprint ar Xiv:1911.09785. Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; and Raffel, C. A. 2019b. Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems, 32. Chapelle, O.; Weston, J.; Bottou, L.; and Vapnik, V. 2000. Vicinal risk minimization. Advances in neural information processing systems, 13. Chen, H.; Tao, R.; Fan, Y.; Wang, Y.; Wang, J.; Schiele, B.; Xie, X.; Raj, B.; and Savvides, M. 2023. Soft Match: Addressing the quantity-quality tradeoff in semi-supervised learning. In The eleventh international conference on learning representations. Coates, A.; Ng, A.; and Lee, H. 2011. An analysis of singlelayer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 215 223. Cubuk, E. D.; Zoph, B.; Mane, D.; Vasudevan, V.; and Le, Q. V. 2019. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 113 123. Cubuk, E. D.; Zoph, B.; Shlens, J.; and Le, Q. V. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 702 703. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248 255. Ieee. De Vries, T.; and Taylor, G. W. 2017. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552. He, K.; Chen, X.; Xie, S.; Li, Y.; Doll ar, P.; and Girshick, R. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000 16009. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778. Huang, Z.; Shen, L.; Yu, J.; Han, B.; and Liu, T. 2023. Flatmatch: Bridging labeled data and unlabeled data with crosssharpness for semi-supervised learning. Advances in neural information processing systems, 36: 18474 18494.

Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. Laine, S.; and Aila, T. 2017. Temporal ensembling for semisupervised Learning. In International conference on learning representations. Lee, D.-H.; et al. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, 896. Atlanta. Li, S.; Wang, Z.; Liu, Z.; Wu, D.; and Li, S. Z. 2023. Openmixup: Open mixup toolbox and benchmark for visual representation learning. ar Xiv preprint ar Xiv:2209.04851. Miyato, T.; Maeda, S.-i.; Koyama, M.; and Ishii, S. 2018. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8): 1979 1993. Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A. Y.; et al. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, 4. Granada. Nguyen, K.-B. 2024. Sequence Match: Revisiting the design of weak-strong augmentations for semi-supervised learning. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 96 106. Pham, H.; Dai, Z.; Xie, Q.; and Le, Q. V. 2021. Meta pseudo labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11557 11568. Pinto, F.; Yang, H.; Lim, S. N.; Torr, P.; and Dokania, P. 2022. Using mixup as a regularizer can surprisingly improve accuracy & out-of-distribution robustness. Advances in neural information processing systems, 35: 14608 14622. Qin, J.; Fang, J.; Zhang, Q.; Liu, W.; Wang, X.; and Wang, X. 2020. Resizemix: Mixing data with preserved object information and true labels. ar Xiv preprint ar Xiv:2012.11101. Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C. A.; Cubuk, E. D.; Kurakin, A.; and Li, C.-L. 2020. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33: 596 608. Tarvainen, A.; and Valpola, H. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30. Uddin, A. F. M. S.; Monira, M. S.; Shin, W.; Chung, T.; and Bae, S.-H. 2021. Saliency Mix: A saliency guided data augmentation strategy for better regularization. In International conference on learning representations. Vapnik, V. 1991. Principles of risk minimization for learning theory. Advances in neural information processing systems, 4. Verma, V.; Lamb, A.; Beckham, C.; Najafi, A.; Mitliagkas, I.; Lopez-Paz, D.; and Bengio, Y. 2019. Manifold mixup: Better representations by interpolating hidden states. In International conference on machine learning, 6438 6447. PMLR.

Wang, Y.; Chen, H.; Heng, Q.; Hou, W.; Fan, Y.; Wu, Z.; Wang, J.; Savvides, M.; Shinozaki, T.; Raj, B.; Schiele, B.; and Xie, X. 2023. Free Match: Self-adaptive thresholding for semi-supervised learning. In The eleventh international conference on learning representations. Wei, C.; Wang, Z.; Yuan, J.; Li, C.; and Chen, S. 2023. Timefrequency based multi-task learning for semi-supervised time series classification. Information sciences, 619: 762 780. Wen, Y.; Jerfel, G.; Muller, R.; Dusenberry, M. W.; Snoek, J.; Lakshminarayanan, B.; and Tran, D. 2021. Combining ensembles and data augmentation can harm your calibration. In International conference on learning representations. Xie, Q.; Dai, Z.; Hovy, E.; Luong, T.; and Le, Q. 2020. Unsupervised data augmentation for consistency training. Advances in neural information processing systems, 33: 6256 6268. Xu, Y.; Shang, L.; Ye, J.; Qian, Q.; Li, Y.-F.; Sun, B.; Li, H.; and Jin, R. 2021. Dash: Semi-supervised learning with dynamic thresholding. In International conference on machine learning, 11525 11536. PMLR. Yun, S.; Han, D.; Oh, S. J.; Chun, S.; Choe, J.; and Yoo, Y. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, 6023 6032. Zagoruyko, S.; and Komodakis, N. 2016. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146. Zhang, B.; Wang, Y.; Hou, W.; Wu, H.; Wang, J.; Okumura, M.; and Shinozaki, T. 2021. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in neural information processing systems, 34: 18408 18419. Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2018. mixup: Beyond empirical risk minimization. In International conference on learning representations.