# unsupervised_data_augmentation_for_consistency_training__3aec65a2.pdf

Unsupervised Data Augmentation

for Consistency Training

Qizhe Xie1,2, Zihang Dai1,2, Eduard Hovy2, Minh-Thang Luong1, Quoc V. Le1

1 Google Research, Brain Team, 2 Carnegie Mellon University {qizhex, dzihang, hovy}@cs.cmu.edu, {thangluong, qvl}@google.com

Semi-supervised learning lately has shown much promise in improving deep learning models when labeled data is scarce. Common among recent approaches is the use of consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, speciﬁcally those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning. By substituting simple noising operations with advanced data augmentation methods such as Rand Augment and back-translation, our method brings substantial improvements across six language and three vision tasks under the same consistency training framework. On the IMDb text classiﬁcation dataset, with only 20 labeled examples, our method achieves an error rate of 4.20, outperforming the state-of-the-art model trained on 25,000 labeled examples. On a standard semi-supervised learning benchmark, CIFAR-10, our method outperforms all previous approaches and achieves an error rate of 5.43 with only 250 examples. Our method also combines well with transfer learning, e.g., when ﬁnetuning from BERT, and yields improvements in high-data regime, such as Image Net, whether when there is only 10% labeled data or when a full labeled set with 1.3M extra unlabeled examples is used.1

1 Introduction

A fundamental weakness of deep learning is that it typically requires a lot of labeled data to work well. Semi-supervised learning (SSL) [1] is one of the most promising paradigms of leveraging unlabeled data to address this weakness. The recent works in SSL are diverse but those that are based on consistency training [2, 3, 4, 5] have shown to work well on many benchmarks.

In a nutshell, consistency training methods simply regularize model predictions to be invariant to small noise applied to either input examples [6, 7, 8] or hidden states [2, 4]. This framework makes sense intuitively because a good model should be robust to any small change in an input example or hidden states. Under this framework, different methods in this category differ mostly in how and where the noise injection is applied. Typical noise injection methods are additive Gaussian noise, dropout noise or adversarial noise.

In this work, we investigate the role of noise injection in consistency training and observe that advanced data augmentation methods, speciﬁcally those work best in supervised learning [9, 10, 11, 12], also perform well in semi-supervised learning. There is indeed a strong correlation between

the performance of data augmentation operations in supervised learning and their performance in consistency training. We, hence, propose to substitute the traditional noise injection methods with high quality data augmentation methods in order to improve consistency training. To emphasize the

1Code is available at https://github.com/google-research/uda.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

use of better data augmentation in consistency training, we name our method Unsupervised Data Augmentation or UDA.

We evaluate UDA on a wide variety of language and vision tasks. On six text classiﬁcation tasks, our method achieves signiﬁcant improvements over state-of-the-art models. Notably, on IMDb, UDA with 20 labeled examples outperforms the state-of-the-art model trained on 1250x more labeled data. On standard semi-supervised learning benchmarks CIFAR-10 and SVHN, UDA outperforms all existing semi-supervised learning methods by signiﬁcant margins and achieves an error rate of 5.43 and 2.72 with 250 labeled examples respectively. Finally, we also ﬁnd UDA to be beneﬁcial when there is a large amount of supervised data. For instance, on Image Net, UDA leads to improvements of top-1 accuracy from 58.84 to 68.78 with 10% of the labeled set and from 78.43 to 79.05 when we use the full labeled set and an external dataset with 1.3M unlabeled examples.

Our key contributions and ﬁndings can be summarized as follows:

First, we show that state-of-the-art data augmentations found in supervised learning can also serve

as a superior source of noise under the consistency enforcing semi-supervised framework. See results in Table 1 and Table 2. Second, we show that UDA can match and even outperform purely supervised learning that uses

orders of magnitude more labeled data. See results in Table 4 and Figure 4. State-of-the-art results for both vision and language tasks are reported in Table 3 and 4. The effectiveness of UDA across different training data sizes are highlighted in Figure 4 and 7. Third, we show that UDA combines well with transfer learning, e.g., when ﬁne-tuning from BERT

(see Table 4), and is effective at high-data regime, e.g. on Image Net (see Table 5). Lastly, we also provide a theoretical analysis of how UDA improves the classiﬁcation performance

and the corresponding role of the state-of-the-art augmentation in Section 3.

2 Unsupervised Data Augmentation (UDA)

In this section, we ﬁrst formulate our task and then present the key method and insights behind UDA. Throughout this paper, we focus on classiﬁcation problems and will use x to denote the input and y to denote its ground-truth prediction target. We are interested in learning a model p (y | x) to predict y based on the input x, where denotes the model parameters. Finally, we will use p L(x) and p U(x) to denote the distributions of labeled and unlabeled examples respectively and use f to denote the perfect classiﬁer that we hope to learn.

2.1 Background: Supervised Data Augmentation Data augmentation aims at creating novel and realistic-looking training data by applying a transformation to an example, without changing its label. Formally, let q(ˆx | x) be the augmentation transformation from which one can draw augmented examples ˆx based on an original example x. For an augmentation transformation to be valid, it is required that any example ˆx q(ˆx | x) drawn from the distribution shares the same ground-truth label as x. Given a valid augmentation transformation, we can simply minimize the negative log-likelihood on augmented examples.

Supervised data augmentation can be equivalently seen as constructing an augmented labeled set from the original supervised set and then training the model on the augmented set. Therefore, the augmented set needs to provide additional inductive biases to be more effective. How to design the augmentation transformation has, thus, become critical.

In recent years, there have been signiﬁcant advancements on the design of data augmentations for NLP [12], vision [10, 11] and speech [13, 14] in supervised settings. Despite the promising results, data augmentation is mostly regarded as the cherry on the cake which provides a steady but limited performance boost because these augmentations has so far only been applied to a set of labeled examples which is usually of a small size. Motivated by this limitation, via the consistency training framework, we extend the advancement in supervised data augmentation to semi-supervised learning where abundant unlabeled data is available.

2.2 Unsupervised Data Augmentation

As discussed in the introduction, a recent line of work in semi-supervised learning has been utilizing unlabeled examples to enforce smoothness of the model. The general form of these works can be summarized as follows:

Labeled Data

Unsupervised Consistency Loss Supervised Cross-entropy Loss

Unlabeled Data

Augmentations

Back translation

Rand Augment

TF-IDF word replacement

$% & () $%+ & ()

Figure 1: Training objective for UDA, where M is a model that predicts a distribution of y given x.

Given an input x, compute the output distribution p (y | x) given x and a noised version p (y | x, )

by injecting a small noise . The noise can be applied to x or hidden states. Minimize a divergence metric between the two distributions D (p (y | x) k p (y | x, )).

This procedure enforces the model to be insensitive to the noise and hence smoother with respect to changes in the input (or hidden) space. From another perspective, minimizing the consistency loss gradually propagates label information from labeled examples to unlabeled ones.

In this work, we are interested in a particular setting where the noise is injected to the input x, i.e.,

ˆx = q(x, ), as considered by prior works [7, 4, 6]. But different from existing work, we focus on the unattended question of how the form or quality of the noising operation q can inﬂuence the performance of this consistency training framework. Speciﬁcally, to enforce consistency, prior methods generally employ simple noise injection methods such as adding Gaussian noise, simple input augmentations to noise unlabeled examples. In contrast, we hypothesize that stronger data augmentations in supervised learning can also lead to superior performance when used to noise unlabeled examples in the semi-supervised consistency training framework, since it has been shown that more advanced data augmentations that are more diverse and natural can lead to signiﬁcant performance gain in the supervised setting.

Following this idea, we propose to use a rich set of state-of-the-art data augmentations veriﬁed in various supervised settings to inject noise and optimize the same consistency training objective on unlabeled examples. When jointly trained with labeled examples, we utilize a weighting factor λ to balance the supervised cross entropy and the unsupervised consistency training loss, which is illustrated in Figure 1. Formally, the full objective can be written as follows:

J ( ) = Ex1 p L(x) [ log p (f (x1) | x1)]+λEx2 p U(x)Eˆx q(ˆx|x2)

p (y | x2)kp (y | ˆx)

(1) where CE denotes cross entropy, q(ˆx | x) is a data augmentation transformation and is a ﬁxed copy of the current parameters indicating that the gradient is not propagated through , as suggested by VAT [6]. We set λ to 1 for most of our experiments. In practice, in each iteration, we compute the supervised loss on a mini-batch of labeled examples and compute the consistency loss on a mini-batch of unlabeled data. The two losses are then summed for the ﬁnal loss. We use a larger batch size for the consistency loss.

In the vision domain, simple augmentations including cropping and ﬂipping are applied to labeled examples. To minimize the discrepancy between supervised training and prediction on unlabeled examples, we apply the same simple augmentations to unlabeled examples for computing p (y | x).

Discussion. Before detailing the augmentation operations used in this work, we ﬁrst provide some intuitions on how more advanced data augmentations can provide extra advantages over simple ones used in earlier works from three aspects:

Valid noise: Advanced data augmentation methods that achieve great performance in supervised

learning usually generate realistic augmented examples that share the same ground-truth labels with the original example. Thus, it is safe to encourage the consistency between predictions on the original unlabeled example and the augmented unlabeled examples. Diverse noise: Advanced data augmentation can generate a diverse set of examples since it can

make large modiﬁcations to the input example without changing its label, while simple Gaussian noise only make local changes. Encouraging consistency on a diverse set of augmented examples can signiﬁcantly improve the sample efﬁciency.

Targeted inductive biases: Different tasks require different inductive biases. Data augmentation

operations that work well in supervised training essentially provides the missing inductive biases.

2.3 Augmentation Strategies for Different Tasks

We now detail the augmentation methods, tailored for different tasks, that we use in this work.

Rand Augment for Image Classiﬁcation. We use a data augmentation method called Rand Augment [15], which is inspired by Auto Augment [11]. Auto Augment uses a search method to combine all image processing transformations in the Python Image Library (PIL) to ﬁnd a good augmentation strategy. In Rand Augment, we do not use search, but instead uniformly sample from the same set of augmentation transformations in PIL. In other words, Rand Augment is simpler and requires no labeled data as there is no need to search for optimal policies.

Back-translation for Text Classiﬁcation. When used as an augmentation method, backtranslation [16, 17] refers to the procedure of translating an existing example x in language A into another language B and then translating it back into A to obtain an augmented example ˆx. As observed by [12], back-translation can generate diverse paraphrases while preserving the semantics of the original sentences, leading to signiﬁcant performance improvements in question answering. In our case, we use back-translation to paraphrase the training data of our text classiﬁcation tasks.2

We ﬁnd that the diversity of the paraphrases is important. Hence, we employ random sampling with a tunable temperature instead of beam search for the generation. As shown in Figure 2, the paraphrases generated by back-translation sentence are diverse and have similar semantic meanings. More speciﬁcally, we use WMT 14 English-French translation models (in both directions) to perform backtranslation on each sentence. To facilitate future research, we have open-sourced our back-translation system together with the translation checkpoints.

Back-translation Given the low budget and production limitations, this movie is very good.

Since it was highly limited in terms of budget, and the production restrictions, the film was cheerful.

There are few budget items and production limitations to make this film a really good one.

Due to the small dollar amount and production limitations the ouest film is very beautiful.

Rand Augment

Figure 2: Augmented examples using back-translation and Rand Augment.

Word replacing with TF-IDF for Text Classiﬁcation. While back-translation is good at maintaining the global semantics of a sentence, there is little control over which words will be retained. This requirement is important for topic classiﬁcation tasks, such as DBPedia, in which some keywords are more informative than other words in determining the topic. We, therefore, propose an augmentation method that replaces uninformative words with low TF-IDF scores while keeping those with high TF-IDF values. We refer readers to Appendix A.2 for a detailed description.

2.4 Additional Training Techniques

In this section, we present additional techniques targeting at some commonly encountered problems.

Conﬁdence-based masking. We ﬁnd it to be helpful to mask out examples that the current model is not conﬁdent about. Speciﬁcally, in each minibatch, the consistency loss term is computed only on examples whose highest probability among classiﬁcation categories is greater than a threshold β. We set the threshold β to a high value. Speciﬁcally, β is set to 0.8 for CIFAR-10 and SVHN and 0.5 for Image Net.

2We also note that while translation uses a labeled dataset, the translation task itself is quite distinctive from a text classiﬁcation task and does not make use of any text classiﬁcation label. In addition, back-translation is a general data augmentation method that can be applied to many tasks with the same model checkpoints.

Sharpening Predictions. Since regularizing the predictions to have low entropy has been shown to be beneﬁcial [18, 6], we sharpen predictions when computing the target distribution on unlabeled examples by using a low Softmax temperature . When combined with conﬁdence-based masking, the loss on unlabeled examples Ex p U(x)Eˆx q(ˆx|x)

p (y | x)kp (y | ˆx)

on a minibatch B is computed as:

y0 p (y0 | x) > β)CE

(y | x)kp (y | ˆx)

(y | x) = exp(zy/ ) P

y0 exp(zy0/ )

where I( ) is the indicator function, zy is the logit of label y for example x. We set to 0.4 for CIFAR-10, SVHN and Image Net.

Domain-relevance Data Filtering. Ideally, we would like to make use of out-of-domain unlabeled data since it is usually much easier to collect, but the class distributions of out-of-domain data are mismatched with those of in-domain data, which can result in performance loss if directly used [19]. To obtain data relevant to the domain for the task at hand, we adopt a common technique for detecting out-of-domain data. We use our baseline model trained on the in-domain data to infer the labels of data in a large out-of-domain dataset and pick out examples that the model is most conﬁdent about. Speciﬁcally, for each category, we sort all examples based on the classiﬁed probabilities of being in that category and select the examples with the highest probabilities.

3 Theoretical Analysis

In this section, we theoretically analyze why UDA can improve the performance of a model and the required number of labeled examples to achieve a certain error rate. Following previous sections, we will use f to denote the perfect classiﬁer that we hope to learn, use p U to denote the marginal distribution of the unlabeled data and use q(ˆx | x) to denote the augmentation distribution.

To make the analysis tractable, we make the following simplistic assumptions about the data augmentation transformation:

In-domain augmentation: data examples generated by data augmentation have non-zero probability

under p U, i.e., p U(ˆx) > 0 for ˆx q(ˆx | x), x p U(x). Label-preserving augmentation: data augmentation preserves the label of the original example,

i.e., f (x) = f (ˆx) for ˆx q(ˆx | x), x p U(x). Reversible augmentation: the data augmentation operation can be reversed, i.e., if q(ˆx | x) > 0

then q(x | ˆx) > 0 .

As the ﬁrst step, we hope to provide an intuitive sketch of our formal analysis. Let us deﬁne a graph Gp U where each node corresponds to a data sample x 2 X and an edge (ˆx, x) exists in the graph if and only if q(ˆx | x) > 0. Due to the label-preserving assumption, it is easy to see that examples with different labels must reside on different components (disconnected sub-graphs) of the graph Gp U . Hence, for an N-category classiﬁcation problems, the graph has N components (sub-graphs) when all examples within each category can be traversed by the augmentation operation. Otherwise, the graph will have more than N components.

Given this construction, notice that for each component Ci of the graph, as long as there is a single labeled example in the component, i.e. (x , y ) 2 Ci, one can propagate the label of the node to the rest of the nodes in Ci by traversing Ci via the augmentation operation q(ˆx | x). More importantly, if one only performs supervised data augmentation, one can only propagate the label information to the directly connected neighbors of the labeled node. In contrast, performing unsupervised data augmentation ensures the traversal of the entire sub-graph Ci. This provides the ﬁrst high-level intuition how UDA could help.

Taking one step further, in order to ﬁnd a perfect classiﬁer via such label propagation, it requires that there exists at least one labeled example in each component. In other words, the number of components lower bounds the minimum amount of labeled examples needed to learn a perfect classiﬁer. Importantly, number of components is actually decided by the quality of the augmentation operation: an ideal augmentation should be able to reach all other examples of the same category given a starting instance. This well matches our discussion of the beneﬁts of state-of-the-art data

augmentation methods in generating more diverse examples. Effectively, the augmentation diversity leads to more neighbors for each node, and hence reduces the number of components in a graph.

Since supervised data augmentation only propagates the label information to the directly connected neighbors of the labeled nodes. Advanced data augmentation that has a high accuracy must lead to a graph where each node has more neighbors. Effectively, such a graph has more edges and better connectivity. Hence, it is also more likely that this graph will have a smaller number of components. To further illustrate this intuition, in Figure 3, we provide a comparison between different algorithms.

(a) Supervised learning. (4/15)

(b) Advanced supervised augmentation. (9/15)

(c) UDA with advanced augmentation. (15/15)

(d) Simple supervised augmentation. (7/15)

(e) UDA with simple augmentation. (10/15)

Figure 3: Prediction results of different settings, where green and red nodes are labeled nodes, white nodes are unlabeled nodes whose labels cannot be determined and light green nodes and light red nodes are unlabeled nodes whose labels can be correctly determined. The accuracy of different settings are shown in ( ).

With the intuition described, we state our formal results. Without loss of generality, assume there are k components in the graph. For each component Ci(i = 1, . . . , k), let Pi be the total probability mass that an observed labeled example fall into the i-th component, i.e., Pi = P

x2Ci p L(x). The following theorem characterizes the relationship between UDA error rate and the amount of labeled examples. Theorem 1. Under UDA, let Pr(A) denote the probability that the algorithm cannot infer the label of a new test example given m labeled examples from PL. Pr(A) is given by

In addition, O(k/ ) labeled examples can guarantee an error rate of O( ), i.e.,

m = O(k/ ) =) Pr(A) = O( ).

Proof. Please see Appendix. C for details. From the theorem, we can see the number of components, i.e. k, directly governs the amount of labeled data required to reach a desired performance. As we have discussed above, the number of components effectively relies on the quality of an augmentation function, where better augmentation functions result in fewer components. This echoes our discussion of the beneﬁts of state-of-the-art data augmentation operations in generating more diverse examples. Hence, with state-of-the-art augmentation operations, UDA is able to achieve good performance using fewer labeled examples.

4 Experiments

In this section, we evaluate UDA on a variety of language and vision tasks. For language, we rely on six text classiﬁcation benchmark datasets, including IMDb, Yelp-2, Yelp-5, Amazon-2 and Amazon-5 sentiment classiﬁcation and DBPedia topic classiﬁcation [20, 21]. For vision, we employ two smaller datasets CIFAR-10 [22], SVHN [23], which are often used to compare semi-supervised algorithms, as well as Image Net [24] of a larger scale to test the scalability of UDA. For ablation studies and experiment details, we refer readers to Appendix B and Appendix E.

4.1 Correlation between Supervised and Semi-supervised Performances

As the ﬁrst step, we try to verify the fundamental idea of UDA, i.e., there is a positive correlation of data augmentation s effectiveness in supervised learning and semi-supervised learning. Based on Yelp-5 (a language task) and CIFAR-10 (a vision task), we compare the performance of different data

augmentation methods in either fully supervised or semi-supervised settings. For Yelp-5, apart from back-translation, we include a simpler method Switchout [25] which replaces a token with a random

Augmentation Sup Semi-Sup (# Sup examples) (50k) (4k)

Crop & ﬂip 5.36 10.94 Cutout 4.42 5.43 Rand Augment 4.23 4.32 Table 1: Error rates on CIFAR-10.

Augmentation Sup Semi-sup (# Sup examples) (650k) (2.5k)

7 38.36 50.80 Switchout 37.24 43.38 Back-translation 36.71 41.35 Table 2: Error rate on Yelp-5.

token uniformly sampled from the vocabulary. For CIFAR-10, we compare Rand Augment with two simpler methods: (1) cropping & ﬂipping augmentation and (2) Cutout.

Based on this setting, Table 1 and Table 2 exhibit a strong correlation of an augmentation s effectiveness between supervised and semi-supervised settings. This validates our idea of stronger data augmentations found in supervised learning can always lead to more gains when applied to the semi-supervised learning settings.

4.2 Algorithm Comparison on Vision Semi-supervised Learning Benchmarks

With the correlation established above, the next question we ask is how well UDA performs compared to existing semi-supervised learning algorithms. To answer the question, we focus on the most commonly used semi-supervised learning benchmarks CIFAR-10 and SVHN.

Vary the size of labeled data. Firstly, we follow the settings in [19] and employ Wide-Res Net-282 [26, 27] as the backbone model and evaluate UDA with varied supervised data sizes. Speciﬁcally, we compare UDA with two highly competitive baselines: (1) Virtual adversarial training (VAT) [6], an algorithm that generates adversarial Gaussian noise on input, and (2) Mix Match [28], a parallel work that combines previous advancements in semi-supervised learning. The comparison is shown in Figure 4 with two key observations.

First, UDA consistently outperforms the two baselines given different sizes of labeled data. Moreover, the performance difference between UDA and VAT shows the superiority of data

augmentation based noise. The difference of UDA and VAT is essentially the noise process. While the noise produced by VAT often contain high-frequency artifacts that do not exist in real images, data augmentation mostly generates diverse and realistic images.

(a) CIFAR-10 (b) SVHN

Figure 4: Comparison with two semi-supervised learning methods on CIFAR-10 and SVHN with varied number of labeled examples.

Vary model architecture. Next, we directly compare UDA with previously published results under different model architectures. Following previous work, 4k and 1k labeled examples are used for CIFAR-10 and SVHN respectively. As shown in Table 3, given the same architecture, UDA outperforms all published results by signiﬁcant margins and nearly matches the fully supervised performance, which uses 10x more labeled examples. This shows the huge potential of state-of-the-art data augmentations under the consistency training framework in the vision domain.

4.3 Evaluation on Text Classiﬁcation Datasets

Next, we further evaluate UDA in the language domain. Moreover, in order to test whether UDA can be combined with the success of unsupervised representation learning, such as BERT [34], we further consider four initialization schemes: (a) random Transformer; (b) BERTBASE; (c) BERTLARGE; (d)

Method Model # Param CIFAR-10 (4k) SVHN (1k)

-Model [4] Conv-Large 3.1M 12.36 0.31 4.82 0.17 Mean Teacher [5] Conv-Large 3.1M 12.31 0.28 3.95 0.19 VAT + Ent Min [6] Conv-Large 3.1M 10.55 0.05 3.86 0.11 SNTG [29] Conv-Large 3.1M 10.93 0.14 3.86 0.27 ICT [30] Conv-Large 3.1M 7.29 0.02 3.89 0.04 Pseudo-Label [31] WRN-28-2 1.5M 16.21 0.11 7.62 0.29 LGA + VAT [32] WRN-28-2 1.5M 12.06 0.19 6.58 0.36 ICT [30] WRN-28-2 1.5M 7.66 0.17 3.53 0.07 Mix Match [28] WRN-28-2 1.5M 6.24 0.06 2.89 0.06 Mean Teacher [5] Shake-Shake 26M 6.28 0.15 - Fast-SWA [33] Shake-Shake 26M 5.0 - Mix Match [28] WRN 26M 4.95 0.08 -

UDA (Rand Augment) WRN-28-2 1.5M 4.32 0.08 2.23 0.07 UDA (Rand Augment) Shake-Shake 26M 3.7 - UDA (Rand Augment) Pyramid Net 26M 2.7 - Table 3: Comparison between methods using different models where Pyramid Net is used with Shake Drop regularization. On CIFAR-10, with only 4,000 labeled examples, UDA matches the performance of fully supervised Wide-Res Net-28-2 and Pyramid Net+Shake Drop, where they have an error rate of 5.4 and 2.7 respectively when trained on 50,000 examples without Rand Augment. On SVHN, UDA also matches the performance of our fully supervised model trained on 73,257 examples without Rand Augment, which has an error rate of 2.84.

BERTFINETUNE: BERTLARGE ﬁne-tuned on in-domain unlabeled data3. Under each of these four initialization schemes, we compare the performances with and without UDA.

Fully supervised baseline Datasets IMDb Yelp-2 Yelp-5 Amazon-2 Amazon-5 DBpedia (# Sup examples) (25k) (560k) (650k) (3.6m) (3m) (560k) Pre-BERT SOTA 4.32 2.16 29.98 3.32 34.81 0.70 BERTLARGE 4.51 1.89 29.32 2.63 34.17 0.64

Semi-supervised setting

Initialization UDA IMDb Yelp-2 Yelp-5 Amazon-2 Amazon-5 DBpedia (20) (20) (2.5k) (20) (2.5k) (140)

Random 7 43.27 40.25 50.80 45.39 55.70 41.14 3 25.23 8.33 41.35 16.16 44.19 7.24

BERTBASE 7 18.40 13.60 41.00 26.75 44.09 2.58 3 5.45 2.61 33.80 3.96 38.40 1.33

BERTLARGE 7 11.72 10.55 38.90 15.54 42.30 1.68 3 4.78 2.50 33.54 3.93 37.80 1.09

BERTFINETUNE 7 6.50 2.94 32.39 12.17 37.32 - 3 4.20 2.05 32.08 3.50 37.12 - Table 4: Error rates on text classiﬁcation datasets. In the fully supervised settings, the pre-BERT SOTAs include ULMFi T [35] for Yelp-2 and Yelp-5, DPCNN [36] for Amazon-2 and Amazon-5, Mixed VAT [37] for IMDb and DBPedia. All of our experiments use a sequence length of 512.

The results are presented in Table 4 where we would like to emphasize three observations:

First, even with very few labeled examples, UDA can offer decent or even competitive performances

compared to the SOTA model trained with full supervised data. Particularly, on binary sentiment analysis tasks, with only 20 supervised examples, UDA outperforms the previous SOTA trained with full supervised data on IMDb and is competitive on Yelp-2 and Amazon-2. Second, UDA is complementary to transfer learning / representation learning. As we can see, when

initialized with BERT and further ﬁnetuned on in-domain data, UDA can still signiﬁcantly reduce the error rate from 6.50 to 4.20 on IMDb. Finally, we also note that for ﬁve-category sentiment classiﬁcation tasks, there still exists a clear

gap between UDA with 500 labeled examples per class and BERT trained on the entire supervised

3One exception is that we do not pursue BERTFINETUNE on DBPedia as ﬁne-tuning BERT on DBPedia does not yield further performance gain. This is probably due to the fact that DBPedia is based on Wikipedia while BERT is already trained on the whole Wikipedia corpus.

set. Intuitively, ﬁve-category sentiment classiﬁcations are much more difﬁcult than their binary counterparts. This suggests a room for further improvement in the future.

4.4 Scalability Test on the Image Net Dataset

Then, to evaluate whether UDA can scale to problems with a large scale and a higher difﬁculty, we now turn to the Image Net dataset with Res Net-50 being the underlying architecture. Speciﬁcally, we consider two experiment settings with different natures:

We use 10% of the supervised data of Image Net while using all other data as unlabeled data. As a

result, the unlabeled exmaples are entirely in-domain. In the second setting, we keep all images in Image Net as supervised data. Then, we use the

domain-relevance data ﬁltering method to ﬁlter out 1.3M images from JFT [38, 39]. Hence, the unlabeled set is not necessarily in-domain.

The results are summarized in Table 5. In both 10% and the full data settings, UDA consistently brings signiﬁcant gains compared to the supervised baseline. This shows UDA is not only able to scale but also able to utilize out-of-domain unlabeled examples to improve model performance. In parallel to our work, S4L [40] and CPC [41] also show signiﬁcant improvements on Image Net.

Methods SSL 10% 100%

Res Net-50 7 55.09 / 77.26 77.28 / 93.73 w. Rand Augment 58.84 / 80.56 78.43 / 94.37

UDA (Rand Augment) 3 68.78 / 88.80 79.05 / 94.49 Table 5: Top-1 / top-5 accuracy on Image Net with 10% and 100% of the labeled set. We use image size 224 and 331 for the 10% and 100% experiments respectively.

5 Related Work

Existing works in consistency training does make use of data augmentation [4, 7]; however, they only apply weak augmentation methods such as random translations and cropping. In parallel to our work, ICT [30] and Mix Match [28] also show improvements for semi-supervised learning. These methods employ mixup [42] on top of simple augmentations such as ﬂipping and cropping; instead, UDA emphasizes on the use of state-of-the-art data augmentations, leading to signiﬁcantly better results on CIFAR-10 and SVHN. In addition, UDA is also applicable to language domain and can also scale well to more challenging vision datasets, such as Image Net.

Other works in the consistency training family mostly differ in how the noise is deﬁned: Pseudoensemble [2] directly applies Gaussian noise and Dropout noise; VAT [6, 43] deﬁnes the noise by approximating the direction of change in the input space that the model is most sensitive to; Cross-view training [8] masks out part of the input data. Apart from enforcing consistency on the input examples and the hidden representations, another line of research enforces consistency on the model parameter space. Works in this category include Mean Teacher [5], fast-Stochastic Weight Averaging [33] and Smooth Neighbors on Teacher Graphs [29]. For a complete version of related work, please refer to Appendix D.

6 Conclusion

In this paper, we show that data augmentation and semi-supervised learning are well connected: better data augmentation can lead to signiﬁcantly better semi-supervised learning. Our method, UDA, employs state-of-the-art data augmentation found in supervised learning to generate diverse and realistic noise and enforces the model to be consistent with respect to these noise. For text, UDA combines well with representation learning, e.g., BERT. For vision, UDA outperforms prior works by a clear margin and nearly matches the performance of the fully supervised models trained on the full labeled sets which are one order of magnitude larger. We hope that UDA will encourage future research to transfer advanced supervised augmentation to semi-supervised setting for different tasks.

Acknowledgements

We want to thank Hieu Pham, Adams Wei Yu, Zhilin Yang and Ekin Dogus Cubuk for their tireless help to the authors on different stages of this project and thank Colin Raffel for pointing out the connections between our work and previous works. We also would like to thank Olga Wichrowska, Barret Zoph, Jiateng Xie, Guokun Lai, Yulun Du, Chen Dan, David Berthelot, Avital Oliver, Trieu Trinh, Ran Zhao, Ola Spyra, Brandon Yang, Daiyi Peng, Andrew Dai, Samy Bengio, Jeff Dean and the Google Brain team for insightful discussions and support to the work. Lastly, we thank anonymous reviewers for their valueable feedbacks.

Broader Impact

This work show that it is possible to achieve great performance with limited labeled data. Hence groups/institutes with limited budgets for annotating data may beneﬁt from this research. To the best of our knowledge, nobody will be put at disadvantage from this research. Our method does not leverage biases in the data. Our tasks include standard benchmarks such as IMDb, CIFAR-10, SVHN and Image Net.

[1] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle,

o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542 542, 2009. [2] Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. In

Advances in Neural Information Processing Systems, pages 3365 3373, 2014. [3] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-

supervised learning with ladder networks. In Advances in neural information processing systems, pages 3546 3554, 2015. [4] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. ar Xiv preprint

ar Xiv:1610.02242, 2016. [5] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged

consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195 1204, 2017. [6] Takeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori Koyama. Virtual adversarial training:

a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018. [7] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transfor-

mations and perturbations for deep semi-supervised learning. In Advances in Neural Information Processing Systems, pages 1163 1171, 2016. [8] Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc V Le. Semi-supervised

sequence modeling with cross-view training. ar Xiv preprint ar Xiv:1809.08370, 2018. [9] Patrice Y Simard, Yann A Le Cun, John S Denker, and Bernard Victorri. Transformation

invariance in pattern recognition tangent distance and tangent propagation. In Neural networks: tricks of the trade, pages 239 274. Springer, 1998. [10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep

convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012. [11] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment:

Learning augmentation policies from data. ar Xiv preprint ar Xiv:1805.09501, 2018. [12] Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi,

and Quoc V Le. Qanet: Combining local convolution with global self-attention for reading comprehension. ar Xiv preprint ar Xiv:1804.09541, 2018. [13] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan

Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up end-to-end speech recognition. ar Xiv preprint ar Xiv:1412.5567, 2014.

[14] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk,

and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition. ar Xiv preprint ar Xiv:1904.08779, 2019.

[15] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical data

augmentation with no separate search. ar Xiv preprint ar Xiv:1909.13719, 2019.

[16] Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation

models with monolingual data. ar Xiv preprint ar Xiv:1511.06709, 2015.

[17] Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation

at scale. ar Xiv preprint ar Xiv:1808.09381, 2018.

[18] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In

Advances in neural information processing systems, pages 529 536, 2005.

[19] Avital Oliver, Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Goodfellow.

Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pages 3235 3246, 2018.

[20] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher

Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pages 142 150. Association for Computational Linguistics, 2011.

[21] Xiang Zhang, Junbo Zhao, and Yann Le Cun. Character-level convolutional networks for text

classiﬁcation. In Advances in neural information processing systems, pages 649 657, 2015.

[22] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.

Technical report, Citeseer, 2009.

[23] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.

Reading digits in natural images with unsupervised feature learning. 2011.

[24] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-

scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

[25] Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. Switchout: an efﬁcient data

augmentation algorithm for neural machine translation. ar Xiv preprint ar Xiv:1808.07512, 2018.

[26] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. BMVC, 2016.

[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image

recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[28] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and

Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning. ar Xiv preprint ar Xiv:1905.02249, 2019.

[29] Yucen Luo, Jun Zhu, Mengxi Li, Yong Ren, and Bo Zhang. Smooth neighbors on teacher graphs

for semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8896 8905, 2018.

[30] Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz. Interpolation

consistency training for semi-supervised learning. ar Xiv preprint ar Xiv:1903.03825, 2019.

[31] Dong-Hyun Lee. Pseudo-label: The simple and efﬁcient semi-supervised learning method for

deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.

[32] Jacob Jackson and John Schulman. Semi-supervised learning by label gradient alignment. ar Xiv

preprint ar Xiv:1902.02336, 2019.

[33] Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. There are many

consistent explanations of unlabeled data: Why you should average. ICLR, 2019.

[34] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of

deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

[35] Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classi-

ﬁcation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 328 339, 2018.

[36] Rie Johnson and Tong Zhang. Deep pyramid convolutional neural networks for text catego-

rization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 562 570, 2017.

[37] Devendra Singh Sachan, Manzil Zaheer, and Ruslan Salakhutdinov. Revisiting lstm networks

for semi-supervised text classiﬁcation via mixed objective function. 2018.

[38] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.

ar Xiv preprint ar Xiv:1503.02531, 2015.

[39] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceed-

ings of the IEEE conference on computer vision and pattern recognition, pages 1251 1258, 2017.

[40] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervised

semi-supervised learning. In Proceedings of the IEEE international conference on computer vision, 2019.

[41] Olivier J Hénaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efﬁcient

image recognition with contrastive predictive coding. ar Xiv preprint ar Xiv:1905.09272, 2019.

[42] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond

empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017.

[43] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-

supervised text classiﬁcation. ar Xiv preprint ar Xiv:1605.07725, 2016.

[44] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian

ﬁelds and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pages 912 919, 2003.

[45] Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-

supervised embedding. In Neural Networks: Tricks of the Trade, pages 639 655. Springer, 2012.

[46] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning

with graph embeddings. ar Xiv preprint ar Xiv:1603.08861, 2016.

[47] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional

networks. ar Xiv preprint ar Xiv:1609.02907, 2016.

[48] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised

learning with deep generative models. In Advances in neural information processing systems, pages 3581 3589, 2014.

[49] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep

generative models. ar Xiv preprint ar Xiv:1602.05473, 2016.

[50] Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov, and William W Cohen. Semi-supervised qa with

generative domain-adaptive nets. ar Xiv preprint ar Xiv:1702.02206, 2017.

[51] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.

Improved techniques for training gans. In Advances in neural information processing systems, pages 2234 2242, 2016.

[52] Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan R Salakhutdinov. Good

semi-supervised learning that requires a bad gan. In Advances in Neural Information Processing Systems, pages 6510 6520, 2017.

[53] Alex Hernández-García and Peter König. Data augmentation instead of explicit regularization.

ar Xiv preprint ar Xiv:1806.03852, 2018.

[54] Xuanli He, Gholamreza Haffari, and Mohammad Norouzi. Sequence to sequence mixture

model for diverse machine translation. ar Xiv preprint ar Xiv:1810.07391, 2018.

[55] Tianxiao Shen, Myle Ott, Michael Auli, and Marc Aurelio Ranzato. Mixture models for diverse

machine translation: Tricks of the trade. ar Xiv preprint ar Xiv:1902.07816, 2019.

[56] Wouter Kool, Herke van Hoof, and Max Welling. Stochastic beams and where to ﬁnd them: The gumbel-top-k trick for sampling sequences without replacement. ar Xiv preprint ar Xiv:1903.06059, 2019. [57] Ronan Collobert and Jason Weston. A uniﬁed architecture for natural language processing: Deep

neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160 167. ACM, 2008. [58] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-

sentations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111 3119, 2013. [59] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for

word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532 1543, 2014. [60] Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in neural

information processing systems, pages 3079 3087, 2015. [61] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken-

ton Lee, and Luke Zettlemoyer. Deep contextualized word representations. ar Xiv preprint ar Xiv:1802.05365, 2018. [62] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language

understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openaiassets/research-covers/languageunsupervised/language understanding paper. pdf, 2018. [63] Trieu H Trinh, Minh-Thang Luong, and Quoc V Le. Selﬁe: Self-supervised pretraining for

image embedding. ar Xiv preprint ar Xiv:1906.02940, 2019. [64] Robert Stanforth, Alhussein Fawzi, Pushmeet Kohli, et al. Are labels required for improving

adversarial robustness? ar Xiv preprint ar Xiv:1905.13725, 2019. [65] Runtian Zhai, Tianle Cai, Di He, Chen Dan, Kun He, John Hopcroft, and Liwei Wang. Adversar-

ially robust generalization just requires more unlabeled data. ar Xiv preprint ar Xiv:1906.00555, 2019. [66] Amir Najaﬁ, Shin-ichi Maeda, Masanori Koyama, and Takeru Miyato. Robustness to adversarial

perturbations in learning from incomplete data. ar Xiv preprint ar Xiv:1905.13021, 2019. [67] Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, Percy Liang, and John C Duchi. Unlabeled

data improves adversarial robustness. ar Xiv preprint ar Xiv:1905.13736, 2019. [68] Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. Learning

discrete representations via information maximizing self-augmented training. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1558 1567. JMLR. org, 2017. [69] Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. Unsupervised embedding learning via

invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6210 6219, 2019. [70] Davis Liang, Zhiheng Huang, and Zachary C Lipton. Learning noise-invariant representations

for robust speech recognition. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 56 63. IEEE, 2018. [71] Julian Salazar, Davis Liang, Zhiheng Huang, and Zachary C Lipton. Invariant representation

learning for robust deep networks. In Workshop on Integration of Deep Learning Theories, Neur IPS, 2018. [72] Julian Mc Auley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. Image-based

recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43 52. ACM, 2015.