# cycle_selftraining_for_domain_adaptation__94644564.pdf

Cycle Self-Training for Domain Adaptation

Hong Liu Dept of Electronic Engineering

Tsinghua University hongliu9903@gmail.com

Jianmin Wang School of Software, BNRist

Tsinghua University jimwang@tsinghua.edu.cn

Mingsheng Long School of Software, BNRist

Tsinghua University mingsheng@tsinghua.edu.cn

Mainstream approaches for unsupervised domain adaptation (UDA) learn domaininvariant representations to narrow the domain shift, which are empirically effective but theoretically challenged by the hardness or impossibility theorems. Recently, self-training has been gaining momentum in UDA, which exploits unlabeled target data by training with target pseudo-labels. However, as corroborated in this work, under distributional shift, the pseudo-labels can be unreliable in terms of their large discrepancy from target ground truth. In this paper, we propose Cycle Self-Training (CST), a principled self-training algorithm that explicitly enforces pseudo-labels to generalize across domains. CST cycles between a forward step and a reverse step until convergence. In the forward step, CST generates target pseudo-labels with a source-trained classiﬁer. In the reverse step, CST trains a target classiﬁer using target pseudo-labels, and then updates the shared representations to make the target classiﬁer perform well on the source data. We introduce the Tsallis entropy as a conﬁdence-friendly regularization to improve the quality of target pseudo-labels. We analyze CST theoretically under realistic assumptions, and provide hard cases where CST recovers target ground truth, while both invariant feature learning and vanilla self-training fail. Empirical results indicate that CST signiﬁcantly improves over the state-of-the-arts on visual recognition and sentiment analysis benchmarks.

1 Introduction

Transferring knowledge from a source domain with rich supervision to an unlabeled target domain is an important yet challenging problem. Since deep neural networks are known to be sensitive to subtle change in underlying distributions [70], models trained on one labeled dataset often fail to generalize to another unlabeled dataset [58, 1]. Unsupervised domain adaptation (UDA) addresses the challenge of distributional shift by adapting the source model to the unlabeled target data [50, 43].

The mainstream paradigm for UDA is feature adaptation, a.k.a. domain alignment. By reducing the distance of the source and target feature distributions, these methods learn invariant representations to facilitate knowledge transfer between domains [34, 22, 36, 54, 37, 73], with successful applications in various areas such as computer vision [63, 27, 77] and natural language processing [75, 49]. Despite their popularity, the impossibility theories [6] uncovered intrinsic limitations of learning invariant representations when it comes to label shift [74, 32] and shift in the support of domains [29].

Recently, self-training (a.k.a. pseudo-labeling) [21, 78, 30, 32, 47, 68] has been gaining momentum as a promising alternative to feature adaptation. Originally tailored to semi-supervised learning, self-training generates pseudo-labels of unlabeled data, and jointly trains the model with source labels and target pseudo-labels [31, 39, 30]. However, the distributional shift in UDA makes pseudo-labeling more difﬁcult. Directly using all pseudo-labels is risky due to accumulated error and even trivial solution [14]. Thus previous works tailor self-training to UDA by selecting trustworthy pseudo-labels. Using conﬁdence threshold or reweighting, recent works try to alleviate the negative effect of domain

Corresponding author: Mingsheng Long (mingsheng@tsinghua.edu.cn)

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Classifier Shared Features

Source Classifier

Target Classifier

Cycle Source

Standard Self-Training Cycle Self-Training

𝐿 𝑆 (𝜃௧, 𝜙)

𝑦௦ 𝐿 𝑃(𝜃௦, 𝜙) 𝑦௦

Forward Update 𝜙and 𝜃 Label Sharpening

Forward Outer loop: update 𝜙and 𝜃௦

Label Sharpening Inner loop:

Figure 1: Standard self-training vs. cycle self-training. In standard self-training, we generate target pseudolabels with a source model, and then train the model with both source ground-truths and target pseudo-labels. In cycle self-training, we train a target classiﬁer with target pseudo-labels in the inner loop, and make the target classiﬁer perform well on the source domain by updating the shared representations in the outer loop.

shift in standard self-training [78, 47], but they can be brittle and require expensive tweaking of the threshold or weight for different tasks, and their performance gain is still inconsistent.

In this work, we ﬁrst analyze the quality of pseudo-labels with or without domain shift to delve deeper into the difﬁculty of standard self-training in UDA. On popular benchmark datasets, when the source and target are the same, our analysis indicates that the pseudo-label distribution is almost identical to the ground-truth distribution. However, with distributional shift, their discrepancy can be very large with examples of several classes mostly misclassiﬁed into other classes. We also study the difﬁculty of selecting correct pseudo-labels with popular criteria under domain shift. Although entropy and conﬁdence are reasonable selection criteria for correct pseudo-labels without domain shift, the domain shift makes their accuracy decrease sharply.

Our analysis shows that domain shift makes pseudo-labels unreliable and that self-training on selected target instances with accurate pseudo-labels is less successful. Thereby, more principled improvement of standard self-training should be tailored to UDA and address the domain shift explicitly. In this work, we propose Cycle Self-Training (CST), a principled self-training approach to UDA, which overcomes the limitations of standard self-training (see Figure 1). Different from previous works to select target pseudo-labels with hard-to-tweak protocols, CST learns to generalize the pseudo-labels across domains. Speciﬁcally, CST cycles between the use of target pseudo-labels to train a target classiﬁer, and the update of shared representations to make the target classiﬁer perform well on the source data. In contrast to the standard Gibbs entropy that makes the target predictions over-conﬁdent, we propose a conﬁdence-friendly uncertainty measure based on the Tsallis entropy in information theory, which adaptively minimizes the uncertainty without manually tuning or setting thresholds. Our method is simple and generally applicable to vision and language tasks with various backbones.

We empirically evaluate our method on a series of standard UDA benchmarks. Results indicate that CST outperforms previous state-of-the-art methods in 21 out of 25 tasks for object recognition and sentiment classiﬁcation. Theoretically, we prove that the minimizer of CST objective is endowed with general guarantees of target performance. We also study hard cases on speciﬁc distributions, showing that CST recovers target ground-truths while both feature adaptation and standard self-training fail.

2 Preliminaries

We study unsupervised domain adaptation (UDA). Consider a source distribution P and a target distribution Q over the input-label space X Y. We have access to ns labeled i.i.d. samples

i=1 from P and nt unlabeled i.i.d. samples b Q = {xt

i=1 from Q. The model f comprises a feature extractor hφ parametrized by φ and a head (linear classiﬁer) g parametrized by , i.e. f ,φ(x) = g (hφ(x)). The loss function is ( , ). Denote by LP ( , φ) := E(x,y) P (f ,φ(x), y) the expected error on P. Similarly, we use L b P ( , φ) to denote the empirical error on dataset b P.

We discuss two mainstream UDA methods and their formulations: feature adaptation and self-training.

Feature Adaptation trains the model f on the source dataset b P, and simultaneously matches the source and target distributions in the representation space Z = h(X):

,φ L b P ( , φ) + d(h] b P, h] b Q). (1)

Figure 2: Analysis of pseudo-labels under domain shift on Vis DA-2017. Left: Pseudo-label distributions with and without domain shift. Middle: Changes of pseudo-label distributions throughout training. Right: Quality of pseudo-labels under different pseudo-label selection criteria.

Here, h] b P denotes the pushforward distribution of b P, and d( , ) is some distribution distance. For instance, Long et al. [34] used maximum mean discrepancy d MMD, and Ganin et al. [22] approximated the H H-distance d H H [7] with adversarial training. Despite its pervasiveness, recent works have shown the intrinsic limitations of feature adaptation under real-world situations [6, 74, 33, 32, 29].

Self-Training is considered a promising alternative to feature adaptation. In this work we mainly focus on pseudo-labeling [31, 30]. Stemming from semi-supervised learning, standard self-training trains a source model fs on the source dataset b P: min s,φs L b P ( s, φs). The target pseudo-labels are then generated by fs on the target dataset b Q. To leverage unlabeled target data, self-training trains the model on the source and target datasets together with source ground-truths and target pseudo-labels:

,φ L b P ( , φ) + Ex b Q (f ,φ(x), arg max

{f s,φs(x)[i]}). (2)

Self-training also uses label-sharpening as a standard protocol [31, 57]. Another popular variant of pseudo-labeling is the teacher-student model [4, 61], which iteratively improves the quality of pseudo-labels via alternatively replacing s and φs with and φ of the previous iteration.

2.1 Limitations of Standard Self-Training Standard self-training with pseudo-labels uses unlabeled data efﬁciently for semi-supervised learning [31, 39, 57]. Here we carry out exploratory studies on the popular Vis DA-2017 [45] dataset using Res Net-50 backbones. We ﬁnd that domain shift makes the pseudo-labels biased towards several classes and thereby unreliable in UDA. See Appendix C.1 for details and results on more datasets.

Pseudo-label distributions with or without domain shift. We resample the original Vis DA-2017 to simulate different relationship between source and target domains: 1) i.i.d., 2) covariate shift, and 3) label shift. We train the model on the three variants of source dataset and use it to generate target pseudo-labels. We show the distributions of target ground-truths and pseudo-labels in Figure 2 (Left). When the source and target distributions are identical, the distribution of pseudo-labels is almost the same as ground-truths, indicating the reliability of pseudo-labels. In contrast, when exposed to label shift or covariate shift, the distribution of pseudo-labels is signiﬁcantly different from target ground-truths. Note that classes 2, 7, 8 and 12 appear rarely in the target pseudo-labels in the covariate shift setting, indicating that the pseudo-labels are biased towards several classes due to domain shift. Self-training with these pseudo-labels is risky since it may lead to misalignment of distributions and misclassify many examples of classes 2, 7, 8 and 12.

Change of pseudo-label distributions throughout training. To further study the change of pseudolabels in standard self-training, we compute the total variation (TV) distance between target groundtruths and target pseudo-labels: d TV(c, c0) = 1

ik, where ci is the ratio of class i. We plot its change during training in Figure 2 (Middle). Although the error rate of pseudo-labels continues to decrease, d TV remains almost unchanged at 0.26 throughout training. Note that d TV is the lower bound of the error rate of the pseudo-labels (shown in Appendix C.1). If d TV converges to 0.26, then the accuracy of pseudo-labels is upper-bounded by 0.74. This indicates that the important denoising ability [66] of pseudo-labels in standard self-training is hindered by domain shift.

Difﬁculty of selecting reliable pseudo-labels under domain shift. To mitigate the negative effect of false pseudo-labels, recent works proposed to select correct pseudo-labels based on thresholding the entropy or conﬁdence criteria [35, 21, 37, 57]. However, it remains unclear whether these strategies are still effective under domain shift. Here we compare the quality of pseudo-labels selected by

different strategies with or without domain shift. For each strategy, we compute False Positive Rate and True Positive Rate for different thresholds and plot its ROC curve in Figure 2 (Right). When the source and target distributions are identical, both entropy and conﬁdence are reasonable strategies for selecting correct pseudo-labels (AUC=0.89). However, when the target pseudo-labels are generated by the source model, the quality of pseudo-labels decreases sharply under domain shift (AUC=0.78).

We present Cycle Self-Training (CST) to improve pseudo-labels under domain shift. An overview of our method is given in Figure 1. Cycle Self-Training iterates between a forward step and a reverse step to make self-trained classiﬁers generalize well on both target and source domains.

3.1 Cycle Self-Training

Forward Step. Similar to standard self-training, we have a source classiﬁer s trained on top of the shared representations φ on the labeled source domain, and use it to generate target pseudo-labels as

y0 = arg max

{f s,φ(x)[i]}, (3)

for each x in the target dataset b Q. Traditional self-training methods use conﬁdence thresholding or reweighting to select reliable pseudo-labels. For example, Sohn et al. [57] select pseudo-labels with softmax value and Long et al. [37] add entropy reweighting to rely on examples with more conﬁdence prediction. However, the output of deep networks is usually miscalibrated [25], and is not necessarily related to the ground-truth conﬁdence even on the same distribution. In domain adaptation, as shown in Section 2.1, the discrepancy between the source and target domains makes pseudo-labels even more unreliable, and the performance of commonly used selection strategies is also unsatisfactory. Another drawback is the expensive tweaking in order to ﬁnd the optimal conﬁdence threshold for new tasks. To better apply self-training to domain adaptation, we expect that the model can gradually reﬁne the pseudo-labels by itself without the cumbersome selection or thresholding.

Reverse Step. We design a complementary step with the following insights to improve self-training. Intuitively, the labels on the source domain contain both useful information that can transfer to the target domain and harmful information that can make pseudo-labels incorrect. Similarly, reliable pseudo-labels on the target domain can transfer to the source domain in turn, while models trained with incorrect pseudo-labels on the target domain cannot transfer to the source domain. In this sense, if we explicitly train the model to make target pseudo-labels informative of the source domain, we can gradually make the pseudo-labels more accurate and learn to generalize to the target domain.

Speciﬁcally, with the pseudo-labels y0 generated by the source classiﬁer s at hand as in equation 3, we train a target head ˆ t(φ) on top of the representation φ with pseudo-labels on the target domain b Q,

ˆ t(φ) = arg min

Ex b Q (f ,φ(x), y0). (4)

We wish to make the target pseudo-labels informative of the source domain and gradually reﬁne them. To this end, we update the shared feature extractor φ to predict accurately on the source domain and jointly enforce the target classiﬁer ˆ t(φ) to perform well on the source domain. This naturally leads to the objective of Cycle Self-Training:

s,φ LCycle( s, φ) := L b P ( s, φ) + L b P (ˆ t(φ), φ). (5)

Bi-level Optimization. The objective in equation 5 relies on the solution ˆ t(φ) to the objective in equation 4. Thus, CST formulates a bi-level optimization problem. In the inner loop we generate target pseudo-labels with the source classiﬁer (equation 3), and train a target classiﬁer with target pseudo-labels (equation 4). After each inner loop, we update the feature extractor φ for one step in the outer loop (equation 5), and start a new inner loop again. However, since the inner loop of the optimization in equation 4 only involves the light-weight linear head t, we propose to calculate the analytical form of ˆ t(φ) and directly back-propagate to the feature extractor φ instead of calculating the second-order derivatives as in MAML [18]. The resulting framework is as fast as training two heads jointly. Also note that the solution ˆ t(φ) relies on s implicitly through y0. However, both standard self-training and our implementation use label sharpening, making y0 not differentiable. Thus we follow vanilla self-training and do not consider the gradient of ˆ t(φ) w.r.t. y0 in the outer loop optimization. We defer the derivation and implementation of bi-level optimization to Appendix B.2.

3.2 Tsallis Entropy Minimization

Gibbs entropy is widely used by existing semi-supervised learning methods to regularize the model output and minimize the uncertainty of predictions on unlabeled data [24]. In this work, we generalize Gibbs entropy to Tsallis entropy [62] in information theory. Suppose the softmax output of a model is y 2 RK, then the -Tsallis entropy is deﬁned as

S (y) = 1 1

where > 0 is the entropic-index. Note that lim !1 S (y) = P

i y[i]log(y[i]) which exactly recovers the Gibbs entropy. When = 2, S (y) becomes the Gini impurity 1 P

We propose to control the uncertainty of target pseudo-labels based on Tsallis entropy minimization:

L b Q,Tsallis, ( , φ) := Ex b QS (f ,φ(x)). (9)

Figure 3: Tsallis entropy vs. entropic-index .

Figure 3 shows the change of Tsallis entropy with different entropic-indices for binary problems. Intuitively, smaller exerts more penalization on uncertain predictions and larger allows several scores yi s to be similar. This is critical in self-training since an overly small (as in Gibbs entropy) will make the incorrect dimension of pseudo-labels close to 1 and have no chance to be corrected throughout training. In Section 5.4, we further verify this property with experiments.

An important improvement of the Tsallis entropy over Gibbs entropy is that it can choose the suitable measure of uncertainty for different systems to avoid over-conﬁdence caused by overly penalizing the uncertain pseudo-labels. To automatically ﬁnd the suitable , we adopt a similar strategy as Section 3.1. The intuition is that if we use the suitable entropic-index to train the source classiﬁer s, , the target pseudo-labels generated by s, will contain desirable knowledge of the source dataset, i.e. a target classiﬁer t, trained with these pseudo-labels will perform well on the source domain. Therefore, we semi-supervisedly train a classiﬁer ˆ s, on the source domain with the -Tsallis entropy regularization L b Q,Tsallis, on the target domain as: ˆ s, = arg min L b P ( , φ)+L b Q,Tsallis, ( , φ), from

which we obtain the target pseudo-labels. Then we train another head ˆ t, with target pseudo-labels. We automatically ﬁnd by minimizing the loss of ˆ t, on the source data:

ˆ = arg min

L b P (ˆ t, , φ) (10)

To solve equation 10, we discretize the feasible region [1, 2] of and use discrete optimization to lower computational cost. We also update at the start of each epoch, since we found more frequent

Algorithm 1 Cycle Self-Training (CST)

1: Input: source dataset b P and target dataset b Q. 2: for epoch = 0 to Max Epoch do 3: Select ˆ as equation 10 at the start of each epoch. 4: for t = 0 to Max Iter do 5: Forward Step 6: Generate pseudo-labels on the target domain with φ and s: y0 = arg maxi{f s,φ(x)[i]}. 7: Reverse Step 8: Train a target head ˆ t(φ) with target pseudo-labels y0 on the feature extractor φ:

ˆ t(φ) = arg min

Ex b Q (f ,φ(x), y0).

9: Update the feature extractor φ and the source head s to make ˆ t(φ) perform well on the source dataset and minimize the ˆ -Tsallis entropy on the target dataset:

φ φ rφ[L b P ( s, φ) + L b P (ˆ t(φ), φ) + L b Q,Tsallis,ˆ ( s, φ)]. (7)

s s r s[L b P ( s, φ) + L b Q,Tsallis,ˆ ( s, φ)]. (8) 10: end for 11: end for

update leads to no performance gain. Details are deferred to Appendix B.3. Finally, with the optimal

ˆ found, we add the ˆ -Tsallis entropy minimization term L b Q,Tsallis,ˆ to the overall objective:

s,φ LCycle( s, φ) + L b Q,Tsallis,ˆ ( s, φ). (11)

In summary, Algorithm 1 depicts the complete training procedure of Cycle Self-Training (CST).

4 Theoretical Analysis

We analyze the properties of CST theoretically. First, we prove that the minimizer of the CST loss LCST(fs, ft) will lead to small target loss Err Q(fs) under a simple but realistic expansion assumption. Then, we further demonstrate a concrete instantiation where cycle self-training provably recovers the target ground truth, but both feature adaptation and standard self-training fail. Due to space limit, we state the main results here and defer all proof details to Appendix A.

4.1 CST Provably Works under the Expansion Assumption

We start from a K-way classiﬁcation model, f : X ! [0, 1]K 2 F and f(x) := arg maxi f(x)[i] denotes the prediction. Denote by Pi the conditional distribution of P given y = i. Assume the supports of Pi and Pj are disjoint for i 6= j. The deﬁnition is similar for Qi. We further Assume P(y = i) = Q(y = i). For any x 2 X, N(x) is deﬁned as the neighboring set of x with a proper metric d( , ), N(x) = {x0 : d(x, x0) }. N(A) := [x2AN(x). Denote the expected error on the target domain by Err Q(f) := E(x,y) QI( f(x) 6= y).

We study the CST algorithm under the expansion assumption of the mixture distribution [66, 11]. Intuitively, this assumption indicates that the conditional distributions Pi and Qi are closely located and regularly shaped, enabling knowledge transfer from the source domain to the target domain. Deﬁnition 1 ((q, )-constant expansion [66]). We say P and Q satisfy (q, )-constant expansion for some constant q, 2 (0, 1), if for any set A 2 X and any i 2 [K] with 1

2 (Pi+Qi)(A) > q, we have P 1

2 (Pi+Qi)(N(A)\A) > min{ , P 1

2 (Pi+Qi)(A)}.

Based on this expansion assumption, we consider a robustness-constrained version of CST. Later we will show that the robustness is closely related to the uncertainty. Denote by fs the source model and ft the model trained on the target with pseudo-labels. Let R(ft) := P 1

2 (P +Q)({x : 9x0 2

N(x), ft(x) 6= ft(x0)}) represent the robustness [66] of ft on P and Q. Suppose E(x,y) QI( fs(x) 6=

ft(x)) c and R(ft) . The following theorem states that when fs and ft behave similarly on the target domain Q and ft is robust to local changes in input, the minimizer of the cycle source error Err P (ft) will guarantee low error of fs on the target domain Q.

Theorem 1. Suppose Deﬁnition 1 holds for P and Q. For any fs, ft satisfying E(x,y) QI( fs(x) 6=

ft(x)) c and R(ft) , the expected error of fs on the target domain Q is bounded,

Err Q(fs) Err P (ft) + c + 2q + min{ , q}. (12)

To further relate the expected error with the CST training objective and obtain ﬁnite-sample guarantee, we use the multi-class margin loss: lγ(f(x), y) := γ( M(f(x), y)), where M(v, y) = v[y] maxy06=y v[y0] and γ is the ramp function. We then extend the margin loss: M(v) = maxy(v[y] maxy06=y v[y0]) (The difference between the largest and the second largest scores in v), and lγ(ft(x), fs(x)) := γ( M(ft(x), fs(x))). Further suppose f[i] is Lf-Lipschitz w.r.t. the metric d( , ) and := 1 2Lf min{ , q} > 0. Consider the following training objective for CST, denoted by LCST(fs, ft), where L b P ,γ(ft) := E(x,y) b P lγ(ft(x), y) corresponds to the cycle source loss in equation 5, L b Q,γ(ft, fs) := E(x,y) b Qlγ(ft(x), fs(x)) is consistent with the target loss in equation 4, and M(ft(x)) is closely related to the uncertainty of predictions in equation 11.

min LCST(fs, ft) := L b P ,γ(ft) + L b Q,γ(ft, fs) +

2 ( b P + b Q)M(ft(x))

The following theorem shows that the minimizer of the training objective LCST(fs, ft) guarantees low population error of fs on the target domain Q.

Theorem 2. b R(F| b P ) denotes the empirical Rademacher complexity of function class F on dataset b P. For any solution of equation 13 and γ > 0, with probability larger than 1 δ,

Err Q(fs) LCST(fs, ft) + 2q + 4K

b R(F| b P ) + b R( F F| b Q)

b R(F| b P ) + b R(F| b Q)

log(1/δ)/ns +

log(1/δ)/nt

is a low-order term. F F refers to the function

class {x ! f(x)[ f 0(x)] : f, f 0 2 F}.

Main insights. Theorem 2 justiﬁes CST under the expansion assumption. The generalization error of the classiﬁer fs on the target domain is bounded with the CST loss objective LCST(fs, ft), the intrinsic property of the data distribution q, and the complexity of the function classes. In our algorithm, LCST(fs, ft) is minimized by the neural networks and q is a constant. The complexity of the function class can be controlled with proper regularization.

4.2 Hard Case for Feature Adaptation and Standard Self-Training

To gain more insight, we study UDA in a quadratic neural network f ,φ(x) = >(φ>x) 2, where is element-wise power. In UDA, the source can have multiple solutions but we aim to learn the one working on the target [34]. We design the underlying distributions p and q in Table 6 to reﬂect this. Consider the following P and Q. x[1] and x[2] are sampled i.i.d. from distribution p on P, and from

Table 1: The design of p and q.

Distribution 1 +1 0

Source p 0.05 0.05 0.90 Target q 0.25 0.25 0.50

q on Q. For i 2 [3, d], x[i] = σix[2] on P and x[i] = σix[1] on Q. σi 2 { 1} are i.i.d. and uniform. We also assume realizability: y = x2

[2] for both source and target. Note that y = x2

[i] for all i 2 [2, d] are solutions to P but only y = x2

[2] works on Q. We visualize this specialized setting in Figure 4.

(a) (b) (F) (G)

Figure 4: The hard case where d = 3. Green dots for y = 1, red dots for y = 0, and blue dots for y = 1. The grey curve is the classiﬁcation boundary of different features. The good feature x2

[2] works on the target domain (shown in (a) and (c)), whereas the spurious feature x2

[3] only works on the source domain (shown in (b) and (d)). In Section 4.2, we show that feature adaptation and standard self-training learn x2

[3], while CST learns x2

To make the features more tractable, we study the norm-constrained version of the algorithms (details are deferred to Section A.3.2). We compare the features learned by feature adaptation, standard selftraining, and CST. Intuitively, feature adaptation fails because the ideal target solution y = x2

[2] has larger distance in the feature space than other spurious solutions y = x2

[i]. Standard selftraining also fails since it will choose randomly among all solutions. In comparison, CST can recover the ground truth, because it can distinguish the spurious solution resulting in bad pseudo-labels. A classiﬁer trained with those pseudo-labels cannot work on the source domain in turn. This intuition is rigorously justiﬁed in the following two theorems. Theorem 3. For 2 (0, 0.5), the following statements hold for feature adaptation and self-training:

For failure rate > 0, and target dataset size nt > (log 1

), with probability at least 1 over

the sampling of target data, the solution (ˆ FA, ˆφFA) found by feature adaptation satisﬁes

Err Q(ˆ FA, ˆφFA) . (14)

With probability at least 1 1 d 1, the solution (ˆ ST, ˆφST) of standard self-training satisﬁes

Err Q(ˆ ST, ˆφST) . (15)

Theorem 4. For failure rate > 0, and target dataset size nt > (log 1

), with probability at least

1 , the solution of CST (ˆφCST, ˆ CST) recovers the ground truth of the target dataset:

Err Q(ˆ CST, ˆφCST) = 0. (16)

5 Experiments

We test the performance of the proposed method on both vision and language datasets. Cycle Self Training (CST) consistently outperforms state-of-the-art feature adaptation and self-training methods. Code is available at https://github.com/Liuhong99/CST.

Datasets. We experiment on visual object recognition and linguistic sentiment classiﬁcation tasks: Ofﬁce-Home [64] has 65 classes from four kinds of environment with large domain gap: Artistic (Ar), Clip Art (Cl), Product (Pr), and Real-World (Rw); Vis DA-2017 [45] is a large-scale UDA dataset with two domains named Synthetic and Real. The datasets consist of over 200k images from 12 categories of objects; Amazon Review [10] is a linguistic sentiment classiﬁcation dataset of product reviews in four products: Books (B), DVDs (D), Electronics (E), and Kitchen (K).

Implementation. We use Res Net-50 [26] (pretrained on Image Net [53]) as feature extractors for vision tasks, and BERT [16] for linguistic tasks. On Vis DA-2017, we also provide results of Res Net101 to include more baselines. We use cross-entropy loss for classiﬁcation on the source domain. When training the target head ˆ t and updating the feature extractor with CST, we use squared loss to get the analytical solution of ˆ t directly and avoid calculating second order derivatives as metalearning [18]. Details on adapting squared loss to multi-class classiﬁcation are deferred to Appendix B. We adopt SGD with initial learning rate 0 = 2e 3 for image classiﬁcation and 0 = 5e 4 for sentiment classiﬁcation. Following standard protocol in [26], we decay the learning rate by 0.1 each 50 epochs until 150 epochs. We run all the tasks 3 times and report mean and deviation in top-1 accuracy. For Vis DA-2017, we report the mean class accuracy. Following Theorem 2, we also enhance CST with sharpness-aware regularization [19] (CST+SAM), which help regularize the Lipschitzness of the function class. Due to space limit, we report mean accuracies in Tables 2 and 3 and defer standard deviation to Appendix C.

5.2 Baselines

We compare with two lines of works in domain adaptation: feature adaptation and self-training. We also compare with more complex state-of-the-arts and create stronger baselines by combining feature adaptation and self-training.

Feature Adaptation: DANN [22], MCD [54], CDAN [37] (which improves DANN with pseudolabel conditioning), MDD [73] (which improves previous domain adaptation with margin theory), Implicit Alignment (IA) [28] (which improves MDD to deal with label shift).

Self-Training. We include VAT [40], Mix Match [8] and Fix Match [57] in the semi-supervised learning literature as self-training methods. We also compare with self-training methods for UDA: CBST [77], which considers class imbalance in standard self-training, and KLD [78], which improves CBST with label regularization. However, these methods involve tricks speciﬁed for convolutional networks. Thus, in sentiment classiﬁcation tasks where we use BERT backbones, we compare with other consistency regularization baselines: VAT [40], VAT+Entropy Minimization.

Feature Adaptation + Self-Training. DIRT-T [56] combines DANN, VAT, and entropy minimization. We also create more powerful baselines: CDAN+VAT+Entropy and MDD+Fixmatch.

Other SOTA. AFN [69] boosts transferability by large norm. STAR [38] aligns domains with stochastic classiﬁers. SENTRY [48] selects conﬁdent examples with a committee of random augmentations.

5.3 Results

Results on 12 pairs of Ofﬁce-Home tasks are shown in Table 2. When domain shift is large, standard self-training methods such as VAT and Fix Match suffer from the decay in pseudo-label quality. CST outperforms feature adaptation and self-training methods signiﬁcantly in 9 out of 12 tasks. Note that CST does not involve manually setting conﬁdence threshold or reweighting.

Table 2: Accuracy (%) on Ofﬁce-Home for unsupervised domain adaptation (Res Net-50).

Method Ar-Cl Ar-Pr Ar-Rw Cl-Ar Cl-Pr Cl-Rw Pr-Ar Pr-Cl Pr-Rw Rw-Ar Rw-Cl Rw-Pr Avg.

DANN [22] 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6 CDAN [37] 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8 CDAN+VAT+Entropy 52.2 71.5 76.4 61.1 70.3 67.8 59.5 54.4 78.6 73.2 59.0 82.7 67.3 Fix Match [57] 51.8 74.2 80.1 63.5 73.8 61.3 64.7 51.4 80.0 73.3 56.8 81.7 67.7 MDD [73] 54.9 73.7 77.8 60.0 71.4 71.8 61.2 53.6 78.1 72.5 60.2 82.3 68.1 MDD+IA [28] 56.2 77.9 79.2 64.4 73.1 74.4 64.2 54.2 79.9 71.2 58.1 83.1 69.5 SENTRY [48] 61.8 77.4 80.1 66.3 71.6 74.7 66.8 63.0 80.9 74.0 66.3 84.1 72.2

CST 59.0 79.6 83.4 68.4 77.1 76.7 68.9 56.4 83.0 75.3 62.2 85.1 73.0

Table 3: Accuracy (%) on Multi-Domain Sentiment Dataset for domain adaptation with BERT.

Method B-D B-E B-K D-B D-E D-K E-B E-D E-K K-B K-D K-E Avg.

Source-only 89.7 88.4 90.9 90.1 88.5 90.2 86.9 88.5 91.5 87.6 87.3 91.2 89.2 DANN [22] 90.2 89.5 90.9 91.0 90.6 90.2 87.1 87.5 92.8 87.8 87.6 93.2 89.9 VAT [40] 90.6 91.0 91.7 90.8 90.8 92.0 87.2 86.9 92.6 86.9 87.7 92.9 90.1 VAT+Entropy 90.4 91.3 91.5 91.0 91.1 92.4 87.5 86.3 92.4 86.5 87.5 93.1 90.1 MDD [73] 90.4 90.4 91.8 90.2 90.9 91.0 87.5 86.3 92.5 89.0 87.9 92.1 90.0

CST 91.5 92.9 92.6 91.9 92.6 93.5 90.2 89.4 93.8 87.9 88.3 93.5 91.5

Table 4 shows the results on Vis DA-2017. CST surpasses state-of-the-arts with Res Net-50 and Res Net101 backbones. We also combine feature adaptation and self-training (DIRT-T, CDAN+VAT+entropy

and MDD+Fix Match) to test if feature adaptation alleviates the negative effect of domain shift in standard self-training. Results indicate that CST is a better solution than simple combination.

While most traditional self-training methods include techniques speciﬁed for Conv Nets such as Mixup [72], CST is a universal method and can directly work on sentiment classiﬁcation by simply replacing the head and training objective of BERT [16]. In Table 3, most feature adaptation baselines improve over source only marginally, but CST outperforms all baselines on most tasks signiﬁcantly.

5.4 Analysis

Table 5: Ablation on Vis DA-2017. Method Accuracy " d TV #

Fix Match [57] 74.5 0.2 0.22 Fixmatch+Tsallis 76.3 0.8 0.15 CST w/o Tsallis 72.0 0.4 0.16 CST+Entropy 76.2 0.6 0.20

CST 79.9 0.5 0.12

Ablation Study. We study the role of each part of CST in self-training. CST w/o Tsallis removes the Tsallis entropy LTsallis, . CST+Entropy replaces the Tsallis entropy with standard entropy. Fix Match+Tsallis adds LTsallis, to standard self-training. Observations are shown in Table 5. CST+Entropy performs 3.7% worse than CST, indicating that Tsallis entropy is a better regularization for pseudolabels than standard entropy. CST performs 5.4% better than Fix Match, indicating that CST is better adapted to domain shift than standard self-training. While Fix Match+Tsallis outperforms Fix Match, it is still 3.6% behind CST, with much larger total variation distance d TV between pseudo-labels and ground-truths, indicating that CST makes pseudo-labels more reliable than standard self-training under domain shift.

Quality of Pseudo-labels. We visualize the error of pseudo-labels during training on Vis DA-2017 in Figure 5 (Left). The error of target classiﬁer t on the source domain decreases quickly in training, when both the error of pseudo-labels (error of s on Q) and the total variation (TV) distance between pseudo-labels and ground-truths continue to decay, indicating that CST gradually reﬁnes pseudolabels. This forms a clear contrast to standard self-training as visualized in Figure 2 (Middle), where the distance d TV remains nearly unchanged throughout training.

Comparison of Gibbs entropy and Tsallis entropy. We compare the pseudo-labels learned with standard Gibbs entropy and Tsallis entropy on Ar!Cl with Res Net-50 at epoch 40. We compute the difference between the largest and the second largest softmax scores of each target example and plot the histogram in Figure 5 (Right). Gibbs entropy makes the largest softmax output close to 1, indicating over-conﬁdence. In this case, if the prediction is wrong, it can be hard to correct it using self-training. In contrast, Tsallis entropy allows the largest and the second largest scores to be similar.

Table 4: Mean Class Accuracy (%) for unsupervised domain adaptation on Vis DA-2017.

Method Res Net-50 Res Net-101 Method Res Net-50 Res Net-101

DANN [22] 69.3 79.5 CBST [77] 76.4 0.9 VAT [40] 68.0 0.3 73.4 0.5 KLD [78] 78.1 0.2 DIRT-T [56] 68.2 0.3 77.2 0.5 MDD [73] 74.6 81.6 0.3 MCD [54] 69.2 77.7 AFN [69] 76.1 CDAN [37] 70.0 80.1 MDD+IA [28] 75.8 CDAN+VAT+Entropy 76.5 0.5 80.4 0.7 MDD+Fix Match 77.8 0.3 82.4 0.4 Mix Match 69.3 0.4 77.0 0.5 STAR [38] 82.7 Fix Match [57] 74.5 0.2 79.5 0.3 SENTRY [48] 76.7

CST 79.9 0.5 84.8 0.6 CST+SAM 80.6 0.5 86.5 0.7

Figure 5: Analysis. Left: Error of pseudo-labels and reverse pseudo-labels. The error of target classiﬁer t on the source domain decreases, indicating the quality of pseudo-labels is reﬁned. Right: Histograms of the difference between the largest and the second largest softmax scores. Tsallis entropy avoids over-conﬁdence.

6 Related Work

Self-Training. Self-training is a mainstream technique for semi-supervised learning [13]. In this work, we focus on pseudo-labeling [52, 31, 2], which uses unlabeled data by training on pseudo-labels generated by a source model. Other lines of work study consistency regularization [4, 51, 55, 40]. Recent works demonstrate the power of such methods [67, 57, 23]. Equipped with proper training techniques, these methods can achieve comparable results as standard training that uses much more labeled examples [17]. Zoph et al. [76] compare self-training to pre-training and joint training. Vu et al. [65], Mukherjee & Awadallah [42] show that task-level self-training works well in few-shot learning. These methods are tailored to semi-supervised learning or general representation learning and do not take domain shift into consideration explicitly. Wei et al. [66], Frei et al. [20] provide the ﬁrst nice theoretical analysis of self-training based on the expansion assumption.

Domain Adaptation. Inspired by the generalization error bound of Ben-David et al. [7], Long et al. [34], Zellinger et al. [71] minimize distance measures between source and target distributions to learn domain-invariant features. Ganin et al. [22] (DANN) proposed to approximate the domain distance by adversarial learning. Follow-up works proposed various improvement upon DANN [63, 54, 37, 73, 28]. Popular as they are, failure cases exist in situation like label shift [74, 32], shift in support of domains [29], and large discrepancy between source and target [33]. Another line of works try to address domain adaptation with self-training. Shu et al. [56] improves DANN with VAT and entropy minimization. French et al. [21], Zou et al. [78], Li et al. [32] incorporated various semi-supervised learning techniques to boost domain adaptation performance. Kumar et al. [30], Chen et al. [15] and Cai et al. [11] showed self-training provably works in domain adaptation under certain assumptions.

7 Conclusion

We propose cycle self-training in place of standard self-training to explicitly address the distribution shift in domain adaptation. We show that our method provably works under the expansion assumption and demonstrate hard cases for feature adaptation and standard self-training. Self-training (or pseudolabeling) is only one line of works in the semi-supervised learning literature. Future work can delve into the behaviors of other semi-supervised learning techniques including consistency regularization and data augmentation under distribution shift, and exploit them extensively for domain adaptation.

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants 62022050 and 62021002, Beijing Nova Program under Grant Z201100006820041, China s Ministry of Industry and Information Technology, the MOE Innovation Plan and the BNRist Innovation Fund.

[1] Albadawy, E. A., Saha, A., and Mazurowski, M. A. Deep learning for segmentation of brain

tumors: Impact of cross-institutional training and testing. Medical Physics, 45(3), 2018.

[2] Arazo, E., Ortego, D., Albert, P., O Connor, N. E., and Mc Guinness, K. Pseudo-labeling and

conﬁrmation bias in deep semi-supervised learning. Co RR, abs/1908.02983, 2019.

[3] Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. On exact computation

with an inﬁnitely wide neural net. In Neur IPS, pp. 8141 8150. 2019.

[4] Bachman, P., Alsharif, O., and Precup, D. Learning with pseudo-ensembles. In Neur IPS,

volume 27, pp. 3365 3373, 2014.

[5] Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and

structural results. JMLR, 3(Nov):463 482, 2002.

[6] Ben-David, S. and Urner, R. On the hardness of domain adaptation and the utility of unlabeled

target samples. In ALT, pp. 139 153, 2012.

[7] Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory

of learning from different domains. Machine Learning, 79(1-2):151 175, 2010.

[8] Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. Mixmatch: A

holistic approach to semi-supervised learning. ar Xiv preprint ar Xiv:1905.02249, 2019.

[9] Bertinetto, L., Henriques, J. F., Torr, P., and Vedaldi, A. Meta-learning with differentiable

closed-form solvers. In ICLR, 2019.

[10] Blitzer, J., Dredze, M., and Pereira, F. Biographies, Bollywood, boom-boxes and blenders:

Domain adaptation for sentiment classiﬁcation. In ACL, pp. 440 447, 2007.

[11] Cai, T., Gao, R., Lee, J. D., and Lei, Q. A theory of label propagation for subpopulation shift,

[12] Carlini, N. Poisoning the unlabeled dataset of semi-supervised learning, 2021.

[13] Chapelle, O., Sch olkopf, B., and Zien, A. Semi-supervised learning. MIT press Cambridge,

[14] Chen, C., Xie, W., Huang, W., Rong, Y., Ding, X., Huang, Y., Xu, T., and Huang, J. Progressive

feature alignment for unsupervised domain adaptation. In CVPR, pp. 627 636, 2019.

[15] Chen, Y., Wei, C., Kumar, A., and Ma, T. Self-training avoids using spurious features under

domain shift. In Neur IPS, pp. 21061 21071, 2020.

[16] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional

transformers for language understanding. In NAACL, pp. 4171 4186, 2019.

[17] Du, J., Grave, E., Gunel, B., Chaudhary, V., Celebi, O., Auli, M., Stoyanov, V., and Conneau,

A. Self-training improves pre-training for natural language understanding. In NAACL, pp. 5408 5418, 2021.

[18] Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep

networks. In ICML, pp. 1126 1135, 2017.

[19] Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for

efﬁciently improving generalization. In ICLR, 2021.

[20] Frei, S., Zou, D., Chen, Z., and Gu, Q. Self-training converts weak learners to strong learners in

mixture models. ar Xiv preprint ar Xiv:2106.13805, 2021.

[21] French, G., Mackiewicz, M., and Fisher, M. Self-ensembling for visual domain adaptation. In

ICLR, 2018.

[22] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Marchand, M., and Lempitsky,

V. Domain-adversarial training of neural networks. JMLR, 17(1):2096 2030, 2016.

[23] Ghiasi, G., Zoph, B., Cubuk, E. D., Le, Q. V., and Lin, T.-Y. Multi-task self-training for learning

general representations. In ICCV, pp. 8856 8865, 2021.

[24] Grandvalet, Y. and Bengio, Y. Semi-supervised learning by entropy minimization. In Neur IPS,

pp. 529 536, 2004.

[25] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks.

In ICML, pp. 1321 1330, 2017.

[26] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR,

pp. 770 778, 2016.

[27] Hoffman, J., Tzeng, E., Park, T., Zhu, J., Isola, P., Saenko, K., Efros, A. A., and Darrell, T.

Cycada: Cycle-consistent adversarial domain adaptation. In ICML, pp. 1994 2003, 2018.

[28] Jiang, X., Lao, Q., Matwin, S., and Havaei, M. Implicit class-conditioned domain alignment for

unsupervised domain adaptation. In ICML, pp. 4816 4827, 2020.

[29] Johansson, F. D., Sontag, D., and Ranganath, R. Support and invertibility in domain-invariant

representations. In AISTATS, pp. 527 536, 2019.

[30] Kumar, A., Ma, T., and Liang, P. Understanding self-training for gradual domain adaptation. In

ICML, pp. 5468 5479, 2020.

[31] Lee, D.-H. Pseudo-label : The simple and efﬁcient semi-supervised learning method for deep

neural networks. ICML Workshop: Challenges in Representation Learning (WREPL), 2013.

[32] Li, B., Wang, Y., Che, T., Zhang, S., Zhao, S., Xu, P., Zhou, W., Bengio, Y., and Keutzer, K.

Rethinking distributional matching based domain adaptation. Ar Xiv, abs/2006.13352, 2020.

[33] Liu, H., Long, M., Wang, J., and Jordan, M. Transferable adversarial training: A general

approach to adapting deep classiﬁers. In ICML, volume 97, pp. 4013 4022, 2019.

[34] Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learning transferable features with deep

adaptation networks. In ICML, pp. 97 105, 2015.

[35] Long, M., Zhu, H., Wang, J., and Jordan, M. I. Unsupervised domain adaptation with residual

transfer networks. In Neur IPS, pp. 136 144, 2016.

[36] Long, M., Zhu, H., Wang, J., and Jordan, M. I. Deep transfer learning with joint adaptation

networks. In ICML, pp. 2208 2217, 2017.

[37] Long, M., Cao, Z., Wang, J., and Jordan, M. I. Conditional adversarial domain adaptation. In

Neur IPS, pp. 1640 1650. 2018.

[38] Lu, Z., Yang, Y., Zhu, X., Liu, C., Song, Y.-Z., and Xiang, T. Stochastic classiﬁers for

unsupervised domain adaptation. In CVPR, pp. 9111 9120, 2020.

[39] Mey, A. and Loog, M. A soft-labeled self-training approach. In ICPR, 2016.

[40] Miyato, T., Maeda, S., Ishii, S., and Koyama, M. Virtual adversarial training: A regularization

method for supervised and semi-supervised learning. TPAMI, 2018.

[41] Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press,

[42] Mukherjee, S. and Awadallah, A. Uncertainty-aware self-training for few-shot text classiﬁcation.

In Neur IPS, volume 33, pp. 21199 21212, 2020.

[43] Pan, S. J. and Yang, Q. A survey on transfer learning. TKDE, 22(10):1345 1359, 2010.

[44] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z.,

Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Neur IPS, volume 32, pp. 8026 8037, 2019.

[45] Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D., and Saenko, K. Visda: The visual

domain adaptation challenge. Co RR, abs/1710.06924, 2017.

[46] Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. Moment matching for

multi-source domain adaptation. In ICCV, pp. 1406 1415, 2019.

[47] Prabhu, V., Khare, S., Kartik, D., and Hoffman, J. Sentry: Selective entropy optimization via

committee consistency for unsupervised domain adaptation, 2020.

[48] Prabhu, V., Khare, S., Kartik, D., and Hoffman, J. Sentry: Selective entropy optimization via

committee consistency for unsupervised domain adaptation. In ICCV, pp. 8558 8567, October 2021.

[49] Qu, X., Zou, Z., Cheng, Y., Yang, Y., and Zhou, P. Adversarial category alignment network for

cross-domain sentiment classiﬁcation. In NAACL, 2019.

[50] Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset Shift in

Machine Learning. The MIT Press, 2009.

[51] Rasmus, A., Berglund, M., Honkala, M., Valpola, H., and Raiko, T. Semi-supervised learning

with ladder networks. In Neur IPS, volume 28, pp. 3546 3554, 2015.

[52] Rosenberg, C., Hebert, M., and Schneiderman, H. Semi-supervised self-training of object

detection models. In WACV, volume 1, pp. 29 36, 2005.

[53] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,

A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Image Net Large Scale Visual Recognition Challenge. IJCV, 115(3):211 252, 2015.

[54] Saito, K., Watanabe, K., Ushiku, Y., and Harada, T. Maximum classiﬁer discrepancy for

unsupervised domain adaptation. In CVPR, pp. 3723 3732, 2018.

[55] Sajjadi, M., Javanmardi, M., and Tasdizen, T. Regularization with stochastic transformations

and perturbations for deep semi-supervised learning. In Neur IPS, volume 29, pp. 1163 1171, 2016.

[56] Shu, R., Bui, H., Narui, H., and Ermon, S. A DIRT-t approach to unsupervised domain

adaptation. In ICLR, 2018.

[57] Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C. A., Cubuk, E. D., Kurakin,

A., and Li, C.-L. Fixmatch: Simplifying semi-supervised learning with consistency and conﬁdence. In Neur IPS, 2020.

[58] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R.

Intriguing properties of neural networks. In ICLR, 2014.

[59] Talagrand, M. Upper and lower bounds for stochastic processes: modern methods and classical

problems, volume 60. Springer Science & Business Media, 2014.

[60] Tan, S., Peng, X., and Saenko, K. Class-imbalanced domain adaptation: An empirical odyssey.

In ECCV Workshop, 2020.

[61] Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weight-averaged con-

sistency targets improve semi-supervised deep learning results. In Neur IPS, volume 30, pp. 1195 1204, 2017.

[62] Tsallis, C. Possible generalization of boltzmann-gibbs statistics. Journal of Statistical Physics,

52(1-2):479 487, 1988.

[63] Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. Adversarial discriminative domain adaptation.

In CVPR, pp. 7167 7176, 2017.

[64] Venkateswara, H., Eusebio, J., Chakraborty, S., and Panchanathan, S. Deep hashing network for

unsupervised domain adaptation. In CVPR, pp. 5018 5027, 2017.

[65] Vu, T., Luong, M.-T., Le, Q. V., Simon, G., and Iyyer, M. Strata: Self-training with task

augmentation for better few-shot learning. ar Xiv preprint ar Xiv:2109.06270, 2021.

[66] Wei, C., Shen, K., Yining, C., and Ma, T. Theoretical analysis of self-training with deep

networks on unlabeled data. In ICLR, 2021.

[67] Xie, Q., Luong, M. T., Hovy, E., and Le, Q. V. Self-training with noisy student improves

imagenet classiﬁcation. In CVPR, 2020.

[68] Xie, S. M., Kumar, A., Jones, R., Khani, F., Ma, T., and Liang, P. In-n-out: Pre-training and

self-training using auxiliary information for out-of-distribution robustness. In ICLR, 2021.

[69] Xu, R., Li, G., Yang, J., and Lin, L. Larger norm more transferable: An adaptive feature norm

approach for unsupervised domain adaptation. In ICCV, 2019.

[70] Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural

networks? In Neur IPS, pp. 3320 3328. 2014.

[71] Zellinger, W., Grubinger, T., Lughofer, E., Natschl ager, T., and Saminger-Platz, S. Central

moment discrepancy (CMD) for domain-invariant representation learning. In ICLR, 2017.

[72] Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk

minimization. In ICLR, 2018.

[73] Zhang, Y., Liu, T., Long, M., and Jordan, M. Bridging theory and algorithm for domain

adaptation. In ICML, pp. 7404 7413, 2019.

[74] Zhao, H., Combes, R. T. D., Zhang, K., and Gordon, G. On learning invariant representations

for domain adaptation. In ICML, volume 97, pp. 7523 7532, 2019.

[75] Ziser, Y. and Reichart, R. Pivot based language modeling for improved neural domain adaptation.

In NAACL, pp. 1241 1251, 2018.

[76] Zoph, B., Ghiasi, G., Lin, T.-Y., Cui, Y., Liu, H., Cubuk, E. D., and Le, Q. Rethinking

pre-training and self-training. In Neur IPS, volume 33, pp. 3833 3845, 2020.

[77] Zou, Y., Yu, Z., Vijaya Kumar, B. V. K., and Wang, J. Unsupervised domain adaptation for

semantic segmentation via class-balanced self-training. In ECCV, pp. 297 313, 2018.

[78] Zou, Y., Yu, Z., Liu, X., Kumar, B. V., and Wang, J. Conﬁdence regularized self-training. In

ICCV, October 2019.