# understanding_selftraining_for_gradual_domain_adaptation__6efcb18b.pdf Understanding Self-Training for Gradual Domain Adaptation Ananya Kumar 1 Tengyu Ma 1 Percy Liang 1 Machine learning systems must adapt to data distributions that evolve over time, in applications ranging from sensor networks and self-driving car perception modules to brain-machine interfaces. Traditional domain adaptation is only guaranteed to work when the distribution shift is small; empirical methods combine several heuristics for larger shifts but can be dataset specific. To adapt to larger shifts we consider gradual domain adaptation, where the goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain. We prove the first non-vacuous upper bound on the error of self-training with gradual shifts, under settings where directly adapting to the target domain can result in unbounded error. The theoretical analysis leads to algorithmic insights, highlighting that regularization and label sharpening are essential even when we have infinite data. Leveraging the gradual shift structure leads to higher accuracies on a rotating MNIST dataset, a forest Cover Type dataset, and a realistic Portraits dataset. 1. Introduction Machine learning models are typically trained and tested on the same data distribution. However, when a model is deployed in the real world, the data distribution typically evolves over time, leading to a drop in performance. This problem is widespread: sensor measurements drift over time due to sensor aging (Vergara et al., 2012), selfdriving car vision modules have to deal with evolving road conditions (Bobu et al., 2018), and neural signals received by brain-machine interfaces change within the span of a day (Farshchian et al., 2019). Repeatedly gathering large sets of labeled examples to retrain the model can be imprac- 1Stanford University, Stanford, California, USA. Correspondence to: Ananya Kumar . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). Figure 1. In gradual domain adaptation we are given labeled data from a source domain, and unlabeled data from intermediate domains that shift gradually in distribution towards a target domain. Here, blue = female, red = male, and gray = unlabeled data. tical, so we would like to leverage unlabeled examples to adapt the model to maintain high accuracy (Farshchian et al., 2019; Sethi and Kantardzic, 2017). The traditional solution to adapt to the distribution shift is unsupervised domain adaptation, but existing methods are only guaranteed to work when the distribution shift is small when the source and target distributions cannot be easily distinguished from each other (Zhao et al., 2019). To adapt to larger shifts, recent empirical papers propose combining several heuristics (Hoffman et al., 2018; Shu et al., 2018), but these methods can be brittle, working well on some datasets but not on others (Peng et al., 2019) and requiring substantial tuning for new domains. In many real applications the domain shift does not happen at one time, but happens gradually, although this structure is ignored by most domain adaptation methods. We show that the gradual shift structure allows us to reliably adapt to very different distributions, both in theory and practice. We analyze self-training (also known as pseudolabeling), a method for semi-supervised learning (Chapelle et al., 2006) that has led to state-of-the-art results on Image Net (Xie et al., 2020) and adversarial robustness on CIFAR-10 (Uesato et al., 2019; Carmon et al., 2019; Najafiet al., 2019). Intuitively, it is easier to handle smaller shifts, but for each shift we can incur some error so the more steps, the more degradation making it unclear whether leveraging the gradual shift structure is better than directly adapting to the tar- Understanding Gradual Domain Adaptation Figure 2. The source classifier w0 gets 100% accuracy on the source domain (Figure 2a), where we have labeled data. But after 3 time steps (Figure 2d) the source classifier is stale, classifying most examples incorrectly. Now, we cannot correct the classifier using unlabeled data from the target domain, which corresponds to traditional domain adaptation directly to the target. Given unlabeled data in an intermediate domain (Figure 2b) where the shift is gradual, the source classifier pseudolabels most points correctly, and self-training learns an accurate classifier (show in green) that separates the classes. Successively applying self-training learns a good classifier on the target domain (green classifier in Figure 2d). get. In this paper, we provide the first theoretical analysis showing that gradual domain adaptation improves over the traditional approach of direct domain adaptation, allowing us to adapt to very different target distributions. As a concrete example of our setting, the Portraits dataset (Ginosar et al., 2017) contains photos of high school seniors taken across many years, labeled by gender (Figure 1). We use the first 2000 images (1905 - 1935) as the source, next 14000 (1935 - 1969) as intermediate domains, and next 2000 images as the target (1969 - 1973). A model trained on labeled examples from the source gets 98% accuracy on held out examples in the same years, but only 75% accuracy on the target domain. Assuming access to unlabeled images from intermediate domains, our goal is to adapt the model to do well on the target domain. Direct adaptation to the target with self-training only improves the accuracy a little, from 75% to 77%. The gradual self-training algorithm begins with a classifier w0 trained on labeled examples from the source domain (Figure 2a). For each successive domain Pt, the algorithm generates pseudolabels for unlabeled examples from that domain, and then trains a regularized supervised classifier on the pseudolabeled examples. The intuition, visualized in Figure 2, is that after a single gradual shift, most examples are pseudolabeled correctly so self-training learns a good classifier on the shifted data, but the shift from the source to the target can be too large for self-training to correct. We find that gradual self-training on the Portraits dataset improves upon direct target adaptation (77% to 84% accuracy). Our results: We analyze gradual domain adaptation in two settings. The key challenge for domain adaptation theory is dealing with source and target domains that are very different, for example where the source and target can be easily discriminated / distinguished (Zhao et al., 2019; Shu et al., 2018), which are typical in the modern highdimensional regime. The gradual shift structure inherent in many applications provides us with leverage to handle adapting to target distributions that are very different. Our first setting, the margin setting, is distribution-free we only assume that at every point in time there exists some Lipschitz classifier that can classify most of the data correctly with a margin, where the classifier may be different at each time step (so this is more general than covariate shift), and that the shifts are small in Wasserstein-infinity distance. The classifier can be non-linear. A simple example (as in Figure 2) shows that a classifier that gets 100% accuracy can get 0% accuracy after a constant number of time steps. Directly adapting to the final target domain also gets 0% accuracy. Gradual self-training does better, letting us bound the error after T steps: err T ec T (α0 + O(1/ n)), where α0 is the error of the classifier on the source domain, and n is the number of unlabeled examples in each intermediate domain. While this bound is exponential in T, this bound is non-vacuous for small α0, and we show that this bound is tight for gradual self-training. In the second setting, stronger distributional assumptions allow us to do better we assume that P(X | Y = y) is a d-dimensional isotropic Gaussian for each y. Here, we show that if we begin with a classifier w0 that is nearly Bayes optimal for the initial distribution, we can recover a classifier w T that is Bayes optimal for the target distribution with infinite unlabeled data. This is an idealized setting to understand what properties of the data might allow selftraining to do better than the exponential bound. Our theory leads to practical insights, showing that regularization even when we have infinite data and label sharpening are essential for gradual self-training. Without regularization, the accuracy of gradual self-training drops from 84% to 77% on Portraits and 88% to 46% on rotating MNIST. Even when we self-train with more examples, the accuracy gap between regularized and unregularized models stays the same unlike in supervised learning where the benefit of regularization diminishes with more examples. Finally, our theory suggests that the gradual shift structure helps when the shift is small in Wasserstein-infinity distance as opposed to other distance metrics like the KL-divergence. For example, one way to interpolate between the source and target domains is to gradually introduce more images from the target, but this shift is large in Wasserstein-infinity distance we see experimentally that gradual self-training does not help in this setting. We hope this gives practitioners Understanding Gradual Domain Adaptation some insight into when gradual self-training can work. Gradually shifting distributions: Consider a binary classification task of predicting labels y { 1, 1} from input features x Rd. We have joint distributions over the inputs and labels, Rd { 1, 1}: P0, P1, . . . , PT , where P0 is the source domain, PT is the target domain, and P1, . . . , PT 1 are intermediate domains. We assume the shift is gradual: for some ϵ > 0, ρ(Pt, Pt+1) < ϵ for all 0 t < T, where ρ(P, Q) is some distance function between distributions P and Q. We have n0 labeled examples S0 = {x(0) i , y(0) i }n0 i=1 sampled independently from the source P0 and n unlabeled examples St = {x(t) i }n i=1 sampled independently from Pt for each 1 t T. Models and objectives: We have a model family Θ, where a model Mθ : Rd R, for each θ Θ, outputs a score representing its confidence that the label y is 1 for the given example. The model s prediction for an input x is sign(Mθ(x)), where sign(r) = 1 if r 0 and sign(r) = 1 if r < 0. We evaluate models on the fraction of times they make a wrong prediction, also known as the 0-1 loss: Err(θ, P) = E X,Y P[sign(Mθ(X)) = Y ] (1) The goal is to find a classifier θ that gets high accuracy on the target domain PT that is, low Err(θ, PT ). In an online setting we may care about the accuracy at the current Pt for every time t, and our analysis works in this setting as well. Baseline methods: We select a loss function ℓ: R { 1, 1} R+ which takes a prediction and label, and outputs a non-negative loss value, and we begin by training a source model θ0 that minimizes the loss on labeled data in the source domain: θ0 = arg min θ Θ (xi,yi) S0 ℓ(Mθ (xi), yi) (2) The non-adaptive baseline is to use θ0 on the target domain, which incurs error Err(θ0, PT ). Self-training uses unlabeled data to adapt a model. Given a model θ and unlabeled data S, ST(θ, S) denotes the output of self-training. Selftraining pseudolabels each example in S using Mθ, and then selects a new model θ that minimizes the loss on this pseudolabeled dataset. Formally, ST(θ, S) = arg min θ Θ xi S ℓ(Mθ (xi), sign(Mθ(xi))) Here, self-training uses hard labels: we pseudolabel examples as either 1 or 1, based on the output of the classifier, instead of a probabilistic label based on the model s confidence we refer to this as label sharpening. In our theoretical analysis, we sometimes want to describe the behavior of self-training when run on infinite unlabeled data from a probability distribution P: ST(θ, P) = arg min θ Θ E X P[ℓ(Mθ (X), sign(Mθ(X)))] The direct adaptation to target baseline takes the source model θ0 and self-trains on the target data ST , and is denoted by ST(θ0, ST ). Prior work often chooses to repeat this process of self-training on the target k times, which we denote by STk(θ0, ST ). Gradual self-training: In gradual self-training, we selftrain on the finite unlabeled examples from each domain successively. That is, for i 1, we set: θi = ST(θi 1, Si) (5) Let ST(θ0, (S1, . . . , ST )) = θT denote the output of gradual self-training, which we evaluate on the target distribution PT . As defined here, gradual self-training uses more data than directly adapting to the target, but we account for this in our theory and experiments. 3. Theory for the margin setting We show that gradual self-training does better than directly adapting to the target, where we assume that at each time step there exists some Lipschitz classifier which can be different at each step that can classify most of the data correctly with a margin (a standard assumption in learning theory), and that the shifts are small. Our main result (Theorem 3.2) bounds the error of gradual self-training. We show that our analysis is tight for gradual self-training (Example 3.4), and explain why regularization, label sharpening, and the ramp loss, are key to our bounds. Proofs are in Appendix A. 3.1. Assumptions Models: We consider a model family ΘR where each model Mθ, for θ ΘR, is R-Lipschitz in the input in ℓ2 norm for some fixed R > 0. That is, for all x, x Rd: |Mθ(x) Mθ(x )| R x x 2 (6) An example is the set of regularized linear models that have weights with bounded ℓ2 norm: ΘL R = {(w, b) : w Rd, b R, w 2 R} (7) In this case, given (w, b) ΘL R, the model s output is Mw,b(x) = w x + b. Our theory applies to non-linear models that are Lipschitz, but it may help the reader to think of linear models on a first reading. Understanding Gradual Domain Adaptation Losses: We consider margin loss functions such as the hinge and ramp losses. Intuitively, a margin loss encourages a model to classify points correctly and confidently by keeping correctly classified points far from the decision boundary. We consider the hinge function h and ramp function r: h(m) = max(1 m, 0) (8) r(m) = min(h(m), 1) (9) The ramp loss is ℓr(ˆy, y) = r(yˆy), where ˆy R is a model s prediction, and y { 1, 1} is the true label. The hinge loss is the standard way to enforce margin, but the ramp loss is more robust towards outliers because it is bounded above no single point contributes too much to the loss. We will see that the ramp loss is key to the theoretical guarantees for gradual self-training because of its robustness. We denote the population ramp loss as: Lr(θ, P) = E X,Y P[ℓr(Mθ(X), Y )] (10) Given a finite sample S, the empirical loss is: Lr(θ, S) = 1 x,y S ℓr(Mθ(x), y) (11) Distributional distance: Our notion of distance is W , the Wasserstein-infinity distance. Intuitively, W moves points from distribution P to Q by distance at most ϵ to match the distributions. For ease of exposition we consider the Monge form of W , although the results can be extended to the Kantarovich formulation as well. Formally, given probability measures P, Q on X: W (P, Q) = inf{ sup x Rd ||f(x) x||2 : f : Rd Rd, f#P = Q} (12) As usual, # denotes the push-forward of a measure, that is, for every set A Rd, f#P(A) = P(f 1(A)). In our case, we require that the conditional distributions do not shift too much. Given joint probability measures P, Q on the inputs and labels Rd { 1, 1}, the distance is: ρ(P, Q) = max(W (PX|Y =1, QX|Y =1), W (PX|Y = 1, QX|Y = 1)). (13) α -low-loss assumption: Assume every domain admits a classifier with low loss α , that is there exists α 0 and for every domain Pt, there exists some θt ΘR with Lr(θt, Pt) α (θt can be different for each domain). Gradual shift assumption: For some ρ < 1 R, assume ρ(Pt, Pt+1) ρ for every consecutive domain, where 1 can be interpreted as the regularization strength of the model class ΘR or as the geometric margin (distance from decision boundary to data) the model is trying to enforce. Bounded model complexity assumption: For finite sample guarantees, we assume that the Rademacher complexity of the model family, Rn(ΘR; P), is bounded for all distributions P0, . . . , PT . That is, for some fixed B > 0: Rn(ΘR; P) B n, P = P0, . . . , PT (14) where Rn(ΘR; P) is defined as usual as: Rn(ΘR; P) = E h sup θ ΘR i=1 σi Mθ(xi) i (15) where the expectation is taken over xi P and σi Uniform({ 1, 1}) sampled independently for i = 1, . . . , n, so σi = 1 or σi = 1 with equal probability 0.5. An example is the set of regularized linear models, ΘL R, when the data is not too large on average: EX P [||X||2 2] β2 where β > 0. In this case a standard result, e.g. Theorem 11, page 82 in (Liang, 2016), is that Rn(ΘL R; P) βR/ n, so B = βR is the desired bound on the model complexity. No label shift assumption: Assume that the fraction of Y = 1 labels does not change: Pt(Y ) is the same for all t. 3.2. Domain shift: baselines fail While the distribution shift from Pt to Pt+1 is small, the distribution shift from the source P0 to the target PT can be large, as visualized in Figure 2. A classifier that gets 100% accuracy on P0, might classify every example wrong on PT , even if T 2. In this case, directly adaptating to PT would not help. The following example formalizes this: Example 3.1. Even under the α -low-loss, no label shift, gradual shift, and bounded model complexity assumptions, there exists distributions P0, P1, P2 and a source model θ ΘL R that gets 0 loss on the source (Lr(θ, P0) = 0), but high loss on the target: Lr(θ, P2) = 1. Self-training directly on the target does not help: Lr(ST(θ, P2), P2) = 1. This holds true even if every domain is separable, so α = 0. Other methods: Our analysis focuses on self-training, but other bounds do not apply in this setting because they either assume that the density ratio between the target and source exists and is not too small (Jiayuan et al., 2006), or that the source and target are similar enough that we cannot discriminate between them (Ben-David et al., 2010). 3.3. Gradual self-training improves error We show that gradual self-training helps over direct adaptation. For intuition, consider a simple example where α = 0 Understanding Gradual Domain Adaptation and θ0 classifies every example in P0 correctly with geometric margin γ = 1 R. If each point shifts by distance < γ, θ0 gets every example in the new domain P1 correct. If we had infinite unlabeled data from P1, we can learn a model θ that classifies every example in the new domain P1 correctly with margin γ since α = 0. Repeating the process for P2, . . . , PT , we get every example in PT correct. But what happens when we start with a model that has some error, for example because the data cannot be perfectly separated, and have only finite unlabeled samples? We show that self-training still does better than adapting to the target domain directly, or using the non-adaptive source classifier. The first main result of the paper says that if we have a model θ that gets low loss and the distribution shifts slightly, self-training gives us a model θ that does not do too badly on the new distribution. Theorem 3.2. Given P, Q with ρ(P, Q) = ρ < 1 R and marginals on Y are the same so P(Y ) = Q(Y ). Assuming ΘR has bounded model complexity with respect to P and Q, if we have initial model θ, and n unlabeled samples S from Q, and we set θ = ST(θ, S), then with probability at least 1 δ over the sampling of S, letting α = minθ ΘR Lr(θ , Q): Lr(θ , Q) 2 1 ρRLr(θ, P) + α 2 log 2/δ n (16) The proof of this result is in Appendix A, but we give a high level sketch here. There exists some classifier that gets accuracy α on Q, so if we had access to n labeled examples from Q then empirical risk minimization gives us a classifier that is accurate on the population from a Rademacher complexity argument we get a classifier θ with loss at most α + O(B/ n), the second and third term in the RHS of the bound. Since we only have unlabeled examples from Q, selftraining uses θ to pseudolabel these n examples and then trains on this generated dataset. Now, if the distribution shift ρ is small relative to the geometric margin γ = 1 R, then we can show that the original model θ labels most examples in the new distribution Q correctly that is, Err(θ, Q) is small if Lr(θ, P) is small. Finally, if most examples are labeled correctly we show that because there exists some classifier θ with low margin loss, self-training will also learn a classifier θ with low margin loss Lr(θ , Q), which completes the proof. We apply this argument inductively to show that after T time steps, the error of gradual self-training is exp(c T)α0 for some constant c, if the original error is α0. Corollary 3.3. Under the α -low-loss, no label shift, gradual shift, and bounded model complexity assumptions, if the source model θ0 has low loss α0 α on P0 (i.e. Lr(θ0, P0) α0) and θ is the result of gradual self-training: θ = ST(θ0, (S1, . . . , Sn)), letting β = 2 1 ρR: Lr(θ, PT ) βT +1 α0 + 4B + p 2 log 2T/δ n Corrollary 3.3 says that the gradual structure allows some control of the error unlike direct adaptation where the accuracy on the target domain can be 0% if T 2. Note that if the classes are separable and we have infinite data, then gradual self-training maintains 0 error. Our next example shows that our analysis for gradual selftraining in this setting is tight if we start with a model with loss α0, then the error can in fact increase exponentially even with infinite unlabeled examples. Intuitively, at each step of self-training the loss can increase by a constant factor, which leads to an exponential growth in the error. Example 3.4. Even under the α -low-loss, no label shift, gradual shift, and bounded model complexity assumptions, given 0 < α0 1 4, for every T there exists distributions P0, . . . , P2T , and θ0 ΘL R with Lr(θ0, P0) α0, but if θ = ST(θ0, (P1, . . . , P2T )) then Lr(θ , P2T ) min(0.5, 1 22T α0). Note that Lr is always in [0, 1]. This suggests that if we want sub-exponential bounds we either need to make additional assumptions on the data distributions, or devise alternative algorithms to achieve better bounds (which we believe is unlikely). 3.4. Essential ingredients for gradual self-training In this section, we explain why regularization, label sharpening, and the ramp loss are essential to bounding the error of gradual self-training (Theorem 3.2). Regularization: Without regularization there is no incentive for the model to change when self-training if we selftrain without regularization an optimal thing to do is to output the original model. The intuition is that since the model θ = (w, b) is used to pseudolabel examples, θ gets every pseudolabeled example correct. The scaled classifier θ = (αw, αb) for large α then gets optimal loss, but θ and θ make the same predictions for every example. We use ST (θ, S) to denote the set of possible θ that minimize the loss on the pseudolabeled distribution (Equation (3)): Example 3.5. Given a model θ ΘL (in other words R = ) and unlabeled examples S where for all x S, Mθ(x) = 0, there exists θ ST (θ, S) such that for all x Rd, Mθ(x) = Mθ (x). More specific to our setting, our bounds require regularized models because regularized models classify the data cor- Understanding Gradual Domain Adaptation rectly with a margin, so even after a mild distribution shift we get most new examples correct. Note that in traditional supervised learning, regularization is usually required when we have few examples for better generalization to the population, whereas in our setting regularization is important for maintaining a margin even with infinite data. Label sharpening: When self-training, we pseudolabel examples as 1 or 1, based on the output of the classifier. Prior work sometimes uses soft labels (Najafiet al., 2019), where for each example they assign a probability of the label being 1 or 1, and train using a logistic loss. The loss on the soft-pseudolabeled distribution is defined as: Lσ,θ(θ ) = E X P[ll(σ(Mθ(X)), σ(Mθ (X)))] (18) , where σ is the sigmoid function, and ll is the log loss: ll(p, p ) = p log p + (1 p) log (1 p ) (19) Self-training then picks θ ΘR minimizing Lσ,θ(θ ). A simple example shows that this form of self-training may never update the parameters because θ minimizes Lσ,θ: Example 3.6. For all ΘR and θ ΘR, θ is a minimizer of Lσ,θ, that is, for all θ ΘR, Lσ,θ(θ) Lσ,θ(θ ). This suggests that we sharpen the soft labels to encourage the model to update its parameters. Note that this is true even on finite data: set P to be the empirical distribution. Ramp versus hinge loss: We use the ramp loss, but does the more popular hinge loss Lh work? Unfortunately, the next example shows that we cannot control the error of gradual self-training with the hinge loss even if we had infinite examples, so the ramp loss is important for Theorem 3.2. Example 3.7. Even under the α -low-loss, no label shift, and gradual shift assumptions, given α0 > 0, there exists distributions P0, P1, P2 and θ0 ΘL R with Lh(θ0, P0) α, but if θ = ST(θ0, (P1, P2)) then Lh(θ , P2) Err(θ , P2) = 1 (θ gets every example in P2 wrong), where we use the hinge loss in self-training. We only analyzed the statistical effects here the hinge loss tends to work better in practice because it is much easier to optimize and is convex for linear models. 3.5. Self-training without domain shift Example 3.4 showed that when the distribution shifts, the loss of gradual self-training can grow exponentially (though the non-adaptive baseline has unbounded error). Here we show that if we have no distribution shift, the error can only grow linearly: if P0 = . . . = PT , given a classifier with loss α0, if we do gradual self-training the loss is at most α0T. Proposition 3.8. Given α0 > 0, distributions P0 = ... = PT , and model θ0 ΘR with Lr(θ0, P0) α0, Lr(θ , PT ) α0(T + 1) where θ = ST(θ0, (P1, . . . , PT )) In Appendix A, we show that self-training can indeed hurt without domain shift: given a classifier with loss α on P, self-training on P can increase the classifier s loss on P to 2α, but here the non-adaptive baseline has error α. 4. Theory for the Gaussian setting In this section we study an idealized Gaussian setting to understand conditions under which self-training can have better than exponential error bounds: we show that if we begin with a good classifier, the distribution shifts are not too large, and we have infinite unlabeled data, then gradual self-training maintains a good classifier. 4.1. Setting We assume Pt(X | Y = y) is an isotropic Gaussian in d-dimensions for each y { 1, 1}. We can shift the data to have mean 0, so we suppose: Pt(X|Y = y) = N(yµt, σ2 t I) (20) Where µt Rd and σt > 0 for each t. As usual, we assume the shifts are gradual: for some B > 0, µt+1 µt 2 B 4 . We assume that the means of the two classes do not get closer than the shift, or else it would be impossible to distinguish between no shift, and the distributions of the two classes swapping: so µt 2 B for all t. We assume infinite unlabeled data (access to Pt(X)) in our analysis. Given labeled data in the source, we use the objective: L(w, P) = E X,Y P[φ(Y (w X))] (21) For unlabeled data, self-training performs descent steps on an underlying objective function (Amini and Gallinari, 2003), which we focus on: U(w, P) = E X P[φ(|w X|)] (22) We assume φ : R R+ is a continuous, non-increasing function which is strictly decreasing on [0, 1]: these are regularity conditions which the hinge, ramp, and logistic losses satisfy. If w = ST(w, P) then U(w , P) U(w, P) (Amini and Gallinari, 2003). The algorithm we analyze begins by choosing w0 from labeled data in P0, and then updates the parameters with unlabeled data from Pt for 1 t T: wt = arg min w 2 1, w wt 1 2 1 2 U(w, Pt) (23) Note that we do not show that self-training actually converges to the constrained minimum of U in Equation (23) and prior work only shows that self-training descends on U we leave this optimization analysis to future work. Understanding Gradual Domain Adaptation 4.2. Analysis Let w (µ) = µ µ 2 where µ B > 0. Note that w (µt) minimizes the 0-1 error on Pt. Our main theorem says that if we start with a regularized classifier w0 that is near w (µ0), which we can learn from labeled data, and the distribution shifts µt+1 µt 2 are not too large, then we recover the optimal w T = w (µT ). The key challenge is that the unlabeled loss U in d dimensions is non-convex, with multiple local minima, so directly minimizing U does not guarantee a solution that minimizes the labeled loss L. Theorem 4.1. Assuming the Gaussian setting, if w0 w (µ0) 2 1 4, then we recover w T = w (µT ). Proving this reduces to proving the single-step case. At each step t + 1, if we have a classifier wt that was close to w (µt), then we will recover wt+1 = w (µt+1). We give intuition here and the formal proof in Appendix B. We first show that if µ changes by a small amount, the optimal parameters (for the labeled loss) does not change too much. Then since wt is close to w (µt), wt is not too far away from w (µt+1). The key step in our argument is showing that the unique minimum of the unlabeled loss U(w, Pµt+1) in the neighborhood of wt, is w (µt) looking for a minimum nearby is important because if we deviate too far we might select other bad minima. We consider arbitrary w near w (µt+1) and construct a pairing of points (a, b) in Rd, using a convexity argument to show that (a, b) contributes more to the loss of w than w (µt+1). 5. Experiments Our theory leads to practical insights we show that regularization and label sharpening are important for gradual selftraining, that leveraging the gradual shift structure improves target accuracy, and give intuition for when the gradual shift assumption may not help. We run experiments on three datasets (see Appendix C for more details): Rotating MNIST: Rotating MNIST is a semi-synthetic dataset where we rotate each MNIST image by an angle between 0 and 60 degrees. We split the 50,000 MNIST training set images into a source domain (images rotated between 0 and 5 degrees), intermediate domain (rotations between 5 and 60 degrees), and a target domain (rotations between 55 degrees and 60 degrees). Note that each image is seen at exactly one angle, so the training procedure cannot track a single image across different angles. Cover Type: A dataset from the UCI repository where the goal is to predict the forest cover type at a particular location given 54 features (Blackard and Dean, 1999). We sort the examples by increasing distance to water body, splitting the data into a source domain (first 50K examples), intermediate domain (next 400K examples), and a target domain (final 50K examples). Portraits: A real dataset comprising photos of high school seniors across years (Ginosar et al., 2017). The model s goal is to classify gender. We split the data into a source domain (first 2000 images), intermediate domain (next 14000 images), and target domain (next 2000 images). In Appendix C we also include synthetic experiments on a mixture of Gaussians dataset which resembles the Gaussian setting of our theory but the covariance matrices are not isotropic, and the number of labeled and unlabeled samples is finite and on the order of the dimension d. 5.1. Leveraging gradual shifts improves adaptation Our goal is to see if adapting to the gradual shift sequentially helps compared to directly adapting to the target. We evaluate four methods: Source: simply train a classifier on the labeled source examples. Target self-train: repeatedly self-train on the unlabeled target examples ignoring the intermediate examples. All self-train: pool all the unlabeled examples from the intermediate and target domains, and repeatedly self-train on this pooled dataset to adapt the initial source classifier. Gradual self-train: sequentially self-train on unlabeled data in each successive intermediate domain, and finally self-train on unlabeled data on the target domain, to adapt the initial source classifier. For the rotating MNIST datasets, we ensured that the target self-train method sees as many unlabeled target examples as gradual self-train sees across all the intermediate examples. Since Portraits and Cover Type are real datasets we cannot synthesize more examples from the target, so target selftrain uses fewer unlabeled examples here. However, the all self-train baseline uses all of the unlabeled examples from all domains. For rotating MNIST and Portraits we used a 3-layer convolutional network with dropout(0.5) and batchnorm on the last layer, that was able to achieve 97% 98% accuracy on held out examples in the source domain. For the Cover Type dataset we used a 2 hidden layer feedforward neural network with dropout(0.5) and batchnorm on the last layer which got higher accuracies than logistic regression. For each step of self-training, we filter out the 10% of images where the model s prediction was least confident Appendix C shows similar findings without this filtering. To account for variance in initialization and optimization, we ran each method 5 times and give 90% confidence intervals. More experimental details are in Appendix C. Table 1 shows that leveraging the gradual structure leads to improvements over baselines on all 3 datasets, and closes over half the gap between the source and oracle classifiers. Understanding Gradual Domain Adaptation Table 1. Percentage classification accuracies for gradual self-training (ST) and baselines on three datasets, with 90% standard errors for the mean over 5 runs in parantheses. Gradual ST closes about half the gap between the source and oracle classifiers on all three datasets, and does better than self-training directly on the target or self-training on all the unlabeled data pooled together. ROT MNIST COVER TYPE PORTRAITS SOURCE MODEL 31.9 ( 1.7) 62.8 ( 5.0) 75.3 ( 1.6) TARGET ST +1.0 ( 0.5) +0.7 ( 1.5) +1.6 ( 1.2) ALL ST +6.1 ( 1.1) +1.5 ( 2.2) +3.5 ( 1.8) GRADUAL ST +56.0 ( 1.5) +7.6 ( 3.7) +8.5 ( 0.9) ORACLE +59.6 ( 1.2) +16.8 ( 3.7) +17.0 ( 0.8) 5.2. Important ingredients for gradual self-training Our theory suggests that regularization and label sharpening are important for gradual self-training, because without regularization and label sharpening there is no incentive for the model to change (Section 3.4). However, prior work suggests that overparameterized neural networks trained with stochastic gradient methods have strong implicit regularization (Zhang et al., 2017; Hardt et al., 2016) in the supervised setting they perform well without explicit regularization even though the number of parameters is much larger than the number of data points is this implicit regularization enough for gradual self-training? In our experiments, we see that even without explicit regularization, or with soft probabilistic labels, gradual selftraining does slightly better than the non-adaptive source classifier, suggesting that this implicit regularization may have some effect. However, explicit regularization and hard labeling gives a much larger accuracy boost. Regularization is important: We repeat the same experiment as Section 5.1, comparing gradual self-training with or without regularization that is, disabling dropout and batchnorm (Ioffe and Szegedy, 2015) in the neural network experiments. In both cases, we first train an unregularized model on labeled examples in the source domain. Then, we either turn on regularization during self-training, or keep the model unregularized. We control the original model to be the same in both cases to see if regularization helps in the self-training process, as opposed to in learning a better supervised classifier. Table 2 shows that accuracies are significantly better with regularization, even though unregularized performance is still better than the non-adaptive source classifier. Soft labeling hurts: We ran the same experiment as Section 5.1, comparing gradual self-training with hard labeling versus using probabilistic labels output by the model. Table 2 shows that accuracies are better with hard labels. Note that in datasets with more intrinsic uncertainty, soft labeling may work well (Mey and Loog, 2016). Regularization is still important with more data: In su- Table 2. Classification accuracies for gradual self-train with explicit regularization and hard labels (Gradual ST), without regularization but with hard labels (No Reg), and with regularization but with soft labels (Soft Labels). Gradual self-train does best with explicit regularization and hard labels, as our theory suggests, even for neural networks with implicit regularization. ROT MNIST COV TYPE PORTRAITS SOFT LABELS 44.1 2.3 63.2 8.5 80.1 1.8 NO REG 45.8 2.5 70.7 1.8 76.5 1.0 GRADUAL ST 83.8 2.5 73.5 1.6 82.6 0.8 pervised learning, the importance of regularization diminishes as we have more training examples if we had access to infinite data (the population), we don t need regularization. On the other hand, for gradual domain adaptation, the theory says regularization is needed to adapt to the dataset shift even with infinite data, and predicts that regularization remains important even if we increase the sample size. To test this hypothesis, we construct a rotating MNIST dataset where we increase the sample sizes. The source domain P0 consists of N {2000, 5000, 20000} images on MNIST. Pt then consists of these same N images, rotated by angle 3t, for 0 t 20. The goal is to get high accuracy on P20: these images rotated by 60 degrees the model doesn t have to generalize to unseen images, but to seen images at different angles. We compare using regularization versus not using regularization during gradual self-training. Table 3 shows that regularization is still important here, and the gap between regularized and unregularized gradual self-training does not shrink much with more data. 5.3. When does gradual shift help? Our theory in Section 3 says that gradual self-training works well if the shift between domains is small in Wassersteininfinity distance, but it may not be enough for the total variation or KL-divergence between P and Q to be small. To test this, we run an experiment on a modified version of the rotating MNIST dataset. We keep the source and target Understanding Gradual Domain Adaptation Table 3. Classification accuracies for gradual self-train on rotating MNIST as we vary the number of samples. Unlike in previous experiments, here the same N samples are rotated, so the models do not have to generalize to unseen images, but seen images at different angles. The gap between regularized and unregularized gradual self-training does not shrink much with more data. N=2000 N=5000 N=20,000 SOURCE 28.3 1.4 29.9 2.5 33.9 2.6 NO REG 55.7 3.9 53.6 4.0 55.1 3.9 REG 93.1 0.8 91.7 2.4 87.4 3.1 domains the same as before, but change the intermediate domains. In Table 1 we saw that gradual self-training works well if we have intermediate images rotated by gradually increasing rotation angles. Another type of gradual transformation is to gradually introduce more examples rotated by 55 to 60 degrees. That is, in the i-th domain, (20 i)/20 fraction of the examples are MNIST images rotated by 0 to 5 degrees, and i/20 of the examples are MNIST images rotated by 55 to 60 degrees, where 1 i 20. Here the total-variation distance between successive domains is small, but intuitively the Wasserstein distance is large because each image undergoes a large ( 55 degrees) rotation. As the theory suggests, here gradual self-training does not outperform directly self-training on the target gradual selftraining gets 33.5 1.5% accuracy on the target, while direct adaptation to the target gets 33.0 2.2% over 5 runs. Intuitively, gradual self-training helps when most of the distribution shifts by a small amount, and it may not be sufficient if only a small fraction of the distribution shifts but by a large amount. We hope this gives practitioners some insight into when gradual self-training helps. 6. Related work Self-training is a popular method in semi-supervised learning (Lee, 2013; Sohn et al., 2020) and domain adaptation (Long et al., 2013; Zou et al., 2019; Inoue et al., 2018), and is related to entropy minimization (Grandvalet and Bengio, 2005). Recent work shows that a robust variant of self-training can mitigate the tradeoff between standard and adversarial accuracy (Raghunathan et al., 2020). Related to self-training is co-training (Blum and Mitchell, 1998), which assumes that the input features can be split into two or more views that are conditionally independent on the label. Other theory in semi-supervised learning (Rigollet, 2007; Singh et al., 2008; Ben-David et al., 2008) does not analyze domain shift. Unsupervised domain adaptation, where the goal is to directly adapt from a labeled source domain to an unlabeled target domain, is widely studied (Qui nonero-Candela et al., 2009). The key challenge for domain adaptation is when the source and target domains are very different, when it is easy to discriminate between the two domains and their supports do not overlap (Zhao et al., 2019; Shu et al., 2018), which is typical in the modern high-dimensional regime. Importance weighting based methods (Shimodaira, 2000; Sugiyama et al., 2007; Jiayuan et al., 2006) assume the domains are close, with bounds depending on the expected density ratios between the source and target. In practice, even if the domains overlap, the density ratio often scales exponentially in the dimension in which case these methods perform poorly. These methods also assume that P(Y | X) is the same for the source and target which we do not require. The theory of H H-divergence (Ben-David et al., 2010; Mansour et al., 2009) gives conditions for when a model trained on the source does well on the target without any adaptation. Empirical methods aim to learn domain invariant representations (Tzeng et al., 2014; Ganin and Lempitsky, 2015; Tzeng et al., 2017) but there are no theoretical guarantees for these methods (Zhao et al., 2019). These methods require several additional heuristics (Hoffman et al., 2018), and work well on some tasks but not others (Bobu et al., 2018; Peng et al., 2019). Hoffman et al. (2014); Michael et al. (2018); Markus et al. (2018); Bobu et al. (2018) among others propose approaches for gradual domain adaptation. This setting differs from online learning (Shalev-Shwartz, 2007), lifelong learning (Silver et al., 2013), and concept drift (Kramer, 1988; Bartlett, 1992; Bartlett et al., 1996), since we only have unlabeled data from shifted distributions. To the best of our knowledge, we are the first to develop a theory for gradual domain adaptation, and investigate when and why the gradual structure helps. 7. Conclusion and Future Work Our work suggests that the gradual shift structure, which appears often in applications, enables us to reliably adapt to very different target distributions. There are many exciting avenues for future work: 1. Better algorithms for gradual shifts: Can we develop better algorithms and more general theory for gradual domain adaptation? 2. Discovering gradual shifts: Can we apply gradual self-training to spatial data: datapoints close in space may be similar? More generally, can we learn which datapoints are similar and do gradual self-training? 3. Structured domain adaptation: Are there other structures in real applications we can leverage to reliably adapt to very different target distributions? Understanding Gradual Domain Adaptation Acknowledgements. The authors would like to thank the Open Philantropy Project and the Stanford Graduate Fellowship program for funding. This work is also partially supported by the Stanford Data Science Initiative and the Stanford Artificial Intelligence Laboratory. We are grateful to the anonymous reviewers, and to Rui Shu, Robin Jia, Pang Wei Koh, Michael Kim, Stephen Mussman, Csaba Szepesvari, Judy Hoffman, Shai Ben-David, Lin Yang, Michael Xie, Aditi Raghunathan, Sanjana Srivastava, Nelson Liu, Rachel Holladay, Megha Srivastava, and Albert Gu for insightful comments. Reproducibility. All code, data, and experiments can be found on Coda Lab at https: //bit.ly/gradual-shift-codalab, code is also on Git Hub at https://github.com/p-lambda/ gradual_domain_adaptation. A. Vergara, S. Vembu, T. Ayhan, M. A. Ryan, M. L. Homer, and R. Huerta. Chemical gas sensor drift compensation using classifier ensembles. Journal of the American Statistical Association, -1:320 329, 2012. A. Bobu, E. Tzeng, J. Hoffman, and T. Darrell. Adapting to continuously shifting domains. In International Conference on Learning Representations Workshop (ICLR), 2018. A. Farshchian, J. A. Gallego, J. P. Cohen, Y. Bengio, L. E. Miller, and S. A. Solla. Adversarial domain adaptation for stable brain-machine interfaces. In International Conference on Learning Representations (ICLR), 2019. T. S. Sethi and M. Kantardzic. On the reliable detection of concept drift from streaming unlabeled data. Expert Systems with Applications, 82:77 99, 2017. H. Zhao, R. T. des Combes, K. Zhang, and G. J. Gordon. On learning invariant representation for domain adaptation. In International Conference on Machine Learning (ICML), 2019. J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada: Cycle consistent adversarial domain adaptation. In International Conference on Machine Learning (ICML), 2018. R. Shu, H. H. Bui, H. Narui, and S. Ermon. A DIRT-T approach to unsupervised domain adaptation. In International Conference on Learning Representations (ICLR), 2018. X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang. Moment matching for multi-source domain adaptation. In International Conference on Computer Vision (ICCV), 2019. O. Chapelle, A. Zien, and B. Scholkopf. Semi-Supervised Learning. MIT Press, 2006. Q. Xie, M. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification. ar Xiv, 2020. J. Uesato, J. Alayrac, P. Huang, R. Stanforth, A. Fawzi, and P. Kohli. Are labels required for improving adversarial robustness? In Advances in Neural Information Processing Systems (Neur IPS), 2019. Y. Carmon, A. Raghunathan, L. Schmidt, P. Liang, and J. C. Duchi. Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems (Neur IPS), 2019. A. Najafi, S. Maeda, M. Koyama, and T. Miyato. Robustness to adversarial perturbations in learning from incomplete data. In Advances in Neural Information Processing Systems (Neur IPS), 2019. S. Ginosar, K. Rakelly, S. M. Sachs, B. Yin, C. Lee, P. Kr ahenb uhl, and A. A. Efros. A century of portraits: A visual historical record of american high school yearbooks. IEEE Transactions on Computational Imaging, 3, 2017. Percy Liang. Statistical learning theory. https://web. stanford.edu/class/cs229t/notes.pdf, 2016. H. Jiayuan, S. A. J., G. Arthur, B. K. M., and S. Bernhard. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems (Neur IPS), 2006. S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning, 79(1):151 175, 2010. M. Amini and P. Gallinari. Semi-supervised learning with explicit misclassification modeling. In International Joint Conference on Artificial Intelligence (IJCAI), 2003. J. A. Blackard and D. J. Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. In Computers and Electronics in Agriculture, 1999. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017. Understanding Gradual Domain Adaptation M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning (ICML), pages 1225 1234, 2016. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pages 448 456, 2015. A. Mey and M. Loog. A soft-labeled self-training approach. In d International Conference on Pattern Recognition, 2016. D. Lee. Pseudo-label: The simple and efficient semisupervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning, 2013. K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. ar Xiv, 2020. M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision, pages 2200 2207, 2013. Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang. Confidence regularized self-training. ar Xiv preprint ar Xiv:1908.09822, 2019. N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa. Crossdomain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5001 5009, 2018. Y. Grandvalet and Y. Bengio. Entropy regularization. In Semi-Supervised Learning, 2005. A. Raghunathan, S. M. Xie, F. Yang, J. C. Duchi, and P. Liang. Understanding and mitigating the tradeoff between robustness and accuracy. In International Conference on Machine Learning (ICML), 2020. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Conference on Learning Theory (COLT), 1998. P. Rigollet. Generalization error bounds in semi-supervised classification under the cluster assumption. Journal of Machine Learning Research (JMLR), 8:1369 1392, 2007. A. Singh, R. Nowak, and J. Zhu. Unlabeled data: Now it helps, now it doesn t. In Advances in Neural Information Processing Systems (Neur IPS), 2008. S. Ben-David, T. Lu, and D. Pal. Does unlabeled data provably help? worst-case analysis of the sample complexity of semi-supervised learning. In Conference on Learning Theory (COLT), 2008. J. Qui nonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. The MIT Press, 2009. H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90:227 244, 2000. M. Sugiyama, M. Krauledat, and K. Muller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research (JMLR), 8:985 1005, 2007. Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Conference on Learning Theory (COLT), 2009. E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. ar Xiv preprint ar Xiv:1412.3474, 2014. Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (ICML), pages 1180 1189, 2015. E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2017. J. Hoffman, T. Darrell, and K. Saenko. Continuous manifold based adaptation for evolving visual domains. In Computer Vision and Pattern Recognition (CVPR), 2014. G. Michael, E. Dennis, K. B. Mara, B. Peter, and M. Dorit. Gradual domain adaptation for segmenting whole slide images showing pathological variability. In Image and Signal Processing, 2018. W. Markus, B. Alex, and P. Ingmar. Incremental adversarial domain adaptation for continually changing environments. In International Conference on Robotics and Automation (ICRA), 2018. S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. Ph D thesis, The Hebrew University of Jerusalem, 2007. D. L. Silver, Q. Yang, and L. Li. Lifelong machine learning systems: Beyond learning algorithms. In Association for the Advancement of Artificial Intelligence (AAAI), volume 13, 2013. A. H. Kramer. Learning despite distribution shift. In Connectionist Models Summer School, 1988. Understanding Gradual Domain Adaptation P. L. Bartlett. Learning with a slowly changing distribution. In Conference on Learning Theory (COLT), 1992. P. L. Bartlett, S. Ben-David, and S. R. Kulkarni. Learning changing concepts by exploiting the structure of change. Machine Learning, 41, 1996.