# online_adaptation_to_label_distribution_shift__d8215da9.pdf Online Adaptation to Label Distribution Shift Ruihan Wu Cornell University rw565@cornell.edu Chuan Guo Facebook AI Research chuanguo@fb.com Yi Su Kilian Q. Weinberger Cornell University {ys756, kqw4@cornell.edu} Machine learning models often encounter distribution shifts when deployed in the real world. In this paper, we focus on adaptation to label distribution shift in the online setting, where the test-time label distribution is continually changing and the model must dynamically adapt to it without observing the true label. Leveraging a novel analysis, we show that the lack of true label does not hinder estimation of the expected test loss, which enables the reduction of online label shift adaptation to conventional online learning. Informed by this observation, we propose adaptation algorithms inspired by classical online learning techniques such as Follow The Leader (FTL) and Online Gradient Descent (OGD) and derive their regret bounds. We empirically verify our findings under both simulated and real world label distribution shifts and show that OGD is particularly effective and robust to a variety of challenging label shift scenarios. 1 Introduction A common assumption in machine learning is that the training set and test set are drawn from the same distribution [25]. However, this assumption often does not hold in practice when models are deployed in the real world [3, 28]. One common type of distribution shift is label shift, where the conditional distribution p(x|y) is fixed but the label distribution p(y) changes over time. This phenomenon is most typical when the label y is the causal variable and the feature x is the observation [31]. For instance, a model trained to diagnose malaria can encounter a much higher prevalence of the disease in tropical regions. Prior work have primarily studied the problem of label shift in the offline setting [2, 4, 22, 42], where the phenomenon occurs only once after the model is trained. However, in many common scenarios, the label distribution can change continually over time. For example, the prevalence of a disease such as influenza changes depending on the season and whether an outbreak occurs. The distribution of news article categories is influenced by real world events such as political tension and the economy. In these scenarios, modeling the test-time label distribution as being stationary can be inaccurate and over-simplistic, especially if the model is deployed over a long period of time. To address this shortcoming, we define and study the problem of online label shift, where the distribution shift is modeled as an online process. An adaptation algorithm in this setting is tasked with making sequential predictions on random samples from a drifting test distribution and dynamically adjusting the model s prediction in real-time. Different from online learning, the test label is not observed after making a prediction, hence making the problem much more challenging. Nevertheless, we show that it is possible to adapt to the drifting test distribution in an online fashion, despite never observing a single label. In detail, we describe a method of obtaining unbiased estimates of the expected 0-1 loss and its gradient using only unlabeled samples. This allows the reduction of online label shift adaptation to conventional online learning, which we then utilize to define two algorithms inspired by classical techniques Online Gradient Descent (OGD) and Follow The History (FTH) the latter being a close relative of the Follow The Leader algorithm. Under 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Framework 1 The general framework for online label shift adaptation. Input: A 1: f1 = f0; 2: for t = 1, , T do 3: Nature provides xt, where (xt, yt) Qt and Qt(x|yt) = Q0(x|yt) 4: Learner predicts ft(xt) 5: Learner updates the classifier: ft+1 = A(f0, {x1, , xt}) 6: end for mild and empirically verifiable assumptions, we prove that OGD and FTH are as optimal as a fixed classifier that had knowledge of the shifting test distribution in advance. To validate our theoretical findings, we evaluate our adaptation algorithms on CIFAR-10 [20] under simulated online label shifts, as well as on the Ar Xiv dataset1 for paper categorization, which exhibits real world label shift across years of submission history. We find that OGD and FTH are able to consistently match or outperform the optimal fixed classifier, corroborating our theoretical results. Furthermore, OGD empirically achieves the best classification accuracy on average against a diverse set of challenging label shift scenarios, and can be easily adopted in practical settings. 2 Problem Setting We first introduce the necessary notations and the general problem of label shift adaptation. Consider a classification problem with feature domain X and label space Y = {1, . . . , M}. We assume that a classifier f : X Y operates by predicting a probability vector Pf(x) M 1, where M 1 denotes the (M 1)-dimensional probability simplex. The corresponding classification function is f(x) = arg maxy Y Pf(x)[y]. While the focus of our paper is on neural networks, other models such as SVMs [7] and decision trees [27] that do not explicitly output probabilities can be calibrated to do so as well [26, 41]. Label distribution shift. Let Q be a data distribution defined on X Y and denote by Q(x|y) the class-conditional distribution, and by Q(y) the marginal distribution, so that Q(x, y) = Q(x|y)Q(y). Standard supervised learning assumes that the training distribution Qtrain and the test distribution Qtest are identical. In reality, a deployed model often encounters distributional shifts, where the test distribution Qtest may be substantially different from Qtrain. We are interested in the scenario where the class-conditional distribution remains constant (i.e., Qtest(x|y) = Qtrain(x|y) for all x X, y Y) while the marginal label distribution changes (i.e., Qtest(y) = Qtrain(y) for some y Y), a setting we refer to as label shift. The problem of label shift adaptation is to design algorithms that can adjust the prediction of a classifier f trained on Qtrain to perform well on Qtest. Offline label shift. The general setting in this paper and in prior work on label shift adaptation is that the label marginal distribution Qtest(y) is unknown. Indeed, the model may be deployed in a foreign environment where prior knowledge about the label marginal distribution is limited or inaccurate. Prior work tackle this challenge by estimating Qtest(y) using unlabeled samples drawn from Qtest [2, 4, 10, 22, 30]. Such adaptation methods operate under the offline setting since the distribution shift occurs only once, and the adaptation algorithm is provided with an unlabeled set of samples from the test distribution Qtest upfront. Online label shift. In certain scenarios, offline label shift adaptation may not be applicable. For example, consider a medical diagnosis model classifying whether a patient suffers from flu or hay fever. Although the two diseases share similar symptoms throughout the year, one is far more prevalent than the other, depending on the season and whether an outbreak occurs. Importantly, this label distribution shift is gradual and continual, and the model must make predictions on incoming patients in real-time without observing an unlabeled batch of examples first. Motivated by these challenging use cases, we deviate from prior work and study the problem where Qtest is not stationary during test time and must be adapted towards in an online fashion. Formally, we consider a discrete time scale with a shifting label distribution Qt(y) for t = 0, 1, 2, . . ., where Q0 = Qtrain denotes the training distribution. For each t, denote by qt M 1 1https://www.kaggle.com/Cornell-University/arxiv the label marginal probabilities so that qt[i] = PQt(yt = i) for all i = 1, . . . , M. Let H be a fixed hypothesis space, let f0 H be a classifier trained on samples from Q0, and let f1 = f0. At time step t, nature provides (xt, yt) Qt and the learner predicts the label of xt using ft. Similar to prior work on offline label shift, we impose a crucial but realistic restriction that the true label yt and any incurred loss are both unobserved. This restriction mimics real world situations such as medical diagnosis where the model may not receive any feedback while deployed. Despite this limitation, the learner may still seek to adapt the classifier to the test-time distribution after predicting on xt. Namely, the adaptation algorithm A takes as input f0 and the unlabeled set of historical data {x1, , xt} and outputs a new classifier ft+1 H for time step t + 1. This process is repeated until some end time T and summarized in pseudo-code in Framework 1. To quantify the effectiveness of the adaptation algorithm A, we measure the expected regret across time steps t = 1, . . . , T. Formally, we consider the expected 0-1 loss ℓ(f; Q) = P(x,y) Q(f(x) = y), (1) which we average across t and measure against the best-in-class predictor from H: t=1 ℓ(ft; Qt) inf f H 1 T t=1 ℓ(f; Qt). (2) The goal of online label shift adaptation is to design an algorithm A that minimizes expected regret. 3 Reduction to Online Learning One of the main differences between online label shift adaptation and conventional online learning is that in the former, the learner does not receive any feedback after making a prediction. This restriction prohibits the application of classical online learning techniques that operate on a loss function observed after making a prediction at each time step. In this section, we show that in fact the expected 0-1 loss can be estimated using only the unlabeled sample xt, which in turn reduces the problem of online label shift adaptation to online learning. This reduction enables a variety of solutions derived from classical techniques, which we then analyze in section 4. Estimating the expected 0-1 loss. For any classifier f and distribution Qt, let Cf,Qt RM M denote the confusion matrix for classifying samples from Qt. Formally, Cf,Qt[i, j] = Pxt Qt( |yt=i)(f(xt) = j). Then the expected 0-1 loss of f under distribution Qt (cf. Equation 1) can be written in terms of the confusion matrix Cf,Qt and the label marginal probabilities qt by: ℓ(f; Qt) = P(xt,yt) Qt(f(xt) = yt) = i=1 Pxt Qt( |yt=i)(f(xt) = i) PQt(yt = i) 1 Pxt Qt( |y=i)(f(xt) = i) qt[i] = 1 diag (Cf,Qt) , qt , where 1 denotes the all-1 vector, and diag(Cf,Qt) is the diagonal of the confusion matrix. The difficulty in computing ℓ(f; Qt) is that both Cf,Qt and qt depend on the unknown distribution Qt. However, by the label shift assumption that the conditional distribution Qt(x|y) does not change over time, it immediately follows that Cf,Qt = Cf,Q0 t, and therefore ℓ(f; Qt) is linear in qt: ℓ(f; Qt) = 1 diag (Cf,Q0) , qt . (3) In the theorem below, we utilize this property to derive unbiased estimates of ℓ(f; Qt) and its gradient fℓ(f; Qt) with the assumption that diag (Cf,Q0) is differentiable with respect to f; we discuss this assumption in more detail in Section 4.1. The proof is included in the appendix and is inspired by prior work on offline label shift adaptation [22]. Assumption 1. diag (Cf,Q0) is differentiable with respect to f. Theorem 1. Let f be any classifier and let f0 be the classifier trained on data from Q0. Suppose that f0 predicts f0(xt) = i on input xt Qt and let ei denote the one-hot vector whose non-zero entry is i. If the confusion matrix Cf0,Q0 is invertible then ˆqt = C f0,Q0 1 ei is an unbiased Framework 2 The re-weighting framework for online label shifting adaptation. Input: A, f0, q0, D0 1: f1 = f0; 2: for t = 1, , T do 3: Nature provides xt, where (xt, yt) Qt and Qt(x|yt) = Q0(x|yt); 4: Learner predicts ft(xt) 5: Learner updates the re-weighting vector: pt+1 = A(f0, q0, D0, {x1, , xt}) 6: Learner updates the classifier: ft+1 = g(x; f0, q0, pt+1). 7: end for estimator of the label marginal probability vector qt. Further, we obtain unbiased estimators of the loss and gradient of f for Qt with Assumption 1: ℓ(f; Qt) = EQt [ 1 diag (Cf,Q0) , ˆqt ] , fℓ(f; Qt) = EQt J f ˆqt , where Jf = f [1 diag (Cf,Q0)] denotes the Jacobian of 1 diag (Cf,Q0) with respect to f. Note that in the theorem, the true confusion matrices Cf,Q0 and Cf0,Q0 require knowledge of the training distribution Q0 and are generally unobservable. We assume we have the full access of distribution Q0 for the theoretical analysis in the remainder of the paper, including both values and gradients of Cf,Q0 given any f. In practice, we can estimate the value of Cf,Q0 by using a large labeled hold-out set D0 drawn from Q0 and the estimation error that can be reduced arbitrarily by increasing the size of D0 [22]. For the detail of the practical gradient estimation, we discuss it in Section 4.1. Re-weighting algorithms for online label shift adaptation. The result from Theorem 1 allows us to approximate the expected 0-1 loss and its gradient for any function f over Qt, which naturally inspires the following adaptation strategy: Choose a hypothesis space G, and at each time step t, estimate the expected 0-1 loss and/or its gradient and apply any online learning algorithm to select the classifier ft+1 G for time step t + 1. A natural choice for G is the hypothesis space of the original classifier f0, i.e., G = H. Indeed, the unbiased estimator for the gradient fℓ(f; Qt) in Theorem 1 can be used to directly update f0 with stochastic online gradient descent. However, as we assume that only the marginal distribution Qt(y) drifts and Qt(x|y) = Q0(x|y), the conditional distribution Qt(y|x) must be a re-weighted version of Q0(y|x). Specifically, Qt(y|x) = Qt(y) Qt(x)Qt(x|y) = Qt(y) Qt(x)Q0(x|y) = Qt(y) Qt(x) Q0(x) Q0(y) Q0(y|x) Qt(y) Q0(y)Q0(y|x). (4) Since the goal of ft is to approximate this conditional distribution, in principle, only a re-weighting of the predicted probabilities is needed to correct for any label distribution shift. A similar insight has been previously exploited to correct for training-time label imbalance [6, 8, 16, 39] and for offline label shift adaptation [4, 22]. We therefore focus our attention to the hypothesis space of re-weighted classifiers: G(f0, q0) = {g(x; f0, q0, p) | p M 1}, where the classifier g(x; f0, q0, p) is defined by the learned parameter vector p and takes the following form: g(x; f0, q0, p) = arg max y Y 1 Z(x) p[y] q0[y]Pf0(x)[y], (5) with Z(x) = P y Y qt[y] q0[y]Pf0(x)[y] being the normalization factor. We specialize the online label shift adaptation framework to the hypothesis space G(f0, q0) in Framework 2, where the adaptation algorithm A focuses on generating a re-weighting vector pt at each time step to construct the classifier g(x; f0, q0, pt). The online learning objective can then be re-framed as choosing the reweighting vector pt that minimizes: t=1 ℓ(pt; qt) inf p M 1 1 T t=1 ℓ(p; qt), (6) where ℓ(pt; qt) := ℓ(g(x; f0, q0, pt); Qt). For the remainder of this paper, we will analyze several classical online learning techniques under this framework. 4 Online Adaptation Algorithms We now describe our main algorithms for online label shift adaptation. In particular, we present and analyze two online learning techniques Online Gradient Descent (OGD) and Follow The History (FTH), the latter of which closely resembles Follow The Leader [32]. We show that under mild and empirically verifiable assumptions, OGD and FTH both achieve O(1/ T) regret compared to the optimal fixed classifier in G(f0, q0). 4.1 Algorithm 1: Online Gradient Descent Online gradient descent (OGD) [32] is a classical online learning algorithm based on iteratively updating the hypothesis by following the gradient of the loss. Applied to our setting, the algorithm Aogd computes the stochastic gradient pℓ(p; ˆqt) p=pt = Jp(pt) ˆqt using Theorem 1, where p 1 diag Cg( ;f0,q0,p),Q0 p=pt (7) denotes the M M Jacobian with respect to p. OGD then applies the following update: pt+1 = Proj M 1 (pt η pℓ(p; ˆqt)|p=pt) , (8) where η > 0 is the learning rate, and Proj M 1 projects the updated vector onto M 1. The convergence rate of Aogd depends on properties of the loss function ℓ(p; q). We empirically observe that ℓ(p; q) is approximately convex in the re-weighting parameter vector p, which we justify in detail in the appendix. To derive meaningful regret bounds for OGD, we further assume that the loss function ℓ(p; q) is Lipschitz with respect to p. Formally: Assumption 2 (Convexity). q M 1, ℓ(p; q) is convex in p. Assumption 3 (Lipschitz-ness). supp M 1,i=1, ,M pℓ p; C f0,Q0 1 ei 2 is finite. Below, we provide our regret bound for OGD, which guarantees a O(1/ T) convergence rate. The proof follows the classical proof for online gradient descent and is given in the appendix. Theorem 2 (Regret bound for OGD.). Under Assumption 1, 2 and 3, let L = supp M 1,i=1, ,M pℓ p; C f0,Q0 1 ei 2 . If η = q 2 T 1 L then Aogd satisfies: E(xt,yt) Qt t=1 ℓ(pt; qt) inf p M 1 1 T t=1 ℓ(p; qt) Algorithm 3 Gradient estimator for pℓ(p; q) Input: p, q, δ, k 1: for i = 1, . . . , M do 2: i := ℓ(p + jδ ei, q) ℓ(p jδ ei, q) 3: ˆ p[i] := Pk j=1 αj i/(2δj), where αj = 2 ( 1)j+1 k k j / k+j k . 4: end for 5: Return: ˆ p Gradient estimation. Computing the unbiased gradient estimator pℓ(p; ˆqt)|p=pt involves the Jacobian term Jp(pt) in Equation 7, which is discontinous when estimated using the hold-out set D0. More precisely, each entry of 1 diag Cg( ;f0,q0,p),Q0 is the expected 0-1 loss for a particular class, whose estimate using D0 is a step function. This means that taking the derivative naively will result in a gradient value of 0. To circumvent this issue, we apply finite difference approximation [1] for computing pℓ(p; ˆqt)|p=pt, which is detailed in Algorithm 3. We also apply smoothing to compute the average estimated gradient around the target point pt to improve gradient stability. Alternatively, we can minimize a smooth surrogate of the 0-1 loss that enables direct gradient computation. In detail, we define ℓprob(f; Q) := E(x,y) Q[1 Pf(x)[y]] [0, 1] so that ℓprob = ℓwhen Pf outputs one-hot probability vectors. Furthermore, we show that ℓprob enjoys the same unbiased estimation properties as that of ℓin Theorem 1, admits smooth gradient estimates using a finite hold-out set D0, and is classification-calibrated in the sense of Tewari and Bartlett [38]. The formal statement and proof of the above properties are given in the appendix. In section 5 we empirically evaluate OGD using both the finite difference approach and the surrogate loss approach for gradient estimation. 4.2 Algorithm 2: Follow The History Next, we describe Follow The History (FTH) a minor variant of the prominent online learning strategy known as Follow The Leader (FTL) [32]. In FTL, the basic intuition is that the predictor ft for time step t is the one that minimizes the average loss from the previous t 1 time steps. Formally: pt+1 = arg min p M 1 1 τ=1 ℓ(p; ˆqτ) = arg min p M 1 ℓ where the second inequality holds by linearity of ℓ. However, faithfully executing FTL requires optimizing the loss in Equation 9 at each time step, which could be very inefficient since multiple gradients of ℓneed to be computed as opposed to a single gradient computation for OGD. To address this efficiency concern, observe that if the original classifier f0 is Bayes optimal, then for any qt M 1, the minimizer over the re-weighting vector p of ℓ(p; qt) is the test-time label marginal probability vector qt itself (cf. Equation 4). In fact, we show in the appendix that this assumption often holds approximately in practice, especially when f0 achieves a low error on Q0 and is well-calibrated [12, 26, 41]. Assuming this approximation error is bounded by some δ 0, we can derive a more efficient update rule and a corresponding regret bound. Formally: Assumption 4 (Symmetric optimality). For any q M 1, q arg minp M 1 ℓ(p; q) 2 δ. We define the more efficient Follow The History update rule Afth as: pt+1 = 1 t Pt τ=1 ˆqτ, which is a simple average of the estimates ˆqτ from all previous iterations of the algorithm. FTH coincides with FTL when Assumption 4 holds with δ = 0. In the following theorem, we derive the regret bound for FTH when δ = 0 but prove the general case of δ 0 in the appendix. The theorem relies on an assumption of Lipschitz-ness that is slightly different from Assumption 3. We formally state them as below: Assumption 5 (Lipschitz-ness for FTH). supp,q M 1 pℓ(p; q) 2 is finite. Theorem 3. Under Assumption 4 and 5 with δ = 0, with probability at least 1 2MT 7 over samples (xt, yt) Qt for t = 1, . . . , T we have that Afth satisfies: t=1 ℓ(pt; qt) inf p M 1 t=1 ℓ(p; qt) 2Lln T where c = 2 maxi=1,...,M C f0,Q0 1 ei and L = supp,q M 1 pℓ(p; q) 2. 5 Experiment In this section, we empirically evaluate the proposed algorithms on datasets with both simulated and real world online label shifts. Our simulated label shift experiment is performed on CIFAR-10 [20], where we vary the shift process and explore the robustness of different algorithms. For real world label shift, we evaluate on the Ar Xiv dataset2 for paper categorization, where label shift occurs naturally over years of paper submission due to changing interest in different academic disciplines. 5.1 Experiment set-up Online algorithms set-up. We evaluate both the OGD and FTH algorithms from section 4. For OGD, we use the learning rate η = q 2 T 1 L suggested by Theorem 2, where L is estimated by taking the maximum over {ey : y Y} for 100 vectors p uniformly sampled from M 1. The 2https://www.kaggle.com/Cornell-University/arxiv gradient estimate can be derived using either the finite difference method in Algorithm 3 or by directly differentiating the surrogate loss ℓprob. We evaluate both methods in our experiments. For FTH, we evaluate both the algorithm Afth defined in subsection 4.2 and a heuristic algorithm Aftfwh which we call Follow The Fixed Window History (FTFWH). Different from FTH where pt+1 is the simple average of ˆqτ across all previous time steps τ = 1, . . . , t, FTFWH averages across previous estimates ˆqτ in a fixed window of size w, i.e., pt+1 = Pt τ=max{1,t w+1} ˆqτ/ min{w, t}. Intuitively, FTFWH assumes that the distribution Qt as fixed for w time steps and solves the offline label adaptation problem for the next time step. We use three different window lengths w = 100, 1000, 10000 in our experiments. We will show that FTFWH can be optimal at times but is inconsistent in performance, especially when the window size w coincides with the periodicity in the label distribution shift. Baselines. In addition, we consider the following baseline classifiers as benchmarks against online adaptation algorithms. Base Classifier (BC) refers to the classifier f0 without any online adaptation, which serves as a reference point for evaluating the performance of online adaptation algorithms. Optimal Fixed Classifier (OFC) refers to the best-in-class classifier in G(f0, q0), which is the optimum in Equation 6. We denote the re-weighting vector that achieves this optimum as popt. In simulated label shift, we can define the ground truth label marginal probability vector qt and optimize for popt directly. For the experiment on Ar Xiv, we derive the optimum using the empirical loss: popt = arg minp M 1 ℓ p; 1 T PT t=1 eyt where yt is the ground truth label at time t. Note that OFC is not a practical algorithm since yt is not observed, but it can be used to benchmark different adaptation algorithms and estimate their empirical regret. Evaluation metric. Computing the actual regret requires access to qt for all t, which we do not observe in the real world dataset. To make all evaluations consistent, we report the average error 1 T PT t=1 1 (g(xt; f0, q0, pt) = yt) to approximate 1 T PT t=1 ℓ(pt; qt). This approximation is valid for large T due to its exponential concentration rate by the Azuma Hoeffding inequality. 5.2 Evaluation on CIFAR-10 under simulated shift Dataset and model. We conduct our simulated label shift experiments on CIFAR-10 [20] with a Res Net-18 [13] classifier. We divide the original training set into train and validation by a ratio of 3 : 2. The training set is used to train the base model f0, and the validation set D0 is used for both temperature scaling calibration [12] and to estimate the confusion matrix. The original test set is for the online data sampling. Additional training details are provided in the appendix. Simulated shift processes. Let q(1), q(2) M 1 be two fixed probability vectors. We define the following simulated shift processes for the test-time label marginal probability vector qt. Constant shift: qt = q(1) for all t, which coincides with the setting of offline label shift. Monotone shift: qt interpolates from q(1) to q(2), i.e., qt := 1 t T q(2). Periodic shift: qt alternates between q(1) and q(2) at a fixed period of Tp. We test under three different periods Tp = 100, 1000, 10000. Exponential periodic shift: qt alternates between q(1) and q(2) with an exponentially growing period. Formally, t [k2i, k2i+1], qt := q(1); t [k2i 1, k2i], qt := q(2). We use k = 2, 5 for our experiments. In our experiments, q(1) and q(2) are defined to concentrate on the dog and cat classes, respectively. That is, q(1)[dog] = 0.55 and q(1)[y] = 0.05 for all other classes y, and similarly for q(2). The end time T is set to 100, 000 for all simulation experiments. All results are repeated using three different random seeds that randomize the samples drawn at each time step t. Results. Table 1 shows the average error of various adaptation algorithms when applied to the simulated label shift processes. All adaptation algorithms can outperform the base classifier f0 (except for FTFWH for periodic shift with Tp = 100), which serves as a sanity check that the algorithm is indeed adapting to the test distribution. Simulated Label Shift Constant Monotone Periodic Exp. Periodic Tp = 100 Tp = 1000 Tp = 10000 k = 2 k = 5 Base Classifier (f0) 12.43 0.04 11.63 0.08 11.63 0.09 11.62 0.08 11.63 0.10 11.67 0.07 11.81 0.11 Opt. Fixed Classifier 7.78 0.10 10.24 0.08 10.25 0.08 10.24 0.08 10.24 0.08 10.27 0.07 10.25 0.09 FTH 7.68 0.11 10.36 0.06 10.27 0.08 10.25 0.06 10.33 0.10 10.23 0.06 10.25 0.04 FTFWH w = 102 8.71 0.10 10.19 0.05 12.15 0.01 8.85 0.04 8.46 0.11 8.52 0.06 8.54 0.05 w = 103 7.74 0.07 9.52 0.10 10.25 0.07 11.16 0.12 7.84 0.09 7.77 0.06 7.67 0.08 w = 104 7.67 0.08 9.53 0.10 10.26 0.07 10.26 0.06 10.83 0.07 8.93 0.07 8.46 0.07 OGD (finite diff.) 8.08 0.08 9.79 0.09 10.71 0.10 10.62 0.09 10.11 0.06 8.99 0.05 8.56 0.12 OGD (surrogate loss) 7.78 0.11 9.75 0.07 10.24 0.06 10.21 0.07 10.05 0.09 8.92 0.04 8.50 0.11 Table 1: Average error (%) for different adaptation algorithms under simulated label shift on CIFAR-10. Standard deviation is computed across three runs. Results that are better than the OFC benchmark by 0.5 or more are highlighted in blue, and results that are worse than the OFC benchmark by 0.5 or more are highlighted in red. Method Base (f0) Opt. Fixed FTH FTFWH OGD w = 102 w = 103 w = 104 finite diff. surr. loss Avg. Error (%) 27.21 25.56 25.62 30.14 26.09 25.87 25.52 25.70 Table 2: Average test error (%) on the Ar Xiv dataset. Comparison with the optimal fixed classifier (OFC) reveals more insightful characteristics of the different algorithms and we discuss each algorithm separately as follows. The performance of Follow The History (FTH) is guaranteed to be competitive with OFC by Theorem 3 when the base model f0 is well-calibrated. Indeed, as shown in the table, the average error for FTH is close to that of OFC for all shift processes. However, it is also very conservative and can never achieve better performance than OFC by a margin larger than 0.5. Follow The Fixed Window History (FTFWH) with a suitably chosen window size performs very well empirically, especially for constant shift, monotone shift, and exponential periodic shift. However, when encountering periodic shift with the periodicity Tp equal to the window size w (highlighted in red), FTFWH is consistently subpar compared to OFC, and sometimes even worse than the base classifier f0. Since real world distribution shifts are often periodic in nature (e.g., seasonal trends for flu and hay fever), this result suggests that deploying FTFWH may require knowledge of the periodicity in advance, which may not be feasible. In fact, we show in our experiment on real world label shift on Ar Xiv that FTFWH is never better than FTH and OGD. Online gradient descent (OGD) with learning rate η = q 2 T 1 L is also guaranteed to be as good as OFC by Theorem 2, which is empirically observed as well. Moreover, unlike FTH which only achieves an average error no worse than that of OFC, OGD is able to outperform OFC on certain scenarios such as monotone shift, periodic shift with Tp = 10000 and exponential periodic shift with K = 2, 5. We also observe that OGD using the surrogate loss for gradient estimation is consistently better than OGD with finite difference gradient estimation. Overall, we observe that OGD is the most reliable adaptation algorithm from the above simulation results, as it is uniformly as good as OFC and sometimes can achieve even better results than OFC. 5.3 Evaluation on Ar Xiv under real world distribution shift Dataset and model. We experiment on the Ar Xiv dataset for categorization of papers from the Computer Science domain into 23 refined categories3. There are a total of 233, 748 papers spanning from the year 1991 to 2020, from which we sort by submission time and divide by a ratio of 2 : 1 : 1 into the training, validation, and test sets. For each paper, we compute the tf-idf vector of its abstract as the feature x, and we use the first category in its primary category set as the true label y. The base model f0 is an L2-regularized multinomial regressor. Same as in the simulated shift experiments, the validation set D0 is used to calibrate the base model f0 and estimate the confusion matrix. Additional details on data processing and training are given in the appendix. 3There are actually 40 categories in the Computer Science domain, from which we select the 23 most populated categories. Results. Table 2 shows the average error for each adaptation algorithm over the test set, which consists of papers sorted by submission time with end time T = 58, 437. In contrast to the simulated shift experiments in subsection 5.2, FTFWH is consistently worse than the optimal fixed classifier (OFC), especially for window size w = 100 where it is even worse than the base classifier f0. This result shows that despite the good performance of FTFWH for simulated label shifts on CIFAR-10, it encounters significant challenges when deployed to the real world. On the other hand, FTH and OGD both achieve an average error close to that of OFC, which again validates the theoretical regret bounds in Theorem 2 and Theorem 3. Given the empirical observation of OGD s performance on both simulated and real world label shift, as well as its conceptual simplicity and ease of implementation, we therefore recommend it as a practical go-to solution for online label shift adaptation in real world settings. 6 Related Work Label shift adaptation has received much attention lately. The seminal work by Saerens et al. [30] made the critical observation that the label shift condition of p(x|y) being stationary implies the optimality of re-weighted classifiers. They defined an Expectation-Maximization (EM) procedure to utilize this insight and learn the re-weighting vector that maximizes likelihood on the test distribution. Alexandari et al. [2] studied this approach further and discovered that calibrating the model can lead to a significant improvement in performance. Another prominent strategy for learning the re-weighting vector is by inverting the confusion matrix [22], which inspired our approach for the online setting. Extensions to this method include regularizing the re-weighting vector [4], generalizing the label shift condition to feature space rather than input space [37], and unifying the confusion matrix approach and the EM approach [10]. Another type of test-time distribution shift that has been widely studied is covariate shift. Differing from the label shift assumption that p(x|y) is constant, covariate shift assumes instead that p(y|x) is a constant and p(x) changes between training and test distributions. Earlier work by Lin et al. [21] relied on the assumption that the density function p(x) is known, while subsequent work relaxed this assumption by estimating the density from data [11, 17, 34, 40]. Online extensions to the covariate shift problem have also been considered. Solutions to this problem either rely on the knowledge of test labels [18], or use unsupervised test-time adaptation methods that are tailored to visual applications [14, 24, 36]. More broadly, both label shift adaptation and covariate shift adaptation can be categorized under the general problem of domain adaptation. This problem is much more challenging because a test sample may not necessarily belong to the support of the training distribution. In this setting, there is a large line of theoretical work that prove performance guarantees under bounded training and test distribution divergence [5, 15, 23, 33], as well as empirical methods for domain adaptation in the realm of deep learning [9, 19, 29, 35]. 7 Conclusion We presented a rigorous framework for studying online label shift adaptation, which addresses limitations in prior work on offline label shift. Under our framework, we showed that it is possible to obtain unbiased estimates of the expected 0-1 loss and its gradient without observing a single label at test time. This reduction enables the application of classical techniques from online learning to define practical adaptation algorithms. We showed that these algorithms admit rigorous theoretical guarantees on performance, while at the same time perform very well empirically and can adapt to a variety of challenging label shift scenarios. One potential future work is to relax Assumption 2 to weak convexity assumption. With this relaxation, one needs to take a closer look to online learning techniques and applies it into the online label shift problem with the reduction introduced in this paper and potential additional reduction. Another extension is that we focused on re-weighting algorithms for online label shift adaptation in this paper, the reduction presented in section 3 technically enables a larger set of solutions. Indeed, one possible future direction is to directly apply OGD to update the base classifier f0, which may further improve the adaptation algorithm s performance. Acknowledgments and Disclosure of Funding RW and KQW are supported by grants from the National Science Foundation NSF (III-1618134, III1526012, IIS-1149882, and IIS-1724282), the Bill and Melinda Gates Foundation, and the Cornell Center for Materials Research with funding from the NSF MRSEC program (DMR-1719875), and SAP America. [1] Karsten Ahnert and Markus Abel. Numerical differentiation of experimental data: local versus global methods. Computer Physics Communications, 177(10):764 774, 2007. [2] Amr Alexandari, Anshul Kundaje, and Avanti Shrikumar. Maximum likelihood with biascorrected calibration is hard-to-beat at label shift adaptation. In International Conference on Machine Learning, pages 222 232. PMLR, 2020. [3] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man e. Concrete problems in ai safety. ar Xiv preprint ar Xiv:1606.06565, 2016. [4] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. In International Conference on Learning Representations, 2019. [5] Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. Analysis of representations for domain adaptation. In Advances in neural information processing systems, volume 19, page 137. MIT; 1998, 2007. [6] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [7] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3): 273 297, 1995. [8] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9268 9277, 2019. [9] Bharath Bhushan Damodaran, Benjamin Kellenberger, R emi Flamary, Devis Tuia, and Nicolas Courty. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 447 463, 2018. [10] Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary C Lipton. A unified view of label shift estimation. ar Xiv preprint ar Xiv:2003.07554, 2020. [11] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Sch olkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009. [12] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321 1330. PMLR, 2017. [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [14] Judy Hoffman, Trevor Darrell, and Kate Saenko. Continuous manifold based adaptation for evolving visual domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 867 874, 2014. [15] Judy Hoffman, Mehryar Mohri, and Ningshan Zhang. Algorithms and theory for multiplesource adaptation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8256 8266, 2018. [16] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5375 5384, 2016. [17] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Sch olkopf, and Alex Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, volume 19, pages 601 608. Citeseer, 2006. [18] Vidit Jain and Erik Learned-Miller. Online domain adaptation of a pre-trained cascade of classifiers. In CVPR 2011, pages 577 584. IEEE, 2011. [19] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4893 4902, 2019. [20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [21] Yi Lin, Yoonkyung Lee, and Grace Wahba. Support vector machines for classification in nonstandard situations. Machine learning, 46(1):191 202, 2002. [22] Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and correcting for label shift with black box predictors. In International conference on machine learning, pages 3122 3130. PMLR, 2018. [23] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In 22nd Conference on Learning Theory, COLT 2009, 2009. [24] Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian. Online model distillation for efficient video inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3573 3582, 2019. [25] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012. [26] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625 632, 2005. [27] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81 106, 1986. [28] Joaquin Qui nonero-Candela, Masashi Sugiyama, Neil D Lawrence, and Anton Schwaighofer. Dataset shift in machine learning. Mit Press, 2009. [29] Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Beyond sharing weights for deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4): 801 814, 2019. [30] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation, 14(1):21 41, 2002. [31] Bernhard Sch olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris M Mooij. On causal and anticausal learning. In ICML, 2012. [32] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107 194, 2011. [33] Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [34] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000. [35] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pages 2058 2065, 2016. [36] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Testtime training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning, pages 9229 9248. PMLR, 2020. [37] Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J Gordon. Domain adaptation with conditional distribution matching and generalized label shift. In Advances in Neural Information Processing Systems, volume 33, 2020. [38] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(5), 2007. [39] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 7032 7042, 2017. [40] Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the Twenty-First International Conference on Machine Learning, page 114, 2004. [41] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In International Conference on Machine Learning, volume 1, pages 609 616, 2001. [42] Kun Zhang, Bernhard Sch olkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819 827. PMLR, 2013.