# online_adaptation_to_label_distribution_shift__d8215da9.pdf

Online Adaptation to Label Distribution Shift

Ruihan Wu Cornell University rw565@cornell.edu

Chuan Guo Facebook AI Research chuanguo@fb.com

Yi Su Kilian Q. Weinberger Cornell University {ys756, kqw4@cornell.edu}

Machine learning models often encounter distribution shifts when deployed in the real world. In this paper, we focus on adaptation to label distribution shift in the online setting, where the test-time label distribution is continually changing and the model must dynamically adapt to it without observing the true label. Leveraging a novel analysis, we show that the lack of true label does not hinder estimation of the expected test loss, which enables the reduction of online label shift adaptation to conventional online learning. Informed by this observation, we propose adaptation algorithms inspired by classical online learning techniques such as Follow The Leader (FTL) and Online Gradient Descent (OGD) and derive their regret bounds. We empirically verify our ﬁndings under both simulated and real world label distribution shifts and show that OGD is particularly effective and robust to a variety of challenging label shift scenarios.

1 Introduction

A common assumption in machine learning is that the training set and test set are drawn from the same distribution [25]. However, this assumption often does not hold in practice when models are deployed in the real world [3, 28]. One common type of distribution shift is label shift, where the conditional distribution p(x|y) is ﬁxed but the label distribution p(y) changes over time. This phenomenon is most typical when the label y is the causal variable and the feature x is the observation [31]. For instance, a model trained to diagnose malaria can encounter a much higher prevalence of the disease in tropical regions.

Prior work have primarily studied the problem of label shift in the ofﬂine setting [2, 4, 22, 42], where the phenomenon occurs only once after the model is trained. However, in many common scenarios, the label distribution can change continually over time. For example, the prevalence of a disease such as inﬂuenza changes depending on the season and whether an outbreak occurs. The distribution of news article categories is inﬂuenced by real world events such as political tension and the economy. In these scenarios, modeling the test-time label distribution as being stationary can be inaccurate and over-simplistic, especially if the model is deployed over a long period of time.

To address this shortcoming, we deﬁne and study the problem of online label shift, where the distribution shift is modeled as an online process. An adaptation algorithm in this setting is tasked with making sequential predictions on random samples from a drifting test distribution and dynamically adjusting the model s prediction in real-time. Different from online learning, the test label is not observed after making a prediction, hence making the problem much more challenging.

Nevertheless, we show that it is possible to adapt to the drifting test distribution in an online fashion, despite never observing a single label. In detail, we describe a method of obtaining unbiased estimates of the expected 0-1 loss and its gradient using only unlabeled samples. This allows the reduction of online label shift adaptation to conventional online learning, which we then utilize to deﬁne two algorithms inspired by classical techniques Online Gradient Descent (OGD) and Follow The History (FTH) the latter being a close relative of the Follow The Leader algorithm. Under

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Framework 1 The general framework for online label shift adaptation. Input: A

1: f1 = f0; 2: for t = 1, , T do 3: Nature provides xt, where (xt, yt) Qt and Qt(x|yt) = Q0(x|yt) 4: Learner predicts ft(xt) 5: Learner updates the classiﬁer: ft+1 = A(f0, {x1, , xt}) 6: end for

mild and empirically veriﬁable assumptions, we prove that OGD and FTH are as optimal as a ﬁxed classiﬁer that had knowledge of the shifting test distribution in advance.

To validate our theoretical ﬁndings, we evaluate our adaptation algorithms on CIFAR-10 [20] under simulated online label shifts, as well as on the Ar Xiv dataset1 for paper categorization, which exhibits real world label shift across years of submission history. We ﬁnd that OGD and FTH are able to consistently match or outperform the optimal ﬁxed classiﬁer, corroborating our theoretical results. Furthermore, OGD empirically achieves the best classiﬁcation accuracy on average against a diverse set of challenging label shift scenarios, and can be easily adopted in practical settings.

2 Problem Setting

We ﬁrst introduce the necessary notations and the general problem of label shift adaptation. Consider a classiﬁcation problem with feature domain X and label space Y = {1, . . . , M}. We assume that a classiﬁer f : X Y operates by predicting a probability vector Pf(x) M 1, where M 1 denotes the (M 1)-dimensional probability simplex. The corresponding classiﬁcation function is f(x) = arg maxy Y Pf(x)[y]. While the focus of our paper is on neural networks, other models such as SVMs [7] and decision trees [27] that do not explicitly output probabilities can be calibrated to do so as well [26, 41].

Label distribution shift. Let Q be a data distribution deﬁned on X Y and denote by Q(x|y) the class-conditional distribution, and by Q(y) the marginal distribution, so that Q(x, y) = Q(x|y)Q(y). Standard supervised learning assumes that the training distribution Qtrain and the test distribution Qtest are identical. In reality, a deployed model often encounters distributional shifts, where the test distribution Qtest may be substantially different from Qtrain. We are interested in the scenario where the class-conditional distribution remains constant (i.e., Qtest(x|y) = Qtrain(x|y) for all x X, y Y) while the marginal label distribution changes (i.e., Qtest(y) = Qtrain(y) for some y Y), a setting we refer to as label shift. The problem of label shift adaptation is to design algorithms that can adjust the prediction of a classiﬁer f trained on Qtrain to perform well on Qtest.

Ofﬂine label shift. The general setting in this paper and in prior work on label shift adaptation is that the label marginal distribution Qtest(y) is unknown. Indeed, the model may be deployed in a foreign environment where prior knowledge about the label marginal distribution is limited or inaccurate. Prior work tackle this challenge by estimating Qtest(y) using unlabeled samples drawn from Qtest [2, 4, 10, 22, 30]. Such adaptation methods operate under the ofﬂine setting since the distribution shift occurs only once, and the adaptation algorithm is provided with an unlabeled set of samples from the test distribution Qtest upfront.

Online label shift. In certain scenarios, ofﬂine label shift adaptation may not be applicable. For example, consider a medical diagnosis model classifying whether a patient suffers from ﬂu or hay fever. Although the two diseases share similar symptoms throughout the year, one is far more prevalent than the other, depending on the season and whether an outbreak occurs. Importantly, this label distribution shift is gradual and continual, and the model must make predictions on incoming patients in real-time without observing an unlabeled batch of examples ﬁrst. Motivated by these challenging use cases, we deviate from prior work and study the problem where Qtest is not stationary during test time and must be adapted towards in an online fashion.

Formally, we consider a discrete time scale with a shifting label distribution Qt(y) for t = 0, 1, 2, . . ., where Q0 = Qtrain denotes the training distribution. For each t, denote by qt M 1

1https://www.kaggle.com/Cornell-University/arxiv

the label marginal probabilities so that qt[i] = PQt(yt = i) for all i = 1, . . . , M. Let H be a ﬁxed hypothesis space, let f0 H be a classiﬁer trained on samples from Q0, and let f1 = f0.

At time step t, nature provides (xt, yt) Qt and the learner predicts the label of xt using ft. Similar to prior work on ofﬂine label shift, we impose a crucial but realistic restriction that the true label yt and any incurred loss are both unobserved. This restriction mimics real world situations such as medical diagnosis where the model may not receive any feedback while deployed. Despite this limitation, the learner may still seek to adapt the classiﬁer to the test-time distribution after predicting on xt. Namely, the adaptation algorithm A takes as input f0 and the unlabeled set of historical data {x1, , xt} and outputs a new classiﬁer ft+1 H for time step t + 1. This process is repeated until some end time T and summarized in pseudo-code in Framework 1.

To quantify the effectiveness of the adaptation algorithm A, we measure the expected regret across time steps t = 1, . . . , T. Formally, we consider the expected 0-1 loss

ℓ(f; Q) = P(x,y) Q(f(x) = y), (1)

which we average across t and measure against the best-in-class predictor from H:

t=1 ℓ(ft; Qt) inf f H 1 T

t=1 ℓ(f; Qt). (2)

The goal of online label shift adaptation is to design an algorithm A that minimizes expected regret.

3 Reduction to Online Learning

One of the main differences between online label shift adaptation and conventional online learning is that in the former, the learner does not receive any feedback after making a prediction. This restriction prohibits the application of classical online learning techniques that operate on a loss function observed after making a prediction at each time step. In this section, we show that in fact the expected 0-1 loss can be estimated using only the unlabeled sample xt, which in turn reduces the problem of online label shift adaptation to online learning. This reduction enables a variety of solutions derived from classical techniques, which we then analyze in section 4.

Estimating the expected 0-1 loss. For any classiﬁer f and distribution Qt, let Cf,Qt RM M denote the confusion matrix for classifying samples from Qt. Formally, Cf,Qt[i, j] = Pxt Qt( |yt=i)(f(xt) = j). Then the expected 0-1 loss of f under distribution Qt (cf. Equation 1) can be written in terms of the confusion matrix Cf,Qt and the label marginal probabilities qt by:

ℓ(f; Qt) = P(xt,yt) Qt(f(xt) = yt) =

i=1 Pxt Qt( |yt=i)(f(xt) = i) PQt(yt = i)

1 Pxt Qt( |y=i)(f(xt) = i) qt[i] = 1 diag (Cf,Qt) , qt ,

where 1 denotes the all-1 vector, and diag(Cf,Qt) is the diagonal of the confusion matrix. The difﬁculty in computing ℓ(f; Qt) is that both Cf,Qt and qt depend on the unknown distribution Qt. However, by the label shift assumption that the conditional distribution Qt(x|y) does not change over time, it immediately follows that Cf,Qt = Cf,Q0 t, and therefore ℓ(f; Qt) is linear in qt:

ℓ(f; Qt) = 1 diag (Cf,Q0) , qt . (3)

In the theorem below, we utilize this property to derive unbiased estimates of ℓ(f; Qt) and its gradient fℓ(f; Qt) with the assumption that diag (Cf,Q0) is differentiable with respect to f; we discuss this assumption in more detail in Section 4.1. The proof is included in the appendix and is inspired by prior work on ofﬂine label shift adaptation [22]. Assumption 1. diag (Cf,Q0) is differentiable with respect to f. Theorem 1. Let f be any classiﬁer and let f0 be the classiﬁer trained on data from Q0. Suppose that f0 predicts f0(xt) = i on input xt Qt and let ei denote the one-hot vector whose non-zero

entry is i. If the confusion matrix Cf0,Q0 is invertible then ˆqt = C f0,Q0 1 ei is an unbiased

Framework 2 The re-weighting framework for online label shifting adaptation. Input: A, f0, q0, D0

1: f1 = f0; 2: for t = 1, , T do 3: Nature provides xt, where (xt, yt) Qt and Qt(x|yt) = Q0(x|yt); 4: Learner predicts ft(xt) 5: Learner updates the re-weighting vector: pt+1 = A(f0, q0, D0, {x1, , xt}) 6: Learner updates the classiﬁer: ft+1 = g(x; f0, q0, pt+1). 7: end for

estimator of the label marginal probability vector qt. Further, we obtain unbiased estimators of the loss and gradient of f for Qt with Assumption 1: ℓ(f; Qt) = EQt [ 1 diag (Cf,Q0) , ˆqt ] ,

fℓ(f; Qt) = EQt J f ˆqt ,

where Jf = f [1 diag (Cf,Q0)] denotes the Jacobian of 1 diag (Cf,Q0) with respect to f.

Note that in the theorem, the true confusion matrices Cf,Q0 and Cf0,Q0 require knowledge of the training distribution Q0 and are generally unobservable. We assume we have the full access of distribution Q0 for the theoretical analysis in the remainder of the paper, including both values and gradients of Cf,Q0 given any f. In practice, we can estimate the value of Cf,Q0 by using a large labeled hold-out set D0 drawn from Q0 and the estimation error that can be reduced arbitrarily by increasing the size of D0 [22]. For the detail of the practical gradient estimation, we discuss it in Section 4.1.

Re-weighting algorithms for online label shift adaptation. The result from Theorem 1 allows us to approximate the expected 0-1 loss and its gradient for any function f over Qt, which naturally inspires the following adaptation strategy: Choose a hypothesis space G, and at each time step t, estimate the expected 0-1 loss and/or its gradient and apply any online learning algorithm to select the classiﬁer ft+1 G for time step t + 1.

A natural choice for G is the hypothesis space of the original classiﬁer f0, i.e., G = H. Indeed, the unbiased estimator for the gradient fℓ(f; Qt) in Theorem 1 can be used to directly update f0 with stochastic online gradient descent. However, as we assume that only the marginal distribution Qt(y) drifts and Qt(x|y) = Q0(x|y), the conditional distribution Qt(y|x) must be a re-weighted version of Q0(y|x). Speciﬁcally,

Qt(y|x) = Qt(y)

Qt(x)Qt(x|y) = Qt(y)

Qt(x)Q0(x|y) = Qt(y)

Qt(x) Q0(x) Q0(y) Q0(y|x) Qt(y)

Q0(y)Q0(y|x). (4)

Since the goal of ft is to approximate this conditional distribution, in principle, only a re-weighting of the predicted probabilities is needed to correct for any label distribution shift. A similar insight has been previously exploited to correct for training-time label imbalance [6, 8, 16, 39] and for ofﬂine label shift adaptation [4, 22]. We therefore focus our attention to the hypothesis space of re-weighted classiﬁers: G(f0, q0) = {g(x; f0, q0, p) | p M 1}, where the classiﬁer g(x; f0, q0, p) is deﬁned by the learned parameter vector p and takes the following form:

g(x; f0, q0, p) = arg max y Y

1 Z(x) p[y] q0[y]Pf0(x)[y], (5)

with Z(x) = P

y Y qt[y] q0[y]Pf0(x)[y] being the normalization factor. We specialize the online label shift adaptation framework to the hypothesis space G(f0, q0) in Framework 2, where the adaptation algorithm A focuses on generating a re-weighting vector pt at each time step to construct the classiﬁer g(x; f0, q0, pt). The online learning objective can then be re-framed as choosing the reweighting vector pt that minimizes:

t=1 ℓ(pt; qt) inf p M 1 1 T

t=1 ℓ(p; qt), (6)

where ℓ(pt; qt) := ℓ(g(x; f0, q0, pt); Qt). For the remainder of this paper, we will analyze several classical online learning techniques under this framework.

4 Online Adaptation Algorithms

We now describe our main algorithms for online label shift adaptation. In particular, we present and analyze two online learning techniques Online Gradient Descent (OGD) and Follow The History (FTH), the latter of which closely resembles Follow The Leader [32]. We show that under mild and empirically veriﬁable assumptions, OGD and FTH both achieve O(1/

T) regret compared to the optimal ﬁxed classiﬁer in G(f0, q0).

4.1 Algorithm 1: Online Gradient Descent

Online gradient descent (OGD) [32] is a classical online learning algorithm based on iteratively updating the hypothesis by following the gradient of the loss. Applied to our setting, the algorithm Aogd computes the stochastic gradient pℓ(p; ˆqt) p=pt = Jp(pt) ˆqt using Theorem 1, where

p 1 diag Cg( ;f0,q0,p),Q0 p=pt (7)

denotes the M M Jacobian with respect to p. OGD then applies the following update:

pt+1 = Proj M 1 (pt η pℓ(p; ˆqt)|p=pt) , (8)

where η > 0 is the learning rate, and Proj M 1 projects the updated vector onto M 1.

The convergence rate of Aogd depends on properties of the loss function ℓ(p; q). We empirically observe that ℓ(p; q) is approximately convex in the re-weighting parameter vector p, which we justify in detail in the appendix. To derive meaningful regret bounds for OGD, we further assume that the loss function ℓ(p; q) is Lipschitz with respect to p. Formally:

Assumption 2 (Convexity). q M 1, ℓ(p; q) is convex in p.

Assumption 3 (Lipschitz-ness). supp M 1,i=1, ,M

pℓ p; C f0,Q0 1 ei

2 is ﬁnite.

Below, we provide our regret bound for OGD, which guarantees a O(1/

T) convergence rate. The proof follows the classical proof for online gradient descent and is given in the appendix.

Theorem 2 (Regret bound for OGD.). Under Assumption 1, 2 and 3, let L =

supp M 1,i=1, ,M

pℓ p; C f0,Q0 1 ei

2 . If η = q

2 T 1 L then Aogd satisﬁes:

E(xt,yt) Qt

t=1 ℓ(pt; qt)

inf p M 1 1 T

t=1 ℓ(p; qt)

Algorithm 3 Gradient estimator for pℓ(p; q) Input: p, q, δ, k

1: for i = 1, . . . , M do 2: i := ℓ(p + jδ ei, q) ℓ(p jδ ei, q) 3: ˆ p[i] := Pk j=1 αj i/(2δj), where αj = 2 ( 1)j+1 k k j / k+j k . 4: end for 5: Return: ˆ p

Gradient estimation. Computing the unbiased gradient estimator pℓ(p; ˆqt)|p=pt involves the Jacobian term Jp(pt) in Equation 7, which is discontinous when estimated using the hold-out set D0. More precisely, each entry of 1 diag Cg( ;f0,q0,p),Q0 is the expected 0-1 loss for a particular class, whose estimate using D0 is a step function. This means that taking the derivative naively will result in a gradient value of 0. To circumvent this issue, we apply ﬁnite difference approximation [1] for computing pℓ(p; ˆqt)|p=pt, which is detailed in Algorithm 3. We also apply smoothing to compute the average estimated gradient around the target point pt to improve gradient stability.

Alternatively, we can minimize a smooth surrogate of the 0-1 loss that enables direct gradient computation. In detail, we deﬁne ℓprob(f; Q) := E(x,y) Q[1 Pf(x)[y]] [0, 1] so that ℓprob = ℓwhen Pf outputs one-hot probability vectors. Furthermore, we show that ℓprob enjoys the same unbiased estimation properties as that of ℓin Theorem 1, admits smooth gradient estimates using a ﬁnite

hold-out set D0, and is classiﬁcation-calibrated in the sense of Tewari and Bartlett [38]. The formal statement and proof of the above properties are given in the appendix. In section 5 we empirically evaluate OGD using both the ﬁnite difference approach and the surrogate loss approach for gradient estimation.

4.2 Algorithm 2: Follow The History

Next, we describe Follow The History (FTH) a minor variant of the prominent online learning strategy known as Follow The Leader (FTL) [32]. In FTL, the basic intuition is that the predictor ft for time step t is the one that minimizes the average loss from the previous t 1 time steps. Formally:

pt+1 = arg min p M 1 1

τ=1 ℓ(p; ˆqτ) = arg min p M 1 ℓ

where the second inequality holds by linearity of ℓ. However, faithfully executing FTL requires optimizing the loss in Equation 9 at each time step, which could be very inefﬁcient since multiple gradients of ℓneed to be computed as opposed to a single gradient computation for OGD.

To address this efﬁciency concern, observe that if the original classiﬁer f0 is Bayes optimal, then for any qt M 1, the minimizer over the re-weighting vector p of ℓ(p; qt) is the test-time label marginal probability vector qt itself (cf. Equation 4). In fact, we show in the appendix that this assumption often holds approximately in practice, especially when f0 achieves a low error on Q0 and is well-calibrated [12, 26, 41]. Assuming this approximation error is bounded by some δ 0, we can derive a more efﬁcient update rule and a corresponding regret bound. Formally: Assumption 4 (Symmetric optimality). For any q M 1, q arg minp M 1 ℓ(p; q) 2 δ.

We deﬁne the more efﬁcient Follow The History update rule Afth as: pt+1 = 1

t Pt τ=1 ˆqτ, which is a simple average of the estimates ˆqτ from all previous iterations of the algorithm. FTH coincides with FTL when Assumption 4 holds with δ = 0. In the following theorem, we derive the regret bound for FTH when δ = 0 but prove the general case of δ 0 in the appendix. The theorem relies on an assumption of Lipschitz-ness that is slightly different from Assumption 3. We formally state them as below: Assumption 5 (Lipschitz-ness for FTH). supp,q M 1 pℓ(p; q) 2 is ﬁnite.

Theorem 3. Under Assumption 4 and 5 with δ = 0, with probability at least 1 2MT 7 over samples (xt, yt) Qt for t = 1, . . . , T we have that Afth satisﬁes:

t=1 ℓ(pt; qt) inf p M 1

t=1 ℓ(p; qt) 2Lln T

where c = 2 maxi=1,...,M

C f0,Q0 1 ei

and L = supp,q M 1 pℓ(p; q) 2.

5 Experiment

In this section, we empirically evaluate the proposed algorithms on datasets with both simulated and real world online label shifts. Our simulated label shift experiment is performed on CIFAR-10 [20], where we vary the shift process and explore the robustness of different algorithms. For real world label shift, we evaluate on the Ar Xiv dataset2 for paper categorization, where label shift occurs naturally over years of paper submission due to changing interest in different academic disciplines.

5.1 Experiment set-up

Online algorithms set-up. We evaluate both the OGD and FTH algorithms from section 4. For

OGD, we use the learning rate η = q

2 T 1 L suggested by Theorem 2, where L is estimated by

taking the maximum over {ey : y Y} for 100 vectors p uniformly sampled from M 1. The

2https://www.kaggle.com/Cornell-University/arxiv

gradient estimate can be derived using either the ﬁnite difference method in Algorithm 3 or by directly differentiating the surrogate loss ℓprob. We evaluate both methods in our experiments.

For FTH, we evaluate both the algorithm Afth deﬁned in subsection 4.2 and a heuristic algorithm Aftfwh which we call Follow The Fixed Window History (FTFWH). Different from FTH where pt+1 is the simple average of ˆqτ across all previous time steps τ = 1, . . . , t, FTFWH averages across previous estimates ˆqτ in a ﬁxed window of size w, i.e., pt+1 = Pt τ=max{1,t w+1} ˆqτ/ min{w, t}. Intuitively, FTFWH assumes that the distribution Qt as ﬁxed for w time steps and solves the ofﬂine label adaptation problem for the next time step. We use three different window lengths w = 100, 1000, 10000 in our experiments. We will show that FTFWH can be optimal at times but is inconsistent in performance, especially when the window size w coincides with the periodicity in the label distribution shift.

Baselines. In addition, we consider the following baseline classiﬁers as benchmarks against online adaptation algorithms.

Base Classiﬁer (BC) refers to the classiﬁer f0 without any online adaptation, which serves as a reference point for evaluating the performance of online adaptation algorithms. Optimal Fixed Classiﬁer (OFC) refers to the best-in-class classiﬁer in G(f0, q0), which is the optimum in Equation 6. We denote the re-weighting vector that achieves this optimum as popt. In simulated label shift, we can deﬁne the ground truth label marginal probability vector qt and optimize for popt directly. For the experiment on Ar Xiv, we derive the optimum using the empirical loss: popt = arg minp M 1 ℓ p; 1

T PT t=1 eyt where yt is the ground truth label at time t. Note that OFC is not a practical algorithm since yt is not observed, but it can be used to benchmark different adaptation algorithms and estimate their empirical regret.

Evaluation metric. Computing the actual regret requires access to qt for all t, which we do not observe in the real world dataset. To make all evaluations consistent, we report the average error 1 T PT t=1 1 (g(xt; f0, q0, pt) = yt) to approximate 1

T PT t=1 ℓ(pt; qt). This approximation is valid for large T due to its exponential concentration rate by the Azuma Hoeffding inequality.

5.2 Evaluation on CIFAR-10 under simulated shift

Dataset and model. We conduct our simulated label shift experiments on CIFAR-10 [20] with a Res Net-18 [13] classiﬁer. We divide the original training set into train and validation by a ratio of 3 : 2. The training set is used to train the base model f0, and the validation set D0 is used for both temperature scaling calibration [12] and to estimate the confusion matrix. The original test set is for the online data sampling. Additional training details are provided in the appendix.

Simulated shift processes. Let q(1), q(2) M 1 be two ﬁxed probability vectors. We deﬁne the following simulated shift processes for the test-time label marginal probability vector qt.

Constant shift: qt = q(1) for all t, which coincides with the setting of ofﬂine label shift. Monotone shift: qt interpolates from q(1) to q(2), i.e., qt := 1 t

T q(2). Periodic shift: qt alternates between q(1) and q(2) at a ﬁxed period of Tp. We test under three different periods Tp = 100, 1000, 10000. Exponential periodic shift: qt alternates between q(1) and q(2) with an exponentially growing period. Formally, t [k2i, k2i+1], qt := q(1); t [k2i 1, k2i], qt := q(2). We use k = 2, 5 for our experiments.

In our experiments, q(1) and q(2) are deﬁned to concentrate on the dog and cat classes, respectively. That is, q(1)[dog] = 0.55 and q(1)[y] = 0.05 for all other classes y, and similarly for q(2). The end time T is set to 100, 000 for all simulation experiments. All results are repeated using three different random seeds that randomize the samples drawn at each time step t.

Results. Table 1 shows the average error of various adaptation algorithms when applied to the simulated label shift processes. All adaptation algorithms can outperform the base classiﬁer f0 (except for FTFWH for periodic shift with Tp = 100), which serves as a sanity check that the algorithm is indeed adapting to the test distribution.

Simulated Label Shift

Constant Monotone Periodic Exp. Periodic Tp = 100 Tp = 1000 Tp = 10000 k = 2 k = 5

Base Classiﬁer (f0) 12.43 0.04 11.63 0.08 11.63 0.09 11.62 0.08 11.63 0.10 11.67 0.07 11.81 0.11 Opt. Fixed Classiﬁer 7.78 0.10 10.24 0.08 10.25 0.08 10.24 0.08 10.24 0.08 10.27 0.07 10.25 0.09

FTH 7.68 0.11 10.36 0.06 10.27 0.08 10.25 0.06 10.33 0.10 10.23 0.06 10.25 0.04

FTFWH w = 102 8.71 0.10 10.19 0.05 12.15 0.01 8.85 0.04 8.46 0.11 8.52 0.06 8.54 0.05 w = 103 7.74 0.07 9.52 0.10 10.25 0.07 11.16 0.12 7.84 0.09 7.77 0.06 7.67 0.08 w = 104 7.67 0.08 9.53 0.10 10.26 0.07 10.26 0.06 10.83 0.07 8.93 0.07 8.46 0.07

OGD (ﬁnite diff.) 8.08 0.08 9.79 0.09 10.71 0.10 10.62 0.09 10.11 0.06 8.99 0.05 8.56 0.12 OGD (surrogate loss) 7.78 0.11 9.75 0.07 10.24 0.06 10.21 0.07 10.05 0.09 8.92 0.04 8.50 0.11

Table 1: Average error (%) for different adaptation algorithms under simulated label shift on CIFAR-10. Standard deviation is computed across three runs. Results that are better than the OFC benchmark by 0.5 or more are highlighted in blue, and results that are worse than the OFC benchmark by 0.5 or more are highlighted in red.

Method Base (f0) Opt. Fixed FTH FTFWH OGD w = 102 w = 103 w = 104 ﬁnite diff. surr. loss

Avg. Error (%) 27.21 25.56 25.62 30.14 26.09 25.87 25.52 25.70

Table 2: Average test error (%) on the Ar Xiv dataset.

Comparison with the optimal ﬁxed classiﬁer (OFC) reveals more insightful characteristics of the different algorithms and we discuss each algorithm separately as follows.

The performance of Follow The History (FTH) is guaranteed to be competitive with OFC by Theorem 3 when the base model f0 is well-calibrated. Indeed, as shown in the table, the average error for FTH is close to that of OFC for all shift processes. However, it is also very conservative and can never achieve better performance than OFC by a margin larger than 0.5. Follow The Fixed Window History (FTFWH) with a suitably chosen window size performs very well empirically, especially for constant shift, monotone shift, and exponential periodic shift. However, when encountering periodic shift with the periodicity Tp equal to the window size w (highlighted in red), FTFWH is consistently subpar compared to OFC, and sometimes even worse than the base classiﬁer f0. Since real world distribution shifts are often periodic in nature (e.g., seasonal trends for ﬂu and hay fever), this result suggests that deploying FTFWH may require knowledge of the periodicity in advance, which may not be feasible. In fact, we show in our experiment on real world label shift on Ar Xiv that FTFWH is never better than FTH and OGD.

Online gradient descent (OGD) with learning rate η = q

2 T 1 L is also guaranteed to be as good as OFC by Theorem 2, which is empirically observed as well. Moreover, unlike FTH which only achieves an average error no worse than that of OFC, OGD is able to outperform OFC on certain scenarios such as monotone shift, periodic shift with Tp = 10000 and exponential periodic shift with K = 2, 5. We also observe that OGD using the surrogate loss for gradient estimation is consistently better than OGD with ﬁnite difference gradient estimation.

Overall, we observe that OGD is the most reliable adaptation algorithm from the above simulation results, as it is uniformly as good as OFC and sometimes can achieve even better results than OFC.

5.3 Evaluation on Ar Xiv under real world distribution shift

Dataset and model. We experiment on the Ar Xiv dataset for categorization of papers from the Computer Science domain into 23 reﬁned categories3. There are a total of 233, 748 papers spanning from the year 1991 to 2020, from which we sort by submission time and divide by a ratio of 2 : 1 : 1 into the training, validation, and test sets. For each paper, we compute the tf-idf vector of its abstract as the feature x, and we use the ﬁrst category in its primary category set as the true label y. The base model f0 is an L2-regularized multinomial regressor. Same as in the simulated shift experiments, the validation set D0 is used to calibrate the base model f0 and estimate the confusion matrix. Additional details on data processing and training are given in the appendix.

3There are actually 40 categories in the Computer Science domain, from which we select the 23 most populated categories.

Results. Table 2 shows the average error for each adaptation algorithm over the test set, which consists of papers sorted by submission time with end time T = 58, 437. In contrast to the simulated shift experiments in subsection 5.2, FTFWH is consistently worse than the optimal ﬁxed classiﬁer (OFC), especially for window size w = 100 where it is even worse than the base classiﬁer f0. This result shows that despite the good performance of FTFWH for simulated label shifts on CIFAR-10, it encounters signiﬁcant challenges when deployed to the real world.

On the other hand, FTH and OGD both achieve an average error close to that of OFC, which again validates the theoretical regret bounds in Theorem 2 and Theorem 3. Given the empirical observation of OGD s performance on both simulated and real world label shift, as well as its conceptual simplicity and ease of implementation, we therefore recommend it as a practical go-to solution for online label shift adaptation in real world settings.

6 Related Work

Label shift adaptation has received much attention lately. The seminal work by Saerens et al. [30] made the critical observation that the label shift condition of p(x|y) being stationary implies the optimality of re-weighted classiﬁers. They deﬁned an Expectation-Maximization (EM) procedure to utilize this insight and learn the re-weighting vector that maximizes likelihood on the test distribution. Alexandari et al. [2] studied this approach further and discovered that calibrating the model can lead to a signiﬁcant improvement in performance. Another prominent strategy for learning the re-weighting vector is by inverting the confusion matrix [22], which inspired our approach for the online setting. Extensions to this method include regularizing the re-weighting vector [4], generalizing the label shift condition to feature space rather than input space [37], and unifying the confusion matrix approach and the EM approach [10].

Another type of test-time distribution shift that has been widely studied is covariate shift. Differing from the label shift assumption that p(x|y) is constant, covariate shift assumes instead that p(y|x) is a constant and p(x) changes between training and test distributions. Earlier work by Lin et al. [21] relied on the assumption that the density function p(x) is known, while subsequent work relaxed this assumption by estimating the density from data [11, 17, 34, 40]. Online extensions to the covariate shift problem have also been considered. Solutions to this problem either rely on the knowledge of test labels [18], or use unsupervised test-time adaptation methods that are tailored to visual applications [14, 24, 36].

More broadly, both label shift adaptation and covariate shift adaptation can be categorized under the general problem of domain adaptation. This problem is much more challenging because a test sample may not necessarily belong to the support of the training distribution. In this setting, there is a large line of theoretical work that prove performance guarantees under bounded training and test distribution divergence [5, 15, 23, 33], as well as empirical methods for domain adaptation in the realm of deep learning [9, 19, 29, 35].

7 Conclusion

We presented a rigorous framework for studying online label shift adaptation, which addresses limitations in prior work on ofﬂine label shift. Under our framework, we showed that it is possible to obtain unbiased estimates of the expected 0-1 loss and its gradient without observing a single label at test time. This reduction enables the application of classical techniques from online learning to deﬁne practical adaptation algorithms. We showed that these algorithms admit rigorous theoretical guarantees on performance, while at the same time perform very well empirically and can adapt to a variety of challenging label shift scenarios.

One potential future work is to relax Assumption 2 to weak convexity assumption. With this relaxation, one needs to take a closer look to online learning techniques and applies it into the online label shift problem with the reduction introduced in this paper and potential additional reduction. Another extension is that we focused on re-weighting algorithms for online label shift adaptation in this paper, the reduction presented in section 3 technically enables a larger set of solutions. Indeed, one possible future direction is to directly apply OGD to update the base classiﬁer f0, which may further improve the adaptation algorithm s performance.

Acknowledgments and Disclosure of Funding

RW and KQW are supported by grants from the National Science Foundation NSF (III-1618134, III1526012, IIS-1149882, and IIS-1724282), the Bill and Melinda Gates Foundation, and the Cornell Center for Materials Research with funding from the NSF MRSEC program (DMR-1719875), and SAP America.

[1] Karsten Ahnert and Markus Abel. Numerical differentiation of experimental data: local versus global methods. Computer Physics Communications, 177(10):764 774, 2007.

[2] Amr Alexandari, Anshul Kundaje, and Avanti Shrikumar. Maximum likelihood with biascorrected calibration is hard-to-beat at label shift adaptation. In International Conference on Machine Learning, pages 222 232. PMLR, 2020.

[3] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man e. Concrete problems in ai safety. ar Xiv preprint ar Xiv:1606.06565, 2016.

[4] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. In International Conference on Learning Representations, 2019.

[5] Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. Analysis of representations for domain adaptation. In Advances in neural information processing systems, volume 19, page 137. MIT; 1998, 2007.

[6] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[7] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3): 273 297, 1995.

[8] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9268 9277, 2019.

[9] Bharath Bhushan Damodaran, Benjamin Kellenberger, R emi Flamary, Devis Tuia, and Nicolas Courty. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 447 463, 2018.

[10] Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary C Lipton. A uniﬁed view of label shift estimation. ar Xiv preprint ar Xiv:2003.07554, 2020.

[11] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Sch olkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.

[12] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321 1330. PMLR, 2017.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[14] Judy Hoffman, Trevor Darrell, and Kate Saenko. Continuous manifold based adaptation for evolving visual domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 867 874, 2014.

[15] Judy Hoffman, Mehryar Mohri, and Ningshan Zhang. Algorithms and theory for multiplesource adaptation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8256 8266, 2018.

[16] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for imbalanced classiﬁcation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5375 5384, 2016.

[17] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Sch olkopf, and Alex Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, volume 19, pages 601 608. Citeseer, 2006.

[18] Vidit Jain and Erik Learned-Miller. Online domain adaptation of a pre-trained cascade of classiﬁers. In CVPR 2011, pages 577 584. IEEE, 2011.

[19] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4893 4902, 2019.

[20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[21] Yi Lin, Yoonkyung Lee, and Grace Wahba. Support vector machines for classiﬁcation in nonstandard situations. Machine learning, 46(1):191 202, 2002.

[22] Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and correcting for label shift with black box predictors. In International conference on machine learning, pages 3122 3130. PMLR, 2018.

[23] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In 22nd Conference on Learning Theory, COLT 2009, 2009.

[24] Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian. Online model distillation for efﬁcient video inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3573 3582, 2019.

[25] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

[26] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625 632, 2005.

[27] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81 106, 1986.

[28] Joaquin Qui nonero-Candela, Masashi Sugiyama, Neil D Lawrence, and Anton Schwaighofer. Dataset shift in machine learning. Mit Press, 2009.

[29] Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Beyond sharing weights for deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4): 801 814, 2019.

[30] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classiﬁer to new a priori probabilities: a simple procedure. Neural Computation, 14(1):21 41, 2002.

[31] Bernhard Sch olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris M Mooij. On causal and anticausal learning. In ICML, 2012.

[32] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107 194, 2011.

[33] Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

[34] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000.

[35] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence, pages 2058 2065, 2016.

[36] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Testtime training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning, pages 9229 9248. PMLR, 2020.

[37] Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J Gordon. Domain adaptation with conditional distribution matching and generalized label shift. In Advances in Neural Information Processing Systems, volume 33, 2020.

[38] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classiﬁcation methods. Journal of Machine Learning Research, 8(5), 2007.

[39] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 7032 7042, 2017.

[40] Bianca Zadrozny. Learning and evaluating classiﬁers under sample selection bias. In Proceedings of the Twenty-First International Conference on Machine Learning, page 114, 2004.

[41] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classiﬁers. In International Conference on Machine Learning, volume 1, pages 609 616, 2001.

[42] Kun Zhang, Bernhard Sch olkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819 827. PMLR, 2013.