# testtime_adaptation_with_source_based_auxiliary_tasks__a6306df2.pdf Published in Transactions on Machine Learning Research (01/2025) Test-Time Adaptation with Source Based Auxiliary Tasks Motasem Alfarra malfarra@qti.qualcomm.com Qualcomm AI Research Alvaro H.C. Correia acorreia@qti.qualcomm.com Qualcomm AI Research Bernard Ghanem bernard.ghanem@kaust.edu.sa King Abdullah University of Science and Technology (KAUST) Christos Louizos clouizos@qti.qualcomm.com Qualcomm AI Research Reviewed on Open Review: https: // openreview. net/ forum? id= XWAXcx Ng4n This work tackles a key challenge in Test Time Adaptation (TTA): adapting on limited data. This challenge arises naturally from two scenarios. (i) Current TTA methods are limited by the bandwidth with which the stream reveals data, since conducting several adaptation steps on each revealed batch from the stream will lead to overfitting. (ii) In many realistic scenarios, the stream reveals insufficient data for the model to fully adapt to a given distribution shift. We tackle the first scenario problem with auxiliary tasks where we leverage unlabeled data from the training distribution. In particular, we propose distilling the predictions of an originally pretrained model on clean data during adaptation. We found that our proposed auxiliary task significantly accelerates the adaptation to distribution shifts. We report a performance improvement over the state of the art by 1.5% and 6% on average across all corruptions on Image Net-C under episodic and continual evaluation, respectively. To combat the second scenario of limited data, we analyze the effectiveness of combining federated adaptation with our proposed auxiliary task across different models even when different clients observe different distribution shifts. We find that not only federated averaging enhances adaptation, but combining it with our auxiliary task provides a notable 6% performance gains over previous TTA methods. 1 Introduction Deep Neural Networks (DNNs) have achieved remarkable success, achieving state-of-the-art results in several applications (Ranftl et al., 2021; He et al., 2016; Deng et al., 2009). Still, their performance severely deteriorates whenever a shift exists between training and testing distributions (Hendrycks et al., 2021a;b). Such distribution shifts are not unlikely in real-world settings. Changes in weather conditions (Hendrycks & Dietterich, 2019), camera parameters (Kar et al., 2022), data compression or even adversarial perturbations (Goodfellow et al., 2015) are all examples of distribution shifts that might impact model performance. Needless to say, adapting to or mitigating the negative effects of distribution shifts is crucial to the safe deployment of DNNs in many cases, e.g., in self-driving cars. Test Time Adaptation (TTA) (Sun et al., 2020; Liu et al., 2021) methods adapt a pretrained model to the test distribution with the goal of mitigating drops in performance caused by distribution shifts. In practice, this typically translates into optimizing a proxy objective function on a stream of unlabeled test data in an online fashion (Wang et al., 2021). The TTA approach has showed great success in improving performance under Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. Published in Transactions on Machine Learning Research (01/2025) distribution shifts in several scenarios (Niu et al., 2022; Wang et al., 2022; Yuan et al., 2023). However, to prevent overfitting, all TTA methods in the literature conduct a single adaptation step on each received batch at test time (Niu et al., 2023; Nguyen et al., 2023). This limits the efficacy of TTA methods by the bandwidth of the stream, thus hampering their online performance. Furthermore, the current paradigm of TTA focuses on updating a single model at a time, assuming the stream will reveal enough data to capture the underlying distribution shift. Yet, in many realistic settings, the stream of data accessible to an individual model might be too scarce to enable adequate adaptation. In such scenarios, we might accelerate adaptation by leveraging other models being adapted to similar domain shifts in a collaborative fashion (Jiang & Lin, 2023). In this work, we tackle the aforementioned lack of data in TTA by proposing an auxiliary task that can be optimized at test time. Since the amount of data from a given distribution shift is limited by the bandwidth of the stream, we follow Kang et al. (2023); Gao et al. (2022); Niu et al. (2022) in leveraging unlabeled data from the training distribution. First, we show one can enhance current TTA methods and accelerate adaptation to distribution shifts by introducing a simple auxiliary task consisting of the same proxy objective of previous TTA methods but applied to unlabeled clean data. Based on this observation, we propose DISTA (Distillation-based TTA), a better auxiliary objective that distills the predictions of the original pretrained model on clean unlabeled data during adaptation. Our empirical results on two benchmarks and three evaluation protocols show DISTA produces significant improvements in performance. In summary, our contributions are threefold: (i) We present a methodology to analyze the effectiveness of auxiliary tasks on accelerating the adaptation under distribution shift through lookahead analysis (Fifty et al., 2021). We show that one can leverage clean unlabeled data to better adapt to distribution shifts. (ii) We propose DISTA; a TTA method with a distillation-based auxiliary task. We conduct comprehensive experimental analysis on the two standard and large-scale TTA benchmarks Image Net-C (Hendrycks & Dietterich, 2019) and Image Net-3DCC (Kar et al., 2022), where we show how DISTA improves the performance over state-of-the-art methods by a significant margin (1.5% under episodic evaluation and 6-8% under continual evaluation). (iii) We further analyze a novel and realistic scenario where each individual model is presented with insufficient amount of data for adaptation. We demonstrate how federated learning facilitates adaptation in this case, even when the observed distribution shift varies among clients. Moreover, we observe DISTA provides a large performance gain (6% on Image Net-C) over state-of-the-art methods in this federated setup. 2 Methodology Preliminaries Test Time Adaptation (TTA) studies the practical problem of adapting pretrained models to unlabeled streams of data from an unknown distribution that potentially differs from the training one. Let fθ : X P(Y) be a classifier parametrized by θ that maps a given input x X to a probability simplex over k labels (i.e. f i θ(x) 0, fθ(x) 1 = 1). During the training phase, fθ is trained on some source data Ds X Y, but at test time, it is presented with a stream of data S that might be differently distributed from Ds. In this work, we focus on covariate shifts, i.e., changes in the distribution over the input space X due to, for instance, visual corruptions caused by changes in weather conditions faced by self-driving systems. TTA defines a learner g(θ, x) that adapts the network parameters θ and/or the received unlabeled input x at test time to enhance model performance under distribution shifts. Formally, and following the online learning notation (Shalev-Shwartz, 2011; Cai et al., 2021; Ghunaim et al., 2023; Alfarra et al., 2024), we describe the interaction at a time step t {0, 1, . . . , } between a TTA method g and the stream of unlabeled data S as: 1. The stream S reveals a sample xt. 2. The learner g adapts xt to ˆxt and θt to ˆθt before issuing a prediction ˆyt = fˆθt(ˆxt). 3. The learner g updates model parameters with θt+1 = αθt + (1 α)ˆθt, for 0 α 1. Importantly, TTA is concerned with online evaluation, meaning the learner must issue its prediction ˆyt immediately after observing xt. The main paradigm in TTA employs an unsupervised objective function to be optimized on-the-fly at test time to circumvent performance drops caused by domain shift. Wang et al. (2021) observed a strong correlation between the entropy of the output prediction for a given batch of Published in Transactions on Machine Learning Research (01/2025) 0 100 200 300 400 500 600 700 800 Num. Batches Lookahead (%) Gaussian Noise Motion Blur Snow (a) Aux-Tent equation 2. 0 100 200 300 400 500 600 700 800 Num. Batches Lookahead (%) Gaussian Noise Motion Blur Snow (b) DISTA equation 4. (c) Pipeline equation 5. Figure 1: Lookahead Analysis and Pipeline. (a) Running mean of lookahead over observed batches when employing Tent on both data revealed from the stream and Ds. (b) Running mean of lookahead over observed batches using DISTA. (c) Pipeline for our proposed DISTA. inputs and the error rate. Based on that, Wang et al. (2021) proposed to minimize the entropy of the output prediction for a given batch of inputs at test time through: θt+1 = arg min θ Ext S [E (fθ(xt))] with E (fθ(xt)) = X i f i θ(xt) log f i θ(xt). (1) In practice, the optimization problem is usually solved with a single gradient descent step to avoid overfitting network parameters on each received batch. It is noteworthy that this approach is only effective when the received batches (i) have diverse sets of labels and (ii) relate to a single type of domain shift (Niu et al., 2023). In previous work, Niu et al. (2022) attempted to accommodate these drawbacks by deploying a data selection procedure, while Yuan et al. (2023) leveraged a balanced episodic memory that have inputs with a diverse set of labels. 2.1 Test Time Adaptation with Auxiliary Tasks TTA imposes many challenges due to its realistic setup, where the learner needs to adapt the model to unlabeled data revealed by the stream in an online manner. The amount of data available for adaptation is thus fairly limited, as the learner only has access to the data revealed by the stream. Yet, the speed of adaptation matters: the faster the learner adapts to the distribution shift, the better its online performance. However, most TTA methods in the literature conduct a single adaptation step to prevent overfitting model parameters to each received batch. That is, even when new batches are revealed slowly enough to allow multiple optimization steps, the learner g cannot benefit from this additional time. This naturally begs the question: can we enhance the adaptation speed of TTA methods in this setting? In this work, we address this question through the lens of auxiliary tasks. Auxiliary tasks (Liebel & Körner, 2018; Lyle et al., 2021) are additional loss terms that indirectly optimize the desired objective function. In fact, the simple TTA objective in equation 1 can already be seen as an auxiliary loss of sorts, but unfortunately it is susceptible to overfitting. We take a step back and ask the following question: what could an adaptation model access at step t other than xt? EATA (Niu et al., 2022), for instance, leveraged source data Ds in an anti-forgetting regularizer, while DDA (Gao et al., 2022) used Ds to train a diffusion model to project xt into the source domain. More recently, Kang et al. (2023) condensed Ds to construct a set of labeled examples per class used for adaptation. While one could potentially access labeled samples Ds for the aforementioned approach, several applications do not allow accessing this labeled distribution (e.g. training procedure can be outsourced with private training data). Note that, however, one could get unlabeled data from this distribution cheaply. For example, one could store few unlabeled data examples at clear weather conditions (for autonomous driving applications) as a proxy for source distributions before deploying the model in an episodic memory, following Yuan et al. (2023). Having said that, a natural question arises: how can we use unlabeled samples from Ds to better adapt on distribution shifts in S? We first examine a simple auxiliary task. During test time, we adapt the model not only on the data revealed from the stream (i.e. xt) but also on a sample xs Ds. For example, for the entropy minimization approach Published in Transactions on Machine Learning Research (01/2025) in equation 1, we get the following objective function: min θ [ E xt SE (fθ(xt)) + E xs Ds E (fθ(xs))]. (2) At first glance, it is unclear whether the additional term in the loss function would effectively facilitate adaptation to domain shifts in S. Therefore, to better analyze the effect of the auxiliary term, we break the optimization problem in equation 2 into two steps as follows θc t = θt γ θ [E (fθ(xt))] , θt+1 = θc t γ θ [E (fθ(xs))] . (3) Note that the gradients in the first and second SGD steps are evaluated at θt and θc t, respectively. Now, we can study the effect of our auxiliary task by measuring the improvement on the entropy after optimizing the auxiliary task via the notion of lookahead (Fifty et al., 2021) defined as Lookahead(%) = 100 1 E(fθt+1(xt))/E fθc t (xt) . The higher the lookahead, the better the auxiliary task is at minimizing the desired objective. We conduct experiments on the Image Net-C benchmark (Hendrycks & Dietterich, 2019), where we fix fθ0 to be a Res Net50 (He et al., 2016) pretrained on the Image Net dataset (Deng et al., 2009). We measure the lookahead over samples revealed from the stream for when S contains one of 3 domain shifts (Gaussian Noise, Motion Blur, and Snow) and we take Ds as a subset of unlabeled images from the training set. For each received batch xt from the stream S, we sample a batch xs from Ds with the same size for simplicity. Figure 1a summarizes the results. We can observe that the simple auxiliary task of minimizing the entropy of predictions on source data has, surprisingly, a positive impact on the desired task (i.e., minimizing entropy on corrupted data). This hints that one could accelerate the convergence of adaptation on corrupted data by leveraging unsupervised auxiliary tasks on source data. We highlight that, through our lookahead analysis, one could analyze the effectiveness of different auxiliary tasks in TTA. We confirm the performance improvement hinted via our lookahead analysis experimentally in Section 4.1 and Table 1. That is, by allowing existing TTA methods (such as Tent (Wang et al., 2021) and SHOT (Liang et al., 2020b)) to leverage source data, one can improve and accelerate their adaptation by adapting on source data as an auxiliary task. Next, we describe our proposed auxiliary task. 2.2 DISTA: Distillation Based Test Time Adaptation In Section 2.1, we analyzed the positive impact of one example of auxiliary task, observing that entropy minimization on source data does improve adaptation to domain shifts. Next, we propose a better and more powerful auxiliary task. We distill a saved copy of the original pretrained model fθ0 during adaptation on samples from the source distribution. More precisely, we replace the entropy minimization term on the source data with a cross-entropy loss between the predictions of fθt and fθ0. We also use a data selection scheme similar to that of Niu et al. (2022) whereby we only update the model on samples with low entropy. Our overall objective function can be described as follows: min θ E xt Sλt(xt)E (fθ(xt)) + E xs Dsλs(xs)CE (fθ(xs), fθ0(xs)) (4) where λt(x) = 1{E(fθt(x)) 4%). Further, we combine the aforementioned approach with the filtering approach of not updating the model on unreliable examples where we observe another performance boost of 1%. At last, we replace SHOT as an auxiliary task with our proposed distillation scheme in Section 2.2, while maintaining the SHOT objective on corrupted data. In this case, we observe another significant performance boost, corroborating the superiority of our proposed auxiliary task and the orthogonality of our components to the adaptation method. C.3.4 Components of DISTA At last, we ablate the effect of each component of DISTA on the performance gain. Note that DISTA is reduced to EATA if we remove the proposed auxiliary task. To that end, we report in Table 11 the error rate of EATA, and its enhanced version through our proposed auxiliary task. Fist, we analyze the effect of introducing our distillation scheme via Cross Entropy (CE) on clean data without filtering. We observe a 0.5% reduction in the average error rate, with the performance gain reaching 0.8% on the motion blur corruption. Further, we analyze combining the aforementioned approach with filtering unreliable samples (by employing λs(xs)), observing another 0.4% performance boost. Finally, we include sample reweighing and increase the filtering margin E0 to 0.5 log(1000) resulting in another boost in accuracy (reduction in error rate). We note that we set the best hyperparameters for EATA, as recommended by the authors, with E0 = 0.4 log(1000). Table 10: Episodic Evalutation on Image Net-C of SHOT with different auxiliary components with Res Net-50. We experiment with auxiliary components when combined with SHOT. (Aux.) represents applying SHOT on both clean and corrupted data. (Fil) adds filtering unreliable examples. (DIS) replaces SHOT as an auxiliary task with our distillation task. Noise Blur Weather Digital Gauss Shot Impul Defoc Glass Motion Zoom Snow Frost Fog Bright Contr Elastic Pixel Jpeg Avg. SHOT 73.1 69.8 72.0 76.9 75.9 58.5 52.7 53.3 62.2 43.8 34.6 82.6 46.0 42.3 48.9 59.5 + Aux 67.1 64.9 65.7 69.0 69.9 55.5 49.8 50.7 58.7 42.3 33.3 68.2 44.4 41.1 46.5 55.1 + Fil. 66.2 64.1 64.3 68.5 68.7 54.9 49.0 50.0 56.7 41.7 32.7 64.2 44.0 40.6 45.9 54.1 + DIS 64.9 62.6 62.7 67.1 66.9 52.9 47.9 48.6 55.4 40.5 32.4 61.8 42.9 39.3 44.7 52.7 Published in Transactions on Machine Learning Research (01/2025) Table 11: Ablating DISTA with Episodic Evaluation on Image Net-C with Res Net-50. We ablate each component of DISTA where (CE) represents the distillation via Cross Entropy, (Fil) represents the filtering, and DISTA is the an improved version with better hyperparameter (setting E0 = 0.5 log(1000). Note that each proposed component provides a consistent performance boost. Noise Blur Weather Digital Gauss Shot Impul Defoc Glass Motion Zoom Snow Frost Fog Bright Contr Elastic Pixel Jpeg Avg. EATA 64.0 62.1 62.5 66.9 66.9 52.5 47.4 48.2 54.2 40.2 32.2 54.6 42.2 39.2 44.7 51.9 + CE 63.2 61.2 61.6 66.3 66.3 51.7 46.9 47.9 53.9 39.7 31.9 54.3 41.9 39.1 44.4 51.4 + Fil. 62.9 60.7 61.4 65.8 65.9 51.2 46.5 47.6 53.7 39.3 31.7 54.3 41.6 38.5 44.1 51.0 DISTA 62.2 59.9 60.6 65.3 65.3 50.4 46.2 46.6 53.1 38.7 31.7 53.2 40.8 38.1 43.5 50.4 C.4 Ablation on the Size of the Source Dataset We complement our results with an ablation study on the effect of the size of source dataset Ds on the performance of DISTA. To that end, let Ds be a random subset of the validation set (unlabeled images). We conduct episodic evaluation on Image Net-C using Res Net-50 dataset for this ablation and report the results in Table 12, where we observe DISTA is robust against variations in the size of Ds. In particular, we observe that even with 10% of the validation set (i.e. storing 5000 unlabeled images), DISTA improves over EATA by 1.4% on average across all corruptions. Furthermore, with only 1% of the validation dataset (500 unlabeled images), DISTA still improves on EATA by 1% on shot and impulse noise. C.5 Limitations of DISTA In our experiments, we showed how DISTA is effective in multiple evaluation protocols, two datasets, and four different architectures. We note here that the performance improvement of DISTA comes at the cost of a memory burden (storing data samples from Ds). However, our experiments in Table 12 show that even with a very small set of unlabeled examples, DISTA is still effective in improving performance. In addition, we experimented with DISTA for when the source data is not available in Section 4.4.3 where DISTA is still very effective in enhancing the performance over EATA. At last, one limitation of our federated TTA setting is the assumption that all clients have access to data from the source distribution. This makes our setting more applicable to the cross-sile setting, where the number of clients is not too large, leaving the exploration to other federated settings for future work. Table 12: Effect of the Size of Ds. We report the error rate of DISTA under episodic evaluation on Image Net-C when Ds is a sub-sampled set of the validation set of Image Net. We observe that DISTA is robust under varying the size of Ds. Ratio represents the sub-sampled coefficient (i.e. ratio of 0.25 means that DISTA only leverages 25% of the validation set as Ds). Ratio (%) Noise Blur Weather Digital Gauss Shot Impul Defoc Glass Motion Zoom Snow Frost Fog Bright Contr Elastic Pixel Jpeg Avg. EATA(0.0%) 64.0 62.1 62.5 66.9 66.9 52.5 47.4 48.2 54.2 40.2 32.2 54.6 42.2 39.2 44.7 51.9 DISTA(1.0%) 63.1 61.1 61.1 66.7 65.8 50.9 46.7 47.3 53.7 39.1 31.9 54.1 41.5 38.6 44.1 51.1 DISTA(2.5%) 62.6 60.8 60.9 65.7 65.8 50.9 46.6 47.2 53.4 39.1 31.7 54.0 41.5 38.7 43.8 50.8 DISTA(5.0%) 62.4 60.4 60.9 65.5 66.0 50.5 46.3 46.9 53.2 38.9 31.8 53.6 41.0 38.3 43.8 50.6 DISTA(7.5%) 62.6 60.3 60.8 65.4 65.3 50.4 46.4 46.8 53.3 38.9 31.7 53.8 41.2 38.2 43.7 50.6 DISTA(10%) 62.4 60.3 60.2 65.5 65.5 50.6 46.3 46.7 53.1 38.8 31.7 53.5 41.1 38.2 43.8 50.5 DISTA(25%) 62.2 60.4 60.6 65.8 65.5 50.5 46.3 46.7 53.1 38.6 31.7 53.3 40.9 38.2 43.6 50.5 DISTA(50%) 62.3 60.4 60.4 65.1 65.7 50.6 46.2 46.7 53.3 38.7 31.7 53.2 40.9 38.3 43.4 50.5 DISTA(75%) 62.3 59.9 60.5 64.8 65.2 50.4 46.0 46.8 53.1 38.7 31.7 53.7 40.9 38.1 43.5 50.4 DISTA(100%) 62.2 59.9 60.6 65.3 65.3 50.4 46.2 46.6 53.1 38.7 31.7 53.2 40.8 38.1 43.5 50.4 Published in Transactions on Machine Learning Research (01/2025) Table 13: Episodic Evaluation on Image Net-C Benchmark. We compare the performance of EATA, DISTA, and leveraging labeled data for DISTA instead of the distillation task. We replace the distillation task with a cross entropy loss between the predictions and the ground-truth labels. We observe that our unsupervised distillation scheme outperforms both EATA and leveraging labeled data. Nevertheless, DISTA+Labels still outperforms EATA by 0.8% on average. Noise Blur Weather Digital Gauss Shot Impul Defoc Glass Motion Zoom Snow Frost Fog Bright Contr Elastic Pixel Jpeg Avg. EATA 64.0 62.1 62.5 66.9 66.9 52.5 47.4 48.2 54.2 40.2 32.2 54.6 42.2 39.2 44.7 51.9 DISTA + Labels 62.7 60.9 60.9 66.0 66.1 50.7 46.9 47.4 53.6 39.2 31.9 54.9 41.5 38.6 44.2 51.0 DISTA 62.2 59.9 60.6 65.3 65.3 50.4 46.2 46.6 53.1 38.7 31.7 53.2 40.8 38.1 43.5 50.4 C.6 Leveraging Labeled Source Data At last, we study a variation of DISTA for when labeled data from the source distribution is available. In this setting, one could replace the distillation loss in Equation equation 4 with a supervised loss function. To that end, we analyze one variant of DISTA where we replace the distillation loss with cross entropy loss between the prediction of fθt and the ground-truth labels. The modified objective function can be expressed as: min θ E xt Sλt(xt)E (fθ(xt)) + E (xs,ys) Dsλs(xs)CE (fθ(xs), ys) We experiment with this labeled variant of DISTA and report the results on Image Net-C in Table 13 under episodic evaluation using Res Net-50 architecture. We observe that leveraging hard (ground-truth) labels does not improve the result over our unsupervised distillation loss. Nevertheless, this supervised variant enhances the performance over the previous state-of-the-art method, EATA. We provide the following hypothesis as to why the labeled variant of DISTA underperforms. The distillation auxiliary task regularizes the adapted model not to stray away too much from the original model in function space. We hypothesize this anti-forgetting regularization improves the stability of the model during adaptation, which facilitates the optimization problem and improves overall performance in corrupted data. When using the labels from the source data and optimizing the cross-entropy, we still have a regularizer with the same anti-forgetting motivation, but in this case we might not get the same stability during adaptation, due to the imperfect performance of fθ0. In practice, for data points that are incorrectly classified by the model, the auxiliary loss term will be high and might dominate the TTA objective, thus slowing down adaptation and possibly pushing the model to less favorable regions of the loss landscape. C.7 Impact of DISTA on Overfitting in TTA Next, we assess the impact of our proposed task on reducing overfitting when adapting with a TTA method with multiple adaptation steps. The following table reports the error rate on Image Net-C under episodic Table 14: Episodic Evaluation on Image Net-C Benchmark under Larger Number of Adaptation Steps for EATA vs DISTA. We compare the performance (error rate) when adapting with either EATA or DISTA with multiple adaptation steps. Our proposed auxiliary task in DISTA slows down overfitting when adapting to the revealed batch by the stream with multiple adaptation steps. Num. Steps Noise Blur Weather Digital Gauss Shot Impul Defoc Glass Motion Zoom Snow Frost Fog Bright Contr Elastic Pixel Jpeg Avg. EATA-1 64.0 62.1 62.5 66.9 66.9 52.5 47.4 48.2 54.2 40.2 32.2 54.6 42.2 39.2 44.7 51.9 EATA-2 68.3 63.8 65.9 72.6 72.4 53.7 48.6 49.3 55.7 40.5 32.9 58.2 42.9 39.8 45.7 54.0 EATA-3 74.1 70.5 75.6 86.9 81.2 58.9 52.0 51.8 60.2 41.7 34.2 74.4 45.4 41.5 48.0 59.8 EATA-4 90.4 82.1 85.7 96.8 91.5 67.1 52.9 55.5 67.2 43.0 35.2 95.7 46.1 42.0 49.0 66.7 EATA-5 95.3 92.9 93.2 97.1 96.2 70.6 56.6 56.9 74.7 45.1 35.4 97.3 47.9 44.4 51.2 70.3 DISTA-1 62.2 59.9 60.6 65.3 65.3 50.4 46.2 46.6 53.1 38.7 31.7 53.2 40.8 38.1 43.5 50.4 DISTA-2 67.5 63.2 63.9 70.6 70.9 53.0 48.4 48.2 55.9 39.9 32.2 60.2 42.4 39.2 44.9 53.4 Published in Transactions on Machine Learning Research (01/2025) evaluation, where EATA/DISTA-X represents adapting with EATA/DISTA with X adaptation steps. We first compare EATA-2 with DISTA as DISTA conducts two sequential adaptation steps making it directly comparable with EATA-2. We report an avarage error rate of 50.4% for DISTA-1 compared with 54.0% for EATA-2. Further, we compare DISTA-2 with EATA-4 where the performance gap becomes much larger (53.3% for DISTA-2 compared to 66.7% for EATA-4). We also note DISTA-2 with four total adaptation steps still outperforms EATA-2 proving that DISTA reduces the overfitting with the proposed auxiliary task. C.8 Evolution of λs and λt At last, we report the evolution of the data selection functions λt and λs throughout the adaptation with DISTA. Figure 4 summarizes the evolution. We observe that: (i) as the number of batches increases, λt increases due to the increase of the confidence of the model in predicting data from the domain shift. (ii) On the other hand, λs remains stable at a higher level than λt due to the original high confidence in predicting the source domain along with our proposed distillation loss. 0 100 200 300 400 500 600 700 800 Num. Batches Gaussian Noise Motion Blur Snow 0 100 200 300 400 500 600 700 800 Num. Batches Gaussian Noise Motion Blur Snow Figure 4: Evolution of λt and λs during adaptation.