# adaptive_testtime_personalization_for_federated_learning__07441881.pdf

Adaptive Test-Time Personalization for Federated Learning

Wenxuan Bao1 , Tianxin Wei1 , Haohan Wang1, Jingrui He1

1University of Illinois Urbana-Champaign {wbao4,twei10,haohanw,jingrui}@illinois.edu

Personalized federated learning algorithms have shown promising results in adapting models to various distribution shifts. However, most of these methods require labeled data on testing clients for personalization, which is usually unavailable in real-world scenarios. In this paper, we introduce a novel setting called test-time personalized federated learning (TTPFL), where clients locally adapt a global model in an unsupervised way without relying on any labeled data during test-time. While traditional test-time adaptation (TTA) can be used in this scenario, most of them inherently assume training data come from a single domain, while they come from multiple clients (source domains) with different distributions. Overlooking these domain interrelationships can result in suboptimal generalization. Moreover, most TTA algorithms are designed for a specific kind of distribution shift and lack the flexibility to handle multiple kinds of distribution shifts in FL. In this paper, we find that this lack of flexibility partially results from their pre-defining which modules to adapt in the model. To tackle this challenge, we propose a novel algorithm called ATP to adaptively learns the adaptation rates for each module in the model from distribution shifts among source domains. Theoretical analysis proves the strong generalization of ATP. Extensive experiments demonstrate its superiority in handling various distribution shifts including label shift, image corruptions, and domain shift, outperforming existing TTA methods across multiple datasets and model architectures. Our code is available at https://github.com/baowenxuan/ATP.

1 Introduction

Federated learning (FL) is a distributed learning system where multiple clients collaborate to train a machine learning model under the orchestration of the central server, while keeping their data decentralized [31, 18]. However, clients in FL typically exhibit distinct data distributions. For example, in the context of animal image classification, users tend to capture pictures of various animals prevalent in their respective regions, introducing label shift [51] to the local image dataset. Meanwhile, even when capturing images of the same species, the visual appearance can be influenced by the environment and camera settings, introducing feature shift [34]. It is crucial that each client can adapt the model to align with its unique data distribution [44]. Previous personalized federated learning (PFL) works have mainly focused on improving the performance on clients participating in training [41, 37, 25, 4] or generalization to new clients [8, 7, 6], assuming the availability of labeled data. However, in many real-world scenarios, clients do not have labeled data for personalization, which limits the application of PFL algorithms. For example, when employing an animal image classifier to mobile phones, their users may capture images of various animals, but without any accompanying labels indicating the species of the animal.

*Equal contribution.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

In this paper, we introduce a novel setting named test-time personalized federated learning (TTPFL). During the training phase, a global model is trained using source clients. During the testing phase, each target client downloads the global model and locally personalizes the model with its unlabeled data during test-time. This setting is particularly well-suited for cross-device FL, especially when generalizing to a large number of target clients that have not participated in the training phase and lack labeled data for supervised personalization. Compared to global FL, which trains a shared global model for all clients, TTPFL enables model adaptation to individual target clients facing complex distribution shifts. Compared to standard PFL, TTPFL does not necessitate additional labeled data from target clients for adaptation.

Test-time adaptation (TTA), which adapts a pretrained model from the source domain to an unlabeled target domain, could be a solution for TTPFL. However, applying current TTA methods to FL poses two challenges. First, most TTA methods assume training data are sampled from a single domain [16, 52]. In FL, where source data are distributed across multiple clients, this simplification neglects interrelationships among source domains, impacting generalization. Furthermore, the current TTA methods are usually customized for specific distribution shifts and lack the flexibility to address diverse types of distribution shifts in FL. The inflexibility of existing TTA algorithms largely results from their predefined selection of modules to adapt (e.g., feature extractor [28, 43], final linear layer [16, 36], batch normalization layers [38, 45]). However, different modules encode varying semantic information levels, and adapting specific modules may be effective for certain shifts but not others [20]. Meanwhile, although the distribution shifts among source and target clients cannot be directly inspected, the same type of distribution shifts is likely to exist among source clients. We argue that

Which modules to adapt should depend on the type of distribution shifts among clients, which can be inferred from source clients.

Motivated by this, we propose a new Adaptive Test-time Personalization algorithm called ATP to learn the adaptation rates from distribution shifts among source clients. During training, each source client simulates unsupervised adaptation and refine the adaptation rates of each module to maximize the effect of unsupervised adaptation. The server aggregates local adaptation rates periodically to improve generalization. During testing, each target client leverages learned adaptation rates to locally adapt the global model, and cumulatively averages adapted models from previous batches to enhance the performance for online TTA. Theoretical analysis confirms ATP s robust generalization due to its utilization of multiple sources and low-dimensional adaptation rates. Extensive experiments demonstrate its superiority in addressing various distribution shifts scenarios, including label shift, image corruptions, and domain shift, consistently outperforming existing TTA methods across multiple datasets and model architectures. We summarize our contributions as follows.

We consider TTPFL, a new learning setting in FL, addressing the challenge of generalizing to new unlabeled clients under complex distribution shifts. (Section 3)

We introduce ATP, which adaptively learns the adaptation rate for each module, enabling it to handle different types of distribution shifts. (Section 4)

We provide theoretical analysis confirming ATP s robust generalization. (Section 5)

We empirically evaluate ATP over various distribution shifts scenarios, using a wide range of datasets and models. (Section 6)

2 Related works

Federated learning (FL) is a distributed learning system where multiple clients collaborate to train a machine learning model under a central server s orchestration while keeping data decentralized [18].

Personalized federated learning (PFL) extends this framework by allowing each client to personalize the model to its own local data. The most straightforward PFL method is fine-tuning the global model with a few steps of gradient descent [48, 8, 7]. Similarly, another line of works use the global model as a regularizer [22] during local training. Fed THE [17] focuses on evolving local testing set, and proposes a test-time adaptation algorithm for FL that adaptively combines global and personalized models. However, all these methods require labeled data to construct personalized models. Fed-Ro D [6] uses hypernetworks to generate personalized model, relaxing the requirements for labeled data. But it still requires the label distribution of the client. Fed UL [30] trains a global model with only

unlabeled clients. However, it is limited to label shift where each client shares the label-conditional feature distribution p(x|y). Our setting is mostly similar to OD-PFL [2], which also focuses on generalization to new unlabeled client. It uses an unsupervised client encoder and a hypernetwork [39] to generate personalized model. However, OD-PFL requires re-training a large hypernetwork, while our TTPFL setting focuses on adapting an existing global model.

Test-time adaptation (TTA) aims to adapt a machine learning model to a testing set with dataset shift during test-time without re-accessing training data. Most of the TTA methods focus on either feature shift or label shift. For feature shift (same p(y|x), different p(x)), entropy minimization is frequently used to adapt the model in the unsupervised fashion. Tent [45] minimizes the average prediction entropy by adapting the batch normalization layers [15]. MEMO [52] minimizes the marginal entropy over different augmentations of the sample input image by adjusting all model parameters. SHOT [28] exploits information maximization and pseudo-labeling to achieve target-specific feature extraction. Differently, T3A [16] adjusts the final classification layer, but it is also shown to implicitly reduce the entropy. It is important to notice that all these methods pre-define which modules to be adapted in the network. For label shift (same p(x|y), different p(y)), most of the previous works focus on estimating the shifted label distribution. EM [36, 1] iteratively uses model predictions to estimate the label prior distribution and uses label prior distribution to adjust model predictions. BBSE [29, 3] constructs a confusion matrix on the validation dataset, and uses the prediction distribution to estimate the ground-truth label distribution. The estimated label distribution is used for re-training a model with importance sampling. [49] generalizes these methods to the online dataset shift setting where the label distribution for testing data is evolving over time. However, all these methods heavily rely on the assumption of the same p(x|y), which can be violated in real applications.

Comparison with Fed THE [17] Recently, Fed THE also explored TTA in FL. However, Fed THE focus on the test-time distribution shift for clients that participate in FL training, while we focus on improving the performance on novel clients. Moreover, Fed THE fuses global head and personalized head to get robust prediction. It cannot be easily generalized to target clients which does not have labeled data to train the personalized head.

Our paper is also related to partial fine-tuning and hyperparameter optimization. We discuss these works in Appendix A.1 in detail.

3 Motivation

In this section, we first introduce the setting of test-time personalized federated learning, and then show that current TTA methods lack the flexibility to various types of distribution shifts in TTPFL.

3.1 Test-time personalized federated learning

Preliminary We consider a standard setting for cross-device FL [46] and domain generalization [47]. Considering an FL system with N source clients {Si}N i=1 and M target clients {Tj}M j=1. Each source client Si has its own labeled dataset DSi with ni samples {(x Si 1 , y Si 1 ), , (x Si ni, y Si ni)} i.i.d. drawn from its distribution P Si(x, y), where x is the input and y is its corresponding label. Each target client Tj has its own unlabeled dataset XTj = {x Tj 1 , , x Tj mj} i.i.d. drawn from its distribution P Tj(x, y), while the corresponding labels {y Tj 1 , , y Tj mj} cannot be accessed. The distributions for different source/target clients are different, sampled from a meta-distribution Q, i.e., distribution of distributions. Global federated learning (GFL) aims to find a single global model minimizing the expected loss over client population [46]: L(w G) = EP QLP (w G), where LP (w G) = E(x,y) P ℓ(f(x; w G); y) (1)

where ℓrepresents the loss function and f represents model. GFL enforces that each client uses the same global model for prediction, which does not allow for adaptation to each client s unique data distribution. In contrast, personalized federated learning (PFL) personalizes the global model w G using its labeled data, and uses the personalized model for prediction, replacing the w G in Eq. (1). However, most of the PFL algorithms [8, 7, 22] require the assumption that the target client also possesses additional labeled data, which is a stronger assumption compared to GFL.

Test-time personalized federated learning In this paper, we introduce a novel setting named test-time personalized federated learning (TTPFL), and compare it with the standard GFL and PFL

Personalized

Personalization

Personalized FL

Personalized

Test-time Personalization

Test-time Personalized FL (Ours)

Figure 1: Comparison between the testing phase of GFL, PFL, and TTPFL. TTPFL enables model personalization without requiring labeled data.

in Figure 1. TTPFL focuses on how to adapt a trained global model to each target client s data distributions during test-time, with an adaptation rule A only using unlabeled data. The objective function can be formulated as

L(w G, A) = EP QLP (w G, A), where LP (w G, A) = E(x,y) P ℓ(f(x; A(w G, X)); y) (2)

which can be unbiasedly estimated by the average loss over M target clients unseen during training

ˆL(w G, A) = 1

j=1 ˆLP Tj (w G, A), where ˆLP Tj (w G, A) = 1 mj

r=1 ℓ(f(x Tj r ; A(w G, X Tj r )); y Tj r ) (3)

The adaptation rule A adapts a the global model with unlabeled samples XTj r . We consider two standard settings: test-time batch adaptation (TTBA) and online test-time adaptation (OTTA) [27]. TTBA individually adapts the global model to each batch of unlabeled samples, where XTj r is the data batch that x Tj r belongs. OTTA adapts the global model in an online manner, where XTj r contains all the data batches arriving before or together with x Tj r .

3.2 Limitation of test-time adaptation

As the precursor to TTPFL, TTA [45, 52, 29] studies how to adapt a trained model to target dataset under certain types of dataset shifts. Since TTA methods only require unlabeled target data for adaptation, they can be applied in TTPFL. We test state-of-the-art TTA methods with Res Net-18 on CIFAR-10 under two types of distribution shifts: label shift and feature shift, with results presented in Figure 2. As expected, each algorithm can boost the model s accuracy under the distribution shift it is designed for. However, most algorithms improve their performance in one scenario while simultaneously impairing it in another scenario, demonstrating a trade-off in their performance on feature shift and label shift. Moreover, when facing a more complex hybrid of distribution shifts, most TTA methods fail to introduce satisfactory performance gain (Table 1). Therefore, TTA methods are not suitable for TTPFL given the variety of distribution shifts in FL client.

The inflexibility of TTA algorithms largely results from their predefined selection of modules to adapt, e.g., batch normalization (BN) layers [38, 45], the feature extractor [28, 43], or the last linear layer [16, 36]. However, which modules to adapt is closely related to the type of distribution shift. For example, adapting the last linear layer can encode the label shift (Proposition 3.1), while it may fail when the extracted features are already corrupted due to feature shift. Similarly, adapting the BN layers can improve the performance under feature shift by distribution alignment (Proposition 3.2), while distribution alignment can harm the performance under label shift [53]. Proposition 3.1 (Adapting the last layer to handle label shift). Consider two distribution p, q with p(x|y) = q(x|y) and p(y) = q(y). When a neural network is calibrated on p, i.e., f(x; w) = p( |x), it is calibrated on q after adding log q(y)

p(y) to the bias term of the final last layer.

Proposition 3.2 (Adapting the BN layer to handle feature shift [38]). When the feature shift only causes differences in the first and second order moments of the feature activations z = g(x) where g

50 55 60 65 70 75 80 Label Shift Accuracy

Feature Shift Accuracy

Algorithm No adapt BN SHOT Tent

T3A MEMO EM BBSE

Figure 2: Performance trade-off of existing TTA methods under two distribution shifts.

64 66 68 70 72 74 76 78 Label Shift Accuracy

Feature Shift Accuracy

Algorithm no adapt input conv block1 block2 block3 block4 final linear conv.weight

bn.running_mean (m=0.1) bn.running_mean (m=-0.1) bn.running_var (m=0.1) bn.running_var (m=-0.1) bn.weight bn.bias linear.weight linear.bias

Figure 3: Performance trade-off of entropy minimization when adapting different modules.

is the combination of layers before the BN layer, the feature shift can be removed by adapting running mean and variance of the BN layer.

To verify the connection between distribution shift and the selection of modules for adaptation, we experiment with adapting different subsets of modules within the network to minimize the entropy loss [45]. In Figure 3, we observe a similar performance trade-off between feature shift and label shift: while adapting certain modules can boost the accuracy under one distribution shift, it is less likely to succeed under the other shift. To break the performance trade-off, it is essential to adaptively choose which modules to adapt according to the present type of distribution shift. Moreover, while [20] suggests adapting different blocks in the network, we find it more important to decide (1) which module type to adapt and (2) what is the adaptation rate (i.e., learning rate for adaptation). For example, adapting all BN running means significantly outperforming adapting any one block under feature shift. Meanwhile, employing positive or negative adaptation rates for running means yields contrasting outcomes, favoring adaptation in the presence of label shift or feature shift while impairing the other. These observations motivate us to choose which module to adapt (instead of blocks) while optimizing the adaptation rates for each module.

4 ATP: adaptive test-time personalization

In this section, we propose ATP that automatically learns the adaptation rates for each module. We introduce the training and testing phase of ATP in subsection 4.1 and 4.2, respectively.

4.1 Training phase: learn to adapt with source clients

In this part, we introduce how ATP learns adaptation rates from source clients without sharing local data. ATP uses the communication protocol of Fed Avg [31] to optimize adaptation rates. In each communication round, each source client first simulates unsupervised adaptation with the current adaptation rates, and then refines the adaptation rates to maximize the effect of adaptation. After local computation, the local adaptation rates are then aggregated on the server to ensure better generalization to target clients. Algorithm 1 gives the overview of the training phase of ATP. We then explain each step in detail.

Unsupervised adaptation We consider a neural network model f( ; w G) with global model parameter w G RD. Similar to previous works [45, 38], we consider the model processes a data batch XSi k = {x Si k,b}B b=1 at a time where B is the batch size, i is the client index and k is the batch index. In the following, we omit the superscript Si for clarity, e.g. XSi k Xk, as unsupervised adaptation and supervised refinement operate identically across all source clients. The network has d modules, with corresponding parameters w[1], , w[d]. Typically we have d D. During unsupervised adaptation, we allow each module w[l] to have a different adaptation rate α[l]. ATP learns to adapt both trainable parameters and running statistics for batch normalization (BN) [15] layers. To achieve more precise control of adaptation, the module in ATP is slightly more finegrained than the layer . For example, each BN layer has four modules: running mean, running variance, weight, and bias.

Update trainable parameters A common strategy for updating trainable parameters is performing one step of gradient descent to minimize the cross-entropy loss. Since label information are unavailable for computing cross-entropy, we instead minimize the entropy loss ℓH( ˆY ) = 1 B PB b=1( P

c ˆyb,c log ˆyb,c), where ˆY is the prediction probabilities over the label space of a data batch. Entropy quantifies the uncertainty of the model prediction, and is frequently used in previous TTA algorithms [45, 52, 28]. For each trainable parameter module w[l], the corresponding unsupervised update direction for each client is the negative gradient direction, i.e.,

h[l] k = w[l]ℓH(f(Xk; w G)) (4)

Update running statistics The running statistics (mean/variance) in BN layers are not updated by gradient descent. Instead, they are updated by running average.

w[l] k (1 m)w[l] G + m ˆw[l] k = w[l] G + m( ˆw[l] k w[l] G)

where w[l] G is the running statistics and ˆw[l] k is the statistic for the current batch of inputs. In previous works, the momentum1 m is usually a fixed hyperparameter in [0, 1]. In ATP, we consider the momentum for each module as an adaptation rate (α[l] R) to be learned. We define the corresponding update direction as

h[l] k = ˆw[l] k w[l] G (5)

After computing the update direction, each module will be updated along the update direction with its corresponding adaptation rate, i.e., w[l] k w[l] G + α[l]h[l] k . Expressed in a compact form,

wk w G + (Aα) hk (6)

where is the element-wise product, hk RD is the concatenation of {h[l] k }d l=1, α = [α[1], , α[d]] and A RD d is a 0-1 assignment matrix that maps each adaptation rate α[l] to the indices of l-th module s parameters in w G.

Supervised refinement After unsupervised adaptation, we refine the adaptation rates on each source client with label information to minimize ℓCE(f(Xk, wk), Y k), where ℓCE is the cross-entropy loss. We use gradient descent to optimize α, i.e.,

α α η αℓCE(f(Xk; wk), Y k) (7)

where η is the learning rate of adaptation rates. Notice that the gradient of α can be computed as

αℓCE(f(Xk; wk), Y k) = ℓCE(f(Xk; wk), Y k)

α = A (hk wkℓCE(f(Xk; wk), Y k))

To estimate the gradient of α, each training client only needs to adjacently compute the unsupervised and supervised gradient, and compute their module-wise inner products. Different from many meta-learning algorithms [9, 26], ATP is computationally very efficient since it requires no secondorder derivatives. In the practical implementation, since each module in the model has significantly different number of parameters, the raw gradient for each α[l] usually has different scales. Therefore we normalize the gradient with the square root of the number of parameters in the corresponding module.

Server aggregation To incorporate adaptation knowledge from multiple source clients and enhance generalization to the clients population, ATP use standard federated aggregation [31] to periodically aggregates the local adaptation rates. In each communication rounds, after each client locally update α for a few iterations, the local adaptation rates are uploaded to the server for averaging (as shown in line 6 of Algorithm 1), and then sent to source clients for the next round of training. With server aggregation, ATP learn the adaptation rates that enables successful adaptation to all source clients in average.

Communication cost Notice that ATP only optimizes the adaptation rates α without changing the global model w G. Therefore, only the adaptation rates are kept transmitted between the server and each client, while the global model parameter is only broadcasted once at the start of the ATP training. Such design significantly reduces the communication cost from 2TD (for standard Fed Avg) to D + 2Td.

1Some literatures consider (1 m) as the momentum. Here we follow the definition in Py Torch.

Algorithm 1 ATP Training

Server Train(w G, α0 G = 0) 1: Broadcast w G to all source clients 2: for communication round t = 1 to T do 3: St (random set of C source clients) 4: for source client Si St in parallel do 5: αt i Client Train(Si, αt 1 G ) 6: αt G = 1

Si St αt i 7: return αT G

Client Train(Si, α) # Run on source client Si 8: for local epoch e = 1 to E do 9: BSi (split DSi into KSi batches of size B) 10: for batch k = 1 to KSi do 11: (XSi k , Y Si k ) (k-th labeled batch in BSi) 12: Estimate update direction h Si k with unlabeled XSi k according to Eq. (4) and (5) 13: w Si k w G + (Aα) h Si k 14: α α η αℓCE(f(XSi j ; w Si k ), Y Si k ) 15: return α

Algorithm 2 ATP Testing

Client Test(Tj, w G, α) # Run on target client Tj 1: BTj (split XTj into KTj batches of size B) 2: hhistory 0 # Cumulative moving average 3: for batch k = 1 to KTj do 4: Estimate update direction h Tj k with unlabeled X Tj k according to Eq. (4) and (5) 5: if TTBA then 6: w Tj k w G + (Aα) h Tj k 7: else if OTTA then 8: hhistory k 1

k hhistory + 1

kh Tj k 9: w Tj k w G + (Aα) hhistory 10: Make prediction: ˆY Tj k = f(X Tj k ; w Tj k )

4.2 Testing phase: exploit adaptation rates on target clients

During testing, each target client downloads both the global model and the adaptation rates. We propose two versions of ATP: ATP-batch for test-time batch adaptation (TTBA) and ATP-online for online test-time adaptation (OTTA). We summarize the testing phase in Algorithm 2.

ATP-batch For TTBA, each target client makes independent predictions on each batch. For each batch of target data, ATP-batch first conducts the unsupervised adaptation identical to source clients, and then makes prediction.

ATP-online For OTTA, data comes in a stream of batches [XTj 1 , XTj 2 , ]. Previous works [43, 45] usually keep updating the model batch after batch. However, such accumulative adaptation can introduce severe batch dependency problem, i.e., each batch is evaluated when the model takes different number of update steps [54]. For the first few batches, the model has not adapted to the local distribution well; while for the last few batches, the model may over-minimize the entropy but increase the cross-entropy loss. To avoid batch dependency, we propose an averaged adaptation mechanism for online adaptation, whose scale of adaptation is stable during online adaptation.

For each batch XTj k in the data stream, we always compute the update direction h Tj k starting with the fixed global model w G according to Eq. (4) and (5). Subsequently, instead of using only the current update direction to adapt the model, we average all the stored update direction to update the model, i.e.,

w Tj k w G + (Aα)

By using the average of previous updates, we simulate updating with larger batch size to utilize historical data, while controlling the number of update steps to be one. In the practical implementation, we use cumulative moving average (as shown in line 8 of Algorithm 2), whose space complexity does not increase with the increment of step k.

5 Theoretical analysis

In this section, we show that ATP enjoys good generalization guarantees because of the low dimensionality of adaptation rates. Formal definitions, assumptions and full proofs are provided in Appendix B.3. We also show in Appendix B.2 that ATP has convergence guarantee similar to Fed Avg [31, 46]. Theorem 5.1 (Generalization). Let H = {α : α 2 R} be the hypothesis space (space of adaptation rates), N be the number of source clients, and K be the number of data batches on each

source client. Assuming (1) L-Lipschitz model, and (2) H-upper-bounded 2-norms for each module s update. For any fixed global model w G and any ϵ > 0, we have

Pr( sup α H |ε(α) ˆε(α)| ϵ) 12LHR

d 4 exp NKϵ2

where ˆε(α) is the average post-adaptation error rate on source clients, and ε(α) is the expected post-adaptation error rate on clients population.

Theorem 5.1 shows that, although ATP improves the model expressiveness by adapting the model to each client s distribution, ATP can still provably generalize well to the clients population. Especially, this generalization benefit from low dimensionality of adaptation rates, since the bound get looser when d increases. Moreover, this bound shows the importance of learning adaptation rates from multiple source clients: if we merge all N source domains with K batches into one domain with NK batches, then the bound will be much looser.

6 Experiments

In this section, we design experiments to answer the following research questions:

RQ1: Can ATP handle different distribution shift and outperform prior TTA methods? RQ2: Does ATP learn adaptation rates specific to distribution shift?

Setup We evaluate ATP on a variety of models, datasets and distribution shifts. We first evaluate on CIFAR-10(-C) with a standard three-way split [50]: we randomly split the dataset to 300 clients: 240 source clients and 60 target clients. Each source client has 160 training samples and 40 validation samples, while each target client has 200 unlabeled testing samples. We simulate three kinds of distribution shifts: feature shift, label shift, and hybrid shift. For feature shift, we follow [12, 17], randomly apply 15 different kinds of corruptions to the source clients, and 4 new kinds of corruptions to the target clients to test the generalization of ATP. For label shift, we use the step partition [5], where each client has 8 minor classes with 5 images per class, and 2 major classes with 80 images per class. For the hybrid shift, we apply both step partition and feature perturbations. To test ATP under more challenging domain shifts, we then evaluate ATP on two domain generalization datasets: Digits-5 [25] and PACS [21]. We adopt the leave-one-domain-out evaluation protocol [10], i.e., one domain is chosen to construct target clients, and the remaining domains are used to construct source clients. We follow similar data preprocessing in [25], while additionally applying step partition to inject label shift. Each domain is divided into 10 clients, leading to 40/10 source/target clients for Digits-5 and 30/10 source/target clients for PACS. For the experiments above, we use Res Net-18 [11] as a common choice in FL experiments [42, 14, 33]. We also test ATP with two different architectures: a five-layer CNN on CIFAR-10(-C) and Res Net-50 on CIFAR-100(-C). Detailed experiment settings are given in Appendix C.1.

6.1 RQ1: Can ATP handle different distribution shift?

Table 1: Accuracy (mean s.d. %) on target clients under various distribution shifts on CIFAR-10

Method Feature shift Label shift Hybrid shift Avg. Rank

No adaptation 69.42 0.13 72.98 0.24 63.68 0.24 7.7 BN-Adapt 73.52 0.22 54.54 0.10 50.42 0.39 7.0 SHOT 71.76 0.17 48.13 0.18 44.68 0.32 9.3 Tent 71.76 0.09 50.13 0.21 46.05 0.26 8.3 T3A 69.53 0.08 71.70 0.32 62.17 0.17 8.0 MEMO 72.43 0.22 77.30 0.15 68.07 0.28 4.3 EM 65.18 0.12 80.73 0.18 69.85 0.43 5.0 BBSE 63.98 0.17 79.30 0.17 67.96 0.43 6.7 Surgical 69.85 0.22 76.00 0.17 66.94 0.43 6.3 ATP-batch 73.68 0.10 79.90 0.22 73.05 0.35 2.3 ATP-online 74.06 0.18 81.96 0.14 75.37 0.22 1.0

We compare ATP with three kinds of baseline TTA methods. For feature shift methods, we compare to BN-Adapt [38] and Tent [45] which adjusts the batch normalization layers, SHOT [28] which adjusts the feature extractor, T3A [16] which adjusts the final classifier, and MEMO [52] which uses augmentation to adjust the whole network. For label shift, we compare to EM [36] which adjusts the label priori unsupervisedly with expectationmaximization, and BBSE [29] which uses the validation data to construct a confusion matrix to estimate the label priori. Since re-training a model with different label weights for each client is not realistic in FL. We use the estimated label distribution to adjust the output of a classifier.

Table 2: Accuracy (mean s.d. %) on target clients under hybrid shift on Digits-5 and PACS

Method Digits-5 PACS

MNIST SVHN USPS Synth Digits MNIST-M Art Cartoon Photo Sketch

No adaptation 95.47 0.22 52.28 1.45 89.62 0.44 79.75 0.69 55.62 0.80 71.57 1.16 74.71 0.70 90.25 0.75 74.20 0.72 BN-Adapt 94.90 0.29 57.57 0.53 89.51 0.39 75.34 0.48 59.68 0.44 73.55 0.51 71.54 0.55 92.07 0.26 70.92 0.53 SHOT 94.69 0.31 57.91 0.23 89.55 0.69 76.43 0.34 60.19 0.69 69.32 0.67 67.77 0.40 86.97 0.60 59.40 0.91 Tent 95.48 0.29 60.67 0.49 91.65 0.61 78.56 0.45 62.49 0.73 71.59 0.71 71.03 0.97 88.06 0.24 63.15 1.10 T3A 94.63 0.61 49.90 1.10 88.46 0.75 75.47 1.14 51.25 1.55 72.15 0.72 75.02 0.78 91.51 0.62 70.14 1.21 MEMO 95.92 0.19 52.85 1.09 89.84 0.44 80.12 0.90 55.48 1.13 71.47 1.29 75.57 0.98 90.65 0.90 76.30 0.65 EM 96.64 0.31 57.21 1.65 92.29 0.32 85.69 0.46 62.08 0.60 73.96 1.85 78.91 0.92 92.30 0.92 80.82 1.52 BBSE 94.47 0.58 57.26 1.47 91.34 0.39 85.54 0.46 61.59 0.91 74.33 1.78 78.69 1.00 91.82 0.68 80.15 1.42 Surgical 97.35 0.13 59.93 2.01 94.19 0.40 86.06 0.44 65.87 0.78 74.59 2.69 77.48 0.64 92.34 0.78 80.90 3.42 ATP-batch 97.81 0.27 62.18 1.71 95.41 0.26 87.91 0.45 69.98 1.96 82.92 0.96 79.64 0.75 95.40 0.41 82.28 1.57 ATP-online 97.81 0.23 62.64 1.92 95.56 0.23 88.33 0.47 70.78 2.36 83.51 0.84 79.46 0.77 95.52 0.40 82.80 1.69

conv.weight

bn.running_mean

bn.running_var

linear.weight

linear.bias

Module type

Adaptation rate

Dataset shift

feature label hybrid

last linear

Adaptation rate

Dataset shift

feature label hybrid

Figure 4: Adaptation rates learned by ATP with different distribution shifts on CIFAR-10

We also compare to Surgical [20] which uses the validation data to decide which blocks to adapt. For all baselines we use the validation data to select hyperparameters.

ATP can handle different types of distribution shifts Table 1 shows the results on CIFAR-10. Under feature and label shifts, most TTA methods suffer from performance trade-off as they improve the performance on one distribution shift while harm the other. The only exception is MEMO, which utilizes data augmentation to robustify the model prediction. However, it also introduces significant computational cost during inference. As an adaptive framework simpler than ours, Surgical also introduces accuracy gain across all distribution shifts. However, its coarse-grained adaptation rule prevents further improvement on the accuracy. ATP reaches great performance comparable to the strongest baseline TTA method under both feature and label shifted. Under the more complex hybrid shift, ATP achieves the highest performance gain with a significant margin. Meanwhile, ATP-online can further improve the performance of ATP-batch by using information from previous batches.

ATP can handle more challenging domain shifts Table 2 shows the results on two domain generalization datasets with a hybrid of domain and label shifts. Compared to baselines, ATP consistently achieves higher accuracy across all domains.

ATP is compatible to multiple model architectures Finally, we evaluate ATP on more model architectures: Shallow-CNN as smaller model and Res Net-50 as larger model. As shown in Table 5 in Appendix C.2, ATP has uniformly good performance on two new models.

6.2 RQ2: Does ATP learn adaptation rates specific to distribution shift?

Besides ATP s good performance, we are also interested in whether ATP successfully learns adaptation rates specific to the type distribution shift. To explore this, we group the adaptation rates by their corresponding block and module type under three kinds of distribution shifts. As shown in Figure 4, ATP learns significantly different adaptation rates under different distribution shifts. In Figure 4 (left), ATP learns to adapt the last linear layer under label shift, while mainly adapt the former layers under feature shift. More interestingly, we notice in Figure 4 (right) that the adaptation rates for batch norm running statistics are positive under feature shift, but negative under label shift. Negative adaptation rate is usually counter-intuitive, since it disaligns the training and testing distributions. However, it benefits the model under label shifts because it explicitly adapts the label prior distribution towards

15 30 60 120 240 Cohort size C

ATP-batch ATP-online

5 10 20 40 80 160 Batch size B

No adaptation ATP-batch ATP-online

Figure 5: Effect of cohort size and batch size

the prediction distribution. We use a toy example in Appendix C.5 to show why negative adaptation rate can improve the model performance under label distribution.

Table 3: Train and test adaptation rates with different distribution shifts, accuracy (mean s.d. %)

Feature shift Label shift Hybrid shift

No adaptation 69.42 0.13 72.98 0.24 63.68 0.24 Feature shift 73.68 0.10 65.05 1.82 60.64 1.43 Label shift 67.99 0.28 79.90 0.22 69.50 0.52 Hybrid shift 72.69 0.14 78.92 0.34 73.05 0.35

Moreover, we examine whether the learn adaptation rates are specific to the type of distribution shift by training on one distribution shift, but testing on another. We observe in Table 3 that, ATP performs the best when trained and tested with the same type of distribution shifts. However, the adaptation rates trained on feature/label shift fails to boost the performance on the other distribution shift. The adaptation rates trained on hybrid shift can generalize to feature shift and label shift, but still worse than the adaptation rates trained with the same type of distribution shifts. These results show that the learn adaptation rates are specific to the type of distribution shift.

6.3 Further discussion

Table 4: Ablation study, accuracy (mean s.d. %)

Method Feature shift Label shift Hybrid shift

No adaptation 69.42 0.13 72.98 0.24 63.68 0.24 ATP-params 69.23 0.27 78.29 0.14 68.05 0.54 ATP-stats 71.27 0.17 74.03 0.18 64.78 0.27 ATP-batch 73.71 0.14 79.90 0.22 73.05 0.35

Ablation study We present two variants of ATP to study how trainable parameters and running statistics contribute to the adaptability of ATP. ATP-params only learns to adapt the trainable parameters, while ATP-stats focuses solely on adapting the running statistics. As shown in Table 4, adapting trainable parameters and running statistics both play critical roles in achieving successful adaptation. More specifically, ATP-params primarily facilitate adaptation to label shift, whereas ATP-stats essentially aid in adapting to feature shift.

Hyperparameter sensitivity Figure 5 shows the effects of cohort size and batch size with CIFAR10 under the hybrid shift, where cohort size refers to the number of clients sampled at each round. ATP demonstrates remarkable consistency in accuracy across different cohort sizes, indicating its robustness. For batch size, we optimize the adaptation rates with B = 20 and subsequently evaluate the algorithm with different batch sizes. We find that ATP consistently improves the model s accuracy across different batch sizes, with larger batch sizes yielding greater benefits for the model. ATP-online is more robust to batch size than ATP-batch since it can utilizes information from previous batches.

7 Conclusion

In this paper, we propose ATP that unsupervisedly learns the adaptation rate for each module to handle various types of distribution shifts encountered in test-time personalized federated learning. As a potential future direction, incorporating the training of the global model could offer advantages in terms of facilitating easier and better personalization.

Acknowledgments and Disclosure of Funding

This work is supported by National Science Foundation under Award No. IIS-1947203, IIS-2117902, IIS-2137468, IIS-2002540, Agriculture and Food Research Initiative (AFRI) grant no. 2020-6702132799/project accession no.1024178 from the USDA National Institute of Food and Agriculture, the U.S. Department of Homeland Security under Grant Award Number, 17STQAC00001-06-00, and IBM-Illinois Discovery Accelerator Institute - a new model of an academic-industry partnership designed to increase access to technology education and skill development to spur breakthroughs in emerging areas of technology. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the government.

[1] Amr Alexandari, Anshul Kundaje, and Avanti Shrikumar. Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 222 232. PMLR, 2020.

[2] Ohad Amosy, Gal Eyal, and Gal Chechik. On-demand unlabeled personalized federated learning, 2022.

[3] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

[4] Wenxuan Bao, Haohan Wang, Jun Wu, and Jingrui He. Optimizing the collaboration structure in cross-silo federated learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 1718 1736. PMLR, 2023.

[5] Hong-You Chen and Wei-Lun Chao. Fedbe: Making bayesian model ensemble applicable to federated learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021.

[6] Hong-You Chen and Wei-Lun Chao. On bridging generic and personalized federated learning for image classification. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022.

[7] Canh T. Dinh, Nguyen Hoang Tran, and Tuan Dung Nguyen. Personalized federated learning with moreau envelopes. In Advances in Neural Information Processing Systems, 2020.

[8] Alireza Fallah, Aryan Mokhtari, and Asuman E. Ozdaglar. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. In Advances in Neural Information Processing Systems, 2020.

[9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1126 1135. PMLR, 2017.

[10] Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021.

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770 778. IEEE Computer Society, 2016.

[12] Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

[13] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

[14] Samuel Horváth, Stefanos Laskaridis, Mário Almeida, Ilias Leontiadis, Stylianos I. Venieris, and Nicholas D. Lane. Fjord: Fair and accurate federated learning under heterogeneous targets with ordered dropout. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 12876 12889, 2021.

[15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 448 456. JMLR.org, 2015.

[16] Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 2427 2440, 2021.

[17] Liangze Jiang and Tao Lin. Test-time robust personalization for federated learning. In The Eleventh International Conference on Learning Representations, 2023.

[18] Peter Kairouz, H. Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista A. Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaïd Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Koneˇcný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Hang Qi, Daniel Ramage, Ramesh Raskar, Mariana Raykova, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning. Found. Trends Mach. Learn., 14(1-2):1 210, 2021.

[19] Mikhail Khodak, Renbo Tu, Tian Li, Liam Li, Maria-Florina Balcan, Virginia Smith, and Ameet Talwalkar. Federated hyperparameter tuning: Challenges, baselines, and connections to weight-sharing. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 19184 19197, 2021.

[20] Yoonho Lee, Annie S Chen, Fahim Tajwar, Ananya Kumar, Huaxiu Yao, Percy Liang, and Chelsea Finn. Surgical fine-tuning improves adaptation to distribution shifts. In The Eleventh International Conference on Learning Representations, 2023.

[21] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Deeper, broader and artier domain generalization. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5543 5551. IEEE Computer Society, 2017.

[22] Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Ditto: Fair and robust federated learning through personalization. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 6357 6368. PMLR, 2021.

[23] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze, editors, Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. mlsys.org, 2020.

[24] Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. Fair resource allocation in federated learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

[25] Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fedbn: Federated learning on noniid features via local batch normalization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021.

[26] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few shot learning. Co RR, abs/1707.09835, 2017.

[27] Jian Liang, Ran He, and Tieniu Tan. A comprehensive survey on test-time adaptation under distribution shifts. Co RR, abs/2303.15361, 2023.

[28] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 6028 6039. PMLR, 2020.

[29] Zachary C. Lipton, Yu-Xiang Wang, and Alexander J. Smola. Detecting and correcting for label shift with black box predictors. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 3128 3136. PMLR, 2018.

[30] Nan Lu, Zhao Wang, Xiaoxiao Li, Gang Niu, Qi Dou, and Masashi Sugiyama. Federated learning from only unlabeled data with class-conditional-sharing clients. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022.

[31] Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 1273 1282. PMLR, 2017.

[32] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, 2012.

[33] Jaehoon Oh, Sangmook Kim, and Se-Young Yun. Fedbabu: Toward enhanced representation for federated image classification. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022.

[34] Amirhossein Reisizadeh, Farzan Farnia, Ramtin Pedarsani, and Ali Jadbabaie. Robust federated learning: The case of affine distribution shifts. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[35] Youngmin Ro and Jin Young Choi. Autolr: Layer-wise pruning and auto-tuning of learning rates in fine-tuning of deep networks. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 2486 2494. AAAI Press, 2021.

[36] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Comput., 14(1):21 41, 2002.

[37] Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE Trans. Neural Networks Learn. Syst., 32(8):3710 3722, 2021.

[38] Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[39] Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized federated learning using hypernetworks. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 9489 9502. PMLR, 2021.

[40] Zhiqiang Shen, Zechun Liu, Jie Qin, Marios Savvides, and Kwang-Ting Cheng. Partial is better than all: Revisiting fine-tuning strategy for few-shot learning. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 9594 9602. AAAI Press, 2021.

[41] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. Federated multi-task learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4424 4434, 2017.

[42] Benyuan Sun, Hongxing Huo, Yi Yang, and Bo Bai. Partialfed: Cross-domain personalized federated learning via partial initialization. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 23309 23320, 2021.

[43] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 9229 9248. PMLR, 2020.

[44] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. Co RR, abs/2103.00710, 2021.

[45] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno A. Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021.

[46] Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan Mc Mahan, Blaise Agüera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas N. Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horváth, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz, Satyen Kale, Sai Praneeth Karimireddy, Jakub Koneˇcný, Sanmi Koyejo, Tian Li, Luyang Liu, Mehryar Mohri, Hang Qi, Sashank J. Reddi, Peter Richtárik, Karan Singhal, Virginia Smith, Mahdi Soltanolkotabi, Weikang Song, Ananda Theertha Suresh, Sebastian U. Stich, Ameet Talwalkar, Hongyi Wang, Blake E. Woodworth, Shanshan Wu, Felix X. Yu, Honglin Yuan, Manzil Zaheer, Mi Zhang, Tong Zhang, Chunxiang Zheng, Chen Zhu, and Wennan Zhu. A field guide to federated optimization. Co RR, abs/2107.06917, 2021.

[47] Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip S. Yu. Generalizing to unseen domains: A survey on domain generalization. IEEE Trans. Knowl. Data Eng., 35(8):8052 8072, 2023.

[48] Kangkang Wang, Rajiv Mathews, Chloé Kiddon, Hubert Eichner, Françoise Beaufays, and Daniel Ramage. Federated evaluation of on-device personalization. Co RR, abs/1910.10252, 2019.

[49] Ruihan Wu, Chuan Guo, Yi Su, and Kilian Q. Weinberger. Online adaptation to label distribution shift. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 11340 11351, 2021.

[50] Honglin Yuan, Warren Richard Morningstar, Lin Ning, and Karan Singhal. What do we mean by generalization in federated learning? In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022.

[51] Jie Zhang, Zhiqi Li, Bo Li, Jianghe Xu, Shuang Wu, Shouhong Ding, and Chao Wu. Federated learning with label distribution skew via logits calibration. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 26311 26329. PMLR, 2022.

[52] Marvin Zhang, Sergey Levine, and Chelsea Finn. MEMO: test time robustness via adaptation and augmentation. In Neur IPS, 2022.

[53] Han Zhao and Geoffrey J. Gordon. Inherent tradeoffs in learning fair representations. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 15649 15659, 2019.

[54] Hao Zhao, Yuejiang Liu, Alexandre Alahi, and Tao Lin. On pitfalls of test-time adaptation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 42058 42080. PMLR, 2023.

[55] Yi Zhou, Parikshit Ram, Theodoros Salonidis, Nathalie Baracaldo, Horst Samulowitz, and Heiko Ludwig. Single-shot general hyper-parameter optimization for federated learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023.

A More discussions

A.1 More related works

Partial fine-tuning, i.e., updating a subset of modules of a pretrained network on a new dataset, has been studied in supervised settings [26, 35, 40]. In FL, Partial Fed [42] adaptively decides whether each parameter is shared or personalized. However, it cannot generalize to testing clients that do not participate in the training. Recently, surgical fine-tuning [20] selectively fine-tunes a subset of blocks with a similar intuition that the type of distribution shift influences which part of the network to be adapted. Different from their method, we focus on the unsupervised setting and propose to refine the adaptation rate for each module.

Hyperparameter optimization is also related to our algorithm if considering adaptation rates as a set of hyperparameters. [19] first investigates the problem of federated hyperparameter tuning and proposed Fed EX that leverages weight-sharing from neural architecture search to efficiently tune hyperparameters. [55] introduces Flo RA that addresses use cases of tabular data and enables single-shot federated hyperparameter tuning. While these methods focus on improving the efficiency of hyperparameter optimization, our paper focuses on finding the optimal adaptation rates that benefit test-time personalization.

A.2 Broader impacts and limitations

Broader impacts We are not aware of any potential negative societal impacts regarding our work to the best of our knowledge. For all the used data sets, there is no private personally identifiable information or offensive content.

Limitations One possible limitation is that we consider a fix global model for lower communication cost and better generalization, while it might be beneficial to also train a global model for easier personalization, which could be a promising future direction.

B Theoretical analysis

In this section, we give theoretical proofs of convergence, and generalization of ATP.

B.1 Approximation analysis

In this subsection, we give detailed proofs of Proposition 3.1 and 3.2 in Section 3 of the main text. These propositions show why certain types of distribution shifts can be handled by adapting certain layers in a neural network.

B.1.1 Proof of Proposition 3.1

Proposition 3.1 (Adapting the last layer to handle label shift). Consider two distribution p, q with p(x|y) = q(x|y) and p(y) = q(y). When a neural network is calibrated on p, i.e., f(x; w) = p( |x), it is calibrated on q after adding log q(y)

p(y) to the bias term of the final last layer.

Proof. W.l.o.g., assuming the last layer of the neural network is a linear layer. Denoting g(x; wg) as the input of the last layer, where x is the input and wg is the model parameters for the feature extractor (i.e., all layers except for the last classification layer). Denote w1, , w K as the weights of the last layer and b1, , b K as the bias terms of the last layer, assuming K classes. Then we have

f(x; w)c = exp(w c g(x; wg) + bc) PK c =1 exp(w c g(x; wg) + bc )

Since the neural network is calibrated on p, for all class index c = 1, , K, we have

f(x, w)c = p(y = ec|x)

where ec is an one-hot vector with its c-th element as one. For distribution q with the same conditional distribution and different priori, by Bayes theorem, x, y

q(y|x) = q(x|y)q(y) P y q(x|y)q(y) = p(x|y)q(y) P y p(x|y)q(y) = p(y|x) q(y)

y p(y|x) q(y)

Therefore, we can calibrate the neural network on distribution q simply by adding log q(y)

p(y) to the bias terms, i.e.,

fcal(x; wcal)c = exp(w c g(x; wg) + bc + log q(ec)

p(ec)) PK c =1 exp(w c g(x; wg) + bc + log q(ec )

= exp(w c g(x; wg) + bc) q(ec)

p(ec) PK c =1 exp(w c g(x; wg) + bc ) q(ec )

= p(ec|x) q(ec)

p(ec) PK c =1 p(ec |x) q(ec )

p(ec ) = q(y = ec|x)

B.1.2 Proof of Proposition 3.2

Proposition 3.2 (Adapting the BN layer to handle feature shift [38]). When the feature shift only causes differences in the first and second order moments of the feature activations z = g(x) where g is the combination of layers before the BN layer, assuming independent activations, the feature shift can be removed by adapting running mean and variance of the BN layer.

Proof. Denote the source and target feature (marginal) distributions to be p(x) and q(x). Given independent, activations, we only need to test the marginal distribution of each z z = g(x). For each z, since the feature shift only introduces differences in the first and second order moments, there exists and r > 0, s.t., zt R

Pr x q(z zt) = Pr x p

which indicates that the distribution of z is first shifted by and then scaled by r. Such distribution shift in the feature activation can be removed by adapting the running mean µp and variance σ2 p

µq = r µp + σq = σp r

As a result, for all t R

σq t = Pr x q(z µq + σq t)

= Pr x p(z µq + σq t

= Pr x p(z µp + σp t)

which indicates that the feature shift is removed after normalization with running statistics µq, σq.

B.2 Convergence analysis

In this part, we show that ATP has the same convergence guarantee as Fed Avg [31]. We first show in Lemma B.5 and B.10 that ATP preserves convexity and smoothness, which are two important conditions in the analysis of convergence. Then we formally prove the convergence of ATP in Theorem B.11.

B.2.1 Definitions: local and global objective

For clarity, we first formally define the data generation process, and local/global objectives for optimization.

Meta-distribution (distribution of distributions)

Data distribution (each one is a client)

Data batches

Figure 6: Data generation process

Data generation We consider a two-stage sampling process as illustrated in Figure 6.

There are N source clients distributions P S1, P S2, , P SN and M target clients distribution P T 1, P T2, , P TM i.i.d. drawn from a meta-distribution Q.

For each source client i s distribution P Si, there are K data batches (XSi 1 , Y Si 1 ), (XSi 2 , Y Si 2 ), , (XSi K , Y Si K ) drawn i.i.d. from P Si.

Each batch consists of B samples, (XSi k , Y Si k ) = {(x Si k,b, y Si k,b)}B b=1 where B is the batch size. For simplicity, we assume that all source client has the same number of batches K and batch size B. Definition B.1 (Batch objective). Define the batch objective of the k-th batch on client Si to be

b=1 ℓCE(f(x Si k,b, w Si k , y Si k,b)

where w Si k = w G + (Aα) h Si k and h Si k is the update direction computed with XSi k = {x Si k,b}B b=1 with Eq. (4) and (5). Definition B.2 (Local objective). Define the local objective of client i to be

Definition B.3 (Global objective). Define the global objective to be

B.2.2 ATP preserves convexity and smoothness

In this part, we show that ATP preserves convexity and smoothness, which are two important conditions in the analysis of convergence. Definition B.4 (Convexity). A function f : RD R is convex if for all x1, x2 RD and λ [0, 1] f(λx1 + (1 λ)x2) λf(x1) + (1 λ)f(x2) Lemma B.5 (Convexity preserving). If ℓCE(f(x; w), y) is convex w.r.t. w given any data sample (x, y), then Fi(α) is convex w.r.t. α.

Proof. Noticing that w Si k = w G + (Aα) h Si k is linear to α, linear transformation preserves convexity. For any update direction h Si k and data sample (x Si k,b, y Si k,b), we find that

ℓCE(f(x Si k,b; w G + (A(λα1 + (1 λ)α2)) h Si k , y Si k,b)

= ℓCE(f(x Si k,b; λ h w G + (Aα1) h Si k i + (1 λ) h w G + (Aα2) h Si k i , y Si k,b)

λℓCE(f(x Si k,b; w G + (Aα1) h Si k , y Si k,b) + (1 λ)ℓCE(f(x Si k,b; w G + (Aα2) h Si k , y Si k,b)

i.e., ℓCE(f(x Si k,b; w G + (Aα) h Si k ), y Si k,b) is convex w.r.t. α.

Finally, since

Fi(α) = 1 KB

b=1 ℓCE(f(x Si k,b; w G + (Aα) h Si k ), y Si k,b)

which is the average of KB convex functions, we have that Fi(α) is also convex to α.

Definition B.6 (β-smoothness). A function f : RD R is L-smoothness with β > 0 if for all x1, x2 RD, f(x1) f(x2) 2 β x1 x2 2 Definition B.7 (H-module-wise-bounded update direction). The update direction is H-module-wisebounded for a data batch XSi k if

(h Si k )[l] 2 H, l = 1, , d

where (h Si k )[l] is the update direction corresonding to the l-th module and d is the number of modules in the neural network. Lemma B.8 (Lipschitz parameter). If the update direction is H-module-wise-bounded for a data batch XSi k . Given two adaptation rates α1, α2 and the global model w G, we have

w Si k (α1) w Si k (α2) 2 H α1 α2 2

where w Si k (α1) = w G + (Aα1) h Si k is the personalized model updated with α1 as the adaptation rate.

w Si k (α1) w Si k (α2) 2 = (w G + (Aα1) h Si k ) (w G + (Aα2) h Si k ) 2

= (A(α1 α2)) h Si k 2

= h Si k (A(α1 α2)) 2

(h Si k )[l] 2

α[l] 1 α[l] 2 2

l=1 H2 α[l] 1 α[l] 2 2

= H α1 α2 2

Remark B.9. Lemma B.8 indicates that when the adaptation rate is perturbed by a little, the personalized model parameter w Si k is also only perturbed by a little.

Lemma B.10 (Smoothness preserving). If (1) ℓCE(f(x; w), y) is β-smooth w.r.t. w given any data sample (x, y), and (2) the update direction h Si k is H-module-wise-bounded for all data batches XSi k , then Fi(α) is (H2β)-smoothness w.r.t. α.

Proof. We first give an upper bound of A diag(h Si k ) 2 when h Si k is H-module-wise-bounded. The update direction h Si k RD is the concatenation of update directions for each module {(h Si k )[l]}d l=1, i.e., h Si k = (h Si k )[1] , , (h Si k )[d]

where (h Si k )[l] is a column vector representing the update direction of the l-th module in the model. Similarly, any other vector v RD can be correspondingly expressed as

v = v[1] , , v[d]

A diag(h Si k ) 2 = sup v RD A diag(h Si k )v 2 v 2

= sup v RD A (h Si k v) 2 v 2

v u u u u t

(h Si k )[l] v[l] 2

Pd l=1 v[l] 2 2

Pd l=1 h (h Si k )[l] 2 v[l] 2

Pd l=1 v[l] 2 2

Pd l=1 H v[l] 2 2

Pd l=1 v[l] 2 2 (Definition B.7)

We then prove that for any H-module-wise-bounded update direction h Si k and data sample (x Si k,b, y Si k,b), we have ℓCE(f(x Si k,b; w Si k (α)), y Si k,b) is H2β-smoothness w.r.t. α.

α1ℓCE(f(x Si k,b; w Si k (α1)), y Si k,b) α2ℓCE(f(x Si k,b; w Si k (α2)), y Si k,b) 2

= A (h Si k w Si k (α1)ℓCE(f(x Si k,b; w Si k (α1)), y Si k,b) A (h Si k w Si k (α2)ℓCE(f(x Si k,b; w Si k (α2)), y Si k,b) 2

A diag(h Si k ) 2 w Si k (α1)ℓCE(f(x Si k,b; w Si k (α1)), y Si k,b) w Si k (α2)ℓCE(f(x Si k,b; w Si k (α2)), y Si k,b) 2

A diag(h Si k ) 2 β w Si k (α1) w Si k (α2) 2 (Definition B.6)

A diag(h Si k ) 2 β H α1 α2 2 (Lemma B.8)

H2 β α1 α2 2

B.2.3 Convergence of ATP under Fed Avg framework

Finally, we show that with preservation of convexity and smoothness, ATP shares the same convergence guarantee as Fed Avg [31]. We apply the proof in [46]. Theorem B.11 (Convergence of ATP). Assume that

1. At any round t, each client takes τ SGD steps with learning rate η.

2. Full participation, i.e., each source client participates every round

3. ℓCE(f(x; w), y) is convex and β-smooth w.r.t. w given any data sample (x, y).

4. The update direction h Si k is H-module-wise-bounded for all i, j

5. Bounded inner variance: for any α and client i,

Ej αFij(α) = αFi(α), Ej αFij(α) αFi(α) 2 2 σ2

6. Bounded outer variance: for any α and client i,

αFi(α) αF(α) 2 2 ζ2

If the client learning rate satisfies η 1 4H2β , then one has

k=1 F( αt,k) F(α )

α0 G α 2 2 2ητT + ησ2

N + 4τη2H2βσ2 + 18τ 2η2H2βζ2

where α = arg minα F(α) and αt,k = 1 N PN i=1 αt,k i . αt,k i is the local adaptation rates after t communication rounds and k local epochs.

Proof. The optimization process of ATP is similar as Fed Avg [31], where the difference is that ATP adapts the adaptation rates instead of model parameter. Lemma B.5 and B.10 that ATP preserves convexity and smoothness, i.e., for each client i, Fi(α) is convex and (H2β)-smoothness. Therefore, we can apply Theorem 1 in [46] to complete the proof.

Remark B.12. The convergence rate of ATP is O( 1

B.3 Generalization analysis

In this part, we studied how an adaptation rate α learned by ATP that performs well on source clients can generalize to target clients. More specifically, we are interested in how many different source clients are required to ensure a certain generalization error. Similar to most of the other generalization analysis, we (1) derive generalization bound for any fixed hypothesis (α), and (2) quantify the size of hypothesis space.

B.3.1 Definitions: data generation and error rates

We first formally define the error rates.

Definition B.13 (Error rate for one data sample). Let f( ; w Si k ) : X |Y| 1 be the neural network with model parameters w Si k that takes one data sample x Si k,b as input and outputs a probability distribution over the label space, i.e., f(x Si k,b; w Si k ) 0 and 1 f(x Si k,b; w Si k ) = 1. Given adapted model parameters w Si k , define the error rate on one data sample (x Si k,b, y Si k,b) to be

ˆeikb(w Si k ) := 1 (y Si k,b) f(x Si k,b; w Si k )

Remark B.14. Definition B.13 is equivalent to the expected misclassification rate if when making random decision based on the output probability f(x Si k,b; w Si k ).

Definition B.15 (Error rate for one data batch). Given global model parameter w G, adaptation rate α, and a batch of data XSi k = {x Si k,b}B b=1, Y Si k = {y Si k,b}B b=1, define the error rate for one data batch

ˆεik(α) := ˆeik(w Si k ) := 1

b=1 ˆeikb(w Si k )

w Si k = w G + (Aα) h Si k

and h Si k is the update direction computed with XSi k . Definition B.16 (Error rates for one client). Given global model parameter w G, adaptation rate α, and a source client Si with K data batches {(XSi k , Y Si k )}K k=1, define the empirical error rate for source client Si

ˆεi(α) := 1

k=1 ˆεik(α)

Also, define the expected error rate for source client Si

εi(α) := E(X Si k ,Y Si k ) P Si [ˆεik(α) ]

Remark B.17. ˆεi(α) quantifies the error rate on client Si s finite dataset DSi = {(XSi k , Y Si k )}K k=1. εi(α) quantifies the expected error rate on a new data batch from client Si. Notice that the same definition applies to target clients. Definition B.18 (Source error rate and expected target error rate). Given global model parameter w G, adaptation rate α, and N source client S1, , SN, each with K data batches {(XSi k , Y Si k )}K k=1, define the training error rate

Also, define the expected testing error rate

ε(α) := EP Si Q εi(α) | P Si = EP Si QE(X Si k ,Y Si k ) P Si [ˆεik(α)]

Remark B.19. ˆε(α) quantifies the averaged error rate across source clients finite samples. ε(α) quantifies the expected error rate on a new data batch from a new client (target client). Noting that both error rates are defined with respect to the personalized model after adaptation.

B.3.2 Generalization bound for one hypothesis

Next, we derive generalization bounds for one fixed adaptation rate α. Since we consider fixed α, for clarity, we denote

Zik := ˆεik(α)

k=1 Zik = ˆεi(α)

µi := E(X Si k ,Y Si k ) P Si[Zik] = εi(α)

i=1 Zi = ˆϵ(α)

µ := EP Si Qµi = ϵ(α)

Intuitively, with enough number of source clients and number of batches, we have Z µ µ.

Lemma B.20 (Hoeffding s inequality). Let X1, , Xn be independent random variables such that ai Xi bi almost surely. Consider the sum of these random variables Sn = X1 + + Xn. For all ϵ > 0,

Pr(Sn E[Sn] ϵ) exp 2ϵ2 Pn i=1(bi ai)2

Proof. Please refer to [13]

Lemma B.21 (Concentration of averaged client expected error rates). For any ϵ > 0, we have

Pr( µ µ ϵ) exp( 2Nϵ2)

Proof. Notice that µ1, , µN are independent given Q. For all i = 1, , N, Eµi = µ and 0 µi 1. Therefore,

Pr( µ µ ϵ) = Pr

i=1 µi Nµ Nϵ

exp 2 (Nϵ)2

(Hoeffding s inequality)

= exp( 2Nϵ2)

Lemma B.22 (Concentration of client empirical error rate). For any ϵ > 0,

Pr Z µ ϵ exp( 2NKϵ2)

Proof. Given distributions P S1, , P SN , we have Z11, , Z1K, Z21, , ZNK are independent. For any i = 1, , N and k = 1, , K, we have E(X Si k ,Y Si k ) P Si Zik = µi and 0 Zik 1. Therefore,

Pr Z µ ϵ | P S1, , P SN = Pr

i=1 Kµi NKϵ

P S1, , P SN !

P S1, , P SN !

exp 2 (NKϵ)2

(Hoeffding s inequality)

= exp( 2NKϵ2)

Then, we use the tower property,

Pr( Z µ ϵ) = EP S1, ,P SN i.i.d. Q Pr( Z µ ϵ | P S1, , P SN )

sup P S1, ,P SN Pr( Z µ ϵ | P S1, , P SN )

exp( 2NKϵ2)

Proposition B.23 (Generalization for one hypothesis). For any fixed global model w G and adaptation rate α, for any ϵ > 0, we have

Pr (|ˆε(α) ε(α)| ϵ) 4 exp 2NKϵ2

Proof. For any ϵ > 0, we have

Pr( Z µ ϵ) = Pr(( Z µ ) + ( µ µ) ϵ)

inf ϵ Pr(( Z µ ϵ ) ( µ µ ϵ ϵ ))

inf ϵ Pr( Z µ ϵ ) + Pr( µ µ ϵ ϵ )

inf ϵ [exp( 2NK(ϵ )2) + exp( 2N(ϵ ϵ )2)] (Lemma B.21 and B.22)

To make the bound clear (although not optimal), we choose ϵ = 1

K+1ϵ and thus ϵ ϵ =

K+1ϵ. Then the bound becomes,

Pr( Z µ ϵ) 2 exp 2NKϵ2

Similarly we can show that

Pr( Z µ ϵ) 2 exp 2NKϵ2

Pr (|ˆε(α) ε(α)| ϵ) = Pr( Z µ ϵ) 4 exp 2NKϵ2

Remark B.24. The RHS is function of both (1) N, the number of source clients and (2) K, the number of data batches on each client.

When N , given any fixed K 1, the RHS 0, indicating that ˆϵ(α) p ϵ(α) (convergence in probability).

However, given a fixed finite N, when K , the RHS does not limit to zero. Intuitively, sampling more batches on finite source clients only help the algorithm learn finite distribution P S1, , P SN . However, more data batches on existing clients does not help further exploration of the meta-distribution Q and generalization to novel target clients. Actually, given a fixed finite N, when K , ˆϵ(α) p 1

N PN i=1 ϵi(α) = ϵ(α).

If we put data from N sources (each with K batches) together as one source with NK batches. The generalization bound is looser.

B.3.3 Generalization bound for hypothesis space (proof of Theorem 5.1)

Finally, we derive the generalization bound for the hypothesis space. We first show in Lemma B.27 that ˆεij(α) is (LH)-Lipschitz to α, then we apply standard generalization analysis in Theorem 5.1 based on covering number [32]. Definition B.25 (L-Lipschitz). The neural network f(x; w) is L-Lipschitz w.r.t. w, if x and w1, w2.

f(x; w1) f(x; w2) 2 L w1 w2 2 Definition B.26 (H-module-wise-bounded update direction). The update direction is H-modulewise-bounded for a data batch XSi k if

(h Si k )[l] 2 H, l = 1, , d

where (h Si k )[l] is the update direction corresonding to the l-th module and d is the number of modules in the neural network. Lemma B.27 (Lipschitz error rate). Given a data batch (XSi k , Y Si k ), if the update direction is H-module-wise-bounded, given any two adaptation rates α1, α2 and the global model w G, we have

|ˆεij(α1) ˆεij(α2)| LH α1 α2 2

|ˆεij(α1) ˆεij(α2)|

1 (y Si k,b) f(x Si k,b; w Si k (α1)) !

1 (y Si k,b) f(x Si k,b; w Si k (α2)) !

b=1 (y Si k,b) f(x Si k,b; w Si k (α1)) f(x Si k,b; w Si k (α2))

b=1 y Si k,b 2 f(x Si k,b; w Si k (α1)) f(x Si k,b; w Si k (α2)) 2

b=1 y Si k,b 2 L w Si k (α1) w Si k (α2) 2 (L-Lipschitz model)

b=1 y Si k,b 2 L H α1 α2 2 (Lemma B.8)

= LH α1 α2 2

Remark B.28. Intuitively, Lemma B.27 shows that small change in α will result in bounded change on ˆεij(α). Corollary B.29. ˆϵ(α) and ϵ(α) are (LH)-Lipschitz w.r.t. α.

Proof. ˆϵ(α) and ϵ(α) are expectations of ˆεij(α) given the empirical and expected distribution of (XSi k , Y Si k ). Lipschitz property is preserved.

Theorem 5.1 (Generalization for hypothesis space). Let H = {α : α 2 R} be the hypothesis space (space of adaptation rates), N be the number of source clients, and K be the number of data batches on each source client. Assuming (1) L-Lipschitz model, and (2) H-module-wise-bounded update direction. For any fixed global model w G and any ϵ > 0, we have

Pr( sup α H |ε(α) ˆε(α)| ϵ) 12LHR

d 4 exp NKϵ2

where ˆε(α) is the average post-adaptation error rate on source clients, and ε(α) is the expected post-adaptation error rate on clients population.

Proof. We use covering number to derive the generalization bound [32]. Define estimation error

ϵ(α) = ε(α) ˆε(α)

| ϵ(α1) ϵ(α2)| = |[ε(α1) ˆε(α1)] [ε(α2) ˆε(α2)]| |ε(α1) ε(α2)| + |ˆε(α1) ˆε(α2)| 2LH α1 α2 2 (Corollary B.29)

A = {α : α 2 R, α Rd} can be covered by K = N2(R, r) L2 balls with radius r = ϵ 4LH . Lemma 6.27 in [32] shows that

S = N2 (R, r) 3R

Denote these L2 balls to be B1, , BS,

Pr sup α A |ε(α) ˆε(α)| ϵ

s=1 Pr sup α Bs |ε(α) ˆε(α)| ϵ

For each ball Bs, s = 1, , S, denote the center to be αs. For any α Bs, we have α αs ϵ 4LH , therefore

| ϵ(αs) ϵ(α)| 2LH α αs ϵ

2 Intuitively, every α Bs has similar error rate. Therefore, the error rate for the whole ball is upper bounded, as long as the center αs has a small error rate

Pr sup α Bs |ε(α) ˆε(α)| ϵ = Pr sup α Bs | ϵ(α)| ϵ

Pr sup α Bs [| ϵ(αs)| + | ϵ(αs) ϵ(α)|] ϵ

Pr sup α Bs | ϵ(αs)| + ϵ

= Pr |ε(αs) ˆε(αs)| ϵ

Finally, by Proposition B.23, for each αs

Pr |ε(αs) ˆε(αs)| ϵ

Put all together

Pr sup α A |ε(α) ˆε(α)| ϵ 12LHR

d 4 exp NKϵ2

C Additional experiments

C.1 Detailed experiment settings

C.1.1 CIFAR-10 experiments

Data preparation We use a benchmarking three-way split [50]: we randomly split the dataset to 300 clients, 240 of them are source clients and 60 are target clients. Each source client has 160 training samples and 40 validation samples, while each target client has 200 testing samples. We simulate three kinds of distribution shifts: feature shift, label shift, and hybrid shift. For feature shift, we follow [12, 17], randomly apply 15 different kinds of corruptions to the source clients (Figure 7a), and 4 new kinds of corruptions to the target clients (Figure 7b) to test the generalization of ATP. The corruption severity is randomly selected from {1, 2, 3, 4, 5}. For label shift, we use the step partition [5], where each client has 8 minor classes with 5 images per class, and 2 major classes with 80 images per class. For the hybrid shift, we apply both step partition and feature corruptions.

Gaussian Noise

Impulse Noise

Defocus Blur

Frosted Glass Blur

Motion Blur

(a) 15 corruptions for training clients

Speckle Noise

Gaussian Blur

(b) 4 corruptions for testing clients

Figure 7: 15 + 4 different corruptions we use to construct feature shift

Global model training We first train a global model with Fed Avg [31] over the training sets of source clients.

Res Net-18: The global model is Res Net-18 with Image Net pretrained parameter (provided by torchvision). We train the global model for T = 200 communication rounds with full participation (cohort size C = 240), local epochs E = 1, learning rate η = 0.01 and batch size B = 20. Shallow CNN: The global model is a randomly initialized 5-layer CNN. We train the global model for T = 200 communication rounds with full participation (cohort size C = 240), local epochs E = 1, learning rate η = 0.1 and batch size B = 20.

ATP training We initialize the adaptation rates as a all-zero vector, and optimize it over the validation sets of source clients. We optimize the adaptation rates for T = 200 (for Res Net-18) or 400 (for Shallow CNN) communication rounds with partial participation (cohort size C = 60), learning rate η = 0.1 and batch size B = 20.

ATP testing We test the optimized adaptation rates on each target client. We use batch size B = 20 by default, and test different batch size in Subsection 6.3.

C.1.2 CIFAR-100 experiments

Data preparation The data preparation is similar to CIFAR-10 experiments. The only difference is for label shift, each client has 98 minor classes with 1 image per class, and 2 major classes with 51 images per class. Same partition is applied to hybrid shift.

Global model training We first train a global model with Fed Avg [31] over the training sets of source clients. The global model is Res Net-18 with Image Net pretrained parameter (provided by torchvision). We train the global model for T = 200 communication rounds with full participation (cohort size C = 240), local epochs E = 1, learning rate η = 0.01 and batch size B = 20.

ATP training We initialize the adaptation rates as a all-zero vector, and optimize it over the validation sets of source clients. We optimize the adaptation rates for T = 200 communication rounds with partial participation (cohort size C = 60), learning rate η = 0.1 and batch size B = 20.

ATP testing We test the optimized adaptation rates on each target client. We use batch size B = 20.

C.1.3 Digits-5 experiments

Data preparation Digits-5 dataset contains five domains: MNIST, SVHN, USPS, Synth Digits, and MNIST-M. We adopt the leave-one-domain-out evaluation protocol [10], i.e., one domain is chosen as the held-out testing domain, and the remaining domains are regarded as source training domains. We follow the data preprocessing in [25], while additionally applying step partition to inject label shift. Each domain is divided into 10 clients, leading to a total of 40 source clients and 10 target clients. Consequently, each client ends up with approximately 743 images spread across 10 classes. Each source client has 80% of its samples as training set and the remained 20% as testing set. Each client has 2 major classes and 8 minor class, where the ratio of images per class is approximately 16 : 1 (the same as our CIFAR-10 experiments). Since there is already domain shift, we do not add corruptions.

Global model training We first train a global model with Fed Avg [31] over the training sets of source clients. The global model is Res Net-18 with Image Net pretrained parameter (provided by torchvision). We train the global model for T = 200 communication rounds with full participation (cohort size C = 50), local epochs E = 1, learning rate η = 0.01 and batch size B = 20.

ATP training We initialize the adaptation rates as a all-zero vector, and optimize it over the validation sets of source clients. We optimize the adaptation rates for T = 200 communication rounds with partial participation (cohort size C = 10), learning rate η = 0.5 and batch size B = 200.

ATP testing We test the optimized adaptation rates on each target client. We use batch size B = 200.

C.1.4 PACS experiments

Data preparation PACS dataset contains four domains: art, cartoon, photo, and sketch. We adopt the leave-one-domain-out evaluation protocol [10], i.e., one domain is chosen as the held-out testing domain, and the remaining domains are regarded as source training domains. We follow the data preprocessing in [10], while additionally applying step partition to inject label shift. Each domain is divided into 7 clients, leading to a total of 21 source clients and 7 target clients. Each source client has 80% of its samples as training set and the remained 20% as testing set. Each client has 2 major classes and 5 minor class, where the ratio of images per class is approximately 16 : 1 (the same as our CIFAR-10 experiments). Since there is already domain shift, we do not add corruptions.

Global model training We first train a global model with Fed Avg [31] over the training sets of source clients. The global model is Res Net-18 with Image Net pretrained parameter (provided by torchvision). We train the global model for T = 200 communication rounds with full participation (cohort size C = 21), local epochs E = 1, learning rate η = 0.05 and batch size B = 20.

ATP training We initialize the adaptation rates as a all-zero vector, and optimize it over the validation sets of source clients. We optimize the adaptation rates for T = 500 communication rounds with full participation (cohort size C = 21), learning rate η = 0.5 and batch size B = 200.

ATP testing We test the optimized adaptation rates on each target client. We use batch size B = 200.

C.1.5 Algorithm details

Assignment matrix A In the main test, we mentioned that A RD d is a 0 1 assignment matrix that maps each adaptation rate α[l] to the indices of the l-th module s parameters in w. Mathematically,

Akl = 1, if the k-th parameter in w belongs to the l-th module 0, otherwise

If there are d = 3 modules, each with 1, 2, and 3 parameters, so D = 1+2+3 = 6, the corresponding assignment matrix will be

1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1

Computation

We did our experiments with single NVIDIA Tesla V100 GPU. However, our experiment should only require less than 2GB of GPU memory.

C.2 Compatibility to model architecture (RQ1)

In this part, we evaluate ATP with two more model architectures: a 5-layer Shallow CNN as a smaller model and Res Net-50 as a larger model.

Table 5: ATP with different model architectures, accuracy (mean s.d. %) on target clients

Method Shallow CNN on CIFAR-10 Res Net-50 on CIFAR-100

Feature shift Label shift Hybrid shift Avg. Rank Feature shift Label shift Hybrid shift Avg. Rank

No adaptation 64.39 0.18 69.33 0.37 61.99 0.47 7.3 45.31 0.30 51.63 0.15 40.01 0.17 7.3 BN-Adapt 66.46 0.22 54.99 0.38 50.40 0.43 7.0 47.75 0.29 34.85 0.26 30.31 0.09 7.3 SHOT 65.60 0.18 49.98 0.29 45.95 0.47 9.0 45.42 0.30 31.06 0.32 27.44 0.14 9.3 Tent 65.61 0.24 50.12 0.25 45.91 0.49 8.7 45.91 0.46 31.34 0.11 27.93 0.31 8.3 T3A 64.31 0.27 66.96 0.43 59.65 0.58 8.3 45.31 0.30 51.42 0.15 39.89 0.20 7.7 MEMO 65.89 0.31 71.95 0.25 64.17 0.47 5.3 48.42 0.14 55.19 0.28 42.53 0.20 3.7 EM 61.74 0.25 76.28 0.29 67.54 0.41 5.0 43.00 0.31 59.34 0.15 44.82 0.27 5.0 BBSE 56.92 0.53 75.99 0.44 66.64 0.53 6.3 37.26 0.64 56.97 0.20 40.09 0.51 7.0 Surgical 64.45 0.12 73.75 0.42 65.67 0.44 5.7 45.18 0.38 54.83 0.26 42.50 0.33 6.7 ATP-batch 66.90 0.05 76.23 0.32 68.88 0.35 2.3 48.35 0.45 58.06 0.53 46.82 0.32 2.7 ATP-online 67.13 0.17 78.56 0.32 71.52 0.51 1.0 49.08 0.26 61.86 0.25 49.51 0.23 1.0

From Table 5, we observe that under the new model architecture (and the new dataset), the performance of ATP is highly similar to the results of the Res Net-18 + CIFAR10 experiment in Table 1 in Subsection 6.1. ATP, in all three scenarios, can handle various types of distribution shifts and surpass baseline methods. This suggests that ATP is compatible with multiple model architectures.

C.3 Robustness to global model

In this subsection, we design experiments to answer the following question: is ATP robust to the choice of global model? Specifically, we have three sub-questions:

Is ATP robust to the parameter of global model? (C.3.1) Is ATP robust to the algorithm to train global model? (C.3.2)

C.3.1 Robustness to the parameter of global model (online updated global model)

In the main text, we primarily focused on the scenario where the global model remains fixed. However, in practical FL systems, the global model may also undergo continuous online updates. Therefore, after obtaining the adaptation rates through ATP training, the global model might have been further updated for several rounds. This raises a question: Are the outdated adaptation rates still effective after several rounds of updates to the global model?

Table 6: Accuracy (%), ATP can learn adaptation rates that generalize to global models with different numbers of communication rounds under hybrid shift on CIFAR-10

Method 200 + 0 Rounds +10 Rounds +20 Rounds +50 Rounds +100 Rounds

No adaptation 63.68 0.24 63.88 0.20 64.03 0.13 64.30 0.08 64.56 0.11 ATP-batch 73.05 0.35 73.20 0.40 73.25 0.37 73.47 0.48 73.61 0.28 ATP-online 75.37 0.22 75.61 0.23 75.69 0.20 75.80 0.15 75.83 0.28

We design experiment to apply the outdated adaptation rates to the global model that has undergone additional updates for several rounds, to see if they can still improve the test-time accuracy of the global model. Specifically, we optimize the adaptation rates α with w T G where T = 200, but test the adaptation rates with w T + T G with T = 10, 20, 50, 100 rounds. We use the same setting of hybrid shift on CIFAR-10 experiments. As shown in Table 6, while further optimizing the global model can marginally improve the accuracy, both ATP-batch and ATP-online can effectively enhance the test-time accuracy through personalization, even when α is trained using an outdated version of the global model.

C.3.2 Robustness to the algorithm to train global model

In the main text, we used Fed Avg [31] to train the global model. However, in real-world FL systems, other FL algorithms may be employed for training the global model, considering stability optimization or fairness. Therefore, we aim to investigate whether ATP can also be applied to other commonly used FL algorithms.

Table 7: Accuracy (%), ATP enhances different global models under hybrid shift on CIFAR-10

Method Fed Avg Fed Prox (µ = 0.01) q-FFL (q = 1)

No adaptation 63.68 0.24 63.77 0.25 63.87 0.23 ATP-batch 73.05 0.35 72.95 0.33 73.15 0.21 ATP-online 75.37 0.22 75.51 0.19 75.79 0.15

In particular, we use Fed Prox [23], an FL algorithm designed to handle heterogeneous setting, and q-FFL [24], an FL algorithm enhancing performance fairness among participating clients. For all global model, we use the same setting of hybrid shift on CIFAR-10 experiments. As shown in Table 7, both ATP-batch and ATP-online can consistently improve the test-time accuracy across different FL algorithms to train global models.

C.4 Convergence and generalization

In Section 5, Appendix B.2 and B.3, we theoretically show that ATP has good convergence and generalization guarantees. In this section, we visualize the training and testing loss curves to verify the fast convergence and superior generalization of ATP under different cohort size C. The results are shown in Figure 8.

0 25 50 75 100 125 150 175 200 Round

(a) C = 240

0 25 50 75 100 125 150 175 200 Round

(b) C = 120

0 25 50 75 100 125 150 175 200 Round

0 25 50 75 100 125 150 175 200 Round

0 25 50 75 100 125 150 175 200 Round

Figure 8: Loss curves of ATP under different cohort size C

Convergence Under full participation (C = 240), both the training and testing loss converge stably and fast, indicating the reliable convergence of ATP. With partial participation, as the cohort size decreases (C = 120, 60, 30, 15), the training loss curve exhibits greater fluctuations, primarily due to sampling different subsets of clients in each communication round. However, the testing loss curve still converge stably with similar speed, indicating that ATP is robust to partial participation.

Generalization Under full participation (C = 240), the training and testing loss curves decrease synchronously without any overfitting. This implies that our algorithm exhibits excellent generalization. Similar observations can be made for partial participation (C = 120, 60, 30, 15). Additionally, it is worth noting that the test loss is lower than the train loss, which may seem counterintuitive. This is primarily due to the use of different corruptions between the testing and source clients. The accuracy of clients varies significantly under different corruptions, as evidenced by the fluctuations in the training curve when C = 15. However, we can still analyze the generalization performance by comparing the trends of the two curves.

C.5 Toy example for negative adaptation rate (RQ2)

In Section 6.2, we notice that ATP learns negative adaptation rates for running means and variance under label shift. In this subsection, we use a toy example to show why negative adaptation rate can improve the model performance under label distribution shift.

We consider a binary classification problem with input x R and binary output y { 1, +1}, where 1 is the negative class and +1 is the positive class. Let the feature for negative samples (x|y = 1) N( 1, 0.82) and for positive samples (x|y = +1) N(+1, 0.82). Let the label distribution Pr(y = 1) = 1

2 for training set, and Pr(y = 1) = 5

6 for testing set. Therefore, for the training distribution, we have

Ex = Pr(y = 1)E(x|y = 1) + Pr(y = +1)E(x|y = +1) = 0 V ar(x) = E[V ar(x|y)] + V ar(E[x|y]) = 1.64

We consider a simple network with only one BN layer, with both normalization and affine transformation (as a linear classifier). There are four modules, each is a scalar: running mean µ, running variance σ2, weight γ, bias β.

3 2 1 0 1 2 3 z

decision pos neg

(a) Train, Acc=0.89

3 2 1 0 1 2 3 z

decision pos neg

(b) α = 1, Acc=0.73

3 2 1 0 1 2 3 z

decision pos neg

(c) α = 0.5, Acc=0.83

3 2 1 0 1 2 3 z

decision pos neg

(d) α = 0, Acc=0.89

3 2 1 0 1 2 3 z

decision pos neg

(e) α = 0.5, Acc=0.92

Figure 9: Adapting batch norm running statistics under label shift.

Training During training, given enough training data, we have µtrain = Ex = 0 and σ2 train = 1.64. Figure 9a shows the histogram of z = x µ

σ , i.e., the intermediate feature after normalization before the transformation. By comparing the histograms of z of two classes, we notice that the optimal decision boundary is z = 0, which indicate that βtrain = 0 and γtrain > 0. We store the corresponding µtrain, σ2 train, γtrain, βtrain, and only update running statistics µtrain, σ2 train during testing.

Testing without updating running statistics (α = 0) Figure 9d shows the testing result when we do not update the running statistics, i.e., α = 0. Since two conditional feature distributions are symmetric, the accuracy will not change.

Testing with α > 0 Positive adaptation rates align the intermediate feature distribution. When we use α = 1, the distribution of z will be centralized. As shown in Figure 9b, such alignment greatly reduces the accuracy. Similar result is also observed with any positive α, e.g., α = 0.5 in Figure 9c.

Testing with α < 0 While α = 0 has stable accuracy under label shift, by comparing the histograms of z of two classes in Figure 9d, we notice that z = 0 is not the optimal decision boundary anymore, because there are less negative samples than positive samples. By using negative adaptation rate α < 0, the normalization layer can further disalign the intermediate feature, which can further improve the accuracy, as shown in Figure 9e.