# gradual_domain_adaptation_via_gradient_flow__b65fd974.pdf

Published as a conference paper at ICLR 2024

GRADUAL DOMAIN ADAPTATION VIA GRADIENT FLOW

Zhan Zhuang1,2, Yu Zhang1, , Ying Wei3,

1Southern University of Science and Technology, 2City University of Hong Kong 3Nanyang Technological University 12250063@mail.sustech.edu.cn, yu.zhang.ust@gmail.com, ying.wei@ntu.edu.sg

Domain shift degrades classification models on new data distributions. Conventional unsupervised domain adaptation (UDA) aims to learn features that bridge labeled source and unlabeled target domains. In contrast to feature learning, gradual domain adaptation (GDA) leverages extra continuous intermediate domains with pseudo-labels to boost the source classifier. However, real intermediate domains are sometimes unavailable or ineffective. In this paper, we propose Gradual Domain Adaptation via Gradient Flow (GGF) to generate intermediate domains with preserving labels, thereby enabling us a fine-tuning method for GDA. We employ the Wasserstein gradient flow in Kullback Leibler divergence to transport samples from the source to the target domain. To simulate the dynamics, we utilize the Langevin algorithm. Since the Langevin algorithm disregards label information and introduces diffusion noise, we introduce classifier-based and sample-based potentials to avoid label switching and dramatic deviations in the sampling process. For the proposed GGF model, we analyze its generalization bound. Experiments on several benchmark datasets demonstrate the superiority of the proposed GGF method compared to state-of-the-art baselines.

1 INTRODUCTION

Unsupervised Domain Adaptation (UDA) stands as a fundamental and classical problem in machine learning (Pan & Yang, 2010). Its primary objective revolves around the transfer of the knowledge from a well-trained source domain to a related yet unlabeled target domain, thereby reducing the need for time-consuming manual labeling and data preprocessing.

Early discrepancy-based UDA methods (Borgwardt et al., 2006; Pan et al., 2010; Tzeng et al., 2014; Courty et al., 2014; Long et al., 2015; Ganin & Lempitsky, 2015; Sun & Saenko, 2016; Ganin et al., 2016; Courty et al., 2017; Tzeng et al., 2017) have shown promise in learning domain-invariant feature representations. However, recent studies (Zhao et al., 2019; Chen et al., 2019; Tang & Jia, 2020; Yang et al., 2020) reveal that simply aligning source and target domains likely reduces discriminability. To address this issue, transport-based (Kirchmeyer et al., 2022; Gao et al., 2023; Xiao et al., 2023) and synthetic sample-based (Cui et al., 2020; Xu et al., 2020; Wu et al., 2020; Choi et al., 2020; Dai et al., 2021; Na et al., 2021; Jing et al., 2022; Na et al., 2022) approaches have been developed. Apart from them, Kumar et al. (2020) proposed a learning paradigm called Gradual Domain Adaptation (GDA), which resorts to an extra sequence of continuous unlabeled samples as intermediate domains to adapt the source classifier to the target domain via self-training instead of feature alignment.

However, in many real-world scenarios where only unlabeled data from the target domain are available, constructing appropriate intermediate domains remains an open question (Wang et al., 2022), which transforms the UDA setting into the GDA setting. Prior studies have demonstrated that generating additional intermediate domains through generative models or interpolation algorithms offers viable approaches and improves the learning performance of the target domain. For instance, Sagawa & Hino (2022) utilized Continuous Normalizing Flow (CNF) (Chen et al., 2018) to interpolate between the source and target domains. GIFT (Abnar et al., 2021) and GOAT (He et al., 2023) use optimal transport and linear interpolation within a mini-batch to form geodesic paths. Those methods share a

Corresponding authors.

Published as a conference paper at ICLR 2024

{(𝑧0(𝑖), 𝑦0(𝑖))}𝑖=1

𝑁 {(𝑧𝑇(𝑗))}𝑗=1

{(𝑥0(𝑖), 𝑦0(𝑖))}𝑖=1

𝑁 {(𝑥𝑇(𝑗))}𝑗=1

distribution-based energy

classifier-based energy

sample-based energy

high energy

low energy 𝑧𝑡+1 = 𝑧𝑡 𝜂 𝑊2ℰ𝜇𝑡(𝑧𝑡)

𝑦0 𝑦0 𝑦0 𝑦0 𝑦0

Labeled Source: 1900s Unlabeled Target: 1970s

Figure 1: Illustration of the proposed GGF method. (a) The three designed energies collectively define a gradient flow in latent space. (b) The training strategy of GGF involves gradual fine-tuning of the classifier by utilizing the intermediate domains constructed in the GGF method.

similar spirit to transport-based and synthetic sample-based UDA methods. Unfortunately, they suffer from two limitations. Firstly, one-step interpolation and CNFs with minimal constraints can result in ambiguous labels for synthetic samples. Secondly, self-training with multiple pseudo-labeling iterations may lead to training instability (Chen et al., 2022).

To address the limitations above, we propose Gradual domain adaptation via Gradient Flow (GGF), a novel method to construct intermediate domains when intermediate domains are absent. Gradient flow refers to the continuous process of gradient descent, which follows the steepest descent direction to minimize a given energy function. As shown in Figure 1, in the proposed GGF method, the designed gradient flow transports feature representations from the source domain to the target domain along a curve to minimize the following three designed energies: (1) distribution-based energy that is to shift the features from the source domain to the target domain, (2) classifier-based energy that is to preserve label information, and (3) sample-based energy that is to avoid large noise in generated samples. To the best of our knowledge, we are the first to propose the construction of intermediate domains, while meantime allowing the source classifier to be gradually fine-tuned to align with the target distribution. We provide theoretical analysis on the generalization bound of the proposed GGF method, showing that the gradient flow with smaller continuous discriminability and transport loss contributes a lower target error. We conduct comprehensive experiments to evaluate the proposed GGF on various domain adaptation scenarios, where GGF outperforms state-of-the-art methods.

2 PRELIMINARIES AND BACKGROUND

2.1 SETTING AND NOTATION

In the context of GDA, we consider the data of the labeled source domain, T unlabeled intermediate domains, and the unlabeled target domain to be sampled from the distribution µ0, {µt}T t=1 and µT +1 over X, respectively. We denote the target distribution by π. Consider the hypothesis class H, where for any classifier h H, h : X Y maps inputs to predictions. We assume that there exists a labeling function in each domain: fµt = ft H. Given a loss function ℓ( , ), the generalization error is defined as ϵµ(h) = ϵµ(h, fµ) = Ex µℓ(h(x), fµ(x)). The source classifier h0 can be learned with minimal error ϵµ0(h0) using supervised learning, and the objective of GDA is to evolve this classifier h0 to h T over the intermediate domains so as to minimize the target error ϵπ(h T ). The UDA problem shares the objective and can be converted into a GDA problem by generating intermediate domains.

Self-training Self-training (ST) (French et al., 2017; Zou et al., 2019; Prabhu et al., 2021; Liu et al., 2021) (a.k.a. pseudo-labeling) is a domain adaptation method that does not rely on feature learning. We use the Wasserstein-2 distance W2(µ, ν) (see Appendix. A.1) to quantify the domain shift. Under the assumption that the shift between adjacent domains is small, previous studies (Kumar et al., 2020; Wang et al., 2022) have shown that ST can effectively update the classifier ht 1 to adapt to the next domain with samples St, as expressed mathematically below:

ht = ST(ht 1, St) = arg min h H

x St ℓ(h(x), ht 1(x)), (1)

Published as a conference paper at ICLR 2024

where ht 1(x) denotes pseudo labels of x by ht 1. Kumar et al. (2020) proposed Gradual Self Training (GST) as the baseline algorithm for GDA, which updates the classifier using ST with pseudo-labels on the intermediate domains in sequence. In this paper, we generate intermediate domains gradually while preserving labels, enabling us to update the classifier in a distinct manner by fine-tuning it directly on the transformed samples.

2.2 WASSERSTEIN GRADIENT FLOW (WGF)

A flow describes a time-dependent diffeomorphic map of particles in a system. Let Φt(x) : [0, 1] Rn Rn denote a flow, so that a vector field of position and time ut(x) defines a flow via an ordinary differential equation (ODE), i.e.,

d dtΦt(x) = ut (Φt(x)) , Φ0(x) = x. (2)

Let P2(Rn) denote the space of probability measures on Rn with finite second moments. For a metric space (P2(Rn), W2), Otto (2001) first proposed WGF to solve the porous medium equation. The vector field of WGF is defined as ut = W2E(µt), where a general energy E(µt) consisting of the following three terms associates with a distribution µt or a group of particles (Santambrogio, 2017),

E(µt) = Z H(µt(x))dx + Z V (x)dµt(x) + 1

ZZ W(x y)dµt(x)dµt(y). (3)

The three terms represent the internal, (external) potential, and interaction energy, respectively. Simply put, internal energy (e.g. entropy H(µ) = µ log µ) is related to distribution density, potential energy is related to the potential field V in Euclidean space, and interaction energy captures the interactions between particles. Correspondingly, the vector field can be calculated as

ut = W2E(µt) = E (µt) = (H (µt)) V ( W) µt, (4)

where denotes the convolution operator. We can conceptualize the Kullback Leibler (KL) divergence, maximum mean discrepancy (MMD) (Arbel et al., 2019), or other related metrics (Mroueh & Rigotti, 2020; Korba et al., 2021; Glaser et al., 2021) as the energy functional. Leveraging the gradient flow s descent property, we can establish a progressive flow that reduces the difference between distributions over time. This approach proves valuable for dataset transformation and data augmentation (Alvarez-Melis & Fusi, 2021; Hua et al., 2023). In the next section, we will delve into our proposed method in the context of domain adaptation.

3 GRADUAL DOMAIN ADAPTATION VIA GRADIENT FLOW (GGF)

In this section, we present the novel GGF method, designed to construct intermediate domains along the gradient flow of three energy functions and gradually adapt a source classifier to the target domain. We commence by providing a comprehensive exposition of the three energies and their corresponding sampling techniques. Subsequently, we unify these components into a cohesive framework. To allow the versatility and extensive applicability of our approach in both data and latent spaces, we utilize the symbol "x" to represent both samples and features in the subsequent sections.

3.1 DISTRIBUTION-BASED ENERGY FOR SHIFTING FEATURES FROM SOURCE TO TARGET

Considering the high computation cost of MMD (Arbel et al., 2019) and KSD (Mroueh & Rigotti, 2020), we opt KL(µt|π) as the distribution-based energy with the resulting vector field given as,

ut = W2KL(µt|π) = W2

dµt = log µt + log π. (5)

This vector field consisting of the above two log-density components drives the distribution µt towards the target distribution π under the influence of both diffusion and drift (Jordan et al., 1998). We simulate the associated flow following the common practice of Langevin Monte Carlo (LMC) (Roberts & Tweedie, 1996) as a time-discretized sampling method. The LMC guarantees convergence in the Wasserstein distance, as described in Eq. (32) (Dalalyan & Karagulyan, 2019). Concretely, given a set of samples x0 from the source distribution µ0, the LMC iteratively updates the samples as

xt+1 = xt + η1 xt log π(xt) + p

Published as a conference paper at ICLR 2024

where π, ξ, and η1 represent the distribution of the target domain, a noise sampled from a standard Gaussian distribution, and the step size, respectively. The log-density x log π(x), also known as the (Stein) score function sπ(x), can be estimated via various methods (Hyvärinen & Dayan, 2005; Vincent, 2011; Song et al., 2020), even when the probability density function is not available. Here, we resort to Denoise Score Matching (DSM) (Vincent, 2011) with a neural network s(x; ϕ) to model the score function. The core idea of DSM is to minimize the discrepancy between the score and the model s predicted score after injecting noise to clean x, whose objective follows,

JDSM = Eqσ(x, x)

2 s( x; ϕ) x log qσ( x|x) 2 . (7)

Simple augmentation methods such as Gaussian noise ϵ N(0, σ2I) can be employed to compute the latter term (i.e., x log qσ( x|x) = x x

σ2 ), where σ denotes the noise variance. By using qσ(x, x) = qσ( x|x)p(x), we can efficiently estimate the scores of perturbed data x in those lowdensity areas of the target distribution. Once the score network has been trained, we can readily construct intermediate domains following the sampling process in Eq. (6).

Challenges There still remain two main challenges with the sampled intermediate domains. Firstly, the sampling disregards the label information of the samples, which likely introduces either condition shift Pµ0(y|x) = Pπ(y|x) or prior shift Pµ0(y) = Pπ(y) between the source and target domains and thereby puts the decision boundary in danger of collapse. Secondly, The noise (a.k.a. diffusion) term introduced by LMC may cause some samples to deviate drastically from the true data distribution, given that the score network cannot accurately estimate the shifted points. To address both issues, we propose the following classifier-based and sample-based energy.

3.2 CLASSIFIER-BASED ENERGY FOR PRESERVING LABEL INFORMATION

We use the cross-entropy loss and the entropy of the logits as the classifier-based potential energy:

LCE(µ, h, y) = Z y(x) log h(x)dµ(x), LH(µ, h) = Z h(x) log h(x)dµ(x), (8)

where y(x) denotes the corresponding label of input x, and h(x) denotes its prediction. For the cross-entropy loss, lower energy ensures that the predictions of the shifted samples remain consistent with the original labels. Meanwhile, lower entropy of logits yields higher prediction confidence, avoiding excessively smooth predictions. By introducing λ , we balance between the two potentials to accommodate different distribution characteristics of datasets. The flow that implements this classifier-based energy is exactly the gradient with respect to inputs via backpropagation, i.e.,

xt+1 = xt η2((1 λ) xt LCE(xt, ht, yt) + λ xt LH(xt, ht)). (9)

Observation Our experiments show that the LMC monotonically reduces the Wasserstein distance between the constructed intermediate domains to the target domain. Including the classifier-based energy, however, results in an initial decrease in the distance to a minimum but followed by an increase, indicating a balance between feature shift and label preserving. We resolve this issue in Section 3.4 by setting a proper number of intermediate domains using this stationary point.

3.3 SAMPLE-BASED ENERGY FOR REDUCING NOISE AND RECTIFYING FLOW

To mitigate the noise of generated samples, we borrow the idea from stochastic interpolant (Liu et al., 2023; Liu, 2022; Albergo & Vanden-Eijnden, 2023; Lipman et al., 2023; Albergo et al., 2023) where flows connecting two distributions do not deviate too much. Those methods use interpolation techniques to create interpolants xτ, based on which they directly estimate the vector field of the samples through a neural network vθ different from WGF that derives a flow from a prior energy. Specifically, we adopt the rectified flow approach (Liu et al., 2023; Liu, 2022) to generate interpolants at time τ [0, 1] as xτ = (1 τ)x0 + τx1, and subsequently estimates the vector field vθ (xτ) with the following flow matching objective:

JF M = Ex0 µ0,x1 π

0 vθ (xτ) v(x | xτ) 2dτ , (10)

Published as a conference paper at ICLR 2024

where the conditional vector field v(x | xτ) = xτ = x1 x0. The estimated vector field vθ (xτ) defines a flow, which can be discretized using the Euler method as xt+1 = xt + η3vθ (xt). Even without explicit prior energy like WGF, this flow guarantees high energy for source samples and low energy for target samples for each interpolation pair.

3.4 COMPLETE ALGORITHM

We summarize the three energies above into a unified Wasserstein gradient flow. To balance the three terms, we use step sizes η1, η2, η3, and halt the iterated sampling process once the minimum Wasserstein distance to the target domain is reached. We construct each intermediate domain with α iterations, resulting in a total of αT iterations across T intermediate domains. Each iteration follows

xt+1 = xt + η1sπ(xt; ϕ) + p

2η1ξ | {z } Distribution-based

η2 xt L(xt, ht, yt) | {z } Classifier-based

+ η3vθ (xt) | {z } Sample-based

A comprehensive algorithm and complexity analysis are described in Appendix C.

4 THEORETICAL ANALYSIS

This section provides a theoretical analysis of the proposed GGF method. The primary distinction between our analysis and prior theoretical works (Kumar et al., 2020; Wang et al., 2022) is that we do not use any given intermediate domains. Instead, we design Wasserstein gradient flows to generate the intermediate domains, which incrementally shift the source distribution while preserving labels.

We set the labeling function f T of the last intermediate and target domains to be the same, which is reasonable considering the synthetic nature of the intermediate domain. Our analysis, with detailed proof in Appendix B, has two main steps: first, we offer an upper-bound of the target error in Lemma 1, using the risk on the last intermediate domain ϵµT (h T ) and the final Wasserstein distance W2 (µT , π). Then, we derive the two terms and provide a generalization bound in Theorem 1. To facilitate our analysis, we adopt the assumptions consistently with prior works (Dalalyan & Karagulyan, 2019; Kumar et al., 2020; Wang et al., 2022):

Assumption 1 Each predictor function h H and labeling function f H is R-Lipschitz in ℓ2 norm, i.e., x, x X : |h(x) h(x )| R x x

Assumption 2 The loss function ℓis ρ-Lipschitz, i.e., y, y Y : |ℓ(y, ) ℓ(y , )| ρ y y 2 and |ℓ( , y) ℓ( , y )| ρ y y

Assumption 3 The Rademacher complexity (Bartlett & Mendelson, 2002) of hypothesis class H denoted by ℜn(H) is bounded by B n, i.e., ℜn(H) B n

Assumption 4 The potential V is m-strongly convex, and M-Lipschitz smooth, i.e., x, x X :

1. V (x) V (x ) V (x), x x m

2 x x 2 (m-strongly convex)

2. V (x) V (x ) + V (x), x x + M

2 x x 2 (M-Lipschitz smooth)

Assumption 5 Consider the vector field ut is bounded, i.e, ut U

With Assumptions 1 and 2 as well as the shared labeling function f T , we present Lemma 1. Despite the seemingly similarity with Lemma 1 in (Wang et al., 2022), we note that this paper considers the Wasserstein distance over X.

Lemma 1 For any classifier h H, the generalization error on the target domain is bounded by the error on the last generated intermediate domain and the Wasserstein distance: ϵπ (h) ϵµT (h) + 2ρRW1 (µT , π) (12)

Our method gradually refines the source classifier h0 from the source µ0 to the last intermediate domain µT 1, enabling iteration analysis. Therefore, we introduce Proposition 1, which states the bounded variance of generalization error after updating a classifier on a domain.

Published as a conference paper at ICLR 2024

Proposition 1 (The Stability of GGF) Consider {(xi, yi)n i=1} are i.i.d. samples from a domain with distribution µ, and hµ is a classifier. GGF provides a map T ν µ that transports xi to the next domain ν and updates the classifier to hν by empirical risk minimization (ERM) on the shifted samples as hν = arg minh H Pn i=1 ℓ(T ν µ (xi), yi). Denote the labeling functions of the two domains are fµ and fν, then, for any δ (0, 1), with probability at least 1 δ, the following bound holds true:

|ϵµ (hµ) ϵν (hν)| ρ(Ex µ |fµ(x) fν(x)| + R µ) + O

where µ = 1

n Pn i=1 xi T ν µ (xi) means the average shift distance for xi on T ν µ .

In comparing the stability of the GST algorithm (Wang et al., 2022) with ours, we note that the primary difference lies in the first term. Specifically, the GST algorithm bounds the error difference by the Wasserstein distance Wp(µ, ν) over the joint distribution between adjacent domains. For our method, the error is bounded using two parts: 1) an expected difference in labeling functions Ex µ |fµ(x) fν(x)|, bounded by the Continuous Discriminability defined in B.2 and 2) a transport cost µ, bounded by the vector field limit.

Theorem 1 (Generalization Bound for GGF) Under Assumptions 1-5, if h T is the classifier on the last intermediate domain updated by GGF, then for any δ (0, 1), with probability at least 1 δ, the generalization error of h T on target domain is upper bounded as:

ϵπ (h T ) ϵµ0 (h0) + ϵµ0(h T ) + ρR ηαTU + 2(1 ηm)αT W2 (µ0, π) + 3.3M ηp

log(1/δ) n T

Based on the above results, we derive the generalization bound for GGF in Theorem 1. The first two terms ϵµ0 (h0) and ϵµ0(h T ) represent the source generalization errors of the source classifier and the target labeling function. Besides, the ϵµ(h T ) term can also be replaced with a lower term ρK, representing the continuous discriminability above. The ηαTU term corresponds to the transport cost of samples across intermediate domains through αT iterations with a step size of η1 and abbreviated as η. Meanwhile, W2 (µ0, π) denotes the Wasserstein distance between the source and target over X.

Remark: Prior theoretical works (Kumar et al., 2020; Wang et al., 2022; Dong et al., 2022) link the generalization bound to the Wasserstein distance over X Y across given intermediate domains. For the GST algorithm, disregarding the O(1/

n T) term introduced in the online learning framework for improved sample complexity, the bound proposed in (Wang et al., 2022) simplifies to

ϵπ (h T ) ϵµ0 (h0) + ρ

R2 + 1 PT t=1 Wp(µt 1, µt) + O ρB+

log(1/δ) n T , (15)

which suggests that the optimal intermediate domains should be along the Wasserstein geodesic path from the source to target domains, which is unfortunately non-trivial to pinpoint on unlabeled domains. Instead, the novel upper bound we establish in Eq. (14) provides a promising and practical error bound for analyzing GDA when intermediate domains are generated along the Wasserstein gradient flow. Moreover, our analysis offers insights into minimizing the upper bound through the optimization of the hyperparameters α, T, and η. By fixing two of them, the optimal value for the remaining one is deterministic aligning with the lowest upper bound.

5 EXPERIMENTS

In Section 5.2, we evaluate the effectiveness of the proposed GGF on three GDA benchmarks and pre-trained features of the UDA task, Office-Home. The experimental results demonstrate that GGF constructs practical intermediate domains and gradually updates the initial classifier more accurately. In Section 5.3, we conduct the ablation studies for the three kinds of energy, pseudo-labeling, multiple iterations, and the number of intermediate domains.

Published as a conference paper at ICLR 2024

GGF GOAT CNF

2 4 6 8 10 12 14 16 18 Intermediate Domain Index

Wasserstein Distance

GGF GOAT CNF

Figure 3: Accuracy (%) and W2 distance to target domain over intermediate domains on Portraits.

Category: Pen, Accuracy: 36% 53%

Source Domain: Art Target Domain: Clipart Adapted Source Domain

Figure 4: The adaptation example of the "pen" category samples from Art to Clipart domain.

Figure 2: Two toy examples. The dashed lines in the first and other columns correspond to the initial and updated classifiers, respectively.

To aid understanding, we construct two toy datasets: a mixture of Gaussian blobs and twomoon (details in Appendix D.1). Figure 2 illustrates how GGF transports source samples to the target and updates the classifier.

5.1 EXPERIMENTAL SETTINGS

Datasets We mainly evaluate our method on five datasets. Portraits is a gender (binary) classification dataset with 37,921 facing portraits from 1905 to 2013. We follow the chronological split from Kumar et al. (2020), creating a source domain (first 2000 images), intermediate domains (14000 images not used here), and a target domain (next 2000 images). Rotated MNIST is a variant of the MNIST dataset (Le Cun, 1998) consisting of 4000 source images and 4000 target images rotated by 45 to 60 degrees, as per He et al. (2023) and Kumar et al. (2020). Office-Home (Venkateswara et al., 2017) is a well-known UDA dataset with 65 categories across four domains: Artistic (Ar), Clipart (Cl), Product (Pr), and Real-World (Rw). Vis DA-2017 (Peng et al., 2017) is a large-scale dataset including a simulation-to-real UDA task with 152,397 synthetic training images and 72,372 real-world test images across 12 categories.

Table 1: Accuracy (%) on GDA tasks.

Portraits MNIST 45 MNIST 60

Source 76.50 57.06 37.47 Self Train 76.99 60.22 39.81 GST (4) (Kumar et al., 2020) 81.45 66.45 56.87

GOAT (He et al., 2023) 83.17 61.45 46.29 CNF (Sagawa & Hino, 2022) 84.57 62.55 42.18 GGF (Ours) 86.16 67.72 54.21

Implementation Following the setting in (Sagawa & Hino, 2022), we use semisupervised UMAP (Mc Innes et al., 2018) as the feature extractor to reduce the dimensionality of input data while preserving the class discriminability. For the UDA datasets, we use the extracted features as input for experiments. Further implementation details are deferred to Appendix E.

5.2 MAIN RESULTS

Results in Table 1 reveal that all three methods can generate efficient intermediate domains for GDA, and the proposed GGF outperforms the others, especially for large domain shifts. Notably, our method is competitive with the baseline GST algorithm with about four real intermediate domains. In Figure 3, we observe the accuracy evolution with the number of intermediate domains on Portraits for each method. GOAT creates samples along the Wasserstein geodesic but has smaller improvements in datasets with small shifts. CNF s fluctuating accuracy indicates some ineffective domains, while our gradient flow-based method consistently reduces domain differences and improves accuracy.

We evaluate GGF on Office-Home tasks and notice that it further improves the accuracy by up to 0.5%, as shown in Table 2. Additionally, we provide visual evidence of source sample adaptation in Figure 4, demonstrating that GGF proficiently transfers source features to the target domain, achieving

Published as a conference paper at ICLR 2024

Table 2: Classification accuracy (%) on Office-Home dataset with Res Net-50. The best accuracy is indicated in bold, and the second best is underlined.

Method Ar Cl Ar Pr Ar Rw Cl Ar Cl Pr Cl Rw Pr Ar Pr Cl Pr Rw Rw Ar Rw Cl Rw Pr Avg.

DANN (Ganin & Lempitsky, 2015) 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6 MSTN (Xie et al., 2018) 49.8 70.3 76.3 60.4 68.5 69.6 61.4 48.9 75.7 70.9 55.0 81.1 65.7 GVB-GD (Cui et al., 2020) 57.0 74.7 79.8 64.6 74.1 74.6 65.2 55.1 81.0 74.6 59.7 84.3 70.4 RSDA (Gu et al., 2020) 53.2 77.7 81.3 66.4 74.0 76.5 67.9 53.0 82.0 75.8 57.8 85.4 70.9 LAMDA (Le et al., 2021) 57.2 78.4 82.6 66.1 80.2 81.2 65.6 55.1 82.8 71.6 59.2 83.9 72.0 SENTRY (Prabhu et al., 2021) 61.8 77.4 80.1 66.3 71.6 74.7 66.8 63.0 80.9 74.0 66.3 84.1 72.2 Fix Bi (Na et al., 2021) 58.1 77.3 80.4 67.7 79.5 78.1 65.8 57.9 81.7 76.4 62.9 86.7 72.7 CST (Liu et al., 2021) 59.0 79.6 83.4 68.4 77.1 76.7 68.9 56.4 83.0 75.3 62.2 85.1 73.0 Co Vi (Na et al., 2022) 58.5 78.1 80.0 68.1 80.0 77.0 66.4 60.2 82.1 76.6 63.6 86.5 73.1

RSDA + GGF 60.1 77.9 82.2 68.4 78.3 77.2 67.8 60.3 82.5 75.8 61.0 85.2 73.1 Covi + GGF 59.2 79.0 80.4 69.3 80.1 78.1 66.8 61.7 83.1 76.2 62.8 86.5 73.6

superior feature alignment. In Appendix D, we provide GGF s performance on large-scale datasets (D.2), along with a comparative analysis of different feature extractors (D.3) and visualizations (D.5).

5.3 ABLATION STUDY

Table 3: Ablation study on different energies.

Distribution-based Classifier-based Sample-based Accuracy Portraits MNIST 45 MNIST 60

82.69 ( 0.38) 66.19 ( 0.99) 46.95 ( 0.77) 84.02 ( 1.42) 66.88 ( 0.56) 53.15 ( 0.73) 84.60 ( 0.21) 67.45 ( 0.60) 50.82 ( 0.33) 86.16 ( 0.19) 67.72 ( 0.34) 54.21 ( 0.86)

Different Energy Functions We conduct an ablation study on the three types of energy in GGF, as outlined in Table 3. Our findings highlight the importance of both the classifierbased and sample-based energies in achieving optimal performance with GGF. Notably, the classifier-based energy, functioning as a regularization term, significantly enhances the performance in the MNIST 60 task by preserving category information. This is crucial given the substantial domain shift between the source and target domains, as well as the unsatisfactory predictions of the initial classifier in the target domain.

Table 4: Sensitivity analysis on step sizes η. The default setting of rescale ratio is 100%.

25% 50% 75% 100% 150% 200% 400%

η1 = 0.03 79.00 79.15 86.45 86.35 84.85 84.75 84.10 η2 = 0.08 84.75 85.80 86.15 86.35 74.45 78.90 68.00 η3 = 0.01 85.85 86.05 86.50 86.35 86.40 86.15 85.80

Sensitivity Analysis on Step Sizes η We perform a sensitivity analysis by rescaling each step size while keeping the other two fixed on Portraits datasets. The results, shown in Table 4, demonstrate the numerical proximity and robustness of our hyperparameters. Notably, we observe a significant decrease in performance only when reducing η1 or increasing η2, indicating that the classifier-based energy dominates the distribution-based energy. This suggests that the former induces larger velocity components, pushing samples away from the decision boundary. Fortunately, our hyperparameter optimization method avoids converging to solutions where the classifier-based energy dominates, as these solutions do not lead to a reduction in the Wasserstein distance from samples to the target domain.

Comparison of Self-Training (ST) and Fine-Tuning (FT) Figure 5 compares the two updating methods of GGF: ST with pseudo-labels and FT with preserving labels. In the Portraits task with smaller α, ST with a smaller confidence threshold outperforms FT, but as α increases, FT is stable and better. In the MNIST 60 task, ST performs poorly with small α due to cumulative instability. These results demonstrate that ST is less effective with larger domain gaps and more pseudo-labeling processes, highlighting the advantages of our preserving labels approach. Additionally, a sufficiently large α represents the one-step adaptation. As shown in the figures, gradually updating the classifier as in the GDA setting works better than updating the classifier only in the last intermediate domain.

6 RELATED WORK

Unsupervised Domain Adaptation (UDA) UDA aims to facilitate knowledge transfer from a source domain to a target one with only unlabeled instances. Early UDA methods, divided into the two major categories of discrepancy-based and adversarial-based, center around learning domain-invariant representations. The discrepancy being minimized in discrepancy-based ways includes the Maximum

Published as a conference paper at ICLR 2024

1 2 4 8 16 32 64 128 Alpha

Preserving lables ST with c=0.1 ST with c=0.2 ST with c=0.4

8 16 32 64 128 256 512 1024 Alpha

Preserving lables ST with c=0.1 ST with c=0.2 ST with c=0.4

Figure 5: Accuracy comparison of two updating methods with varying hyperparameter α on Portraits and MNIST 60 tasks, using different confidence thresholds c for self-training. With fixed sampling iterations αT, the number of intermediate domains T decreases as α increases.

Mean Discrepancy (MMD) (Borgwardt et al., 2006; Long et al., 2015), second-order statistics (Sun & Saenko, 2016), and optimal transport distances (Courty et al., 2014; 2017; Shen et al., 2018; Li et al., 2020; Kerdoncuff et al., 2021). Instead, adversarial-based methods pursue domain-invariant features that are indistinguishable by a domain discriminator (Ganin & Lempitsky, 2015; Tzeng et al., 2017; Long et al., 2018; Liu et al., 2019). Recent research (Chen et al., 2019; Yang et al., 2020; Tang & Jia, 2020) suggests a negative impact on discriminability when dealing with two wildly dissimilar domains, which motivates the transportand synthetic sample-based methods.

Transport-based methods preserve domain-specific features by training a feature extractor without transfer loss and then transferring these features from the target to the source domain using optimal transport or sampling techniques. For instance, Kirchmeyer et al. (2022) and Fan & Alvarez-Melis (2023) trained an optimal transport map between conditional distributions in feature space and performs reweighting for label matching and robust pre-training. Recently, Gao et al. (2023) and Xiao et al. (2023) leveraged diffusion and energy-based models to shift target data for test-time input adaptation and preserve the image structure and latent variables related to class information. Synthetic sample-based methods create intermediate samples at the input or feature levels by employing techniques such as geodesic flow kernels (Gong et al., 2012), adaptors (Choi et al., 2020), generative models (Hoffman et al., 2018; Gong et al., 2019; Hsu et al., 2020; Cui et al., 2020) or data augmentation (Xu et al., 2020; Wu et al., 2020; Fatras et al., 2022; Dai et al., 2021; Na et al., 2021; Jing et al., 2022; Na et al., 2022; Sahoo et al., 2023). The generated samples are utilized for improved feature learning, consistency regularization, or robust discriminator training.

Gradual Domain Adaptation (GDA) Unlike UDA methods prioritizing feature learning, GDA updates the model using unlabeled sequential intermediate domains, allowing more fine-grained adaptation. Kumar et al. (2020) first introduced the setting with a straightforward gradual self-training (GST) algorithm. Meanwhile, Wang et al. (2020) proposed an adversarial-based method for a similar setting with continuous indexed domains. Based on the assumption of gradually shifting distributions, prior works (Kumar et al., 2020; Dong et al., 2022; Wang et al., 2022) provide and improve the upper generalization bound. Researchers have explored various algorithms for challenging scenarios to obtain the ideal intermediate domains. For instance, when the sequence of extra unlabeled data is unavailable, Chen & Chao (2021) proposed an Intermediate Domain Labeler (IDOL) module to index them. In cases where intermediate domains are insufficient, prior studies have shown that generating additional intermediate domains through continuous normalizing flow (Sagawa & Hino, 2022) or optimal transport and linear interpolation (Abnar et al., 2021; He et al., 2023) can reduce the distance between adjacent domains and improve the performance of GST.

7 CONCLUSION

We introduce gradual domain adaptation via gradient flow (GGF), a novel approach to create continuous intermediate domains and incrementally finetune the classifier. Theoretically, We offer an upper-bound analysis of target error. Empirically, we demonstrate that GGF outperforms prior synthetic-based GDA methods and enhances performance compared to state-of-the-art UDA baselines. Our results underscore GGF s ability to generalize effectively across various pre-trained features.

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENTS

This work is supported by NSFC key grant under grant no. 62136005, NSFC general grant under grant no. 62076118, Hong Kong RMGS 9229111, and Shenzhen fundamental research program JCYJ20210324105000003.

Samira Abnar, Rianne van den Berg, Golnaz Ghiasi, Mostafa Dehghani, Nal Kalchbrenner, and Hanie Sedghi. Gradual domain adaptation in the wild: When intermediate distributions are absent. ar Xiv preprint ar Xiv:2106.06080, 2021.

Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. 2023.

Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. ar Xiv preprint ar Xiv:2303.08797, 2023.

David Alvarez-Melis and Nicolò Fusi. Dataset dynamics via gradient flows in probability space. In International Conference on Machine Learning. PMLR, 2021.

Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy gradient flow. In Advances in Neural Information Processing Systems, 2019.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, 2006.

Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard Schölkopf, and Alex J Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49 e57, 2006.

Baixu Chen, Junguang Jiang, Ximei Wang, Pengfei Wan, Jianmin Wang, and Mingsheng Long. Debiased self-training for semi-supervised learning. In Advances in Neural Information Processing Systems, 2022.

Hong-You Chen and Wei-Lun Chao. Gradual domain adaptation without indexed intermediate domains. In Advances in Neural Information Processing Systems, 2021.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, 2018.

Xinyang Chen, Sinan Wang, Mingsheng Long, and Jianmin Wang. Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation. In International Conference on Machine Learning. PMLR, 2019.

Xiang Cheng and Peter Bartlett. Convergence of langevin mcmc in kl-divergence. In Algorithmic Learning Theory. PMLR, 2018.

Jongwon Choi, Youngjoon Choi, Jihoon Kim, Jinyeop Chang, Ilhwan Kwon, Youngjune Gwon, and Seungjai Min. Visual domain adaptation by consensus-based transfer to intermediate domain. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

Nicolas Courty, Rémi Flamary, and Devis Tuia. Domain adaptation with regularized optimal transport. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part I 14. Springer, 2014.

Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, 2017.

Published as a conference paper at ICLR 2024

Shuhao Cui, Shuhui Wang, Junbao Zhuo, Chi Su, Qingming Huang, and Qi Tian. Gradually vanishing bridge for adversarial domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

Yongxing Dai, Jun Liu, Yifan Sun, Zekun Tong, Chi Zhang, and Ling-Yu Duan. Idm: An intermediate domain module for domain adaptive person re-id. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.

Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. In Conference on Learning Theory. PMLR, 2017.

Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278 5311, 2019.

Jing Dong, Shiji Zhou, Baoxiang Wang, and Han Zhao. Algorithms and theory for supervised gradual domain adaptation. Transactions on Machine Learning Research, 2022.

Alain Durmus and Eric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. 2019.

Alain Durmus, Szymon Majewski, and Bła zej Miasojedow. Analysis of langevin monte carlo via convex optimization. Journal of Machine Learning Research, 20(1):2666 2711, 2019.

Jiaojiao Fan and David Alvarez-Melis. Generating synthetic datasets by interpolating along generalized geodesics. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, 2023.

Kilian Fatras, Hiroki Naganuma, and Ioannis Mitliagkas. Optimal transport meets noisy label robust loss and mixup regularization for domain adaptation. In Conference on Lifelong Learning Agents. PMLR, 2022.

Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, et al. Pot: Python optimal transport. Journal of Machine Learning Research, 22(1):3571 3578, 2021.

Geoffrey French, Michal Mackiewicz, and Mark Fisher. Self-ensembling for visual domain adaptation. ar Xiv preprint ar Xiv:1706.05208, 2017.

Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning. PMLR, 2015.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(1):2096 2030, 2016.

Jin Gao, Jialing Zhang, Xihui Liu, Trevor Darrell, Evan Shelhamer, and Dequan Wang. Back to the source: Diffusion-driven adaptation to test-time corruption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.

Pierre Glaser, Michael Arbel, and Arthur Gretton. Kale flow: A relaxed kl gradient flow for probabilities with disjoint support. In Advances in Neural Information Processing Systems, 2021.

Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.

Rui Gong, Wen Li, Yuhua Chen, and Luc Van Gool. Dlow: Domain flow for adaptation and generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.

Xiang Gu, Jian Sun, and Zongben Xu. Spherical space domain adaptation with robust pseudo-label loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

Yifei He, Haoxiang Wang, Bo Li, and Han Zhao. Gradual domain adaptation: Theory and algorithms. ar Xiv preprint ar Xiv:2310.13852, 2023.

Published as a conference paper at ICLR 2024

Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning. PMLR, 2018.

Han-Kai Hsu, Chun-Han Yao, Yi-Hsuan Tsai, Wei-Chih Hung, Hung-Yu Tseng, Maneesh Singh, and Ming-Hsuan Yang. Progressive domain adaptation for object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020.

Xinru Hua, Truyen Nguyen, Tam Le, Jose Blanchet, and Viet Anh Nguyen. Dynamic flows on curved space generated by labeled data. In International Joint Conference on Artificial Intelligence, 2023.

Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.

Mengmeng Jing, Lichao Meng, Jingjing Li, Lei Zhu, and Heng Tao Shen. Adversarial mixup ratio confusion for unsupervised domain adaptation. IEEE Transactions on Multimedia, 2022.

Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker planck equation. SIAM journal on mathematical analysis, 29(1):1 17, 1998.

Tanguy Kerdoncuff, Rémi Emonet, and Marc Sebban. Metric learning in optimal transport for domain adaptation. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2021.

Matthieu Kirchmeyer, Alain Rakotomamonjy, Emmanuel de Bezenac, and patrick gallinari. Mapping conditional distributions for domain adaptation under generalized target shift. In International Conference on Learning Representations, 2022.

Anna Korba, Pierre-Cyril Aubin-Frankowski, Szymon Majewski, and Pierre Ablin. Kernel stein discrepancy descent. In International Conference on Machine Learning. PMLR, 2021.

Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual domain adaptation. In International Conference on Machine Learning. PMLR, 2020.

Trung Le, Tuan Nguyen, Nhat Ho, Hung Bui, and Dinh Phung. Lamda: Label matching deep domain adaptation. In International Conference on Machine Learning. PMLR, 2021.

Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991.

Mengxue Li, Yi-Ming Zhai, You-Wei Luo, Peng-Fei Ge, and Chuan-Xian Ren. Enhanced transport distance for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. 2023.

Hong Liu, Mingsheng Long, Jianmin Wang, and Michael Jordan. Transferable adversarial training: A general approach to adapting deep classifiers. In International Conference on Machine Learning. PMLR, 2019.

Hong Liu, Jianmin Wang, and Mingsheng Long. Cycle self-training for domain adaptation. In Advances in Neural Information Processing Systems, 2021.

Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport. ar Xiv preprint ar Xiv:2209.14577, 2022.

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. 2023.

Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning. PMLR, 2015.

Published as a conference paper at ICLR 2024

Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, 2018.

Leland Mc Innes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. ar Xiv preprint ar Xiv:1802.03426, 2018.

Youssef Mroueh and Mattia Rigotti. Unbalanced sobolev descent. In Advances in Neural Information Processing Systems, 2020.

Jaemin Na, Heechul Jung, Hyung Jin Chang, and Wonjun Hwang. Fixbi: Bridging domain spaces for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.

Jaemin Na, Dongyoon Han, Hyung Jin Chang, and Wonjun Hwang. Contrastive vicinal space for unsupervised domain adaptation. In European Conference on Computer Vision, 2022.

Felix Otto. The geometry of dissipative evolution equations: the porous medium equation. 2001.

Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345 1359, 2010.

Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE transactions on neural networks, 22(2):199 210, 2010.

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. Visda: The visual domain adaptation challenge. ar Xiv preprint ar Xiv:1710.06924, 2017.

Viraj Prabhu, Shivam Khare, Deeksha Kartik, and Judy Hoffman. Sentry: Selective entropy optimization via committee consistency for unsupervised domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.

Gareth O Roberts and Richard L Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996.

Shogo Sagawa and Hideitsu Hino. Gradual domain adaptation via normalizing flows. ar Xiv preprint ar Xiv:2206.11492, 2022.

Aadarsh Sahoo, Rameswar Panda, Rogerio Feris, Kate Saenko, and Abir Das. Select, label, and mix: Learning discriminative invariant feature representations for partial domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023.

Adil Salim, Anna Korba, and Giulia Luise. The wasserstein proximal gradient algorithm. In Advances in Neural Information Processing Systems, 2020.

Filippo Santambrogio. {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bulletin of Mathematical Sciences, 7:87 154, 2017.

Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.

Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence. PMLR, 2020.

Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In Computer Vision ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer, 2016.

Hui Tang and Kui Jia. Discriminative adversarial domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

Published as a conference paper at ICLR 2024

Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. ar Xiv preprint ar Xiv:1412.3474, 2014.

Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems, 2019.

Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

Hao Wang, Hao He, and Dina Katabi. Continuously indexed domain adaptation. In International Conference on Machine Learning. PMLR, 2020.

Haoxiang Wang, Bo Li, and Han Zhao. Understanding gradual domain adaptation: Improved analysis, optimal path and beyond. In International Conference on Machine Learning. PMLR, 2022.

Yuan Wu, Diana Inkpen, and Ahmed El-Roby. Dual mixup regularized learning for adversarial domain adaptation. In European Conference on Computer Vision, 2020.

Zehao Xiao, Xiantong Zhen, Shengcai Liao, and Cees G. M. Snoek. Energy-based test sample adaptation for domain generalization. In International Conference on Learning Representations, 2023.

Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen. Learning semantic representations for unsupervised domain adaptation. In International Conference on Machine Learning. PMLR, 2018.

Minghao Xu, Jian Zhang, Bingbing Ni, Teng Li, Chengjie Wang, Qi Tian, and Wenjun Zhang. Adversarial domain adaptation with domain mixup. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

Jianfei Yang, Han Zou, Yuxun Zhou, Zhaoyang Zeng, and Lihua Xie. Mind the discriminability: Asymmetric adversarial domain adaptation. In European Conference on Computer Vision, 2020.

Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning. PMLR, 2019.

Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. On learning invariant representations for domain adaptation. In International Conference on Machine Learning. PMLR, 2019.

Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. Confidence regularized self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.

Published as a conference paper at ICLR 2024

A MATHEMATICAL BACKGROUND

A.1 WASSERSTEIN METRIC

P2(Rn) denote the space of probability measures on Rn with finite second moments, i.e. P2 (Rn) = µ P (Rn) , R x 2dµ(x) < . Given µ, ν P2(Rn), the Wasserstein-p distance between them is defined as

Wp(µ, ν) = inf π Γ(µ,ν)

Rn Rn ||x y||pdπ(x, y) 1

where Γ(µ, ν) is the set of all possible couplings between µ and ν. For all joint distribution π Γ(µ, ν), we have µ(x) = R

Rn π(x, y)dy and ν(y) = R

Rn π(x, y)dx. The integral on the right-hand side is also considered as the cost of the optimal transport (OT) problem (Kantorovitch s formulation), and the π is the optimal transport plan. Besides, the monotonicity of the Wasserstein-p distance can be easily shown using Jensen s inequality, which implies that for 1 p q, Wp(µ, ν) Wq(µ, ν).

In the measurable space (P2(Rn), W2) (a.k.a the Wasserstein space), the inner product is defined as:

µ, ν µt = Z

Rn µ(x), ν(x) Rn dµt(x). (17)

A.2 DESCENT PROPERTY OF WASSERSTEIN GRADIENT FLOW

For a measurable space (P2(Rn), W2), the vector field of a Wasserstein gradient flow is defined as ut = W2E(µt). The descent property is a fundamental property of Wasserstein gradient flow, demonstrating that the rate of change of an energy functional E(µt) over time is always non-positive along the gradient flow. This property can be mathematically derived using the following equation:

dt = W2E(µt), dµt

µt = W2E(µt), W2E(µt) µt = W2E(µt) 2 µt 0. (18)

The first equality is derived using the Chain Rule, while the second equality is obtained by applying the definition of Wasserstein gradient flow.

A.3 DISCRETIZATION SCHEMES OF WASSERSTEIN GRADIENT FLOW

There are two main discretization methods for Wasserstein Gradient Flow. The forward scheme utilizes gradient descent in the Wasserstein space to determine the steepest movement direction. Given an energy functional E(µt) and a step size γ, the forward scheme updates the distribution as follows:

µt+1 = (I γ W2E (µt))# µt. (19)

The backward scheme, also known as the Jordan-Kinderlehrer-Otto (JKO) scheme Jordan et al. (1998), is a well-known discretization method for Wasserstein gradient flow. It involves solving an optimization problem to obtain the updated distribution µt+1, and is represented as follows:

µt+1 = argminµ P2(Rd) E(µ) + 1

2γ W 2 2 (µ, µt) . (20)

B.1 PROOF OF LEMMA 1 (ERROR DIFFERENCE OVER LAST INTERMEDIATE AND TARGET DOMAIN)

Lemma 1 For any classifier h H, the generalization error on the target domain is bounded by the error on the last generated intermediate domain and the Wasserstein distance:

ϵπ (h) ϵµT (h) + 2ρRW1 (µT , π) (12)

Published as a conference paper at ICLR 2024

Note that the generalization error is defined as ϵµ(h) = ϵµ(h, fµ) = Ex µℓ(h(x), fµ(x)), where h represents the predictor and fµ represents the labeling function defined on the data distribution µ.

Proof. We start by using the definition of the generalization error to get

|ϵµT (h) ϵπ(h)| = |Ex µT [ℓ(h(x), f T (x))] Ex π [ℓ(h (x ) , f T (x ))]|

Z ℓ(h(x), f T (x))dµT Z ℓ(h (x ) , f T (x )) dπ

Z ℓ(h(x), f T (x)) ℓ(h (x ) , f T (x )) dγ ,

where γ is any coupling (joint distribution) of µT and ν. Then we have

|ϵµT (h) ϵπ(h)| Z |ℓ(h(x), f T (x)) ℓ(h (x ) , f T (x ))| dγ

Z |ℓ(h(x), f T (x)) ℓ(h (x ) , f T (x))|

+ |ℓ(h (x ) , f T (x)) ℓ(h (x ) , f T (x ))| dγ

Z ρ ( h(x) h (x ) + f T (x) f T (x ) ) dγ

Z 2ρR x x dγ.

In the above inequalities, we have used the triangle inequality and the Lipschitz continuity property of the loss function ℓand the predictor h (label function f) to derive the desired result. After considering the arbitrary coupling γ, we can use the definition of the Wasserstein distance to get

ϵπ(h) ϵµT (h) + inf γ

Z 2ρR x x dγ

= ϵµT (h) + 2ρRW1(µT , π) ϵµT (h) + 2ρRW2(µT , π).

B.2 DEFINITION OF CONTINUOUS DISCRIMINABILITY

Definition 1 (Continuous Discriminability) Let f0 = h 0 and f T = h T be the Bayes optimal predictors on source and target domain. For R-Lipschitz labelers {ft}T t=1 on the intermediate domains, we can define ft to satisfy the following minimization objective and denote the minimum as K. In particular, we let K0 be the value of K when all labeling functions within the intermediate domains are set to h T , and there exists K K0:

t=0 Ex µt |ft(x) ft+1(x)| Ex µ0 |h 0(x) h T (x)| = K0 (24)

In this study, only f0 and f T are determined based on the true distributions of the source and target domains. For theoretical analysis, we need to set appropriate labeling functions ft for intermediate domains. As stated in Definition 1, we minimize the bound above to specify the labeling function and obtain the minimum value K. It is worth noting that K depends on the distribution of the generated intermediate domains. The discriminability of the feature representations can be measured by the term λ = ϵµ(h ) + ϵν(h ), as described in the classic UDA theory (Ben-David et al., 2006). When only one intermediate domain exists, K equals λ. For more intermediate domains, we can define K to represent the Continuous Discriminability of the model across domains. In an ideal scenario, minimizing the value K to zero is possible. Figure 1(b) provides such an example, where Ex µt |ft(x) ft+1(x)| = 0 for any two adjacent domains, indicating that while two adjacent labeling functions may differ, they perform equally well on the distribution of the former domain. However, since K is non-deterministic, we utilize the deterministic upper bound K0 for further analysis, which corresponds to the source generalization error of the target labeling function ϵµ0(f T ).

Published as a conference paper at ICLR 2024

B.3 PROOF OF PROPOSITION 1 (THE STABILITY OF GGF)

Proposition 1 (The Stability of GGF) Consider {(xi, yi)n i=1} are i.i.d. samples from a domain with distribution µ, and hµ is a classifier. GGF provides a map T ν µ that transports xi to the next domain ν and updates the classifier to hν by empirical risk minimization (ERM) on the shifted samples as hν = arg minh H Pn i=1 ℓ(T ν µ (xi), yi). Denote the labeling functions of the two domains are fµ and fν, then, for any δ (0, 1), with probability at least 1 δ, the following bound holds true:

|ϵµ (hµ) ϵν (hν)| ρ(Ex µ |fµ(x) fν(x)| + R µ) + O

where µ = 1

n Pn i=1 xi T ν µ (xi) means the average shift distance for xi on T ν µ .

Proof. We first introduce ϵµ (hµ, fν) and use the triangle inequality to split the left-hand side (LHS) into two terms as

|ϵµ (hµ) ϵν (hν)| = |ϵµ (hµ, fµ) ϵν (hν, fν)| |ϵµ (hµ, fµ) ϵµ (hµ, fν)| + |ϵµ (hµ, fν) ϵν (hν, fν)| . (25)

For the first term, we can express it as the difference in the labeling function:

|ϵµ (hµ, fµ) ϵµ (hµ, fν)| = |Ex µℓ(hµ(x), fµ(x)) Ex µℓ(hµ (x ) , fν(x ))| Ex µ |ℓ(hµ(x), fµ(x)) ℓ(hµ (x) , fν(x))| ρEx µ |fµ(x) fν(x)| . (26)

For the second term, we introduce the empirical error ˆϵµ(h, fµ) = 1

n Pn i=1 ℓ(h(xi), fµ(xi)), where xi represents the i-th sample drawn from the distribution µ. Then, using the triangle inequality, we can then obtain the following:

|ϵµ (hµ, fν) ϵν (hν, fν)| |ϵµ (hµ, fν) ˆϵµ (hµ, fν)| + |ˆϵµ (hµ, fν) ˆϵν (hν, fν)| + |ˆϵν (hν, fν) ϵν (hν, fν)| . (27)

To bound the difference between the generalization and empirical errors, we utilize the Rademacher complexity (Bartlett & Mendelson, 2002). This bound is also described in Lemma A.1 of the previous work by Kumar et al. (2020). Then, we can get

|ϵµ (hµ, fν) ϵν (hν, fν)| |ˆϵµ (hµ, fν) ˆϵν (hν, fν)| + O

|ˆϵµ (hµ, fν) ˆϵν (hν, fν)| + O

We can obtain the second inequality using Talagrand s Contraction Lemma (Ledoux & Talagrand, 1991), which bounds the Rademacher complexity of the composition of a hypothesis set H and a Lipschitz function ℓby ρℜn(H), where ρ is the Lipschitz constant of ℓand ℜn(H) is the Rademacher complexity of the hypothesis set H concerning the given dataset.

We rely on the fine-tuning classifier updating method to obtain the first term of the inequality in Eq. (28). This approach involves updating the classifier using the transported samples with preserving

Published as a conference paper at ICLR 2024

labels. Specifically, we have hµ (xi) = hν T ν µ (xi). Then, we can get

|ˆϵµ (hµ, fν) ˆϵν (hν, fν)| =

i=1 ℓ(hµ (xi) , fν(xi)) 1

i=1 ℓ hν T ν µ (xi) , fν(T ν µ (xi))

(Triangle inequality) 1

ℓ(hµ (xi) , fν(xi)) ℓ hν T ν µ (xi), fν T ν µ (xi)

(Triangle inequality) 1

ℓ(hµ (xi) , fν(xi)) ℓ hν T ν µ (xi), fν(xi)

ℓ hν T ν µ (xi), fν(xi) ℓ hν T ν µ (xi), fν T ν µ (xi)

(Using hµ (xi) = hν T ν µ (xi)) = 1

ℓ hν T ν µ (xi), fν(xi) ℓ hν T ν µ (xi), fν T (xi)

(Using ρ-Lipschitz) 1

i=1 ρ fν(xi) fν T ν µ (xi)

(Using R-Lipschitz) 1

i=1 ρR xi T ν µ (xi)

(Definition of µ) = ρR µ (Limitation of ut) ρRηαU

By plugging Eqs. (26)-(29) into Eq. (25), we reach the conclusion.

B.4 PROOF OF THEOREM 1

Theorem 1 (Generalization Bound for GGF) Under Assumptions 1-5, if h T is the classifier on the last intermediate domain updated by GGF, then for any δ (0, 1), with probability at least 1 δ, the generalization error of h T on target domain is upper bounded as:

ϵπ (h T ) ϵµ0 (h0) + ϵµ0(h T ) + ρR ηαTU + 2(1 ηm)αT W2 (µ0, π) + 3.3M ηp

log(1/δ) n T

Proof. In Lemma 1, we have the inequality ϵπ (h T ) ϵµT (h T ) + 2ρRW2 (µT , π). To prove this inequality further, we will bound the first term ϵµT (h T ) using the accumulation of Proposition 1, and bound the second term W2 (µT , π) based on the nature of the gradient flow.

We can apply Proposition 1 recursively to get

|ϵµ0 (h0) ϵµT (h T )|

ϵµt 1 (ht 1) ϵµt (ht)

t=1 Ex µt 1 |ℓ(ht 1(x), ft 1(x)) ℓ(ht 1 (x) , ft(x))| + ρRηαTU + O

log(1/δ) n T

(30) Only the label functions f0 and f T are determined based on the true distributions of the source and target domains. To select the optimal label functions ft for the intermediate domains, we minimize the first term of the right-hand side of Eq. (30), similar to the definition of K and K0. Therefore, for

Published as a conference paper at ICLR 2024

a simple strategy for assigning all labeling functions ft on the intermediate domains be f T , we have1

t=1 Ex µt 1 |ℓ(ht 1(x), ft 1(x)) ℓ(ht 1 (x) , ft(x))|

Ex µ0 |ℓ(h0(x), f0(x)) ℓ(h0(x), f T (x))| =Ex µ0ℓ(h0(x), f T (x)) Ex µ0ℓ(h0(x), f0(x)) Ex µ0ℓ(h0(x), f0(x)) + Ex µ0ℓ(f0(x), f T (x)) Ex µ0ℓ(h0(x), f0(x)) =Ex µ0ℓ(f0(x), f T (x)) = ϵµ0(f T ) = ϵµ0(h T ).

To bound W2 (µT , π), we use the convergence properties of the LMC in the Wasserstein metric, which have been investigated extensively (Durmus & Moulines, 2019; Dalalyan, 2017; Cheng & Bartlett, 2018; Dalalyan & Karagulyan, 2019; Durmus et al., 2019; Vempala & Wibisono, 2019). Suppose the potential is m-strongly convex and M-Lipschitz smooth. Then by the Theorem 1 from Dalalyan & Karagulyan (2019), we have

W2 (µT , π) (1 mη)αT W2 (µ0, π) + 1.65(M/m) ηp, (32)

where η 2 m+M is the constant step-size, and p is the dimension of samples. This theorem shows the LMC implies exponential convergence in the Wasserstein distance. Moreover, replacing the LMC with the JKO scheme (Jordan et al., 1998) would further eliminate the second term (Salim et al., 2020).

Finally, by combining Eq. (30)-(32), we have

ϵπ(h T ) ϵµT (h T ) + 2ρRW2(µT , π)

ϵµ0 (h0) + ϵµ0(h T ) + ρRηαTU + O

log(1/δ) n T

+ 2ρRW2(µT , π)

ϵµ0 (h0) + ϵµ0(h T ) + ρR ηαTU + 2(1 ηm)αT W2 (µ0, π) + 3.3M ηp

log(1/δ) n T

where we conclude.

C COMPLETE ALGORITHM AND COMPLEXITY ANALYSIS

The process to construct intermediate domains is outlined in Algorithm 1. Assuming that N represents the size of the source domain dataset, the space and time complexities are O(N) and O(αN), respectively. In practice, we divide the dataset into multiple batches, keeping consistent complexities.

Algorithm 1: Construct Next Intermediate Domain (xt, yt, sπ,ϕ, ht, vθ, α, η)

Input: Samples (xt, yt), Classifier ht, Score network sπ(xt; ϕ), Rectified flow vθ(x), Hyperparameters α, η, λ Output: Adapted samples (xt+1, yt+1) # Simplify the classifier-based potential L(xt, ht, y) = (1 λ)LCE(xt, ht, y) + λLH(xt, ht) # Update samples using three energy functions and construct each domain after α iterations repeat α times

xt xt η1sπ(xt; ϕ) + p

2η1ξ | {z } Distribution based

η2 xt L(xt, ht, yt) | {z } Classifier based

+ η3vθ (xt) | {z } Sample based end (xt+1, yt+1) (xt, yt)

1We can use ρK to bound this, where K is denoted as inf PT 1 t=0 Ex µt |ft(x) ft+1(x)|. K is able to represent the Continuous Discriminability of the intermediate domains. However, since K is non-deterministic, we analyze the deterministic upper bound ϵµ0(h T ).

Published as a conference paper at ICLR 2024

Algorithm 2: Complete Algorithm with Fixed Hyperparameters Input: Source samples (xsrc, ysrc), target samples xtgt, expected number of intermediate domains T, fixed hyperparameters η and α, learning rate k, evaluation frequency K Output: Target classifier h T Train initial classifier h0 according to cross entropy loss LCE on source samples; Train score network sπ,ϕ for distribution-based energy according to JDSM; Train rectified flow vθ for sample-based energy according to JF M; t 0, x0 xsrc, dis W2(x0, xtgt); # OR operator is a short-circuit operator while t mod K = 0 OR W2(xt, xtgt) dis do

if t mod K = 0 then

dis W2(xt, xtgt); end (xt+1, yt+1) Construct Next Intermediate Domain (xt, yt, sπ,ϕ, ht, vθ, α, η); ht+1 arg minh Hℓ(xt+1, yt+1); t t + 1; end

Algorithm 3: Complete Algorithm with Bilevel Optimization Input: Source samples (xsrc, ysrc), target samples xtgt, expected number of intermediate domains T, initialized hyperparameters η and α, learning rate k Output: Target classifier h T Train initial classifier h0, score network sπ,ϕ and rectified flow vθ. for epoch 0 to Max Epoch do

t 0, x0 xsrc, dis W2(x0, xtgt); # Fix η, update α while t mod K = 0 OR W2(xt, xtgt) dis do

if t mod K = 0 then

dis W2(xt, xtgt); end (xt+1, yt+1) Construct Next Intermediate Domain (xt, yt, sπ,ϕ, ht, vθ, α, η); ht+1 arg minh Hℓ(xt+1, yt+1); t t + 1; end Update α as α αt

T ; # Fix α, update η for t 0 to T do

(xt+1, yt+1) Construct Next Intermediate Domain (xt, yt, sπ,ϕ, ht, vθ, α, η); ht+1 arg minh Hℓ(xt+1, yt+1); end Update η as η k ηW2(x T , xtgt); end

Algorithm 2 provides a comprehensive overview of our complete algorithm. This approach dynamically determines the optimal number of iterations based on the Wasserstein distance within a mini-batch. To compute this distance, we employ the Sinkhorn solver from the toolbox (Flamary et al., 2021), which has demonstrated nearly O(n2) space and time complexities, where n signifies the batch size. To further alleviate the computational burden associated with distance calculation, we introduce an evaluation frequency parameter K. Consequently, the space and time complexities become O(N + n2) and O(αTN + Tn2/K), respectively. If we fix the values for the hyper-parameter T, the space and time complexities reduce to O(N) and O(αTN), respectively.

Due to the monotonicity property, our analysis in Eq. (14) suggests that a hyper-parameter with the lowest upper bound remains deterministic when others are held constant. We introduce a bi-level optimization approach, detailed in Algorithm 3, to find appropriate hyper-parameter settings. In our experiments, we apply this optimization exclusively to the GDA dataset. We do not perform hyperparameter optimization for the Office-Home and Vis DA-2017 datasets due to gradient vanishing issues caused by numerous iterations. Instead, we use grid search to determine the hyper-parameters.

Published as a conference paper at ICLR 2024

D ADDITIONAL EXPERIMENTAL DETAILS AND RESULTS

D.1 TOY DATASET

Figure 6 illustrates the two toy datasets. The first dataset is a mixture of three Gaussian blobs (3 classes), while the second dataset is the two-moon dataset (2 classes) implemented using scikitlearn Pedregosa et al. (2011). Both datasets consist of 1000 samples per class in the source domain. To construct the target domain, we sample an additional 1000 samples from the same distribution and then shift the samples. We change the mean and standard variance for the Gaussian blob dataset to create a shifted distribution. We rotated samples around the origin by 45 degrees for the two-moon dataset.

(a) A Mixture of Gaussian Blobs

(b) Two Moons

Figure 6: The left figures show the source domain data and initial classifier, the middle ones show the target domain data and estimated probability density by denoise score matching, and the right ones show an intuitive comparison between the source and target domains.

D.2 EVALUATION ON THE LARGE-SCALE DATASET

Table 5: Accuracy (%) on Vis DA2017 (Res Net-50).

Method Synthetic Real

CDAN (Long et al., 2018) 70.0 MDD (Zhang et al., 2019) 74.6 RSDA (Gu et al., 2020) 75.8

RSDA + GGF 77.6

To further validate the effectiveness of our algorithm, we conducted an additional experiment on this dataset. We utilized the state-of-the-art RSDA method for feature extraction and applied GGF in the feature space. As shown in Tables 5 and 6, the results demonstrate notable improvements, with an average accuracy increase of 2.3% across classes and a 1.8% enhancement in overall accuracy.

Table 6: The detailed accuracy (%) on Vis DA campared with RSDA.

Method aero bicycle bus car horse knife motor person plant skate train truck Avg. Acc.

RSDA 92.4 87.3 88.9 74.0 94.2 0.06 89.1 44.8 93.4 93.1 84.5 42.4 73.72 75.75 RSDA + GGF 92.4 85.2 87.1 73.3 94.1 0.07 89.7 70.0 94.0 90.6 84.5 43.9 75.98 77.56

Published as a conference paper at ICLR 2024

D.3 COMPARATIVE EVALUATION OF VARIOUS FEATURE EXTRACTORS

Table 7: Accuracy (%) on Office-Home dataset with Res Net-50 (Red for increased accuracy).

Method Ar Cl Ar Pr Ar Rw Cl Ar Cl Pr Cl Rw Pr Ar Pr Cl Pr Rw Rw Ar Rw Cl Rw Pr Avg.

Res Net-50 (fine-tuned) 46.2 71.5 74.0 58.4 68.1 69.7 55.8 42.3 72.9 66.9 47.3 76.0 62.4 Res Net-50 + GGF 48.5 74.0 76.7 61.2 70.0 71.4 60.3 44.2 75.4 69.1 49.5 77.9 64.9 for Res Net-50 +2.3 +2.5 +2.7 +2.8 +1.9 +1.7 +4.5 +1.9 +2.5 +2.2 +2.2 +1.9 +2.5

MSTN 57.5 70.9 77.0 60.4 71.0 69.2 61.4 56.3 79.6 70.9 54.4 80.4 67.4 MSTN + GGF 58.1 75.9 79.7 66.5 75.7 75.1 65.6 58.6 81.5 74.3 60.0 84.2 71.3 for MSTN +0.6 +5.0 +2.7 +6.1 +4.7 +5.9 +4.2 +2.3 +1.9 +3.4 +5.6 +4.2 +3.9

RSDA 53.2 77.7 81.3 66.4 74.0 76.5 67.9 53.0 82.0 75.8 57.8 85.4 70.9 RSDA + GGF 60.1 77.9 82.2 68.4 78.3 77.2 67.8 60.3 82.5 75.8 61.0 85.2 73.1 for RSDA +6.9 +0.2 +0.9 +2.0 +4.3 +0.7 -0.1 +6.7 +0.5 0 +3.2 -0.2 +2.2

Co Vi 58.5 78.1 80.0 68.1 80.0 77.0 66.4 60.2 82.1 76.6 63.6 86.5 73.1 Co Vi + GGF 59.2 79.0 80.4 69.3 80.1 78.1 66.8 61.7 83.1 76.2 62.8 86.5 73.6 for Co Vi +0.7 +0.9 +0.4 +1.2 -0.1 +0.9 +0.4 +1.4 +1.0 -0.4 -0.9 0 +0.5

In Table 7, we reiterate the performance of our method on the Office-Home dataset with three different feature extractors. Our primary objective and contribution revolve around the generalization of pre-trained features by any feature extractor, which is readily accessible nowadays, to target domains. Through the comprehensive empirical evaluation, we provide compelling evidence to demonstrate the consistent superiority of GGF even when considering pre-trained features generated by Co Vi, which has already shown the SOTA performance in reducing the distribution shift.

D.4 COMPUTATIONAL OVERHEAD ANALYSIS

Table 8: Running time (s) on GDA tasks.

Datasets GOAT CNF GGF (Ours)

Portraits 2.84 5.25 5.76 MNIST 45 7.95 14.23 29.53

In this section, we detail the computational overhead associated with our approach. The training process of GGF can be broadly divided into two stages. The first stage involves training the score network and rectified flow, with the time required contingent upon the number of training epochs. The second stage entails the generation of intermediate domains and the subsequent gradual fine-tuning of the classifier. GGF discretizes the Wasserstein gradient flow using a forward scheme, which leads to a time complexity that is linearly dependent on the number of intermediate domains and the scale of the dataset. During this stage, our time consumption for generating intermediate domains and updating the classifier is commensurate with those of the baseline methods, GOAT and CNF. Table 8 presents the running time for each method on the two GDA datasets.

D.5 VISUALIZATION

To demonstrate the process of generating intermediate domains for the proposed GGF method, we visualize the evolution of source samples using t-SNE on the Portraits, and MNIST 45 in Figures 7-8, respectively. We equally generated a small number of intermediate domains and selected them for visualization. These figures show that the source samples evolve over the intermediate domains, indicating that the gradual domain adaptation algorithm effectively adapt the source distribution to the target distribution.

In the UDA task on Office-Home, which involves 65 categories, we conduct a detailed analysis of the t-SNE visualizations for specific classes. As shown in Figures 9 and 10, we visualize the categories where the performance can be improved (success cases) and degraded (failure cases).

In the success cases, we observe that same-class samples in the source domain tend to cluster together with slight variance, while their corresponding target domains are more widely distributed and do not group as a single cluster. Additionally, the intermediate domains generated by GGF can better cover the target distribution. In the failure cases, we observe that the performance degradation is mainly due to the possible diffusion of source domain samples to other similar categories. For instance, in Figure 10(c), the source samples in the Bottle category shift in the opposite direction of the target domain distribution, resulting in a 6% decrease in accuracy for this category. However, all those samples are classified into a similar category Soda .

Published as a conference paper at ICLR 2024

domain = Source Domain domain = #20 Intermediate Domain domain = #40 Intermediate Domain domain = #60 Intermediate Domain

40 20 0 20 40 x

domain = #80 Intermediate Domain

40 20 0 20 40 x

domain = #100 Intermediate Domain

40 20 0 20 40 x

domain = #120 Intermediate Domain

40 20 0 20 40 x

domain = Target Domain

Figure 7: The t-SNE of features from the source, intermediate, and target domains on Portraits.

domain = Source Domain domain = #600 Intermediate Domain domain = #1200 Intermediate Domain domain = #1800 Intermediate Domain

domain = #2400 Intermediate Domain domain = #3000 Intermediate Domain domain = #3600 Intermediate Domain domain = #4200 Intermediate Domain

100 50 0 50 100 x

domain = #4800 Intermediate Domain

100 50 0 50 100 x

domain = #5400 Intermediate Domain

100 50 0 50 100 x

domain = #6000 Intermediate Domain

100 50 0 50 100 x

domain = Target Domain

0 1 2 3 4 5 6 7 8 9

Figure 8: The t-SNE of features from the source, intermediate, and target domains on MNIST 45 .

Published as a conference paper at ICLR 2024

Category: Desk_Lamp, Accuracy: 22% 83%

Source Domain: Art Target Domain: Product Adapted Source Domain

Category: Table, Accuracy: 42% 59%

Source Domain: Art Target Domain: Real_World Adapted Source Domain

Category: Clipboards, Accuracy: 74% 83%

Source Domain: Clipart Target Domain: Product Adapted Source Domain

Category: Couch, Accuracy: 67% 76%

Source Domain: Clipart Target Domain: Real_World Adapted Source Domain

Category: Flipflops, Accuracy: 70% 78%

Source Domain: Product Target Domain: Art Adapted Source Domain

Category: Laptop, Accuracy: 56% 59%

Source Domain: Real_World Target Domain: Clipart Adapted Source Domain

Figure 9: Visualization of success cases, with source samples (in yellow), the last shifted intermediate samples (in red), and target samples (in blue). Each subfigure corresponds to a category in a single task and displays the corresponding accuracy change for that category.

Category: Monitor, Accuracy: 44% 37%

Source Domain: Art Target Domain: Clipart Adapted Source Domain

Category: TV, Accuracy: 63% 50%

Source Domain: Clipart Target Domain: Product Adapted Source Domain

Category: Bottle, Accuracy: 82% 76%

Source Domain: Clipart Target Domain: Real_World Adapted Source Domain

Figure 10: Visualization of failure cases, and the settings are the same as Figure 9.

Published as a conference paper at ICLR 2024

Figure 11: T-SNE visualization of generated (grey) and existing intermediate domains (colorful) on Portraits.

To further validate the effectiveness of our generated intermediate domains, we present visual representations depicting the relationship between the transported examples and the pre-existing examples in the intermediate domains of two GDA datasets. These visualizations on Portraits and rotated MNIST are depicted in Figure 11 and Figure 12, respectively. In these figures, ellipses represent the variance of samples, while transparency indicates the domain index (with the source domain being the most transparent and the target domain being the least transparent). Each intermediate domain is represented solely by its mean and variance within these visualizations.

For Portraits, due to the imperfect alignment of facial features over the years, previous work (Chen & Chao, 2021) has shown limited overall improvement when using existing intermediate domains. As shown, while our transported examples lie along a straight trajectory in latent space from source to target, they do not precisely match the real intermediate examples. By generating smoother transport, our method achieves more stable accuracy improvements and even surpasses the accuracy of using the existing examples. For rotated MNIST, many categories perfectly align our transported examples and the existing examples in the t-SNE visualizations. This indicates our model successfully learned the "rotation" features. However, for some categories, the alignment is still poorer, leading to lower performance than gradual self-training on the real intermediate samples.

(a) Class 1

(b) Class 4

(c) Class 5

(d) Class 8

(e) Class 0

(f) Class 3

(g) Class 6

(h) Class 9

Figure 12: T-SNE visualization of generated (grey) and existing domains (colorful) on rotated MNIST. First row: categories well-aligned with real domains. Second row: categories with fewer alignments.

In summary, our generated examples are comparable or even superior to the existing intermediate samples, effectively bridging the gap between the source and target domains.

Published as a conference paper at ICLR 2024

Table 9: Hyperparameters of GGF on different datasets.

Dataset Confidence Threshold λ α T η1 η2 η3 Portraits 0.05 0 10 20 0.03 0.08 0.01 MNIST 45 0.2 1 100 60 0.01 0.005 0.002 MNIST 60 n/a 0.8 10 500 0.1 0.02 0.005

Ar Cl n/a 1 10 10 0.1 0.05 0.001 Ar Pr n/a 1 10 20 0.03 0.01 0.001 Ar Rw n/a 1 10 20 0.05 0.02 0.0001 Cl Ar n/a 1 10 20 0.01 0.01 0.001 Cl Pr n/a 1 10 20 0.01 0.001 0.001 Cl Rw n/a 1 10 20 0.001 0.001 0.001 Pr Ar n/a 1 10 20 0.03 0.06 0.001 Pr Cl n/a 1 100 5 0.2 0.05 0.001 Pr Rw n/a 1 100 20 0.005 0.005 0.0001 Rw Ar n/a 1 10 50 0.2 0.2 0.001 Rw Cl n/a 1 10 10 0.1 0.1 0.001 Rw Pr n/a 1 10 10 0.01 0.01 0.001

E IMPLEMENTATION

E.1 TRAINING SETTINGS

We use the official implementations23 for the GOAT (He et al., 2023) and CNF (Sagawa & Hino, 2022) methods and use UMAP (Mc Innes et al., 2018) to reduce the dimensions of three GDA datasets to 8. We conduct experiments on a single NVIDIA 2080Ti GPU. In the context of the two UDA datasets, namely Office-Home and Vis DA-2017, we re-implement the RSDA (Gu et al., 2020) and Co Vi (Na et al., 2022) as feature extractors on a single NVIDIA V100 GPU and then conduct our methods on the latent space similarly. For the Rectified Flow (Liu et al., 2023; Liu, 2022), we adopt the official implementation4. Our code is publicly available at https://github.com/zwebzone/ggf.

E.2 HYPERPARAMETERS

Figure 13: Accuracy comparison with varying hyperparameters on Portraits.

In our experiments, we train three neural networks (i.e., score network, rectified flow, and classifier), each consisting of three fully connected layers. For the classifier, we train it in two steps: training the initial classifier and then updating it using the intermediate domains. We apply SGD optimizer with a learning rate of 10 4 for training all modules and updating the classifier. The batch size for each domain is set to 1024. To improve the training of the score network and rectified flow, we weigh the target domain samples using the classification confidence of the initial classifier.

During the gradual adaptation phase, we update the classifier in each intermediate domain using self-training or fine-tuning with five epochs. For self-training, we set a confidence threshold c to filter out the least confident examples, which means that c 100% of the samples will be removed when updating the classifier. For fine-tuning, we use all generated samples with preserving labels and note that the confidence threshold c is not applicable in this context.

We have discussed how to optimize hyperparameters in Appendix C. Here, we provide the hyperparameters in Table 9, where λ is the balancing weight between the cross-entropy and entropy in the class-based energy function, and α, T, and η are used in the sampling process. As shown in Figure 13, our method is not particularly sensitive to those hyperparameters when setting α as a small integer.

2https://github.com/yifei-he/GOAT 3https://github.com/ssgw320/gdacnf 4https://github.com/gnobitab/Rectified Flow